<<

Glift: Generic Data Structures for Graphics Hardware

By

AARON ELIOT LEFOHN

B.A. Chemistry (Whitman College) 1997 M.S. Physical Chemistry (University of Utah) 2001 M.S. Computer Science (University of Utah) 2004

DISSERTATION

Submitted in partial satisfaction of the requirements for the degree of

DOCTOR OF PHILOSOPHY

in

Computer Science

in the

OFFICE OF GRADUATE STUDIES

of the

UNIVERSITY OF CALIFORNIA

DAVIS

Approved:

Committee in charge

2006

–i– Glift: Generic Data Structures for Graphics Hardware

Copyright 2006

by

Aaron Eliot Lefohn

–ii– To Karen and Kaia.

–iii– Acknowledgments

I am humbled by the large number of people who have directly and indirectly contributed to this

thesis. Throughout my time at UC Davis, I have had the great fortune to work with an outstand-

ing collection of colleagues at UC Davis, , the University of Utah, Stanford, , the

University of North Carolina, the University of Virginia and elsewhere.

I begin by thanking my advisor, John Owens. I consider the most important traits in a graduate

advisor to be trust and a passionate interest in my work. John has provided everything I needed

to succeed and repeatedly put my best interests before his own. His passion for and interest in my

work propelled me through the inevitable low points of research and encouraged me to take the risks

required for true innovation. I cannot thank John enough for promoting my work and giving me the

freedom to explore, even when it meant that I published four papers that are not part of this thesis.

I also want to thank my committee members, Ken Joy, Bernd Hamann, Oliver Staadt, and Nina

Amenta, for their feedback and comments. I additionally want to thank Ken Joy for his vision and continued determination to build the UC Davis Graphics Lab (officially known as the Institute

for Data Analysis and Visualization, or IDAV) into a top-tier graphics and visualization research

institute.

Next, I would like to thank my coauthors on the work that is presented in this thesis: Joe Kniss,

Shubhabrata “Shubho” Sengupta, Robert Strzodka, and Michael Kass. Joe Kniss entered my life as

my GPU programming mentor at the University of Utah in 2002. Since then, Joe and I have pub-

lished thirteen papers, sketches, technical reports, and conference tutorials together. Joe has been an

integral part of the inspiration, design, and implementation of Glift, introduced me to generic pro-

gramming and implemented most of the the octree 3D paint application. I cannot thank Joe enough

for his friendship, time, and brilliance.

Shubho Sengupta has morphed from beginning OpenGL programmer to an invaluable contributor

of core algorithm and implementation over the course of the last two years. Shubho implemented

large portions of several versions of the adaptive shadow map, resolution-matched shadow map,

and hybrid parallel scan algorithms. Shubho also helped significantly with the theory section for the

–iv– depth-of-field application. I thank him very much for his hard work, dedication, and insights.

I want to thank Robert Strzodka at for his rigor and being an integral part of

the adaptive data structure development, especially with respect to correct filtering and the node- centered representation. Robert has been involved with the adaptive grid work from the beginning,

has contributed substantially to the Glift code base, and continues to push the usage and applications of the Glift work with his research on GPU-based adaptive partial differential equation solvers.

Fabio Pellacini, an assistant professor at Dartmouth University and formerly at Pixar

Studios, has had a large impact on the octree 3D paint and shadow applications in this thesis. It was Fabio who first pointed out that we might be able to implement adaptive shadow maps with an adaptation of our octree structure. Fabio has been a tremendous role model over the last three years, and I want to thank him for his mentorship, career advice, and research brainstorming sessions.

We’ve mulled over many great and not-so-great ideas together, and I greatly value our time together.

I am honored to have had the opportunity to work with Michael Kass at Pixar Animation Studios on the depth-of-field project. Michael is credited with the heat diffusion model for depth-of-field.

I want to thank him very much for his mentorship during our collaboration. I am indebted to the other contributors to the depth-of-field project, including the following people from Pixar Animation

Studios: Mark Adams built the flags and fence model, John Warren provided shots from Cars,

Mark Meyer and Kiril Vidimce integrated the algorithm into Pixar’s production tools, and Rob

Cook and Tony DeRose supported the project from conception through completion, including the collaboration with UC Davis. In addition, I want to thank Mike Kirby at the University of Utah for valuable discussions on parallel tridiagonal solver techniques.

In addition to the people mentioned so far, my colleagues at Pixar have had an enormous impact on my career and thinking. I want to thank Kiril Vidimce and Mark Leone for their support, friendship, and belief in my work during my time at Pixar. I also want to thank Dana Batali, Wayne Wooten,

David Laur, Jonathan Shade, and the other members of the RenderMan team for including me in the

group, supporting my work, and engaging me in wonderful arguments. I’ve learned a tremendous amount from working with the team and thank you very, very much for the opportunity.

–v– The Glift work simply would not have been possible without the unprecedented level of support

we received from NVIDIA. Nick Triantos tirelessly answered questions about GPU architecture

and provided pre-release display drivers, Craig Kolb and Cass Everitt both added features to the

Cg runtime for Glift and spent many extra hours providing support to the project, and David Kirk generously provided GPU hardware and was a strong advocate for my work.

I want to give a special thanks to Randima Fernando at NVIDIA, the inventor of the adaptive shadow map algorithm, for being willing to answer many questions about his work and being supportive of our work. Having Randy as a resource proved absolutely invaluable as we first adapted, then revised and improved upon his groundbreaking work.

In addition, a number of other people have contributed to the Glift data structure work. Ross

Whitaker and Milan Ikits and the University of Utah provided early feedback and contributions to the adaptive data structure idea. James Ahrens and Patrick McCormick at the Advanced Computing

Laboratory (ACL) at Los Alamos National Lab have contributed significantly to my appreciation and understanding of data-parallel algorithmic primitives. Similarly, I want to thank Mark Harris at NVIDIA for many stimulating discussions about data-parallel algorithms, data structures, and machine models. Daniel Horn at Stanford provided detailed support during our implementation of his stream compaction algorithm as well as early feedback and encouragement on our improved algorithm. I also thank Kayvon Fatahalian and Mike Houston at Stanford for their constructive crit- icism and encouragement on Glift and the adaptive/resolution-matched shadow work. I also want to thank Naga Govindaraju, at the University of North Carolina, for providing source code for a custom version of GPUSort amidst the SIGGRAPH crunch. I also thank the following additional people for providing feedback on the work throughout its development: David Blythe at Microsoft,

Ian Buck at NVIDIA, Dominik Goddek¨ e at Dortmund University, Matt Papakipos at PeakStream,

Matt Pharr at Neoptica, Mark Segal at ATI, Peter-Pike Sloan at Microsoft, Dan Wexler at NVIDIA,

Yao Zhang at Beijing Institute of Technology, and the anonymous reviewers.

I owe thanks to a number of my lab-mates in the Institution for Data Analysis and Visualization

(IDAV) at UC Davis for their friendship and help throughout this work. Adam Moerschell at UC

Davis selflessly agreed to make all of the movies for papers, sketches, and talks related to Glift. Yong

–vi– Kil, Chris Co, Shubho Sengupta, Taylor Holliday, and Louis Feng provided various 3D models for

the shadow and depth-of-field projects. I also want to thank, in no particular order, Chris Co, Yong

Kil, Brian Budge, Serban Porumbescu, and Ben Gregorski for their friendship and support during

my time in IDAV.

A National Science Foundation (NSF) Graduate Fellowship funded all of the work presented in this dissertation. The NSF fellowship has given me pure and complete intellectual freedom throughout

my Ph.D. Most of this work simply would not have been possible without the fellowship. I have

truly reveled in the freedom gained by having my own research funding. Additional funding for

travel, equipment, and conference costs were provided by grants from the Department of Energy,

Chevron, and Los Alamos National Laboratories.

Lastly, I want to thank my family. My parents, Allen and Phyllis Lefohn, continue to be my role

models for living life with passion, creativity, and hard work. More than anyone else, however, I

want to thank my wife, Karen, and daughter, Kaia, for their patience and support throughout this

very long journey. They have sacrificed many nights without me, yet all of this would be for nothing

without them. They provide me with indescribable wisdom, perspective, and love.

–vii– Contents

List of Figures xii

List of Tables xiv

Abstract xv

I Introduction 1

1 Introduction 2 1.1 Graphics Processor Background ...... 3 1.2 Glift ...... 4 1.2.1 Example ...... 5 1.3 GPU Data Structures ...... 6 1.4 Applications ...... 6 1.4.1 Octree 3D Paint ...... 7 1.4.2 Adaptive Shadow Maps ...... 9 1.4.3 Resolution Matched Shadow Maps ...... 10 1.4.4 Depth Of Field ...... 11 1.5 Publications ...... 12

2 Background 14 2.1 Graphics Processor Architecture ...... 14 2.1.1 Classification of Parallel Processors ...... 15 2.1.2 Graphics Processors ...... 16 2.1.3 IBM Cell Processor ...... 19 2.1.4 Multicore CPUs ...... 20 2.1.5 Looking Ahead: The Convergence ...... 20 2.2 Programming Graphics Hardware ...... 21 2.3 CPU Data-Parallel Libraries and Languages ...... 22

II Glift: Generic GPU Data Structures 25

3 An Abstraction for Random-Access GPU Data Structures 26 3.1 The GPU Memory Model ...... 27

–viii– 3.2 Glift Components ...... 28 3.2.1 Physical Memory ...... 28 3.2.2 Virtual Memory ...... 29 3.2.3 Address Translator ...... 29 3.2.4 Iterators ...... 31 3.2.5 Container Adaptors ...... 32

4 Glift Programming 33 4.1 Physical Memory ...... 35 4.2 Virtual Memory ...... 36 4.3 Address Translator ...... 36 4.4 Iterators ...... 37 4.4.1 Traversal ...... 37 4.4.2 Access Permissions ...... 39 4.4.3 Access Patterns ...... 39 4.4.4 GPU Iterators ...... 40 4.4.5 Address Iterators ...... 41 4.5 Container Adaptors ...... 42

5 Glift Design and Implementation 44 5.1 GPU Code Generation and Compilation ...... 45 5.1.1 Alternate Designs ...... 46 5.2 Cg Program Specialization ...... 46 5.3 Iterators ...... 47 5.3.1 Address Iterators ...... 47 5.3.2 Element Iterators ...... 47 5.4 Mapping Glift Data into CPU Memory ...... 48 5.5 Virtualized Range Operations ...... 49

6 Results 51 6.1 Static Analysis of Glift GPU Code ...... 51 6.2 Memory Access Coherency ...... 52

7 Discussion 55 7.1 Language Design ...... 55 7.2 Separation of Address Translation and Physical Data Memory ...... 55 7.3 Iterators ...... 56 7.3.1 Generalization of GPU Computation Model ...... 56 7.3.2 Encapsulation of Optimizations ...... 57 7.3.3 Explicit Memory Access Patterns ...... 57 7.3.4 Parallel Iteration Hardware ...... 57 7.4 Near-Term GPU Architectural Changes ...... 58 7.5 Limitations ...... 58

III GPU Data Structures 60

8 Classification of GPU Data Structures 61 8.1 Analytic ND-to-MD Translators ...... 61

–ix– 8.2 Page Table Translators ...... 63 8.3 GPU Tree Structures ...... 64 8.4 Dynamic GPU Structures ...... 65 8.5 Limitations of the Abstraction ...... 65

9 Example Glift Data Structures 67 9.1 GPGPU 4D Array ...... 67 9.2 GPU Stack ...... 70 9.3 Dynamic Multiresolution Adaptive GPU Data Structures ...... 73 9.3.1 The Data Structure ...... 74 9.3.2 Adaptivity Implementation Details ...... 76

IV Applications 79

10 Octree 3D Paint 80 10.1 Introduction ...... 80 10.2 Data Structure ...... 81 10.3 Algorithm ...... 83 10.4 Results ...... 86

11 Quadtree Shadow Maps 88 11.1 Introduction and Background ...... 89 11.1.1 Recent Work in Shadow Maps ...... 90 11.2 Quadtree Shadow Map Data Structure ...... 92 11.3 Adaptive Shadow Maps on the GPU ...... 94 11.3.1 ASM Results ...... 96 11.4 Resolution-Matched Shadow Maps ...... 98 11.4.1 Algorithm ...... 99 11.4.2 Implementation ...... 104 11.4.3 Results and Discussion ...... 108

12 A Heat Diffusion Model for Interactive Depth of Field Simulation 124 12.1 Prior work ...... 125 12.2 Theory and Algorithm ...... 127 12.2.1 Circle of Confusion ...... 128 12.2.2 Single-Layer Depth-of-Field Algorithm ...... 130 12.2.3 Separating the Background and Midground ...... 135 12.2.4 Solving the Foreground ...... 139 12.2.5 Automatically Generating Multiple Input Images ...... 140 12.3 Implementation ...... 142 12.3.1 Data Structures ...... 142 12.3.2 Algorithm Implementation ...... 143 12.4 Results ...... 145 12.4.1 Runtime and Analysis ...... 146 12.4.2 Limitations ...... 147

–x– V Conclusions and Future Work 150

13 Future Work 151 13.1 Glift ...... 151 13.1.1 A Programming Model for Commodity Parallelism ...... 152 13.1.2 Generic Algorithms ...... 153 13.1.3 Impact of Future GPU Architectures ...... 153 13.1.4 Additional GPU Data Structures ...... 155 13.2 Octree 3D Paint ...... 158 13.3 Quadtree Shadow Maps ...... 159 13.4 Depth of Field ...... 161

14 Conclusions 162

VI Appendix 164

A Glift C++ Source Code Example 165

B C++ Template Type Factories 170 B.1 Introduction ...... 170 B.2 Template Type Factories ...... 171 B.3 Analysis ...... 172 B.4 Code Example ...... 172

C Separating Types from Behavior in C++ 175 C.1 User-Visible Class Declaration ...... 176 C.1.1 Code Example ...... 176 C.2 Types Class ...... 177 C.2.1 Code Example ...... 177 C.3 Implementation Class ...... 178 C.3.1 Code Example ...... 178 C.4 Questions ...... 179 C.5 Complete Code Example ...... 180

Bibliography 182

–xi– List of Figures

1.1 Dragon model interactively painted with Octree texture ...... 8 1.2 Interactive adaptive shadow map with effective resolution of 131,0722 ...... 9 1.3 Comparison of resolution-matched shadow maps to adaptive shadow maps . . . . . 10 1.4 Interactive depth-of-field using a new heat diffusion model ...... 12

2.1 The modern GPU pipeline ...... 16

3.1 Glift components ...... 29

6.1 Glift bandwidth analysis for paged structures ...... 53

9.1 Glift stack memory layout ...... 71 9.2 Glift stack pop ...... 72 9.3 Glift multiresolution data structure ...... 75 9.4 Glift node-centered multiresolution scheme ...... 76

10.1 Details of octree 3D paint brushing and filtering ...... 84 10.2 Octree 3D paint on dragon model ...... 87

11.1 Projective aliasing addressed with resolution-matched shadow maps ...... 91 11.2 Quadtree shadow map data structure ...... 93 11.3 Interactive adaptive shadow map with 131,0722 resolution ...... 94 11.4 Resolution-matched shadow map with 4,000 self-shadowing hairs ...... 100 11.5 Robot scene used as example for resolution-matched shadow maps ...... 101 11.6 Comparison of resolution-matched shadow maps to adaptive shadow maps . . . . . 103 11.7 Performance comparison for all scenes for ASM and RMSM ...... 114 11.8 Skeleton scene used to evaluate resolution-matched shadow maps ...... 115 11.9 City scene used to evaluate resolution-matched shadow maps ...... 116 11.10Shadow performance scaling for varying image resolutions ...... 117 11.11Performance scaling with GPU generations for resolution-matched shadow maps . 117 11.12Number of renderered superpages for ASMs and RMSMs ...... 118 11.13Memory consumption comparison for ASM and RMSM ...... 119 11.14Memory usage efficiency for resolution-matched shadow maps ...... 120 11.15Coherency of shadow data for resolution-matched shadow maps ...... 121 11.16Quality comparison of 8,1922 to 32,7682 resolution-matched shadow maps . . . . 122 11.17Performance tuning results for resolution-matched shadow maps ...... 123

–xii– 12.1 Graph of circle of confusion ...... 129 12.2 One-level heat-diffusion depth-of-field images ...... 136 12.3 Problems with 1-layer, and 2-layer depth-of-field solution ...... 137 12.4 circle of confusion visualization ...... 139 12.5 Flag and fence depth-of-field results ...... 146 12.6 Robot, three-layer depth-of-field results ...... 149

–xiii– List of Tables

3.1 The GPU memory model ...... 28

6.1 Glift static performance results ...... 52

8.1 Glift taxonomy of previous GPU data structures ...... 62

–xiv– Abstract

This thesis presents Glift, an abstraction and generic template library for parallel, random-access

data structures on graphics processing units (GPUs). Glift simplifies the description of new and

existing GPU data structures, stimulates development of complex GPU algorithms, and performs

equivalently to hand-coded implementations. Modern GPUs are the first commodity, desktop par-

allel processor. Although designed for interactive rendering, researchers in the field of general

purpose computation on graphics processors (GPGPU) are showing that the power, ubiquity and

low cost of GPUs makes them an attractive alternative high-performance computing platform. The

primitive GPU programming model, however, greatly limits the ability of both graphics and GPGPU

programmers to build complex applications that take full advantage of the hardware.

This dissertation demonstrates the effectiveness of Glift in three ways. First, we characterize a large

body of previously published GPU data structures in terms of Glift abstractions and present novel

GPU data structures. Second, we show that our example Glift data structures perform comparably

to handwritten implementations but require only a fraction of the programming effort. Third, we

implement four novel high-quality interactive rendering applications with complex data structure

requirements: octree 3D paint, adaptive shadow maps, resolution-matched shadow maps and a new

depth-of-field algorithm.

Professor John D. Owens Dissertation Committee Chair

–xv– 1

Part I

Introduction 2

Chapter 1

Introduction

The desktop computer hardware and software industries are amidst a revolution. After more than

twenty years, the trend of ever-increasing serial processor clock speeds ended abruptly in 2003 [122].

Until this change, software performance increased “for free” as processor speeds increased. Fu-

ture software performance improvements require rewriting software to take advantage of increasing

amounts of parallelism. However, most programmers are unfamiliar with parallel programming

paradigms, and the new architectures contain capabilities not captured by current programming

models. The desktop parallel computing revolution requires new programming models that are

efficient, modular, and familiar to serial programmers.

This thesis demonstrates an efficient data structure abstraction for one of these new, high-performance

parallel architecture, graphics processing units (GPUs). Although designed for interactive render-

ing, modern GPUs are quickly evolving into general-purpose parallel processors with tens of pro-

cessors and high-performance, parallel memory systems. GPUs can outperform current CPUs by

more than ten times; however, the primitive GPU programming model greatly limits the ability of

programmers to build complex applications that take full advantage of the hardware. This thesis

presents Glift, a GPU data structure abstraction that simplifies the description of new and exist-

ing data structures, eases development of complex GPU algorithms, and performs equivalently to

hand-written implementations. Glift also defines the GPU computation model in terms of parallel 3 iteration over data structure elements.

We demonstrate the effectiveness of Glift in three ways. First, we characterize a large body of pre- viously published GPU data structures in terms of Glift abstractions, and present novel GPU data structures built with Glift. Second, we show that our example Glift data structures perform compa- rably to hand-written implementations but require only a fraction of the programming effort. Third, we describe four novel high-quality interactive rendering algorithms with complex data structure and iteration requirements: octree 3D paint, adaptive shadow maps, resolution-matched shadow maps, and a new depth-of-field algorithm. The octree 3D paint and adaptive shadow map appli- cations are novel GPU adaptations of existing algorithms, whereas the resolution-matched shadow and depth-of-field applications are entirely new rendering algorithms.

These rendering algorithms also demonstrate that the next generation of real-time rendering algo- rithms will combine complex data structures with an inseparable mix of traditional graphics and data-parallel programming. The remainder of this chapter gives an overview of the thesis and its organization.

1.1 Graphics Processor Background

Current graphics processors are models of a class of parallel computers described as concurrent- read, exclusive-write (CREW) Parallel Random-Access Machines (PRAM) [77]. They use a data- parallel programming model in which a computation pass executes a single program (also called a kernel or shader) on all elements of a data stream in parallel. In rendering applications, the stream elements are either vertices or pixel fragments [85]. For general computation, the stream elements represent the data set for the particular problem [18]. Users initiate a GPU computation pass by sending data and commands via APIs such as OpenGL or DirectX, and write computational kernels in a GPU shading language such as Cg, GLSL, or HLSL.

The GPU programming model has quickly evolved from assembly languages to high-level shading and stream processing languages [18, 70, 89, 91, 92, 107]. Writing complex GPU algorithms and 4 data structures, however, continues to be much more difficult than writing an equivalent CPU pro-

gram. With modern CPU programming languages such as C++, programmers manage complexity

by separating algorithms from data structures, using generic libraries such as the Standard Template

Library (STL), and encapsulating low-level details into reusable components. In contrast, GPU pro-

grammers have very few equivalent abstraction tools available to them. As a result, GPU programs

are often an inseparable tangle of algorithms and data structure accesses, are application-specific

instead of generic, and rarely reuse existing code.

Chapter 2 gives an in-depth overview of current and near-term GPU architecture, paying particular attention to the memory system. The chapter also discusses related CPU and GPU programming systems.

1.2 Glift

The core contribution of this dissertation is Glift, an abstraction and generic C++ template library implementation, that enables programmers to easily create, access, and traverse parallel, random- access GPU data structures. Like modern CPU data structure libraries such as the Standard Tem- plate Library (STL), Glift enables GPU programmers to separate algorithms from data structure definitions, thereby greatly simplifying algorithmic development and enabling reusable and inter- changeable data structures.

The design goals of the abstraction and implementation include:

Simplifying the creation and use of random-access, GPU data structures; Creating a minimal abstraction of the GPU memory model; Separating GPU data structures and algorithms; Performing as efficiently as hand-coding; Being easily extensible and incrementally adoptable.

Glift achieves these goals through its careful choice of abstraction level, generalization of STL’s iterator concepts, encapsulation of optimization strategies, and policy-based C++ template software 5 design. Glift’s abstraction level is similar to the STL and is lower-level than GPU stream program-

ming systems such as Brook [18], Scout [92], or Sh [91]. These systems also abstract GPU memory

but provide only a small number of hard-coded data structure primitives and require users to adopt their entire system. In contrast, Glift supports a wide range of user-definable data structures and integrates into existing OpenGL/Cg/C++ GPU programming environments. This makes it easily adoptable for either graphics or general-purpose computation (GPGPU) programmers.

In order to separate GPU data structures into independent, composable components and simplify the explanation of complex structures, Glift factors GPU data structures into five core components: virtual memory, physical memory, address translator, iterators, and container adaptors (Figure 3.1).

The motivation for this factoring and descriptions of the components are explained in Chapter 3.

1.2.1 Example

The following example Glift C++ code shows the declaration of the octree structure described in

Chapter 10. To the best of our knowledge, this is the first GPU-based octree structure1. The octree is generic and reusable for various value types and is built from reusable Glift components. Note that the syntax is similar to CPU-based C++ template libraries, and the intent of the code is clear. If not

implemented with the Glift template library, the intent would be obscured by numerous low-level

GPU memory operations, the code would be distributed between CPU and GPU code bases, and the data structure would be hard-coded for a specific application. This example allocates an octree that stores RGBA color values and has an effective maximum resolution of 20483. § ¤ typedef OctreeGpu OctreeType ; // 3D addrs, RGBA values OctreeType octree ( vec3i (2048 , 2048 , 2048) ); // effective size 20483 ¦ ¥

A 3D model can easily be textured by paint stored in the octree using the following Cg fragment program. Glift makes it possible for the program to be reused with any 3D structure, be it an octree, native 3D texture, or sparse volume. If implemented without Glift, this example would be hard- coded for the specific octree implementation, contain many parameters, and obscure the simple

1In parallel with our work, Lefebvre et al. performed complimentary research on GPU-based octrees [79]. 6 intent of reading a color from a 3D spatial data structure. § ¤ float4 main ( uniform VMem3D paint , float3 texCoord ) : COLOR { return paint . vTex3D ( texCoord ); } ¦ ¥

By encapsulating the implementation of the entire octree structure, Glift enables portable shaders and efficient, hardware-specific implementations of data structures on current and future GPUs.

Part II of this thesis describes the design, implementation, and analysis of Glift. The contributions of Glift include:

The development of programmable address translation as a simple, powerful ab- straction for composable, virtualized GPU data structures;

The design, implementation, and analysis of the Glift template library to imple- ment the abstraction;

The clarification of the GPU execution model in terms of data structure iterators.

1.3 GPU Data Structures

Part III of this thesis demonstrates Glift’s ability to express a large number of random-access GPU data structures in two ways. First, Chapter 8 presents an extensive taxonomy of previous GPU data structures in terms of Glift primitives and address translator characteristics. Second, Chapter 9 describes three Glift data structure implementations and uses in detail: a multidimensional array, an n-stack, and a dynamic adaptive array used to build quadtrees and octrees. The stack and dynamic adaptive array have not previously been demonstrated on GPUs.

1.4 Applications

The last section of this thesis demonstrates that a data structure abstraction such as Glift helps enable the creation of complex GPU algorithms. Without Glift the complexity of GPU applications 7 are limited by the requirement that programmers write algorithms in terms of the physical data layout in texture memory. Glift enables programmers to specify the data layout once, then write algorithms in terms of the virtual domain defined by their data structures. We demonstrate the use

of Glift to define four novel, complex interactive rendering applications: octree 3D paint, adaptive

shadow maps, resolution-matched shadow maps, and a new depth-of-field algorithm.

We describe the data structures for all four of these applications in terms of simple, reusable Glift

components, and demonstrate the efficiency of Glift through application performance. The applica-

tions also demonstrate that higher-level programming abstractions such as Glift are beneficial not

only for general-purpose GPU programming (GPGPU), but also for the next generation of interac-

tive graphics algorithms. The rendering algorithms presented in this thesis use an inseparable mix

of traditional graphics and data-parallel (GPGPU) programming.

Octree 3D paint and adaptive shadow maps are widely reported to be rigorous solutions to their

respective problems of texture and shadow mapping, but they are not possible to implement on

the GPU due to their data structure complexity [22, 24, 90, 117]. This thesis describes interactive

GPU implementations of both of these algorithms using Glift data structures. Resolution-matched

shadow mapping is a novel shadow algorithm that uses the same shadow data structure as adaptive

shadow maps, yet results in higher-quality shadows and performs significantly better for dynamic

scenes. The depth-of-field chapter describes a new algorithm for approximating the focusing effects

of a real camera lens. The algorithm solves the depth-of-field problem via a heat diffusion model

and infinite impulse response (IIR) recursive filters. To make this possible at interactive rates, we

introduce a GPU-based, highly parallel tridiagonal solver that uses Glift data structures to define

arrays of tridiagonal matrices.

1.4.1 Octree 3D Paint

Interactive painting of complex or unparameterized surfaces is an important problem in the digital

film community. Many models used in production environments are either difficult to parameterize

or are unparameterized implicit surfaces. Texture atlases offer a partial solution to the problem [22] 8

Figure 1.1: Our interactive 3D paint application stores paint in a GPU-based octree-like data struc- ture built with the Glift template library. These images show an 817k polygon model with paint stored in an octree with an effective resolution of 20483 (using 15 MB of GPU memory, quadlinear filtered). but cannot be easily applied to implicit surfaces. Octree textures [7, 34] offer a more general solu- tion by using the model’s 3D coordinates as a texture parameterization. Christensen and Batali [28]

recently refined the octree texture concept by storing pages of voxels (rather than individual voxels)

at the leaves of the octree. While this texture format is now natively supported in Pixar’s Photoreal-

istic RenderMan renderer, unfortunately, the lack of GPU support for this texture format has made

authoring octree textures very difficult.

We implement a sparse and adaptive 3D painting application that stores paint in an octree-like Glift

data structure. The data structure is a 3D version of the structure described in Section 9.3 that

supports quadlinear (mipmap) filtering. We demonstrate interactive painting of an 817k polygon

model with effective paint resolutions varying between 643 to 20483 (see Figure 1.1).

Chapter 10 describes the octree 3D paint application. The contributions include:

Demonstrating that octree textures are possible at interactive frame rates on current GPUs with effective resolutions up to 20483;

Supporting quadlinear, mipmap filtering on GPU-based octree texture;

Encapsulating octree texture as a Glift data structure. 9

Figure 1.2: This adaptive shadow map uses a GPU-based adaptive data structure built with the Glift template library. It has a maximum effective shadow map resolution of 131,0722 (using 37 MB of GPU memory, trilinearly filtered). The top-right inset shows the ASM and the bottom-right inset shows a 20482 standard shadow map.

1.4.2 Adaptive Shadow Maps

Shadow maps, which are depth images rendered from the light position, offer an attractive solution

to real-time shadowing because of their simplicity. Their use is plagued, however, by the prob-

lems of projective aliasing, perspective aliasing, and false self-shadowing [117,119,132]. Adaptive

shadow maps [41] (ASMs) offer an attractive solution to projective and perspective shadow map

aliasing while maintaining the simplicity of a purely image-based technique. However, the com-

plexity of the ASM data structure, a quadtree of small shadow maps, has prevented full GPU-based

implementations that support dynamic scenes until now.

Section 11.3 presents a novel implementation of adaptive shadow maps (ASMs) that performs all

shadow lookups and scene analysis on the GPU, enabling interactive rendering with ASMs while

moving both the light and camera. We support shadow map effective resolutions up to 131,0722 and,

unlike previous implementations, provide smooth transitions between resolution levels by trilinearly

filtering the shadow lookups (see Figure 1.2). Our ASM data structure is an instantiation of the

general AdaptiveMem Glift structure defined in Section 9.3.

The adaptive shadow map algorithm has many complex data structure requirements. The scene anal-

ysis, quadtree node allocation, and writes to quadtree nodes exercise nearly all of Glift’s function-

ality. In fact, the only GPU data structure operation not currently exercised by the ASM application

is generating data structure iterators on the GPU. 10

Figure 1.3: Comparison of a shadows for a scene (left) with 4,000 hairs, each consisting of 12 line segments. Resolution-matched shadow maps (second-from-left) perform at interactive rates for dynamic scenes. Adaptive shadow maps (second-from-right) perform interactively for static scenes, but can miss shadow edges (note error in lower-left corner) and perform poorly for dynamic scenes. Standard shadow maps (right) perform well, but suffer from perspective and projective aliasing problems due to their fixed, low resolution.

1.4.3 Resolution Matched Shadow Maps

Section 11.4 introduces an image-based shadow algorithm, resolution-matched shadow maps (RMSM),

that generates properly sampled hard shadows at interactive rates on current graphics hardware. Re- cent shadow literature has shown that rendering shadow samples at the exact location required by the current view’s shadow coordinates produces alias-free shadows. This chapter introduces an al- gorithm that closely approximates this goal via a modified adaptive shadow map algorithm. While adaptive shadow maps offer an attractive solution to the projective and perspective aliasing problems of shadow maps, their practical use is plagued by an iterative refinement algorithm that produces unpredictable performance and is not guaranteed to converge to a correct solution.

We introduce a single-step, non-iterative technique for building resolution-matched, adaptive shadow maps that is guaranteed to be correct for the current camera position. We implement the technique using data-parallel algorithmic primitives—scan, gather, and sort—and describe a novel scan im- plementation that is up to 4 times faster than previous implementations. For the scenes described in this chapter, resolution-matched shadow maps are 2–5 times faster than the GPU-based adap- tive shadow maps described in Section11.3 and 1–4 times slower than standard shadow maps. We achieve 20–70 frames per second on static scenes and 12–30 frames per second on dynamic scenes for a 5122 image and a maximum effective shadow resolution of 32,7682 texels. Figure 1.3 shows a comparison of RMSMs, ASMs, and standard shadow maps.

The development of resolution-matched shadow maps demonstrates the benefit of separating data 11 structures and algorithms. After creating the Glift data structure for adaptive shadow maps, we were able to focus entirely on improving the shadow algorithm in order to address ASM’s shortcomings.

RMSMs represent a complex, new algorithm developed using a pre-existing Glift data structure.

1.4.4 Depth Of Field

The simulation of the depth-of-field lens effect is an important component of realistic imagery [106]. Standard computer graphics images simulate a pinhole camera, where features at all depths are in focus. In contrast, images from real cameras have a limited focal range that depends on the lens aperture, focal distance, and film size. Objects outside of this focal range appear blurry. High-quality software renderers simulate depth of field by rendering image samples from positions distributed across the area of a lens [30]. GPU-based, interactive renderers usually approximate depth-of-field effects as an image-based post-process [35]. The existing post- processing methods suffer from either objectionable artifacts or slow performance.

Chapter 12 presents a new interactive depth-of-field algorithm that addresses many of the shortcom- ings of existing post-processing solutions. The algorithm uses a heat diffusion model and variable- width, recursive filters to achieve an interactive depth-of-field solution whose cost is a function only of image size (see Figure 1.4). The key implementation challenge and contribution of the work is

the description of a GPU-compatible, direct tridiagonal linear solver, thereby enabling recursive fil-

ters as an algorithmic primitive for interactive rendering. While previous, CPU-based, data-parallel

tridiagonal solvers exist, this chapter presents the refactoring required to express the algorithm using

GPU iterators. Glift’s iterator abstraction helped significantly in this refactoring.

Like resolution-matched shadow maps, this application is another example of a novel interactive

rendering algorithm whose implementation requires complex GPU data structures. It also provides

an additional example of a high-quality interactive rendering solution that is an inseparable mix of

general-purpose, data-parallel and traditional graphics programming. 12

Figure 1.4: Chapter 12 introduces a new interactive rendering solution for depth of field. The original image (left) is blurred via a heat diffusion model and variable-width recursive filter to produce the final image (right). The algorithm uses Glift data structures to implement a highly parallel, GPU-based tridiagonal linear solver, running at 25 frames per second on a 512 512 image. The computational complexity of the algorithm is based only on image size and is independent of the blur size.

1.5 Publications

This dissertation encompasses six separate publications with five other authors: Michael Kass from

Pixar Animation Studios, Joe Kniss from the University of Utah, John Owens from the University of California Davis, Shubhabrata Sengupta from the University of California Davis, and Robert

Strzodka from Stanford University. The complete list of publications includes:

Glift : Generic, Efficient Random-Access GPU Data Structures, Aaron E. Lefohn, Joe Kniss, Robert Strzodka, Shubhabrata Sengupta, John D. Owens, ACM Transactions on Graph-

ics, 25(1), pp. 60–99, Jan. 2006 (accepted to ACM SIGGRAPH 2005 with major revi-

sions) [82],

Implementing Efficient Parallel Data Structures on GPUs, Aaron E. Lefohn, Joe Kniss, John D. Owens, in GPU Gems 2, Addison Wesley, pp 521–545, 2005 [80],

Resolution Matched Shadow Maps, Aaron E. Lefohn, Shubhabrata Sengupta, John D. Owens, in preparation,

Interactive Depth of Field Using Simulated Diffusion on a GPU, Michael Kass, Aaron E. Lefohn, John D. Owens, in preparation, 13

Dynamic Adaptive Shadow Maps on Graphics Hardware, Aaron E. Lefohn, Shubhabrata Sengupta, Joe Kniss, Robert Strzodka, John D. Owens, Technical Sketch at ACM SIG-

GRAPH 2005 [81], and

Octree Textures on Graphics Hardware, Joe Kniss, Aaron E. Lefohn, Robert Strzodka, Shubhabrata Sengupta, John D. Owens, Technical Sketch at ACM SIGGRAPH 2005 [73]. 14

Chapter 2

Background

2.1 Graphics Processor Architecture

This section gives an overview of (GPU) architecture and relates it to multicore parallel CPUs and the IBM Cell processor. We also give an overview of parallel machine models and describe the architectures in terms of these more general models.

Parallel processors have existed for over forty years in high-end supercomputers, but, until recently, commodity computing has continued to use serial processors and serial programming models. How- ever, in the last three years, power consumption and heat generation problems have prevented single- processor clock speeds from increasing further [122]. In order to achieve performance increases,

CPU manufacturers are introducing parallel, multicore, processors into commodity laptop and desk- top computers.

In contrast, GPUs are a fundamentally parallel architecture that, instead of evolving from a se- rial model like CPUs, are evolving from fixed-function graphics engines into user-programmable general-purpose processors. However, this evolution is far from complete and GPUs are missing many basic features that CPU programmers expect such as virtual memory, efficient conditional ex- ecution, and relative addressing of registers. This section describes the high-level architecture and the low-level capabilities and limitations of current GPUs. 15

2.1.1 Classi®cation of Parallel Processors

Kuck et al. [77] describe a taxonomy of parallel random-access machine models as well as the

limitations that each model places on algorithm design1. We briefly summarize this taxonomy and

use the terminology to categorize current commodity parallel processors, including GPUs, the IBM

Cell processor, and multicore CPUs.

We condense the four machine models presented by Kuck et al. into three classes of Parallel

Random-Access Machine (PRAM) models. The models are defined based on the memory access model and consist of:

Exclusive Read Exclusive Write (EREW). This model prohibits multiple processors from simultaneously accessing the same memory address for both read and write accesses.

Concurrent Read Exclusive Write (CREW). This model allows any set of processors to concurrently read the same memory address, but allows only exclusive, single-processor

memory writes.

Concurrent Read Concurrent Write (CRCW). This model permits concurrent reads and writes to memory. The resolution of concurrent writes are further classified into four cate-

gories:

– Common: All concurrent writes must be the same value for the value to be committed

to memory.

– Arbitrary: The write value is arbitrarily selected from the processors concurrently writ-

ing to the same memory address.

– Priority: The value from the processor with the highest priority is stored in memory.

Processor priority can be statically or dynamically determined.

– Combining: The final value is composited from all values via an associative and com-

1For the discussions in this dissertation, we define a parallel processor to be a single chip containing multiple pro- grammable processing units. 16

VS3.0 GPUs

Texture Buffer

Vertex Fragment Rasterizer Frame Vertex Processor Processor Buffer Buffer(s)

Figure 2.1: The modern graphics hardware pipeline. The vertex and fragment processor stages are programmable by the user via a data-parallel programming model. Users specify a set of vertices or fragments to be processed by a small program called a kernel or shader. The GPU executes the kernel on all data elements in parallel. Data are stored in vertex buffers, textures, and framebuffers.

mutative operator such as addition, multiplication, or maximum.

2.1.2 Graphics Processors

Graphics processors (GPUs) have evolved from hardware implementations of fixed-function OpenGL/Di-

rectX graphics pipelines to programmable parallel processors. Modern GPUs contain arrays of ver-

tex processors and fragment processors operating in parallel. These processors use a data-parallel

execution model and have memory systems designed to hide memory access latency via simulta-

neous execution of many threads. This section gives a brief overview of the architecture, memory

systems, and programming models for current and near-term GPUs. Kilgariff et al. and Owens et

al. provide complete overviews of current GPU architecture [72, 99].

The programming model exposed to users of current graphics processors is the Concurrent-Read,

Exclusive-Write (CREW) Parallel Random-Access Machine (PRAM) model. Computational ker-

nels may simultaneously read from any memory location, but only write to a single, pre-determined

location. This limitation enables the GPU to process all data elements independently and in parallel,

but excludes algorithms that require scatter operations. Scatter operations are operations that write

to computed memory addresses such as a[i] = v. GPU programmers have devised methods for 17 emulating scatter on GPUs, but the operation remains inefficient and not supported natively [16].

The GPU programming model conceals the number of actual parallel processors from the user. The

programmer specifies only a kernel and a data stream over which the kernel is to be executed. The

GPU then maps this data onto the available processors to compute the result. The user cannot query

which processor was used, nor is the user allowed to perform explicit inter-processor communica-

tion.

An advantage of this model is programming simplicity and hardware manufacturer flexibility to change the number of processors without impacting existing programs. A disadvantage of this model is that it is often helpful to know the approximate number of processors when choosing an appropriate parallel algorithm. For example, Section 11.4.2 describes a GPU parallel scan [12] algorithm that outperforms previous implementations by 3–4 times. The speedup is obtained by dy- namically selecting an appropriate parallel algorithm depending on the data set size and the number of GPU processors.

Current GPUs have Multiple-Instruction, Multiple-Data (MIMD) vertex processing units and Single-

Program, Multiple-Data (SPMD) fragment processing units. GPUs execute batches of fragment

threads in Single-Instruction, Multiple-Data (SIMD) execution style (one thread per pixel). GPUs

support branching in the fragment shader by changing instructions only per-batch of fragment

threads. The size of a batch of threads varies from tens to hundreds depending on the architec-

ture. Branch performance is highly dependent on the coherency within a batch. If all fragments take the same branch, performance remains high. If both branches are needed by the batch, all fragments

in the batch will execute both code paths.

The GPU’s memory system creates a branch in a modern computer’s memory hierarchy. The GPU,

just like a CPU, has its own caches and registers to accelerate data access during computation.

GPUs, however, also have their own main memory with its own address space—a fact that means

programmers must explicitly copy data from CPU to GPU memory via the PCI-Express [100] bus

before beginning program execution. This transfer is often a bottleneck in GPU applications, and

the separate address space is one of the key challenges in creating a GPU memory abstraction.

Chapter 3 discusses this topic in more detail with respect to the development of Glift. 18

The following is an overview of the memory model that is exposed to GPU kernel/shader program-

mers2:

No CPU main memory access; no disk access.

No GPU stack or heap.

Random reads from global texture memory.

Reads from constant registers.

– Vertex programs can use relative indexing of constant registers.

Reads/writes to temporary registers.

– Registers are local to the stream element being processed.

– No relative indexing of registers.

Streaming reads from stream input registers.

– Vertex kernels read from vertex streams.

– Fragment kernels read from interpolant streams (rasterizer results).

Streaming writes (at end of kernel only).

– Write location is fixed by the position of the element in the stream.

Cannot write to computed address (that is, no scatter).

– Vertex kernels write to vertex output streams.

Can write up to 12 four-component floating-point values.

– Fragment kernels write to framebuffer streams.

2The rules in this list are for vertex and fragment kernels on GPUs that support Pixel Shader 3.0 and Vertex Shader 3.0 functionality [93, 94]. 19

Can write up to 4 four-component floating-point values.

2.1.3 IBM Cell Processor

Like the GPU, the IBM Cell processor is a desktop, commodity, parallel processor. The Sony

PlayStation 3 game console contains both an IBM Cell processor and an NVIDIA GPU. It is possi- ble that the Cell will appear in future desktop or server computers and other devices. The Cell is a very flexible parallel processor and models the most general PRAM model of Concurrent-Read, Ar- bitrary Concurrent-Write (Arbitrary CRCW), although more restrictive models can be implemented through software configurations.

Cell processors consist of nine processors with fast interconnects between them [104]. One proces- sor is a simplified IBM PowerPC core with a 512 kB L2 cache and in-order instruction execution.

The other eight processors are identical Synergistic Processing Elements (SPEs). An SPE is an in- order processor with an explicit memory system. An SPE does not have a cache, but instead has a

256 kB memory called a local store and 128 128-bit registers. The local store must hold all instruc- tions and data used by the SPE. The local store is essentially a software-managed cache. SPEs can load data into the local store from main memory, other SPEs’ local stores, or the L2 cache on the

PowerPC core. SPEs can hide memory latency by directly issuing asynchronous Direct Memory

Access (DMA) requests and maintaining many outstanding requests.

The flexibility of the Cell processor makes it a challenging architecture to program. It is possible that

software layers will appear for the Cell that implement one or more of the more restricted parallel machine models in order to simplify writing efficient programs. The explicit memory hierarchy of the Cell, however, does open possibilities of new programming paradigms such as the one described in Fatahalian et al. [39]. 20

2.1.4 Multicore CPUs

All of the major CPU vendors are now shipping dual-core processors. The cores are similar to their single-core ancestors, each with their own cache hierarchy and out-of-order execution units.

The expected programming model for these processors is to either run multiple serial programs or a single program that uses a small number of threads (one to two threads per processor). Unlike GPUs, multicore CPUs are not designed to hide memory access latency by low-level support for many hardware-managed threads. They can, however, run the same programs that run on single-processor machines. How the programming model and architectures will scale to many more processor cores is currently unknown and an area of active industry and academic research.

2.1.5 Looking Ahead: The Convergence

GPU architectures are evolving quickly, with major new features being added every 1.5–2 years.

Will GPUs evolve to a less restricted machine model than CREW? The most likely would be a

Combine-CRCW with a programmable combination unit. If, however, arbitrary writes are per- mitted, what use will they be to programmers? Multicore CPUs are also evolving quickly, with quad-core CPUs coming within a year. Will multicore CPUs continue to use traditional, indepen- dent processors or will they use smaller, simplified compute cores, like GPUs, that are more tightly coupled together?

Graphics processors, IBM’s Cell processor, and and AMD multicore CPUs represent the first time in history that parallel processors have been deployed to the commodity market. As GPU ar- chitectures mature and multicore CPUs add additional parallelism, the capabilities of the processors will quickly overlap. At the same time, the best programming model for these future processors is largely unknown and an area of active industry and academic research. Glift presents one contribu- tion to this larger effort. 21

2.2 Programming Graphics Hardware

The instruction set of the microprocessor has been largely static over the past two decades, and its scalar programming model is well known to an entire generation of programmers. In contrast, the

GPU’s graphics-centric, data-parallel programming model is unfamiliar and complex for most of its users. The rapid advances in GPU architectures and features, while delivering higher performance and new capabilities with each generation, also contribute to the difficulty of GPU programming.

Before GPUs added programmable vertex and fragment stages, they were programmed via fixed- function graphics API calls from OpenGL or DirectX. Peercy et al. [102] showed that even fixed- function OpenGL could be treated as a general SIMD compute engine capable of executing one

SIMD instruction per render pass. The initial user-programmable GPUs were programmed with rudimentary assembly constructs such as register combiners or OpenGL API calls for each assem- bly instruction. The second generation of programmable GPUs supported textual assembly pro- grams [97], but no commercial high-level language compilers were available. At the same time, academic researchers began exploring high-level languages for writing GPU programs. Proud- foot et al. allowed programmers to express shading computations using a single high-level “Real-

Time Shading Language” [107]. RTSL spurred the development of industry-standard languages like Cg [89], the DirectX High-Level Shading Language (HLSL), and the OpenGL Shading Lan- guage [70] (GLSL).

Shading languages are designed primarily for shading calculations. The languages only permit de- scription of GPU computational kernels and do not encapsulate memory management or the execu- tion of GPU programs. To address this, Buck et al. , McCool et al. , and McCormick et al. developed abstractions for expressing more complex and general-purpose applications while hiding the com- plexity of the underlying implementation [18,91,92]. The Brook programming environment, in the authors’ words, “abstracts the GPU as a streaming processor.”

The combination of high-level shading languages and the stream programming model results in an effective abstraction for the problem of expressing computation on GPUs. However, no compara- ble abstraction effectively describes the complex data structures with which these algorithms must 22 interact. Although it is possible to build complex data structures atop the native multidimensional arrays provided by Brook, Sh, and Scout, the mix of CPU and GPU programming required by these languages makes it very difficult to create abstract data structures. The result is that GPU programs are an inseparable tangle of data structure access and algorithm details. The resulting lack of encap- sulation adds significant complexity to GPU programs, makes it nearly impossible to reuse or share code, and inhibits innovation. Glift attempts to solve this problem by providing a data structure abstraction that spans the entire GPU memory model. Section 3 describes both the CPU and GPU portions of the GPU memory model and how Glift abstracts them.

2.3 CPU Data-Parallel Libraries and Languages

Glift shares many similarities with generic and data-parallel data structure libraries designed for

CPUs. In fact, one of the goals of Glift is to show that ideas from CPU programming abstrac- tions can apply to GPU programming. This section gives a brief overview of generic programming concepts as well as data-parallel research efforts and their relationship to this dissertation.

Glift is a generic data structure library for graphics hardware. Generic programming uses static polymorphism to create abstract and efficient software components [60]. In general, polymorphism allows programmers to define a single function or class that can be used for a wide range of pa- rameter types. Static polymorphism restricts the definition to polymorphism in which all parameter types are resolved at compile time. Generic programming seeks to provide independent, compos- able software modules that result in code nearly as efficient as hand coding. In other words, generic programming seeks to use static polymorphism to create abstract software components without pay- ing a performance penalty for the abstraction. Generic programming replaces costly object-oriented programming constructs such as abstract base classes and virtual functions with concepts such as policies and traits. The most well-known generic programming library is the C++ Standard Tem- plate Library (STL); however, a number of other generic libraries exist.

The STL introduction [118], Alexandrescu [3], and Duret-Lutz et al. [36] provide detailed introduc- tions to generic programming and generic implementations of design patterns [45]. Of the many 23 generic design patterns, one of the most important and ubiquitous is the iterator concept. Iterators abstract the details of traversing aggregates of data and permit clean separation of data structures and algorithms. Glift introduces iterators to GPU programming and describes how other GPU pro- gramming models can be described in terms of this more general model.

Veldhuizen et al. [129] describe how generic programming can be used to implement modular scien- tific computing components that offer nearly the same performance as raw C or Fortran code. They describe the idea of active libraries that participate in code generation rather than relying entirely on compiler optimizations. These generic libraries use C++ templates to implement generative pro- gramming to create efficient code before compiler optimizations are applied. They demonstrate that C++ numeric libraries such as Blitz++ can achieve performance comparable to carefully hand- tuned Fortran [127,128]. Glift demonstrates that these same constructs are an effective tool for GPU programming, where performance is at least as critical as in CPU-based scientific programming.

A number of research efforts have focused on extending generic data structure and algorithm li- braries to support parallelism. Many of these libraries support parallel iteration over both regu- lar and irregular collections of data. The Standard Template Adaptive Parallel Library (STAPL) and the Parallel Standard Template Library (PSTL) [62] are parallel implementations of much of the C++ Standard Template Library (STL). An et al. [4] present STAPL as well as a thor- ough comparison of many parallel data structure libraries. Other generic parallel libraries include

CHARM++ [64], CHAOS++ [25], NESL [13], and Parallel Object-Oriented Methods and Appli- cations (POOMA) [61, 66]. Of these, Glift shares the most similarities with STAPL, but focuses on spatial data structures rather than the STL containers and targets GPU architectures rather than

CPUs.

An alternate approach to expressing parallelism is to design a new language with built-in parallel computing primitives. Examples include data-parallel languages such as High-Performance Fortran

(HPF) [88], Split-C [33], Titanium [134], Berkeley Unified Parallel C (UPC) [20], and (ZPL) [23].

Most of these languages implicitly define parallel iteration via expression of execution across a multidimensional range construct. Ranges, as defined in these works, are analogous to parallel iterators (including Glift’s) but are restricted to regular grid data structures. 24

Glift takes the library approach rather than the language approach, but shares the notion of parallel

iteration over ranges with parallel languages. It defines data structures and execution of algorithms over the elements of those structures via parallel iterators. A significant advantage of the library ap- proach is that it is more easily adoptable into existing programming systems. Glift takes advantage of this fact and integrates into an existing OpenGL and Cg development environment. 25

Part II

Glift: Generic GPU Data Structures 26

Chapter 3

An Abstraction for Random-Access GPU Data Structures

Glift is designed to simplify the creation and use of random-access (i.e., indexable) GPU data struc-

tures for both graphics and GPGPU programming. The abstraction combines common design pat-

terns present in many existing GPU data structures with concepts from CPU-based data structure

libraries such as the Standard Template Library (STL) and its parallel derivatives such as STAPL [4].

Unlike these CPU libraries, however, Glift is designed to express graphics and GPGPU data struc-

tures that are efficient on graphics hardware.

The design goals of the abstraction include:

Minimal abstraction of GPU memory model The GPU memory model spans multiple processors and languages (CPU and GPU) and is inherently data-parallel. Glift makes it possible to create complex data structures that virtualize all operations defined by the GPU memory model with a small amount of code (see Table 3.1). Glift’s abstraction level is similar to CPU data structure libraries such as the STL and Boost [15]. It is higher-level than raw GPU memory, but low-level enough to easily integrate into existing graphics and GPGPU programming environments.

Separate GPU data structures and algorithms Separation of data structures and algorithms is a proven complexity management strategy in CPU programming. Glift’s virtualization of the GPU memory model 27

and support for generic data structure iterators bring this capability to GPU pro- gramming, thereby encouraging significantly more complex GPU applications.

Efficiency GPU programming abstractions must not hinder performance. The Glift abstrac- tions and implementation are designed to minimize abstraction penalty. The ad- dress translation and iterator abstractions generalize and encapsulate (not prevent) many existing optimization strategies.

Glift addresses challenges particular to GPU programming including a cross-processor/cross-language

memory model, heavily restricted memory access rules, graphics-centric programming systems,

and graphics-centric applications. While inspiration is taken from previous parallel data structure

libraries, the GPU architecture and programming environment present their own, unique set of chal-

lenges. One of the goals of Glift is to help bridge the gap between GPU programming and the CPU parallel programming libraries such as STAPL.

3.1 The GPU Memory Model

Glift structures provide virtualized implementations of each operation supported by the GPU texture

and vertex buffer memory model. Table 3.1 enumerates these operations and shows their syntax

in OpenGL/Cg, Glift, Brook, and C. Note that the memory model contains both CPU and GPU

interfaces. For example, users allocate GPU memory via a CPU-based API but read from textures in GPU-based shading code. This poses a significant challenge in creating a GPU data structure abstraction because the solution must span multiple processors and languages.

The CPU interface includes memory allocation and freeing, memory copies, binding for read or write, and mapping a GPU buffer to CPU memory. The GPU memory interface requires parameter declaration, random read access, stream read access, and stream write. Note that all copy operations are specified as parallel operations across contiguous multidimensional regions (e.g., users can copy a 163 data cube with one copy operation). 28

Operation OpenGL/Cg Glift Brook C CPU Interface Allocate glTexImageND class constructor stream<> malloc Free glDeleteTextures class destructor implicit free CPU GPU transfer glTexSubImageND write streamRead memcpy → GPU CPU transfer glGetTexSubImageND read streamWrite memcpy → GPU GPU transfer glCopyTexSubImageND copy from framebuffer implicit memcpy → Bind for GPU read glBindTextureND bind for read implicit implicit Bind for GPU write glFramebufferTextureND bind for write implicit implicit Map to CPU glMapBuffer map cpu N/A N/A Map to GPU glUnmapBuffer unmap cpu N/A N/A

GPU Interface Shader declaration uniform samplerND uniform VMemND float<>,float[] parameter decl. Random read texND(tex,coord) vTexND array access array access Stream read texND(tex,streamIndex) input iterator stream access read-only pointer Stream write out floatN : COLOR output iterator out float<> write-only pointer

Table 3.1: The GPU memory model described in terms of OpenGL/Cg, Glift, Brook, and C memory primitives. Note glGetTexSubImageND is not a true OpenGL function, but it can be emulated by attaching a texture to a framebuffer object and calling glReadPixels.

3.2 Glift Components

In order to separate GPU data structures into orthogonal, reusable components and simplify the explanation of complex structures, Glift factors GPU data structures into five components: virtual memory, physical memory, address translator, iterators, and container adaptors (Figure 3.1). This section introduces each of these components, and Chapter 4 presents Glift source code examples for

each of them.

3.2.1 Physical Memory

The PhysMem component defines the data storage for the structure. It is a lightweight abstraction around GPU texture memory that supports the 1D, 2D, 3D, and cube-mapped physical memory available on current GPUs as well as mipmapped versions of each. A PhysMem instance supports all of the memory operations defined in Table 3.1. Users choose the type of PhysMem that lets them most efficiently exploit the hardware features required by a data structure. This choice is often made irrespective of the dimensionality of the virtual domain. For example, if a 3D algorithm requires efficient bind for write, current hardware mandates that the structure use 2D physical memory. 29

Application A E Container Adaptor A E Virtual Memory A E A Physical Memory Address Translator C++ / Cg / OpenGL

Figure 3.1: Block diagram of Glift components (shown in green/grey). Glift factors GPU data structures into a virtual (problem) domain, physical (data) domain, and an address translator that maps between them. Container adaptors are high-level structures that implement their behavior atop an existing structure. Glift structures support CPU and GPU iteration with address iterators (circled A) and element iterators (circled E). Applications can use high-level, complete Glift data structures or use Glift’s lower-level AddrTrans and PhysMem components separately. The Glift library is built on top of C++, Cg, and OpenGL primitives.

If, however, fast trilinear filtering is more important than fast write operations, the user selects 3D

physical memory.

3.2.2 Virtual Memory

The VirtMem component defines the programmer’s interface to the data structure and is selected

based on the algorithm (problem) domain. For example, if an algorithm requires a 3D data structure,

the VirtMem component will be 3D, irrespective of the PhysMem type. In Glift, a generic VirtMem

component combines a physical memory and address translator to create a virtualized structure

that supports all of the operations of the GPU memory model listed in Table 3.1. An important

consequence of this feature is that VirtMem and PhysMem components are interchangeable, making

it possible for users to build complex structures by composing VirtMem types.

3.2.3 Address Translator

A Glift address translator is a mapping between the physical and virtual domains. While concep-

tually simple, address translators are the core of Glift data structures and define the small amount 30 of code required to virtualize all of GPU memory operations. Address translators support mapping of single points as well as contiguous ranges. Point translation enables random-access reads and writes, and range translation allows Glift to support efficient block operations such as copy, write, and iteration. Example address translators include the ND-to-2D translator used by Brook to rep- resent N-D arrays, page-table based translators used in sparse structures, and tree-based translators recently presented in the GPU ray tracing literature.

Just as users of the STL choose an appropriate container based on the required interface and per- formance considerations, Glift users select a translator based on a set of features and performance criteria. Chapter 8 presents a taxonomy of previous GPU data structures in terms of Glift compo- nents and these characteristics. The following six characteristics of address translators were selected based on patterns in previous GPU data structures and the performance considerations for current graphics hardware:

Memory Complexity: Constant/Log/Linear How much memory is required to represent the address translator function? Fully procedural mappings, for example, have O(1) memory complexity and page ta- bles are O(n), where n is the size of the virtual address space.

Access Complexity: Constant/Log/Linear What is the average computational cost of performing an address translation? Page tables exhibit O(1) complexity whereas tree traversals are O(logn).

Access Consistency: Uniform/Non-uniform Does address translation always require the exact same sequence of instructions? This characteristic is especially important for SIMD-parallel architectures. Non- uniform translators, such as hash tables or trees, require a varying number of operations per translation.

Location: CPU/GPU Is the address translator on the CPU or GPU? CPU translators are used to rep- resent structures too complex for the GPU or as an optimization to pre-compute physical memory addresses. Note that applications whose memory access pat- terns are not known in advance must use GPU-based address translators. 31

Mapping : Total/Partial and One-to-one/Many-to-one Is all of the virtual domain mapped to the physical domain? Sparse data structures use partial-mappings.

Does each point in the virtual domain map to a single unique point in the physical domain? Adaptive mappings optimize memory usage by mapping coarse regions of the virtual domain to less physical memory.

Invertible Does the address translator support both virtual-to-physical and physical-to-virtual mappings? Virtual-to-physical translation is the most common and is used to read and write data with algorithms written using virtual addresses. Physical-to-virtual translation is required for GPU iteration over the structure (i.e., GPGPU compu- tation) because GPU computation is implemented in terms of the physical address of each output element.

3.2.4 Iterators

The PhysMem, VirtMem and AddrTrans components are sufficient to build texture-like read-only data structures but lack the ability to specify computation over a structure’s elements. Glift iterators add support for this feature. Glift extends the iterator concepts from Boost and the STL to abstract traversal over GPU data structures. Iterators form a generic interface between algorithms and data structures by abstracting data traversal, access permission, and access patterns.

Glift iterators provide the following benefits:

GPU and CPU iteration over complex data structures; generalization of the GPU computation model; and encapsulation of GPU optimizations.

Glift supports two types of iterators: element iterators and address iterators. Element iterators are

STL-like iterators whose value is a data structure element. Element iterators create a pointer-like entity that can retrieve a value from a data structure without the user knowing the address. In con- trast, address iterators are lower-level constructs that traverse the virtual or physical N-D addresses of a structure rather then its elements. Address iterators enumerate the N-D grid address spaces 32 of the data structure. Most programmers will prefer the higher-level element iterators, but address

iterators are important for algorithms that perform computation on virtual addresses or for users wanting to adopt Glift at a lower abstraction level. Chapter 7 provides a further discussion of Glift iterators, especially the relationship between Glift’s iteration model and the stream programming model popularized by Brook and Sh.

3.2.5 Container Adaptors

In addition to the four main components, Glift defines higher level data structures as container adaptors. Container adaptors implement their behavior on top of an existing container. For example, in the STL, stacks are container adaptors built atop either a vector or a queue. Container adaptors are valuable because they enable users to easily create new data structures by leveraging existing structures. Chapter 4 describes a simple N-D array container adaptor, and Chapter 9 describes two new data structures built as container adaptors: a GPU stack built atop a Glift array and a generic quadtree/octree structure built atop the Glift page-table based address translator. 33

Chapter 4

Glift Programming

The Glift programming model is designed to be familiar to STL, OpenGL, and Cg programmers. In

this section, we use the example of building a 4D array in Glift to describe the declaration, use, and

capabilities of each Glift component.

Each Glift component defines both the C++ and Cg source code required to represent it on the CPU

and GPU. The final source code for a Glift structure is generated by the composition of templated

components. Each component is designed to be both composited together or used as a standalone

construct. Additionally, all Glift components have a CPU-based version, both to facilitate debugging

and to enable algorithms for both the CPU and GPU.

We begin with a complete example, then break it down into its separate components. The C++

and Cg code for defining, instantiating, initializing, binding, and using the 4D Glift array is shown

below: § ¤ typedef glift :: ArrayGpu ArrayType ;

// Instantiate Glift shader parameter GliftType arrayTypeCg = glift :: cgGetTemplateType < ArrayType >(); glift :: cgInstantiateNamedParameter (prog , " array " , arrayTypeCg );

// ... compile and load shader ... 34

// Create Glift 4D array vec4i size ( 10 , 10 , 10 , 10 ); ArrayType array ( size );

// Initialize all elements to random values std :: vector data ( 10000 ); for ( size_t i = 0; i < data . size (); ++ i ) { data [i ] = vec4f ( drand48 () , drand48 () , drand48 () , drand48 ()); } array . write (0 , size , data );

// Bind array to shader CGparameter param = cgGetNamedParameter (prog , " array "); array . bind_for_read ( param );

// ... Bind shader and render ... ¦ ¥

The type of the array is declared to be addressed by 4D integers and store 4D float values1. Glift adds

template support to Cg and so the array type of the parameter to the Cg shader must be instantiated

with cgInstantiateParameter. Next, the array is defined and initialized; then lastly, the array

is bound as a shader argument using bind for read. This is analogous to binding a texture to a

shader, but binds the entire Glift structure.

The Cg shader that reads from the 4D Glift array is as follows: § ¤ float4 main ( uniform VMem4D array , varying float4 texCoord ) : COLOR { return array . vTex4D ( texCoord ); } ¦ ¥ 1Note that the C++ code in this thesis uses vector types such as vec4f to indicate statically-sized tuples of either integers or floats. In this case, vec4f indicates a tuple of four floats. In the future, Glift will accept user-defined vector/tuple classes that conform to a minimal interface. 35

This is standard Cg syntax except the array is declared as an abstract 4D type and vTex4D replaces native Cg texture accesses. Note that all Glift components are defined in the C++ namespace glift; however, all subsequent code examples in this dissertation exclude explicit namespace scoping for brevity.

4.1 Physical Memory

The PhysMem component encapsulates a GPU texture. It is most often instantiated by specifying an address type and value type. These address and value types are statically-sized, multidimensional vectors. For example, the PhysMem component for the 4D array example is declared to use 2D integer addressing (vec2i) and store 4D float values (vec4f) as follows: § ¤ typedef PhysMemGPU PhysMemType ; ¦ ¥

Just as the STL containers have many default template parameters that are effectively hidden from most users, Glift components have optional template parameters for advanced users. If left unspeci-

fied, Glift uses type inference or default values for these extra parameters. In addition to the address

and value type, PhysMem components are also parameterized by an addressing mode and internal

format. The addressing mode is either scaled (integer) or normalized (floating point) and is spec-

ified using a Glift type tag (see Appendix B)2. The default addressing mode is determined based

on the address data type (integer addresses default to scaled and floating-point addresses default to

normalized). The internal format is the OpenGL enumerant for the internal texture format. The

complete template specification for a PhysMem component is: § ¤ typedef PhysMemGPU < AddrType , ValueType , AddrModeTag , InternalFormat > PhysMemType ; ¦ ¥

In the 4D array example, type inference determines the addressing mode to be ScaledAddressTag

and the internal format to be GL RGBAF32 ARB. 2Type tags are like C/C++ enumerants, but are more flexible and scalable for use in generic libraries 36

4.2 Virtual Memory

In the 4D array example, the virtual domain is four-dimensional and the physical domain is two- dimensional. The generic VirtMem Glift component is parameterized by only the physical memory and address translator types: § ¤ typedef VirtMemGPU < PhysMemType , AddrTransType > VirtMemType ; ¦ ¥

The memory copy operations (read, write, copy from framebuffer) are specified in terms of

contiguous regions of the virtual domain. For example, in the 4D array example, the user reads and writes 4D sub-regions. The VirtMem class automatically converts the virtual region to a set of

contiguous physical regions and performs the copies on these physical blocks.

4.3 Address Translator

An address translator defines the core of the Glift data structure. They are primarily used to define a VirtMem component; however, they may also be used independently as first-class Glift objects to facilitate incremental adoption. Address translators are minimally parameterized by two types: the virtual and physical address types. For example, the address translator for the 4D array example is defined as: § ¤ typedef NdTo2dAddrTransGPU AddrTransType ; ¦ ¥

This typedef defines a 4D-to-2D address translator where both the virtual and physical domain use

scaled addressing and no boundary conditions are applied.

For advanced users, address translators are parameterized by at least three additional types: the virtual boundary condition, the virtual addressing mode, and the physical addressing mode. The

complete template prototype for an address translator is: § ¤ typedef AddrTrans < VirtAddrType , PhysAddrType , VirtBoundaryTag , VirtAddrModeTag , PhysAddrModeTag > AddrTransType ; ¦ ¥ 37

The VirtBoundaryTag is a Glift type tag defining the boundary conditions to be applied to virtual

addresses. It defaults to no boundary condition, but Glift supports all OpenGL wrap modes.

To create a new Glift address translator, users must define the two core translation functions: § ¤ pa_type translate ( const va_type & va ); void translate_range ( const va_type & origin , const va_type & size , range_type & ranges ); ¦ ¥

The translate method maps a point in the virtual address space to a point in the physical address space. The translate range method converts a contiguous region of the virtual domain into a set

of contiguous physical ranges. Range translation is required for efficiency; it enables Glift to trans-

late all addresses in the range using a small number of point translations followed by interpolation.

Without range translation, the Glift copy operations would have to be executed separately for every point in the specified domain—a very inefficient operation. The VirtMem component uses these two methods to automatically create a virtualized GPU memory interface for any randomly-indexable structure.

4.4 Iterators

Iterators are a generalization of C/C++ pointers and abstract data structure traversal, access permis- sions, and access patterns. This section introduces the use and capabilities of Glift’s CPU and GPU element and address iterators. We begin with simple CPU examples and build up to GPU iterators.

4.4.1 Traversal

Iterators encapsulate traversal by enumerating data elements into a 1D, linear address space. With iterators, programmers can express per-element algorithms over any data structure. For example, a

Glift programmer writes a CPU-based algorithm that negates all elements in the example 4D array as: 38

§ ¤ ArrayType :: iterator it; for ( it = array . begin (); it != array . end (); ++ it ) { * it = -(* it ); } ¦ ¥ where ++it advances the iterator to the next element, *it retrieves the data value, begin obtains

an iterator to the first element, and end returns an iterator one increment past the end of the array.

Glift programmers may also traverse sub-regions of structures by specifying the origin and size of a

range. For example, to traverse the elements between virtual addresses (0,0,0,0) and (4,4,4,4)

inclusive, the programmer writes: § ¤ ArrayType :: range r = array . range ( vec4i (0 ,0 ,0 ,0) , vec4i (5 ,5 ,5 ,5) ); ArrayType :: iterator it; for ( it = r. begin (); it != r. end (); ++ it ) { * it = -(* it ); } ¦ ¥

Given that the primary goal of Glift is primarily parallel execution, we need to replace the explicit

for loop with an encapsulated traversal function that can be parallelized. For example, the STL’s

transform construct can express the previous example as: § ¤ ArrayType :: range r = array . range ( vec4i (0 ,0 ,0 ,0) , vec4i (5 ,5 ,5 ,5) ); std :: transform ( r. begin () , r. end () , r. begin () , std :: negate () ); ¦ ¥ where the first two arguments specify the input iterators, the third specifies the corresponding out-

put iterator, and the fourth argument is the computation to perform on each element. In stream

programming terminology, this fourth argument is called the kernel.

While this example is executed serially on the CPU, the transform operation’s order-independent semantic makes it trivially parallelizeable. Glift leverages this insight to express GPU iteration as parallel traversal over a range of elements. 39

4.4.2 Access Permissions

In addition to defining data traversal, Glift iterators also express data access permissions. Access

permissions control if the value of an iterator is read-only, write-only, or read-write. For simplicity, the examples so far have used iterators with read-write access permissions. It is especially important to distinguish between access permissions on current GPUs, which prohibit kernels from reading and writing to the same memory location. The previous example can be re-written to obey such rules by using separate input and output arrays: § ¤ // ... Declare arrayIn and arrayOut as identically sized arrays ...

ArrayType :: range r = arrayIn . range ( vec4i (0 ,0 ,0 ,0) , vec4i (5 ,5 ,5 ,5) ); std :: transform ( r. begin_i () , r. end_i (), arrayOut . begin_o () , std :: negate () ); ¦ ¥

Note that begin i and end i return read-only (input) iterators and begin o returns a write-only output iterator. One-sided communication models such as this avoid synchronization problems and are also known to be efficient on CPU-based parallel computers [69].

4.4.3 Access Patterns

Glift iterators support three types of data access patterns: single, neighborhood, and random. An iterator’s access pattern provides a mechanism by which programmers can explicitly declare their application’s memory access patterns. In turn, this information can enable compilers or runtime systems to hide memory access latency by pre-loading data from memory before it is needed. As noted in Purcell et al. [108], future architectures could use this information to optimize cache usage based on the declared memory access pattern. For example, the stream model permits only single- element access in kernels, enabling perfect pre-loading of data before it is needed. At the other extreme, random memory accesses afford little opportunities to anticipate which memory will be needed. Single-access iterators do not permit indexing and are analogous to Brook’s stream inputs and the STL’s non-random-access iterators. Neighborhood iterators permit relative indexing in a 40 small, constant-sized region around the current element. These iterators are especially useful in

image processing and grid-based simulation applications. Random-access iterators permit indexing into any portion of the data structure.

4.4.4 GPU Iterators

We now have enough machinery to express a GPU version of the CPU example shown in Sec-

tion 4.4.1. To begin, the Cg source for negateKernel is: § ¤ void main ( SingleIter it , out float4 result : COLOR ) { result = -( it. value ()); }. ¦ ¥

SingleIter is a Glift Cg type for a single-element (stream) iterator and it.value() dereferences

the iterator to obtain the data value.

The C++ source that initiates the execution of the GPU kernel is: § ¤ ArrayType :: gpu_in_range inR = array1 . gpu_in_range ( vec4i (0 ,0 ,0 ,0) , vec4i (5 ,5 ,5 ,5) ); ArrayType :: gpu_out_range outR = array2 . gpu_out_range ( vec4i (0 ,0 ,0 ,0) , vec4i (5 ,5 ,5 ,5));

CGparameter arrayParam = cgGetNamedParameter (prog , "it"); inR . bind_for_read ( arrayParam ); outR . bind_for_write ( GL_COLOR_ATTACHMENT0_EXT );

// ... Bind negateKernel fragment program ...

exec_gpu_iterators (inR , outR ); ¦ ¥

The gpu in range and gpu out range methods create entities that specify GPU computation over a range of elements and deliver the values to the kernel via Cg iterators. Note that GPU range 41 iterators are first-class Glift primitives that are bound to shaders in the same way as Glift data

structures.

The exec gpu iterators call is a proof-of-concept GPU execution engine that executes negateKernel across the specified input and output iterators but is not a part of core Glift. The goal of Glift is to provide generic GPU data structures and iterators, not provide a runtime GPU execution en- vironment. Glift’s iteration mechanism is designed to be a powerful back-end to GPU execution

environments such as Brook, Scout, or Sh.

4.4.5 Address Iterators

In addition to the data element iterators already described, Glift also supports address iterators.

Rather than iterate over the values stored in a data structure, address iterators traverse N-D virtual or physical addresses. Address iterators enable users to specify iteration using only an AddrTrans component.

A simple CPU example using an address iterator to add one to all elements of the example 4D array is: § ¤ AddrTransType :: range r = addrTrans . range ( vec4i (0 ,0 ,0 ,0) , vec4i (5 ,5 ,5 ,5) ); AddrTransType :: iterator ait ;

for ( ait = rit . begin (); ait != rit . end (); ++ ait ) { vec4i va = * ait + vec4i (1 ,1 , -1 , -1); vec4i val = array . read ( va ); array . write ( va , val + 1 ); } ¦ ¥

Note that iteration is now defined with an address translator rather than a virtual or physical con- tainer, and the value of the iterator is an index rather than a data value.

The Cg code for a GPU address iterator example is: 42

§ ¤ float4 main ( uniform VMem4D array , AddrIter4D it ) : COLOR { float4 va = it. value (); return array . vTex4D (va ); } ¦ ¥

and the corresponding C++ code is: § ¤ AddrTransType :: gpu_range r = addrTrans . gpu_range ( vec4i (0 ,0 ,0 ,0) , vec4i (5 ,5 ,5 ,5) );

// ... Bind Glift parameters ... // ... Bind Cg shader ...

exec_gpu_iterators (r); ¦ ¥

4.5 Container Adaptors

Container adaptors are higher-level containers that define a behavior atop an existing VirtMem

structure. For example, the ArrayGpu template class shown in the example at the beginning of

this Chapter is a simple container adaptor built atop the set of typedefs developed in the preceding

sections: § ¤ typedef PhysMemGPU PhysMemType ; typedef NdTo2dAddrTransGPU AddrTransType ; typedef VirtMemGPU < PhysMemType , AddrTransType > VirtMemType ; ¦ ¥

The ArrayGpu container adaptor encapsulates this definition such that users can declare an array

simply as: § ¤ typedef glift :: ArrayGpu ArrayType ; ¦ ¥ 43

Chapter 9 describes more complex container adaptors. For example, the complexity of duplicating data to prepare an octree structure for native GPU filtering can be encapsulated in the write method of an octree texture container adaptor. 44

Chapter 5

Glift Design and Implementation

The Glift C++ template library maps the abstractions described in Chapter 3 to a C++/OpenGL/Cg

GPU programming environment. This section describes the design and implementation of Glift, including how Glift unifies the CPU and GPU source code, adds template support to Cg, and virtu- alizes the multi-processor, multi-language GPU memory model.

To support the design goals of the abstraction presented in Chapter 3, the implementation of Glift is designed with the following goals:

incremental adoption; extensibility; efficiency; and CPU and GPU interoperability.

Glift supports easy, incremental adoptability by providing familiar texture-like and STL-like in- terfaces, requiring minimal changes to shader authoring and compilation pipelines, and allowing

Glift components to be used alone or in combination with others. Glift offers easy extensibility by providing three ways to create new, fully-virtualized Glift data structures: write a new address translator, change the behavior of an existing Glift component by writing a new policy1, or write a

1Policy-based template design factors classes into orthogonal components, making it possible to change behavior by replacing one small module [3]. 45 container adaptor that maps new behavior atop an existing structure. Glift’s efficiency mechanisms include static polymorphism, template specialization, program specialization, and leveraging opti-

mizing GPU compilers. Lastly, Glift supports processor interoperability by supporting CPU and

GPU memory mappings and iterators.

5.1 GPU Code Generation and Compilation

This section describes how Glift adds template-like support to a GPU shading language. We chose to make Glift a C++ template library based on our design goal of providing a high-performance programming abstraction. High-performance C++ libraries such the STL, Boost, and POOMA [66] have demonstrated the power of static polymorphism for providing abstractions for performance- critical coding. The challenge, however, is that current GPU languages do not support templates.

We began by selecting Cg as the target GPU language. This was largely based on its support for primitive static polymorphism via its interface construct [105]. Interfaces allow shader writers to specify input parameter types in terms of an abstract interface whose concrete type is determined at compile time.

Given that the GPU shading code could not be templatized and our desire for Glift structures to support CPU and GPU usage, we integrated the GPU and CPU code bases. Each Glift component contains templatized C++ code and stringified GPU code. This design enables us to generate GPU code from C++ template instantiations. This generated Cg code becomes the concrete implementa- tion of the interface type declared by a shader.

The goals for integrating templates into Cg include easy integration with existing Cg shader com- pilation work flow and minimal API extensions to the Cg API. The resulting system adds only two new Cg API calls: § ¤ GliftType cgGetTemplateType < GliftCPlusPlusType >(); CGprogram cgInstantiateParameter ( CGprogram , const char *, GliftType ); ¦ ¥

The first call maps a C++ type definition to a runtime identifier (GliftType). This identifier is then 46 passed to the second new call, cgInstantiateParameter, to insert the Glift source code into the shader. The first argument is the Cg program handle, the second is the name of the abstract interface parameter this Glift type will instantiate, and the third is the GliftType identifier. The call prepends

the Glift Cg code into the shader and defines the GliftType to be the concrete implementation of the interface shader parameter. The returned Cg program is compiled and loaded like a standard Cg shader. The only other change to the shader pipeline is that Glift parameters are bound to shaders by calling bind for read instead of using Cg’s standard parameter value setting routines.

5.1.1 Alternate Designs

The above design was decided upon after multiple early design iterations. One of these earlier ap-

proaches obtained the Glift Cg source code from instances of Glift data structures. This approach

proved burdensome and clumsy because it required the Glift data structure to be present and initial-

ized at shader compile time. The final model instead requires only the GliftType identifier. This

identifier can be obtained either from the C++ type definition or a Glift object.

5.2 Cg Program Specialization

Glift leverages Cg’s program specialization capability to help create efficient code from generic

implementations. Program specialization is a program transformation that takes a procedure and

some of its arguments, and returns a procedure that is the special optimized case obtained by fix-

ing the value of those arguments at compile time [50]. The Cg API supports specialization on

shaders by having users set the value of uniform parameters, change their variability to constant us-

ing cgSetParameterVariability, then recompile the program to generate a specialized version

of it.

Glift components support specialization by providing an analogous set member variability method. This call specializes (or un-specializes) all uniform parameters defined by the compo- nent. This feature allows users to “bake out” many parameters that will not change at run time at 47 the cost of additional compilation time and shader management.

5.3 Iterators

As described in Section 3.2.4, Glift supports two kinds of iterators: address iterators and element iterators. Here, we describe how these constructs map to current graphics hardware.

5.3.1 Address Iterators

Address iterators are an important concept for the Glift implementation for two reasons. First, address iterators can be efficiently implemented as GPU rasterizer interpolants (i.e., texture coor- dinates). Second, they are the iterator type supported by AddrTrans components. The concept enables address translators to specify iteration without having knowledge of physical data.

Glift address translators generate a stream of valid virtual addresses for a GPU data structure. They abstract the screen-space geometry used to initiate GPU computation as well as the physical-to- virtual translation required to map from fragment position to the virtual address domain. Glift represents address iterators using vertex data, vertex shader code and parameters, and a viewport specification.

Note that this design means that modifying a Glift data structure (e.g., inserting or deleting elements) means writing to both the address translator memory and the iterator representation. The latter operation requires either render-to-vertex-array support or CPU involvement.

5.3.2 Element Iterators

Glift element iterators pose several challenges for implementation on current GPUs. First, element iterators are a pointer-like abstraction, and GPUs do not support pointers. Second, GPU data struc- ture traversal is usually accomplished with the rasterizer, yet the rasterizer cannot generate true 48 memory addresses. It can only generate N-D indices via texture coordinate interpolation (i.e., ad- dress iterators). As such, Glift implements element iterators by combining an address iterator with a Glift container (either PhysMem or VirtMem). An address translator provides a physical or virtual address to the Glift container, and the user sees only the result of the data access. The last challenge is the design requires shading language structures that contain a mixture of uniform and varying members. This feature has recently been added to Cg and is available in Cg version 1.5.

5.4 Mapping Glift Data into CPU Memory

All Glift components are designed to support data access on both the CPU and GPU. Glift im- plements this feature by supporting explicit map cpu and unmap cpu functionality. Just as with

mappable GPU memory like vertex buffer objects and pixel buffer objects, map copies the data into

CPU memory while unmap copies data into GPU memory. The default behavior is to transfer the

entire data structure; however, Glift can optionally operate in a paged, lazy-transfer mode where only the required portions of the structure are copied. This mode reduces CPU read/write efficiency but can greatly reduce the amount of data sent between the CPU and GPU.

Initial designs attempted to automatically map data to the appropriate processor based on usage.

This proved problematic, however, because it is not possible to detect all cases when a GPU-to-

CPU or CPU-to-GPU synchronization is required. For example, this can arise if a user binds a Glift structure to a shader, writes to the CPU mapping of the structure, then later re-binds the shader without explicitly re-binding the Glift structure. This is a perfectly legal OpenGL usage pattern, but one in which Glift cannot detect that the structure is now being read by the GPU. To support this automatic virtualization, drivers would need to support a query to discover if a texture had been written, subsume a much larger portion of the GPU programming model (shader binding, etc.), or be integrated directly into GPU drivers.

Unfortunately, automatic synchronization is required to support virtualized CPU-to-GPU and GPU- to-CPU transfers for discrete address translators. The problem is that virtualized range operations

(see Section 5.5) require a valid CPU-side address translator. If the GPU has modified data owned 49 by the address translator, Glift must synchronize the CPU version before performing address trans- lation. This subtle but important design fact indicates that drivers need to support a dirty primitive so that client code can query when a piece of GPU memory was last written.

5.5 Virtualized Range Operations

Providing generic support for virtualized range operations (read, write, copy) was one of the most

difficult challenges in the design of Glift. The goal is for application programmers to be able to

easily create new containers that have full support for these operations. The challenge is that the contiguous virtual region over which these operations are defined often maps to multiple physical

regions. For example, in a paged address translator, a continuous virtual domain will often map to

many physical pages that are not adjacent in physical memory.

Early designs placed this functionality in the VirtMem component. This approach, however, re-

quired each data structure to implement a significant amount of redundant, complex code. Our

solution is inspired by a concept from automatic parallelization research. In their work to create a parallel STL, Austern et al. describe a range partition adaptor as an entity that takes a range defined by a begin and end iterator and breaks it up into subranges which can be executed in parallel [5].

Applying this idea to Glift, the generic range operation problem is solved if address translators can translate a virtual range into an ordered sequence of physical ranges.

The prototypical usage of this operation is: § ¤ // Input: origin and size of virtual rectangle vec3i virtOrigin (0) , virtSize (1 ,2 ,7);

// Ouput: corresponding physical ranges // - Allocate list of ranges AddrType :: ranges_type ranges ; addrTrans . translate_range ( virtOrigin , virtSize , ranges );

for ( size_t i = 0; i < ranges . size (); ++ i ) { 50

// Perform operation on ranges DoPhysRangeOp ( ranges [i]. origin , ranges [i]. size ); } ¦ ¥ where DoPhysRangeOp performs an operation such as read/write/copy across a range of contiguous physical addresses. The generic VirtMem class template now uses this idiom to virtualize all of the

range-based memory operation for any address translator.

While conceptually simple, the design had a profound effect on Glift: address translators are now

composable. The design places all of the data structure complexity into the address translator, which

in turn means that fully virtualized containers can be constructed by combining simpler structures

with little or no new coding. 51

Chapter 6

Results

6.1 Static Analysis of Glift GPU Code

GPU programmers will not use abstractions if they impose a significant performance penalty over

writing low-level code. Consequently our framework must be able to produce code with comparable efficiency to handcoded routines. We evaluate the efficiency of our code by comparing instruction count metrics for three coding scenarios: a handcoded implementation of our data structure accesses and Glift code generated with Cg both before and after driver optimization. We compare these met- rics in Table 6.1 on three address translators: a 1D 2D stream translator, a 1-level non-adaptive → page table, and the adaptive page table lookup used for quadtree shadow maps in Chapter 11 com- bined with an additional call to compute the page offset. This last case is a particularly difficult one for optimization because it contains a numerous redundant instructions executed by multiple method calls on the same structure. Table 6.1 shows both the number of assembly instructions re- ported by the Cg compiler driver and the number of microcode hardware instructions reported by

NVShaderPerf. The latter is the actual number of hardware instructions whereas the former is the number of high-level, user-visible assembly instructions.

All results in this section were computed on a 2.8 GHz Pentium 4 AGP 8x system with 1 GB

of RAM and an NVIDIA GeForce 6800 GT with 256 MB of RAM, running Windows XP. The 52

Method Cg ops HW ops Stream 1D 2D → Glift, no specialization 8 5 Glift, with specialization 5 4 Brook (for reference) — 4 Handcoded Cg 4 3 1D sparse, uniform 3D 3D page table → Glift, no specialization 11 8 Glift, with specialization 7 5 Handcoded Cg 6 5 Adaptive shadow map + offset Glift, no specialization 31 10 Glift, with specialization 27 10 Handcoded Cg 16 9 Table 6.1: Comparison of instruction counts for various compilation methods on 3 memory access routines. “Cg ops” indicates the number of operations reported by the Cg compiler before driver optimization; “HW ops” indicates the number of machine instructions (“cycles”), including multiple operations per cycle, after driver optimization, as reported by the NVIDIA shader performance tool NVShaderPerf.

NVIDIA driver version was 75.80 and the Cg compiler is version 1.4 beta.

The results for these address translators and others we have tested show that the performance gap between programming with Glift and handcoding is minimal if any. The careful use of partial template specialization, Cg program specialization, and improved optimizations of recent compilers and drivers make it possible to express abstract structures with very little or no performance penalty.

6.2 Memory Access Coherency

Like their CPU counterparts, many GPU data structures use one or more levels of memory indi- rection to implement sparse or adaptive structures. This section assesses the cost of indirect GPU memory accesses and the relative importance of coherency in those accesses. Examples include page-table structures such the adaptive data structure described in Section 9.3.1 and tree struc- tures [42, 78, 125].

This evaluation provides guidance in choosing the page size and number of levels of indirection to use in an address translator. We evaluate the performance of 1- and 2-level page tables built with

Glift as well as n-level chained indirect lookups with no address translation between lookups (see 53

1· 104 0 indirections(sequential) 1 indirection 7.5· 103 ) ) 2 indirections /s /s 4indirections MB MB h( h( 5· 103 1 level page table dt dt 2 level page table wi wi nd nd Ba Ba 2.5· 103

0 100 101 102 103 104 105 Page size (entries) Figure 6.1: Bandwidth as a function of page size for n-level chained indirect lookups (using no address computation) and for n-level page tables using our framework. Bandwidth figures only measure the data rate of the final lookup into physical memory and not the intermediate memory references. All textures are RGBA8 format. The results indicate that, for an NVIDIA GeForce 6800 GT, peak bandwidth can be achieved with pages larger than 8 8 for small levels of indirection and 16 16 for larger levels of indirection.

Figure 6.1). The experiment simulates memory accesses performed by paged structures by per- forming coherent accesses with page-sized regions, but randomly distributing the pages in physical memory. As such, the memory accesses are completely random when the page size is one (single pixel). Accesses are perfectly coherent when the page size is the entire viewport. To validate our experimental setup, we also measure reference bandwidths for both fixed-position and sequential- position direct texture accesses. These results (20 GB/sec and 9.5 GB/sec respectively) match the bandwidth tests reported by GPUBench [17].

We can draw several conclusions from Figure 6.1. First, the performance of a n-level page table is only slightly less than the performance of a n-level indirect lookup, indicating that our address translation code does not impose a significant performance penalty. The slightly better performance of the page table translator for larger page sizes is likely due to the page table being smaller than the indirection test’s address texture. The page table size scales inversely to the page size, whereas the indirection test uses a constant-sized address texture. The page table translators are bandwidth bound when the pages contain less than 64 entries and are bound by arithmetic operations beyond that. Lastly, higher levels of indirection require larger page sizes to maintain peak GPU bandwidth.

The results show that pages must be at least 64 entries to maximize GPU bandwidth for a small 54 number of indirections and 256 entries for higher levels of indirection.

These results show that it is not necessary to optimize the relative placement of physical pages in

a paged data structure if the pages are larger than a minimum size (16 16 4-byte elements for the NVIDIA GeForce 6800 GT). This validates the design decision made in Section 9.3.1 not to solve the NP-hard bin packing problem when allocating ND pages in an ND buffer. 55

Chapter 7

Discussion

This section describes insights, limitations, and near-term future work from the the Glift abstraction.

Section 13 describes longer-term future work for Glift and GPU data structures.

7.1 Language Design

Results from high-performance computing [29], as well as our results for Glift, show that it is possible to express efficient data structures at a high level of abstraction by leveraging compile- time constructs such as static polymorphism (e.g., templates) and program specialization. We have shown how to add template-like support to an existing GPU language with minimal changes to the shader compilation pipeline. We strongly encourage GPU language designers to add full support for static polymorphism and program specialization to their languages.

7.2 Separation of Address Translation and Physical Data Memory

In our implementation, address translation memory is separate from physical application memory.

This separation has several advantages: 56

Multiple physical buffers mapped by one address translator Mapping data to multiple physical memory buffers with a single address translator lets Glift store an “unlimited” amount of data in data structure nodes.

Multiple address translators for one physical memory buffer Using multiple address translators for a single physical memory buffer enables Glift to reinterpret data in-place. For example, Glift can “cast” a 4D array into a 1D stream by simply changing the address translator.

Efficient GPU writes Separating address translator and physical data allows the data in each to be laid out contiguously, thus enabling efficient GPU reading and writing. The separation also clarifies if the structure is being altered (write to ad- dress translator) or if only the data is being modified (write to physical memory).

Future GPU architecture optimizations In CPUs, access to application memory is optimized through data caches, while access to address translation memory uses the specialized, higher-performance translation lookaside buffer (TLB). Future GPU hardware may optimize different classes of memory accesses in a similar way.

7.3 Iterators

The Glift iterator abstraction clarifies the GPU computation model, encapsulates an important class of GPGPU optimizations, and provides insight into ways that future hardware can better support computation.

7.3.1 Generalization of GPU Computation Model

Glift’s iterator abstraction generalizes the stream computation model and clarifies the class of data structures that can be supported by current and future GPU computation. The stream model, popu- larized in graphics by Brook, describes computation of a kernel over elements of a stream. Brook defines a stream as a 1D–4D collection of records. In contrast, Glift defines GPU computation in terms of parallel iteration over a range of an arbitrary data structure. The structure may be as simple as an array or as complex as an octree. If the structure supports parallel iteration over its elements, it can be processed by the GPU. Given that the stream model is a subset of the Glift iterator model,

GPU stream programming environments can use Glift as their underlying memory model to support 57 more complex structures than arrays.

7.3.2 Encapsulation of Optimizations

Glift iterators capture an important class of optimizations for GPGPU applications. Many GPGPU

papers describe elaborate computation strategies for pre-computing memory addresses and spec- ifying computation domains by using a mix of CPU, vertex shader, and rasterizer calculations.

Examples include pre-computing all physical addresses, drawing many small quads, sparse com- putation using stencil or depth culling, and Brook’s iterator streams [14, 18, 47, 84, 121]. In most cases, these techniques are encapsulated by Glift iterators and can therefore be separated from the algorithm description. An exciting avenue of future work is generating GPU iterators entirely on the

GPU using render-to-vertex-array, vertex texturing, and future architectural features for generating geometry [9].

7.3.3 Explicit Memory Access Patterns

Glift’s current element iterator implementation performs random-access reads for all iterator access patterns (single, neighborhood, and random) because current hardware does not support this distinc- tion. However, the abstraction to the programmer remains one of limited stream or neighbor access.

While a current GPU may not be able to benefit from this abstraction, current CPUs, multiple-GPU configurations, future GPUs, and other parallel architectures may be able to schedule memory loads much more efficiently for programs written with these explicit access pattern constructs.

7.3.4 Parallel Iteration Hardware

Future GPUs could offer better support for computation over complex data structures by providing true pointers and native support for incrementing those pointers in parallel. Glift’s current GPU iterators give the illusion of pointers by combining textures and address iterators. This approach can require multiple address translation steps to map between the output domain, a data structure’s 58 virtual domain, and a structure’s physical domain. It is possible that future architectures will opti-

mize this by providing native support for parallel range iterators. Instead of generating computation

with a rasterizer, computation would be generated via parallel incrementing of iterators.

7.4 Near-Term GPU Architectural Changes

We note how near-term suggested improvements in graphics hardware and drivers would impact the

performance and flexibility of Glift:

Time-Stamped GPU Memory: As discussed in Section 5.4, supporting virtual- ized range operations requires the ability to query the GPU driver to find out when a region of GPU memory was last written. Adding this feature to drivers obviates the need for Glift to be implemented within the driver, thereby keeping drivers smaller and simpler.

Integers: Hardware support for integer data types would both simplify address translation code and avoid the precision difficulties with floating-point address- ing [16].

Per-Fragment Mipmapping: In current NVIDIA GPUs, accessing mipmap tex- tures with the explicit level-of-detail (LOD) instruction (TXL) results in all four fragments in a 2 2 pixel block accessing the same mipmap level, essentially ig- noring three of the four LOD values. The explicit derivative texture instruction (TXD) does give correct, per-fragment results but performs significantly slower. Other workarounds for this behavior are error-prone and computationally expen- sive, limiting the usefulness of mipmaps as general data structure primitives.

7.5 Limitations

The address translation abstraction we have presented in this dissertation does not currently repre-

sent all possible data structures. In particular, we have not investigated GPU implementations of non-indexable structures such as linked lists, graphs, and mesh connectivity structures. Chapter 13 describes ways that Glift could be extended to possibly support these irregular structures.

The Glift implementation is a proof-of-concept with significant room for improvement. For ex-

ample, Glift does not currently virtualize the total amount of GPU memory, virtualize memory on 59 multiple GPUs, or provide file formats for its structures (although the textures can be saved using any compatible image format). In addition, the CPU side of Glift does not yet take advantage of

SIMD instructions or support multithreading. 60

Part III

GPU Data Structures 61

Chapter 8

Classi®cation of GPU Data Structures

This dissertation demonstrates the expressiveness of the Glift abstractions in three ways. First, this chapter characterizes a large number of existing GPU data structures in terms of Glift concepts.

Second, Chapter 9 introduces novel GPU data structures, and Part IV describes two applications not previously demonstrated on the GPU due to the complexity of their data structures. The clas-

sification presented in this section identifies common patterns in existing work, showing that many structures can be built out of a small set of common components. It also illuminates holes in the existing body of work and trends.

8.1 Analytic ND-to-MD Translators

Analytic ND-to-MD mappings are widely useful in GPU applications in which the dimensionality of the virtual domain differs from the dimensionality of the desired memory. Note that some of the structures in Table 8.1 use ND-to-MD mappings as part of more complex structures. These trans- lators have the following properties: O(1) memory complexity, O(1) access complexity, uniform access consistency, GPU or CPU location, a complete mapping, one-to-one, and invertible.

The mappings are typically implemented in one of two ways. The first form linearizes the N-D space to 1D, then distributes the the 1D space into M-D. C/C++ compilers and Brook support N-D 62 the xity mem- and column identify to comple decompres- is ysical 3dPolyCube- ph domains access translation table y lookups, the space this dimension address uniform of address lists lists book hierarch time virtual of of ysical The e mapping on v deform code ph (MCmplx), table 3D-to-3dPolyCube, olume purpose Grid Grid to v VQ pages and hash mapping (U/A). adapti xity 2 + The page e GPU map 2D table, v table. table. an on 3D CPU CPU arp virtual table, table table table el bounding comple w page page, page page v adapti on on page-based tree 1D-to-2D the ND-to-1D-to-2D or 3D page -d page page page 3D 3D 3D represent k GPU el el el el el el el el math to v v v v v v v v “proposed”. memory characteristics. ariable-le describe Description 2-le Streams: Arrays: Quad-tree Quad-tree Static 3D-to-2D 1-le 1-le V 1-le 1-le 1-le 1-le Non-linear 1-le Static to-2D ory sion uniform the we and indicates Y N N Y N N N N Y Y Y N N N Y Y Y ork, ? Mut? (Loc), translator w (C/S) each (Mut). U U U U U U U U U Cons NU NU NU NU NU NU NU NU address or sparse A F translation of or GPU the n n n n n size size size terms 1 1 1 1 1 1 1 1 1 on address log log log log log Cmplx list list list research. in A complete ∝ ∝ ∝ of the as rans body n n n n n updatable xpressed n n n n 1 1 1 n n n n n e is AddrT perform log log log log log MCmplx described to xisting e are ? GPU the structure structures Loc CPU CPU CPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU in GPU the the data if aps or g Mappings A A A A U U A U U A A U U U A A A and GPU CPU 3.2.3. S S S S S S Mapping C C C C C C C C C C C the Cons), vious illuminate → (A → pre D uses y } D Section and D 2D 3D 2D 2D 2D 2D 2D 2D 2D 2D 2D 2D 2D 2D 3D } of } in → → → → → → → → → → → → → → → 1–4 2,3,4 2,3 2D Domains 1D { { 3D { 3D 2D 3D 3D 4D 3D 3D 2D 3D 3D 3D 3D 3D consistenc structures et described [78] [31] implementation [63] [10] access [84] [109] [108] [125] [53] [74] al. [124] [42] al. [18] al. Characterization the al. [22] al. between al. al. al. [116] the et al. et al. al. et al. al. if et et et et et al. et al. et et Citation et et vre 8.1: et et y [114] characteristics ole arini Cmplx), Kraus Lefeb Harris Buck Thrane Purcell Schneider al. Purcell Carr Binotto Sen Coombe F Lefohn Johnson T able (A indicates similarities six T 63 arrays in this manner. The second approach is to directly map the N-D space into M-D memory.

This is a less general approach but can result in a more efficient mapping that better preserves M-D spatial locality. For example, this approach is used in the implementation of flat 3D textures [46,

53, 83]. Either implementation can be implemented wholly on the GPU. Lefohn et al. [80] detail

the implementation of both of these mappings for current GPUs. Glift currently provides a generic

ND-to-2D address translator called NdTo2dAddrTransGPU.

8.2 Page Table Translators

Many of the GPU-based sparse or adaptive structures in Table 8.1 are based on a page table design.

These structures are a natural fit for GPUs, due to their uniform-grid representation, fast, uniform access, and support for dynamic updates. Page tables use a coarse, uniform discretization of a virtual

address space to map a subset of it to physical memory. Like the page tables used in the virtual

memory system of modern operating systems and microprocessors, page table data structures must

efficiently map a block-contiguous, sparsely allocated large virtual address space onto a limited

amount of physical memory [71, 84].

Page table address translators are characterized by O(N) memory complexity, O(1) access complex-

ity, uniform access consistency, GPU or CPU location, complete or sparse mapping, and one-to-one

or many-to-one mapping. Page tables are invertible if an inverse page table is also maintained.

Page tables have the disadvantage of requiring O(N) memory, where N is the size of the virtual

address space; however, multilevel page tables can dramatically reduce the required memory. In

fact, the multilevel page table idiom provides a continuum of structures from a 1-level page table to

a full tree. This was explored by Lefebvre et al. [78], who began with a full tree structure and found

the best performance was with a shallower-branching multilevel page table structure. Section 9.3

describes a new dynamic, multiresolution, adaptive GPU data structure that is based on a page table design.

The basic address translation calculation for an N-D page table translator is shown below: 64

§ ¤ vpn = va / pageSize // va: virtual address // vpn: virtual page number pte = pageTable . read ( vpn ) // pte: page table entry ppn = pte . ppn () // ppn: physical page number ppa = ppn * pageSize // ppa: physical page origin off = va % pageSize // off: offset into physical page pa = ppa + off // pa : physical address ¦ ¥

Beginning with the above mapping, we can succinctly describe many complex structures simply as variants of this basic structure, including varying physical page sizes (grids of lists), multilevel page

tables, and adaptively sized virtual or physical pages. Glift provides a generic page table address

translator. The policy-based implementation is highly parameterized to maximize code reuse. As

demonstrated in Chapter 9, entirely new data structures can by created by writing a small amount of policy code for the existing class template.

8.3 GPU Tree Structures

Until 2004, nearly all GPU-based address translators had O(1) access complexity and uniform con-

sistency. In the last two years, however, researchers have begun to implement tree structures such as k-d trees, bounding volume hierarchies, and N-trees. One notable precursor to these recent struc- tures is Purcell’s non-uniform consistency grid-of-lists construct used to build a ray tracing acceler- ation structure and kNN-grid photon map in 2002 and 2003, respectively [108, 109].

The change to non-uniform access consistency structures came largely as a result of NVIDIA re- leasing the NV40 architecture with support for looping in fragment programs. Support for this feature is, however, still primitive, and incoherent branching can greatly impact performance [52].

Purcell [109] and Foley [42] both avoided the need for fragment program looping by rendering one loop iteration per pass, but this approach significantly increases the bandwidth requirement for data structure traversal. As future GPUs provide more efficient branching support, these memory- efficient data structures will continue to improve in performance. 65

8.4 Dynamic GPU Structures

Including dynamic complex data structure in GPU applications is an area of active research. In fact, Purcell et al. are the only researchers to describe an entirely GPU-updated, dynamic sparse data structure [109]. Lefohn et al. [84] and Coombe et al. [31] both describe efficient GPU-based dynamic algorithms that use the CPU only as a memory manager. The adaptive shadow data struc- ture described in Chapters 11.3 and 11.4 of this dissertation are GPU-updated dynamic, adaptive structures that use the CPU only to generate the GPU iterators required to update the page table.

Clearly, GPU-updated sparse and adaptive structures will be an area of focus for the next couple of years, especially with the interest in ray tracing of dynamic scenes.

Section 9.3 introduces a new GPU-based dynamic sparse and adaptive structure and Chapters 11.3 and 11.4 describe its use for high-quality, interactive shadows. One of the conclusions from this work is that an efficient parallel scan algorithm is required to build structures with parallel compu- tation [59]. This conclusion is also noted in the parallel computing literature [58].

8.5 Limitations of the Abstraction

Most of the structures listed in Table 8.1 are encapsulated well by the Glift abstractions. One structure that is challenging to express in Glift is the grid-of-lists used by Purcell et al. [109] and

Johnson et al. [63] This structure is a page table with variable-size pages. Interpreting this as a random-access container requires using a 1D index for elements in the variable-sized pages; thus a

3D grid-of-lists uses a 4D virtual address space.

A second challenge to the Glift abstraction is the interface for building each of the structures. The sparse and adaptive structures, in particular, require additional methods to allocate and free memory.

In practice, we have added these operations in the high-level container adaptor classes. They should, however, likely be a policy of the address translators, given that they operations modify the address translation function. In the future, we may be able to identify a small set of generic interfaces that abstract these operations. 66

Chapter 14 describes ideas for extending Glift to support non-random-access structures such as

linked lists, hash tables, graphs, etc. It also discusses the implications of future hardware on Glift

data structures. 67

Chapter 9

Example Glift Data Structures

This chapter describes how Glift is used to build three GPU data structures.

9.1 GPGPU 4D Array

After describing the Glift components required to build and use a 4D array in Chapter 3, we now

show a complete “before” and “after” transformation of source code with Glift. We use a GPGPU

computation example here because of its simplicity; note that it demonstrates the use of Glift ele-

ment iterators.

Here, we show the Cg code for the Glift and non-Glift versions of this example. Appendix 1 shows

the corresponding C++ code for both examples. For each element in the 4D array, the following

kernel computes the finite discrete Laplacian. The Cg shader for the non-Glift example is: § ¤ float4 physToVirt ( float2 pa , float2 physSize , float4 sizeConst4D ) { float4 addr4D ; float addr1D = pa.y * physSize .x + pa.x; addr4D .w = floor ( addr1D / sizeConst4D .w ); addr1D -= addr4D .w * sizeConst4D .w; 68

addr4D .z = floor ( addr1D / sizeConst4D .z ); addr1D -= addr4D .z * sizeConst4D .z; addr4D .y = floor ( addr1D / sizeConst4D .y ); addr4D .x = addr1D - addr4D .y * sizeConst4D .y; return addr4D ; } float2 virtToPhys ( float4 va , float2 physSize , float4 sizeConst4D ) { float addr1D = dot ( va , sizeConst4D ); float normAddr1D = addr1D / physSize .x; return float2 ( frac ( normAddr1D ) * physSize .x , normAddr1D ); } float4 main ( uniform sampler2D array1 , uniform float2 physSize , uniform float4 sizeConst , varying float2 winPos : WPOS ) : COLOR { // Get virtual address for current fragment float2 pa = floor ( winPos ); float4 va = physToVirt ( pa , physSize , sizeConst );

// Finite difference discrete Laplacian float4 offset ( 1 , 0 , 0 , 0 ); float4 laplace = -8 * tex2D ( array1 , pa ); for ( float i = 0; i < 4; ++ i ) { laplace += tex2D ( array1 , virtToPhys ( va + offset , physSize , sizeConst ) ); laplace += tex2D ( array1 , virtToPhys ( va - offset , physSize , sizeConst ) ); offset = offset . yzwx ; } 69

return laplace ; } ¦ ¥

Note that this Cg program includes both a physical-to-virtual and virtual-to-physical address trans-

lation. The former maps the output fragment position to the virtual domain of the 4D array, and

the latter maps the virtual array addresses to physical memory. While the functions encapsulate the

address translation, the data structure details obscure the algorithm, and the shader is hard-coded for this particular 4D array. Any changes to the structure require rewriting the shader.

In contrast, the Cg shader for the Glift version of the same example is: § ¤ # include < gliftCg .h>

float4 main ( NeighborIter it ) : COLOR { // Finite difference discrete Laplacian float4 offset ( 1 , 0 , 0 , 0 ); float4 laplace = -8 * it. value ( 0 ); for ( float i = 0; i < 4; ++ i ) { laplace += it. value ( offset ); laplace += it. value ( - offset ); offset = offset . yzwx ; } return laplace ; } ¦ ¥

The NeighborIter parameter is a neighborhood iterator that gives the kernel access to a limited window of data values surrounding the current stream element. Note the intent of the algorithm is much clearer here than in the non-Glift version, the code is much smaller, and the algorithm is defined completely separately from the data structure. The only requirement of the data structure is that it supports neighbor iterators with 4D offsets.

There is an additional subtle benefit to the Glift version of the code. The expensive address trans- 70 lation shown in the non-Glift Cg shader can be optimized without changing the C++ or Cg user code. The iterator can pre-compute the physical-to-virtual and virtual-to-physical address transla- tions. These optimizations can be performed using pre-computed texture coordinates, the vertex processor, and/or the rasterizer. These types of optimization have been performed by hand in nu- merous GPGPU research publications [14, 46, 47, 84, 121], and Chapter 7 discusses this important property of iterators in more detail.

9.2 GPU Stack

The stack is a fundamental data structure in CPU programming, yet there has been little research on a GPU version of it. Applications of a GPU stack for graphics structures include k-d tree traversal for ray tracing [37] and GPU-based memory allocation for page-table structures (see Section 9.3).

This section introduces a GPU-compatible stack-like structure called the n-stack. The structure is a more general version of the multistack presented in work performed simultaneously with ours [37].

It is implemented as a Glift container adaptor atop a virtualized 1D array.

The n-stack can be thought of as n stacks combined into a single structure (see Figure 9.1). Each push and pop operation processes n elements in parallel. We implement the n-stack as a varying- size 1D array with a fixed maximum size.

A Glift n-stack of 4-component floats with n = 1000 is declared as: § ¤ const int n = 1000; typedef StackGPU StackType ;

int maxNumN = 10000; StackType stack ( maxNumN ); ¦ ¥

The stack can store at most maxNumN number of n-sized arrays. Users may push either CPU-based or GPU-based data onto the stack. The following code demonstrates a GPU stack push: § ¤ typedef ArrayGpu ArrayType ; 71

n

n

(a) (b) Figure 9.1: The Glift n-stack container stores a stack of n-length arrays (a) to support n-way parallel push and pop operations. We implement the structure in Glift as a 1D virtual array stored in 2D physical memory (b).

ArrayType data ( 1000 );

// ... initialize data array ...

ArrayType :: gpu_in_range r = data . gpu_in_single_range ( 0 , 1000 ); stack . push ( r ); ¦ ¥

Push writes a stream of n input values into the physical memory locations from stack.top() to stack.top() + n - 1 (see Figure 9.2). Note that the input to push is specified as a range of element iterators rather than a specific data structure. This makes push compatible with any Glift

data structure rather than just dense, 1D arrays.

Pop removes the top n elements from the stack and returns a GPU input range iterator (see Fig- ure 9.2). This iterator can be bound as input to a GPU computation over the specified range. Exam- ple C++ source code that pops two streams from the stack, combines them in a subsequent kernel,

and writes them to an array looks like: § ¤ StackType :: gpu_in_range result1 = stack . pop (); StackType :: gpu_in_range result2 = stack . pop ();

// ... Create Cg shader and instantiate Glift types 72

New “top” Old “totop”p”

(a)

ONewld “ “totopp””

Popped data

(b) Figure 9.2: The Glift n-stack container supports pushing and popping of n elements simultaneously. The push operation writes elements from positions top to top + n - 1 (a). The pop operation returns a range iterator that points to elements from top to top - n, then decrements top by n (b). 73

CGparameter result1Param = cgGetNamedParameter ( prog , " result1 " ); result1 . bind_for_read ( result1Param );

CGparameter result2Param = cgGetNamedParameter ( prog , " result2 " ); result2 . bind_for_read ( result2Param );

ArrayType output ( StackType :: value_size ); ArrayType :: gpu_out_range outR = output . gpu_out_range ( 0 , StackType :: value_size ); outR . bind_for_write ( GL_COLOR_ATTACHMENT0_EXT );

glift :: execute_gpu_iterator ( result1 , result2 , outR ); ¦ ¥

The corresponding Cg shader is: § ¤ float4 main ( SingleIter result1 , SingleIter result2 ) : COLOR { return result1 . value () + result2 . value (); } ¦ ¥

Our current implementation does not support a Cg interface for push or pop because these opera- tions modify the stack data structure. Such operations must be expressed as their own pass because current GPUs only only write operations at the end of a kernel (whereas a Cg interface would re- quire modifying the stack data structure at the point it was read rather than the end of the pass).

Current GPUs do not support read-write memory access within a rendering pass, and so push must

execute in its render pass. Note that pop returns only a GPU input iterator and therefore does not

require its own render pass.

9.3 Dynamic Multiresolution Adaptive GPU Data Structures

In this section, we describe a novel dynamic multiresolution adaptive data structure. The data structure is defined as a Glift container adaptor, requiring only minor modifications to structures 74 already presented thus far. Adaptive representations of texture maps, depth maps, and simulation

data are widely used in production-quality graphics and CPU-based scientific computation [7, 34,

41, 87]. Adaptive grids make it possible for these applications to efficiently support very large

resolutions by distributing grid samples based on the frequency of the stored data. Example effective

resolutions include a 524,2882 adaptive shadow map that consumes only 16 MB of RAM and a

10243 fluid simulation grid. Unfortunately, the complexity of adaptive-grid data structures (usually

tree structures) has, for the most part, has prevented their use in real-time graphics and GPU-based

simulations that require a dynamic adaptive representation.

Previous work on adaptive GPU data structures (see Table 8.1) includes CPU-based address trans-

lators [22, 31], static GPU-based address translators [10, 42, 74, 125], and a GPU-based dynamic

adaptive grid-of-lists [109]. In contrast, our structure is entirely GPU-based, supports dynamic up-

dates, and leverages the GPU’s native filtering to provide full mipmapping support (i.e., trilinear

[2D] and quadlinear [3D] filtering). Chapter 10 describes the differences between our structure and that of work done in parallel with ours by Lefebvre et al. [78].

9.3.1 The Data Structure

We define our dynamic, multiresolution, adaptive data structure as a Glift container adaptor, built atop the 1-level page table structure defined in Section 8.2. As such, the structure can be described as simply a new interpretation of a page table virtual memory container (in the same way that Sec- tion 9.2 defines a stack on top of a 1D virtual memory definition). The structure supports adaptivity via a many-to-one address translator that can map multiple virtual pages to the same physical page.

We avoid the bin-packing problem and enable mipmap filtering by making all physical pages the same size. We support multiresolution via a mipmap hierarchy of page tables. Figure 9.3 shows a

diagram of our structure.

The adaptive structure is predominantly built out of generic Glift components and requires only a small amount of new coding. The new code includes several small policy classes for the generic page table address translator, a page allocator, and the GPU iterators. The resulting AdaptiveMem 75

Address Virtual Domain Physical Domain Translator (a) (b) (c) Figure 9.3: This figure shows the three components of the multiresolution, adaptive structure used to represent a quadtree and octree. In this adaptive shadow mapping example, the virtual domain (a) is formed by the shadow map coordinates. The address translator (b) is a mipmap hierarchy of page tables, and the physical memory (c) is a 2D buffer that stores identically sized memory pages. The address translator supports adaptivity by mapping a varying number of virtual pages to a single physical page. container adaptor is defined as: § ¤ typedef AdaptiveMem < VirtMemPageTableType , PageAllocator > AdaptiveMemType ; ¦ ¥

VirtMemPageTable is the type definition of the generic VirtMem component built as a 1-level page

table structure. The PageAllocator parameter defines the way in which virtual pages are mapped to physical memory. The choice of allocator determines if the structure is adaptive or uniform and whether or not it supports mipmapping. Note that the allocator is a similar construct to the allocators used in the STL to generically support multiple memory models.

The adaptive address translation function differs from the one presented in Section 8.2 in only two small ways. First, we add a mipmap level index to the page table read: § ¤ pte = pageTable . read ( vpn , level ) ¦ ¥

Second, we support variable-sized virtual pages by storing the resolution level of virtual pages in

the page table and permitting redundant page table entries (Figure 9.3). We thus change the offset 76

Figure 9.4: Depiction of our node-centered adaptive grid representation. Red nodes indicate T- junctions (hanging nodes). The white node emphasizes the complex situation at the junction of multiple levels of resolution.

computation to: § ¤ off = ( va >> pte . level ()) % physicalPageSize ¦ ¥

The decision to represent adaptivity with variable-sized virtual pages and uniform-sized physical pages is in contrast to previous approaches. Our approach greatly simplifies support for native GPU

filtering (including mipmapping) at the cost of using slightly more physical memory. In addition, uniform physical pages simplifies support for dynamic structures by avoiding the bin-packing prob- lem encountered when allocating variable-sized physical pages. See Chapter 6 for memory system benchmarks that further indicate that complex page packing schemes are unnecessary.

9.3.2 Adaptivity Implementation Details

Correct representation of data on an adaptive grid presents several challenges irrespective of its

GPU or CPU implementation. Our node-centered implementation correctly and efficiently handles

T-junctions (i.e., hanging nodes), resolution changes across page boundaries, boundary conditions, and fast linear interpolation.

Adaptive grid representations generally store data at grid node positions rather than the cell-centered approach supported by OpenGL textures. A node-centered representation makes it possible to re- construct data values at arbitrary virtual positions using a sampling scheme free of special cases, 77 thus enabling us to correctly sample our adaptive structure using the GPU’s native linear interpo- lation. Cell-centered approaches must take into account a number of special cases when sampling across resolution boundaries.

While the node-centered representation makes it possible to sample the data identically at all po- sitions, discontinuities will still occur if we do not first correct the T-junction values. T-junctions

(red nodes in Figure 9.4) arise at the boundary between resolution levels because we use the same refinement scheme on the entire grid. The data at these nodes must be the interpolated value of their neighboring coarser nodes. Before sampling our adaptive structure we enforce this constraint by again using the GPU’s native filtering to write interpolated values into the hanging nodes. This also works for higher level resolution changes between the elements.

A node-centered representation also simplifies boundary condition support on adaptive grids. Un- like OpenGL textures, a node-centered representation contains samples on the exact boundary of the domain. As such, the position of data elements on the borders is identical and independent of the resolution (see edge nodes in Figure 9.4). Dirichlet and Neumann boundary conditions are easily supported by either writing fixed values into the boundary nodes or updating the values based on the neighboring internal nodes, respectively. Our implementation supports GL CLAMP TO EDGE and

GL REPEAT boundary modes. The GL CLAMP mode can also be supported, although it is not suitable

for adaptive representations because it samples between nodes. The resulting sample positions are

therefore irregular, depending on the resolution with which the virtual domain is represented.

The page table abstraction provides a convenient solution to the problem of reading neighboring

values across resolution changes (important for high-quality filtering and scientific computation).

Along each edge we must know the distance to the next regular node disregarding T-junctions, e.g.

consider the regular black neighbors of the white node in Figure 9.4. For a given edge, we can

directly obtain the resolution level on either side of the edge from the page table. For disparate

resolutions, the finer resolution corresponds to the hanging node while the coarser resolution con-

veys the distance to the next regular node. In 2D, the overhead for this evaluation is one extra page table read, in 3D the same is true for faces and three additional reads are necessary at edges. This resolution information allows us to resolve hanging nodes across multi-level resolution jumps. 78

Lastly, in order to support native GPU linear filtering, we must share one layer of nodes between physical memory pages to ensure that samples are never read from disparate physical pages. If samples were allowed to be read from adjacent pages, the filtered result color would be incorrect because adjacent physical pages are not guaranteed to be adjacent in the virtual domain. We there-

fore copy one layer of nodes and offset the physical coordinates by half a texel to ensure correct

hardware filtering. This approach has been described in previous literature on adaptive GPU struc-

tures [10, 22, 74]. Each time an application updates data values, the shared nodes must be updated

similar to the hanging node update. Nodes that are both T-junctions and shared are correctly handled

by the same interpolation scheme. 79

Part IV

Applications 80

Chapter 10

Octree 3D Paint

This chapter describes the first of four applications of Glift data structures. We demonstrate an interactive 3D paint application that stores paint in a GPU-based, 3D octree-like structure built with the Glift framework. The structure is built using a number of existing Glift components and has an intuitive, texture-like Cg syntax. The implementation supports quadlinear filtering of paint texels and has minimal impact on application performance. We demonstrate interactive painting of an

817k polygon model with effective paint resolutions varying between 643 to 20483.

We begin with an introduction of the 3D painting problem and octree textures. Section 10.2 de- scribes the data structure challenges and details, Section 10.3 describes the painting and mipmap-

ping algorithms, and Section 10.4 gives performance results for the painting application.

10.1 Introduction

Interactive painting of complex or unparameterized surfaces is an important problem in the digital

film community. Many models used in production environments are either difficult to parameterize

or are unparameterized implicit surfaces. Texture atlases offer a partial solution to the problem [22]

but cannot be easily applied to implicit surfaces. Octree textures [7, 34] offer a more general solu-

tion by using the model’s 3D coordinates as a texture parameterization. Christensen and Batali [28] 81 recently refined the octree texture concept by storing pages of voxels, rather than individual voxels, at the leaves of the octree. While this texture format is now natively supported in Pixar’s Photoreal- istic RenderMan renderer, unfortunately, the lack of GPU support for this texture format has made

authoring octree textures very difficult.

Despite the benefits of octree textures, the complexity of implementing them on GPUs has led

multiple authors to report that it is not possible. For example, Carr et al. [22] say, “Octree tex-

tures can adapt their resolution to the fine detail of surface painting and are fast enough to adapt

per-stroke. However, octree textures are not well-suited to capitalize on the acceleration and anti-

aliasing benefits of modern graphics hardware. . . ” The data structure for a GPU octree texture

must efficiently support all of the operations listed in Table 3.1 (efficient writes, random access,

etc.), hardware-accelerated filtering, and be easy enough to use to be a “drop-in” replacement for

traditional textures.

Simultaneously with our work, several other researchers published tree-like GPU data structures

(see Chapter 8 for a complete discussion). Foley et al. [42] and Thrane et al. [125] describe GPU-

based k-d trees and bounding volume hierarchies. However, these structures are statically built on

the CPU, and do not support the GPU-based updates required for interactive painting. Lefebvre et

al. [78], in parallel with our work, implemented a GPU-based octree-like structure (see Table 8.1).

Their work implemented general n-tree traversals with logarithmic, non-uniform accesses. In con-

trast, our structure supports uniform, O(1) accesses and supports quadlinear filtering without im-

pacting application frame rates with the tradeoff of requiring more address translator memory. Our

work can be seen as a special case of Lefebvre’s work, with the page table and physical data stored

in separate textures and support added for quadlinear mipmapping.

10.2 Data Structure

Our data structure is a 3D version of the adaptive structure described in Section 9.3. Without Glift,

all of the details for writing, reading, and building the data structure would be spread across C++

and Cg code, and any shader that used the structure would be exposed to the internal data struc- 82 ture details. In contrast, using Glift, we can define the structure once using almost entirely prede-

fined components, encapsulate new data structure development within a derived structure, and write shaders using a simple Cg interface that is general for any 3D, random-access data structure.

We describe the structure as an n-tree, where n is the number of texels in a physical pages. For

example, if we use a 23 page, the structure is an octree, and if we use an 83 page, the structure is a

512-tree. The high branching factor allows us to use a single level page table and support constant- time accesses. The tradeoff is additional memory consumed by the large physical pages and the page table.

We use 3D virtual and physical addresses with a mipmap hierarchy of page tables and a single, small

3D physical memory buffer. The 3D physical memory format enables the GPU to perform native trilinear filtering within physical pages. In our implementation, we unify the notion of mipmap level and brick resolution level. This is implemented within the Glift framework by mipmapping only the virtual page table and sharing a single physical memory buffer between all mipmap levels.

When the mipmap level is finer than or equal to the resolution of the painted texture, they utilize the

same physical brick. When the mipmap level is coarser than the painted texture resolution, we must

allocate a new page to store downsampled texels.

To define the Glift data structure, we first define structure-specific page-table entry (PTE) and page

allocator classes. We then combine these classes with pre-existing, generic Glift components to

create the structure as follows:

generic Glift page table address translator, – generic Glift mipmapped physical memory (3D addresses, PTE values),

– application-specific page allocator,

generic physical memory (3D addresses, RGBA color values), generic virtual memory (combine physical memory and address translator), and application-specific container adaptor (encapsulate entire VirtMem object to add higher-level functionality).

The result is a C++ class that entirely defines all operations required for the data structure and the 83 required Cg source code to instantiate the structure in a shader. Finally, a Cg shader can use octree structure similar to a conventional 3D texture access: § ¤ float4 main ( uniform VMem3D octreePaint , float3 objCoord ) : COLOR { return octreePaint . vTex3D ( objCoord ); } ¦ ¥

In fact, the shader above can be used with any 3D Glift data structure, be it a standard 3D texture,

n-tree, etc.

10.3 Algorithm

With the data structure defined in Glift, the bulk of development time for the application was spent

implementing brushing techniques, proper mipmap filtering, and adaptive refinement. This section describes each of those steps and how they use the Glift data structure.

The process of painting into the octree involves several steps: texture coordinate identification, brush rasterization, page allocation, texel update, and mipmap filtering. Note that we use the normalized coordinates of the rest pose of the model as texture coordinates. The first step, texture coordinate identification, uses the GPU to rasterize the model’s texture coordinates for locating brush-model intersections.

We implement efficient and accurate brushes by using 2D brush profiles. The approach is motivated by the fact that octree voxel centers are rarely located directly on the surface being painted. This makes it difficult to create smooth 3D brush profiles without artifacts. To accommodate this, we project each texel into screen space and update the color using 2D brush profiles (Figure 10.1(a)).

Our system restricts the minimum brush size to be no smaller than one screen space pixel, although this may translate into a larger brush in model space. We use the model-space brush size to auto- matically determine the resolution of the octree page. With this approach, larger brushes apply paint to coarser texture resolutions than would a smaller brush. 84

Projected Voxel

Model

Brush

t s r

Image Plane

(a) Our paint system uses smooth 2D brush profiles to paint (b) Our octree paint structure supports quadlin- into 3D texels. We intersect the model and brush by using ear mipmapping. Mipmap generation requires the screen-space projection of the 3D texture coordinates. special care because of the node-centered repre- The texture coordinates are the normalized rest pose of the sentation. This figure shows cell-centered ver- model. sus node-centered filtering. Note the difference in filter support: cell-centered has a 2 2 sup- port, whereas node-centered has 3 3 with some fine nodes shared by multiple coarse nodes. Figure 10.1: Brush profile and filtering detail for our painting system. 85

Since octree textures are sparse data structures, before a texel can be written we must ensure that it has been allocated at the desired resolution. When the desired brush resolution is finer than the existing texture resolution, the brick is re-allocated at the finer resolution and the previous texture color is interpolated into the higher resolution brick (i.e., texture refinement). Recall that because we are leveraging native GPU trilinear filtering, texels on the border of a texture brick are replicated by the neighboring bricks. This necessitates that we allocate neighboring tiles and write to all replicated

texels when updating a brick-border texel. The common and tedious problem of allocation and

writing to shared texels was folded into a specialized subclass of the more general Glift sparse, N-

D, paged data structure, which greatly simplifies the application-specific brushing algorithms. Since brushes always cover more than one texel and require blending with the previous texture color, we found it important to both cache the octree texture in CPU memory (eliminating unnecessary texture readback) and create a copy of the current brush at the appropriate brick resolution before blending it with the previous texture colors. Glift’s ability to easily map sub-regions (individual pages) of the octree to CPU memory greatly simplified brushing updates.

Once a brush stroke is completed, we must update the mipmap hierarchy levels of the octree texture.

Downsampling texels in our octree structure is a non-trivial issue due to the fact that our represen- tation is node-centered as opposed to the cell-centered representation used by ordinary OpenGL textures (Figure 10.1(b)). With the node-centered scheme, some finer mipmap level texels are cov- ered by the foot print of multiple coarser mipmap level texels. This requires the filtering to use a weighted average, with weights inversely proportional to the number of coarse mipmap level texels that share a specific finer mipmap level texel.

Because we are painting 2D surfaces using 3D bricks, there may be many texels in a brick that are never painted, since they do not intersect the surface. It is important that these texels do not con- tribute to the mipmap filtering process. We mark texels as having been painted by assigning them non-zero alpha values, having initialized the texture with zero alpha values. We set the alpha value to the resolution level at which the texel was last painted. This is done to ensure correct filtering.

This subtle issue arises when combining texture refinement (interpolating coarse bricks into finer

ones during the painting process) and mipmap filtering. When coarser texels are refined/interpolated into regions of a finer brick that do not intersect the surface, we end up with texels that cannot be 86 updated by the finer brush, thus potentially resulting in erroneous mipmap filtering. In the mipmap

filtering process, texels painted at finer resolutions are weighted higher than those that may have

been interpolated from a lower resolution. While this approach does not completely fix the prob- lem, since non-surface intersecting interpolated texels still have some contribution in the mipmap

filtering, it tends to work well in practice.

10.4 Results

Figure 10.2 shows an 817,000-polygon model painted with our system. The frame rates for viewing textured models in our 3D paint application were determined entirely by the complexity of geometry and varied between between 15 and 80 fps with models ranging in complexity from 50k to 900k polygons. The octree texturing operation did not affect frame rates in any of the viewing scenarios we tested. Frame rates while painting depend on the size of the current brush, and we maintain

highly interactive rates during painting. We evaluate the performance of our structure by comparing

it to the speed of a conventional 3D texture and no texturing. We measured the performance impact

of our data structure using synthetic tests similar to those shown in Figure 6.1. These synthetic

results are invariant to changes in page sizes between 83 and 323. As such, lookups into our structure

are bound by the address translation instructions rather than the memory accesses (see Chapter 7).

The biggest limitation of our approach is the amount of GPU memory consumed by the page table

and unused pages. Smaller pages would provide a tighter fit to the surface, yet would require

more page table entries to support an equivalent virtual domain. In addition, redundantly storing

extra texels to support native filtering means that pages smaller than 83 are impractical due to the

increased ratio of redundant to unique texels. As such, in order to reduce the memory requirement,

a significant performance penalty will be incurred due to both a multilevel page table and shader-

defined filtering. 87

Figure 10.2: Our interactive 3D paint application stores paint in a GPU-based octree-like data struc- ture built with the Glift template library. These images show an 817k polygon model with paint stored in an octree with an effective resolution of 20483 (using 15 MB of GPU memory, quadlinear filtered). 88

Chapter 11

Quadtree Shadow Maps

Despite an extensive body of research, interactive rendering of alias-free hard shadows for dynamic scenes remains a difficult and largely unsolved problem. This chapter contributes to this effort by de- scribing two high-quality, interactive shadow algorithms not previously demonstrated on GPUs due to their data structure complexity. The algorithms, Adaptive Shadow Maps (ASMs) and Resolution-

Matched Shadow Maps (RMSMs), both store shadow data in a quadtree of small shadow maps. For dynamic scenes, the quadtree must be rebuilt each frame using GPU-based algorithms in order to achieve interactive frame rates. The required GPU data structures are built using Glift components, and these algorithms stress nearly every portion of the Glift framework.

The GPU quadtree data structure has similar requirements to the octree structure described in Chap- ter 10, except that all data remains on the GPU at all times. The structure must support fast paral- lel node insertion, node deletion, node writes, and natively trilinearly filtered (mipmapped) reads.

Section 11.2 gives an overview of how the structure is defined with Glift and how each of these operations are supported.

The first algorithm, Adaptive Shadow Maps (ASMs) was first presented by Fernando et al. [41] as a hybrid CPU-GPU algorithm that used a CPU-based quadtree data structure and achieved interactive frame rates for static scenes. ASMs are widely regarded as a robust solution to shadow map aliasing problems, but not possible to implement for dynamic scenes because of the need for a dynamic GPU 89 quadtree structure. For example, Sen et al. [117] say, “Adaptive shadow maps [41] and perspective shadow maps [119] attempt to minimize visible aliasing by better matching the resolution repre- sented in the shadow map to that of the final image. Of these two, only the latter seems practical to implement on current graphics hardware;” and Chan et al. [24] say, “the required data structures and host-based calculations preclude real-time performance for dynamic scenes.” This dissertation presents the first known implementation of ASMs that uses an entirely GPU-based data structure and supports dynamic scenes, albeit slowly.

The second algorithm, Resolution-Matched Shadow Maps (RMSMs), is a modification of the ASM algorithm that is up to ten times faster for dynamic scenes while producing more accurate shadows.

The algorithm is novel to this dissertation and is based on the insight that a large amount of co- herency exists between image and shadow samples for surfaces continuously visible from the eye.

We leverage this insight to simplify the ASM algorithm and greatly improve the performance and improve shadow quality.

11.1 Introduction and Background

Despite an expansive body of literature, efficient generation of high-quality shadows is still largely an unsolved problem. All three popular methods, ray tracing, shadow volumes, and shadow maps, suffer from various limitations and problems. This section gives a brief overview of the techniques, their limitations, and how the quadtree-based shadow map algorithms presented in this chapter address the problems. For more information, Woo et al. provide an early survey of shadow algo- rithms [133], with Hasenfratz et al.’s more recent survey encompassing both an overview of these two algorithms as well as more recent work [54].

The shadowing problem is easily described as: given a point visible from the eye, determine if it is also visible from a light. The most elegant way of solving the problem is to trace a ray from the point to the light and check for intersections. Ray tracing methods are very high quality but are used sparingly even in offline renderers due to their cost. Examples of real-time ray-traced shadows exist [21, 27, 65, 108, 111, 123, 125, 130], but their widespread adoption is limited by the challenge 90 of requiring random access to the scene database and efficient support for dynamic scenes.

Shadow volume techniques [32] are another popular in real-time applications. The algorithm ren- ders polygons that extend from a light source into the scene and trace the silhouette edges seen from the light. The algorithm is popular, but require object-based analysis that can be costly, is difficult to combine with displacement/vertex shaders that procedurally deform geometry, and does not scale well with geometric complexity.

Shadow mapping techniques [131] are commonly used in both offline and interactive renderers due to their simplicity, support for procedural deformations, and native GPU support [43]. Briefly, the shadow map, first described by Lance Williams [131], is an image-space technique that computes the scene from the light’s point of view, storing the distance from the light to the scene geometry in a shadow map. The scene is then rendered from the point of view of the camera, using the information in the shadow map to decide if portions of the image are directly illuminated by the light or are instead in shadow. While efficient, flexible, and easy to implement, shadow mapping techniques suffer from projective, perspective, and depth-precision aliasing [119]. The quadtree- based shadow algorithms presented in this chapter are a variant of the basic shadow map algorithm.

11.1.1 Recent Work in Shadow Maps

The classic shadow map algorithm suffers from three kinds of aliasing: perspective, projective, and depth-precision aliasing [119]. Perspective aliasing is caused by a mismatch between the sampling rate of screen-space pixels and shadow texels, projective aliasing occurs when the light is nearly parallel to an occluder (Figure 11.1), and depth-precision aliasing results in shadow acne due to false self-shadowing of surfaces. This chapter addresses perspective and projective aliasing and uses standard bias techniques to avoid depth-precision aliasing.

A number of recent approaches address perspective aliasing by computing the shadow map in a perspective-warped space rather than light space [119, 132]. While these techniques remedy many aliasing artifacts, they do not remove all aliasing, and have special cases that make them difficult to use in practice [43]. Recent work by Lloyd et al. helps reduce the special cases [86]. However, 91

Figure 11.1: A difficult case for standard and perspective-shadow-map-based algorithms is pro- jective aliasing, which occurs when the light is parallel to the occluder. In the picture at the left, the grazing angles of the light that cause the two shadow boundaries on the middle sphere result in severe projective aliasing artifacts with a standard shadow map (center). By matching shadow resolution to the current camera view, quadtree-based shadow map algorithms give a more accurate shadow boundary (right), although close inspection reveals that even the finest 32,7682 resolution level we show here is insufficient; in fact, in the limiting case, infinite resolution would be required. the perspective-based techniques do not address projective aliasing errors. Chong et al. [26] present an algorithm that generates correct shadows for user-selected planes, however, the required user intervention and lack of global shadow quality guarantees limit the effectiveness of the algorithm.

Other recent efforts include the hybrid shadow map/volume rendering algorithm of Chan and Du- rand [24], trapezoidal shadow maps [90] (described by Lloyd et al. [86] as better quality than per- spective and light-space perspective shadow maps), and silhouette maps [116, 117], which apply a non-linear deformation to the shadow lookup based on an object-based edge analysis. Finally,

Forsyth’s shadow-map-based “shadow buffers” method [43] renders several shadow buffers per light in order to better match the resolution of the shadow map to the screen.

Another set of approaches eliminates shadow map aliasing by proposing new graphics hardware that rasterizes shadow data at the exact location required by the shadow coordinates in the current camera view [2, 63]. These approaches entirely eliminate aliasing and are equivalent to ray-traced hard shadows. The quadtree-based shadow map techniques described in this chapter closely ap- proximates these methods, run in real time on current graphics hardware, but require more memory. 92

11.2 Quadtree Shadow Map Data Structure

Both shadow algorithms in this chapter store shadow data in the same quadtree-like GPU data

structure. We replaces the traditional shadow map’s uniform grid with a quadtree of small shadow

map pages, as shown in Figures 9.3 and 11.2. The GPU quadtree structure is largely a 2D version of

the page-table-based octree structure described in Chapter 10 and Section 9.3. In fact, the only new

data structure code written specifically for the shadow applications was the page allocation code,

thus demonstrating the large amount of code reuse possible with Glift.

The structure keeps all address translator and physical data on the GPU at all times and supports fast

parallel page insertions, deletions, writes, and reads. The virtual addresses are (s,t,z) shadow map

coordinates and the physical addresses are 2D. The physical page size is a user-configurable param-

eter (typical values are 162, 322, or 642). Shadow lookups use the GPU’s native depth-compare and

2 2 percentage-closer filtering (PCF) [110] to return a fractional scalar value indicating how much light reaches the pixel. We also support trilinear (mipmapped) shadow lookups, enabling our appli-

cation to transition smoothly between resolution levels with no perceptible popping. As described

in Section 9.3.2, we support native hardware filtering by over-representing one row and column of

texels on the borders of each page. This is easily achieved when rendering shadow pages by simply

rendering pages of size (n + 1) (n + 1) rather than n n.

Figure 11.2 shows the factoring of the quadtree data structure into Glift components as well as a

visualization of the page borders mapped atop a rendered image. Cg shaders access the quadtree

shadow map in the same way as a traditional shadow map lookup. Below is an example of a Cg

shader that performs a quadtree shadow lookup: § ¤ float4 main ( uniform VMem2D shadowMap , float3 shadowCoord ) : COLOR { return shadowMap . vTex2Ds ( shadowCoord ); } ¦ ¥

Note that the above shader could actually be used for any 2D, random-access, Glift structure, and the actual structure used depends on the type to which VMem2D is instantiated at shader compile 93

(a) Virtual Domain (b) Adaptive Tiling (c) Page Table (d) Physical Memory (e) Adaptive Shadow Map Figure 11.2: Adaptive Shadow Maps and Resolution-Matched Shadow Maps address shadow map aliasing problems by storing the shadow map in a quadtree-like structure. The Glift data structure adaptively maps (b) the virtual domain (shadow map space (a)) into uniformly sized physical pages (d) using a mipmap hierarchy of page tables (c). The resulting rendering (e) showing adaptive refinement of the virtual domain.

time.

The single-level page table structure naturally supports O(1), uniform consistency, read accesses.

However, it is equally important to support efficient writes and insertions. We implement parallel copies from other GPU buffers by rendering page-size quadrilaterals into the physical memory of the quadtree. Each quad is positioned over a range of contiguous physical addresses in the quadtree and reads data from the source texture.

We implement fast parallel insertions by rendering quadrilaterals into the page table mipmap hier- archy. The page allocation policy requires that pages mapped to a resolution coarser than a given mipmap level may be entered into the page table if no finer resolution entry exists. We enforce this

condition, even while performing parallel allocations, by keeping a depth buffer for each page table.

We perform parallel allocation by rendering quadrilaterals (one for each new PTE) into each page table with a depth proportional to the new page resolution and the color being the value of the PTE.

The depth test enforces the allocation policy and resolves conflicts between the PTEs. In practice, this allocation scheme is very fast and has not been the bottleneck for any of our test scenes. It also

demonstrates the value of having an atomic test-and-set operation (the depth test) supported within

a parallel pipeline. 94

Figure 11.3: This adaptive shadow map uses a GPU-based adaptive data structure built with the Glift template library. It has an effective shadow map resolution of 131,0722 (using 37 MB of GPU memory, trilinearly filtered). The top-right inset shows the ASM and the bottom-right inset shows a 20482 standard shadow map.

11.3 Adaptive Shadow Maps on the GPU

Adaptive shadow maps [41] offer a rigorous solution to projective and perspective shadow map

aliasing while maintaining the simplicity of a purely image-based technique. ASMs nearly eliminate

shadow map aliasing by ensuring that the projected area of a screen-space pixel into light space

matches the shadow map sample area. The complexity of the ASM data structure, however, has prevented a GPU-based implementations until now. We present a novel implementation of adaptive

shadow maps (ASMs) that performs all shadow lookups and scene analysis on the GPU, enabling

interactive rendering with ASMs while moving both the light and camera. We support shadow

map effective resolutions up to 131,0722 and, unlike previous implementations, provide smooth

transitions between resolution levels by trilinearly filtering the shadow lookups. Example results

can be seen in Figure 11.3.

The ASM algorithm addresses both projective and perspective aliasing problems by adaptively sam-

pling the shadow map and matching its resolution to the pixels in the current camera view. ASMs

build the quadtree via an iterative refinement algorithm that refines along shadow boundaries found

in the current camera view. Shadow map resolutions are estimated based on derivatives of the

shadow map coordinates in screen space, much in the same way that texture mapping computes

mipmap levels [115]. The refinement algorithm begins with a low-resolution seed shadow map.

During each iteration, the current quadtree is analyzed for shadow edges. Edge texels that are not 95 at the correct resolution are refined to the correct resolution by generating new shadow pages and updating the data structure. If the ASM refinement algorithm finds all shadow edges in the cur- rent image and the required resolution does not exceed the maximum depth of the quadtree, it can generate accurate, alias-free hard shadows.

The ASM refinement algorithm proceeds as follows and is described in more detail in Algorithm 1. § ¤ void refineASM () { AnalyzeScene (...); // Identify shadow pixels with resolution mismatch StreamCompaction (...); // Pack these pixels into small stream CpuReadback (...); // Read refinement request stream AllocPages (...); // Render new PTEs into mipmap page tables CreatePages (...); // Render depth into ASM for each new page } ¦ ¥

1: render low-resolution seed into quadtree memory 2: repeat 3: for all pixels in image rendered from camera do 4: calculate (s,t,zl,`) shadowmap coords. and LOD 5: lookup in quadtree memory 6: if pixel on shadow edge and page not in ASM then 7: convert (s,t,zl,`) to shadow page request 8: transfer page requests to CPU 9: remove invalid page requests 10: generate unique page requests 11: allocate new page in quadtree 12: bin requests into superpages 13: render shadow data into superpages 14: copy shadow data from superpage to quadtree 15: until page requests == 0 Algorithm 1: The original, iterative ASM algorithm. The algorithm generates a quadtree of small shadow map pages.

The algorithm begins by performing a scene analysis to determine which camera-space pixels re-

quire refinement. A pixel requires refinement if it lies on a shadow boundary and its required

resolution is not in the current ASM. We use a Sobel edge detector to identify shadow boundaries

and compute required resolutions using derivatives of the shadow coordinates. We then pack the

pixels needing refinement into a small contiguous stream using a non-uniform reduction, stream 96 compaction, algorithm [59]. The resulting small image is read back to the CPU to initiate new shadow data generation. The CPU removes duplicate requests, allocates new quadtree pages by rendering page allocations into the GPU-based page tables, then rendering the scene geometry from the light into the new physical pages. We repeat this refinement algorithm to convergence.

11.3.1 ASM Results

We tested our ASM implementation on an NVIDIA GeForce 6800 GT using a window size of 5122

(see Section 11.4.3 for ASM performance results on a GeForce 7800 GTX and a detailed com- parison to resolution-matched shadow maps). For a 100k polygon model and an effective shadow map resolution of 131,0722, our implementation achieves 15–50 frames per second while the cam- era is moving. We achieve 4–10 fps while interactively moving the light for the same model, thus rebuilding the entire ASM each frame.

Data Structure Results

ASM lookup performance is between 73–91% of a traditional 20482 shadow map. The table below lists total frame rate including refinement (FPS) and the speed of ASM lookups relative to a standard

20482 traditional shadow map for a 5122 image window. We list results for bilinearly filtered ASM

(ASM L), bilinearly filtered mipmapped ASM (ASM LMN), and trilinearly filtered ASM (ASM

LML).

PageSize FPS ASM L ASM LMN ASM LML 82 13.7 91% 77% 74% 162 15.6 90% 76% 73% 322 12.1 89% 75% 73% 642 12.9 89% 74% 73%

The memory consumed by the ASM is configurable and is the sum of the page table and physical

memory size. In our tests above, we vary the values from 16 MB to 85 MB and note that a 2-level page table might significantly reduce the page table memory requirements. 97

The ASM lookup rates are nearly as efficient as a standard shadow map, and are bound by the cost of

the address translation instructions. This shows that our address translator is not bandwidth bound

and that the paging does not significantly impact memory performance.

ASM Algorithm Results

The total frame rate of our initial implementation was dominated by the cost of the O(nlogn)

stream compaction [56, 59] portion of the refinement algorithm. This computation greatly reduces

CPU read back cost at the expense of GPU computation (a net win), but clearly a more efficient algorithm for this operation would further improve our frame rates. We removed this bottleneck

from the application by devising the hybrid scan algorithm described in Section 11.4.2. The new compaction is 3–4 times faster than the Horn algorithm and is no longer a significant bottleneck.

An additional bottleneck is the readback and CPU traversal of the compacted image. The size of

this shadow page request image varies from tens to tens-of-thousands of pixels and is especially

large when the light position changes. We found that, for dynamic scenes, the readback becomes a

bottleneck above 15 frames per second. As such, we implemented an efficient uniquify operation on the GPU that removes redundancies from the shadow request stream before reading it back to the

CPU. This algorithm is described in detail in Chapter 11.4.1 and consists of a sort followed by an additional compaction.

For static scenes, ASMs require 1–2 iterations to converge for static scenes (for both coherent and incoherent shadow receivers). In this case, the work is averaged over a number of frames because the results can be cached between frames. For dynamic scenes, ASMs perform very differently for coherent and incoherent receivers, requiring 15–20 iterations to converge for coherent receivers and

2–4 iterations to converge for incoherent receivers. Surprisingly, both cases result in the same per- formance. The dynamic, coherent-receiver case generates a small number of requests per iteration

(10–50), while the incoherent case generates a large number of requests per iteration (3000–4000).

This behavior is caused by the edge-finding algorithm used by ASMs. It is much easier to find edges

(and thus make a shadow request) from an incoherent receiver than a smooth receiver that has no edges of its own. As a result, the performance of dynamic ASMs is similar for both coherent or 98 incoherent receivers.

We also tried adding constraints to the algorithm such as those described in Fernando’s original paper. We tried fulfilling a limited number of shadow page requests per frame in order to bound the required time. The result is a partially-defined shadow that leads to objectionable temporal artifacts.

In addition, while the deviation in frame times improves (decreases), the average frame rate does not change significantly because the overall amount of work remains approximately the same. In order to avoid temporal artifacts, the only constraint we found that worked well was lowering the maximum shadow resolution uniformly for the entire image. This increases performance at the cost of lower-quality shadows overall.

Finally, the application becomes unnecessarily geometry bound with large models (we have tested up to one million triangles) due to the lack of frustum culling optimization used in Fernando et al.’s implementation. This is not a problem for smaller models (< 100k triangles) because we minimize the number of render passes required to generate new ASM data by coalescing page requests.

In conclusion, we achieve 15–50 fps for static scenes and 4–10 fps for dynamic scenes for a 5122 image. The GPU scene analysis uses data-parallel algorithmic primitives such as scan, sort, and gather to send a small amount of data back to the CPU for new shadow generation. Overall, fur- ther performance improvements are limited by the iterative refinement algorithm. The fact that the number of iterations required to converge to a usable result is unknown and highly variable makes the algorithm problematic for real-time usage where both performance and frame-to-frame coher- ence must be maintained. However, the algorithm is a valid choice for interactive film preview rendering [103].

11.4 Resolution-Matched Shadow Maps

Although adaptive shadow maps [41], and our real-time adaptation of them (Section 11.3), offer an attractive solution to the projective and perspective aliasing problems of shadow maps, their practical use for dynamic scenes is plagued by an iterative edge-finding step that is costly, takes a 99 highly variable amount of time per frame, and is not guaranteed to converge to a correct solution.

This section introduces a new shadow algorithm, resolution-matched shadow maps, that replaces

ASM’s iterative refinement step with a single-pass allocation scheme. This simplification is based

on the insight that the number of possible shadow page requests in a single frame has a small practical upper bound. The resulting algorithm is up to ten times faster than ASMs for dynamic scenes, has more predictable performance, and delivers more accurate shadows.

For the scenes described in this section, we achieve 20–70 frames per second on static scenes and

12–30 frames per second on dynamic scenes for 5122 and 10242 images with a maximum effective shadow resolution of 32,7682 texels. The algorithm requires 1–3x more memory than ASMs for

the same shadow resolution and therefore represents a space–time tradeoff.

11.4.1 Algorithm

Principal of Shadow Thrift

We begin our discussion of our design decisions with an observation about the relationship between shadow space and image space. In an adaptive shadow map, each image pixel is mapped to a

shadow texel of the correct shadow resolution, where the size of the texel is as close as possible to the size of the pixel. Consider two adjacent image pixels that are also adjacent samples of an object, and a shadow that is cast onto that surface. Because the pixels are adjacent in image and object space, they will likely be mapped to identical shadow resolutions. And because of the 1:1 mapping between image space and adaptive-shadow-map space, neighboring pixels in image space map to neighboring texels in the adaptive shadow map. We thus draw the conclusion that continuously visible surfaces in image space result in coherent accesses in adaptive-shadow-map space. We

name this observation the Principal of Shadow Thrift, as it is closely related to Peachy’s Principle of Texture Thrift [101] that says that the number of mipmapped texels required to texture all visible surfaces in a scene is proportional to the number of pixels rather than the number of surfaces or size of the textures. The principal of shadow thrift is vital to the success of our algorithm because we take advantage of this coherency in several ways to improve upon the original ASM algorithm. 100

Figure 11.4: Top: A very difficult scene for most shadow map algorithms. The scene con- sists of 4,000 self-shadowing hairs, each consisting of 12 line segments, being shadowed by a 32,7682 resolution-matched shadow map. Our algorithm closely approximates the goal of gen- erating shadow data at the exact location required by the current camera view, and the performance scales with the number of continuously visible surfaces. Bottom: Close-up of the shadow from the 4,000 hairs. With an image size of 10242, we render the image on the bottom at 30–35 frames per second (fps) when the light is static and 20-25 fps when the light is moving. The large number of occluded surfaces makes the top image more challenging and we render it at 15–17 fps for static lights and 6–7 fps for moving lights. Our approach is up to 10 faster than our highly-optimized, GPU-based, adaptive shadow map implementation. 101

Figure 11.5: We use this robot scene (left) with 66,000 polygons as an example to explain our design decisions. The middle image shows a shadow close-up using our non-iterative, resolution-matched shadow map algorithm, and the right image shows the same view with a standard, 20482 shadow map.

Do all scenes exhibit this coherency property? No; a scene with a different object at each pixel,

a light perpendicular to the eye, and depths distributed across the entire light frustum will exhibit no locality. We submit, however, that interesting scenes are ones that do have continuously visible surfaces and will exhibit this property. For humans to recognize an object, we must see enough of that object to draw a conclusion about it, which implies many pixels will be visible. Conversely, a scene with a different object at each pixel would be quite difficult to understand. In Section 11.4.3, we show that even highly complex, incoherent shadow receivers, such as the furball of Figure 11.4, exhibit substantial locality in shadow space. We take advantage of this locality in three ways: we organize our adaptive shadow map into shadow pages [82], use a single-step, non-iterative algorithm to request shadow pages, and introduce a connected-component step to optimize the scene analysis stage.

Optimizing the ASM Algorithm

As we describe above, ASM’s iterative edge-finding algorithm imposes a high cost, high perfor-

mance variability, and is error-prone (Figure 11.6); however, the advantage of the edge-finding step

is that it only refines on shadow boundaries, which reduces the number of shadow pages that need to

be processed and allocated to render an image. Fortunately, the image-shadow coherency property

described in Section 11.4.1 enables us to eliminate the iteration by requesting a shadow page for

every pixel in the image. Because of the coherency, many of these requests will be redundant, and 102 the number of unique shadow pages is only slightly more than are required by the iterative method.

In order to make this single-step method practical, we use data-parallel, GPU-based algorithms to create a set of unique page requests before sending the request to the CPU to initiate shadow data generation. This section describes our new algorithm (summarized in Algorithm 2) and the design process. In the following discussion, we use the example “robot” scene (Figure 11.5) to show how

each of our optimizations impacts performance for both our new, non-iterative algorithm and the

previous (iterative) ASM implementation.

To begin, if we simply read back all shadow requests to the CPU without any simplification, we achieve 0.5–1 frame per second due to the cost of the CPU processing all of the requests and the time required to transfer data from the GPU to CPU. Fortunately, we can take advantage of the

image-shadow coherency by performing a GPU connected-components analysis before transferring the data to the CPU. We eliminate redundant page requests between neighboring pixels by marking only requests whose immediate neighbor pixels below and to the left of the current pixel request a different page (in the best case, we only mark one request per page). This optimization improves the robot performance by 10–20 times and achieves an average frame rate of 10 fps.

Our next optimization step eliminates invalid page requests before transferring page requests to the

CPU. Invalid requests arise from pixels that do not request a shadow page due to the connected-

components pass, a scene that does not completely cover the viewport, or, for the iterative ASM

algorithm, a shadow page that is already allocated from a previous iteration. Eliminating invalid page requests on the GPU takes the average frame rate from 10 fps to 15 fps.

Our final optimization step is to eliminate all remaining redundant page requests on the GPU before

transferring them back to the CPU. This redundancy may be missed by the connected-components step if occlusion breaks up otherwise continuously visible surfaces. We perform this uniquify oper- ation by first sorting the page requests by page address using GPUSort [48], then marking unique elements by comparing each element with its immediate predecessor. Finally, we compact the sorted list to remove all non-unique elements. For the robot scene, this step results in a slight performance gain, increasing the average frame rate from 15 fps to 17 fps, but performance improves up to 2 times for scenes with significant occlusion, such as the furball scene in Figure 11.4. 103

(a) (b) (c) (d)

Figure 11.6: We use a scene (a) with incoherent, fine geometry (4,000 hairs each consisting of 12 line segments) as a stress-test for our resolution-matched shadow map (RMSM) algorithm (b) to compare against traditional, iterative ASMs (c) and standard shadow maps (d). Note the refinement error in the lower-left corner of the ASM image (c).

For comparison, we implemented the same three optimizations for the iterative ASM algorithm. The

first optimization takes the average frame rate for the robot scene from 4 fps to 5 fps. The second optimization increases the average frame rate to 7 fps. The third optimization has no discernible effect on the average frame rate.

The final, resolution-matched shadow map algorithm is:

1: for all pixels in image rendered from camera do 2: calculate (s,t,zl,`) shadowmap coords. and LOD 3: convert (s,t,zl,`) to shadow page request 4: eliminate redundant requests via connected-components 5: eliminate invalid requests (compaction) 6: sort page requests 7: compact again to generate unique page requests 8: transfer unique page requests to CPU 9: allocate new page in quadtree 10: bin requests into superpages 11: render shadow data into superpages 12: copy shadow data from superpage to quadtree memory Algorithm 2: The resolution-matched shadow map (non-iterative) algorithm. The algorithm gen- erates a quadtree of small shadow map pages. All steps except for 8 and 10 are GPU-based compu- tations. 104

11.4.2 Implementation

Algorithm Phase 1: Generating Requests

Our first phase (Steps 1–3 of Algorithm 2) begins with a geometry pass that generates the shadow coordinates. At each pixel, we calculate the s, t, and zl coordinates for the light-space shadow map, where s and t are standard shadow map coordinates and zl is the depth of the current pixel transformed into light space. In order to better handle anisotropy, we have found it necessary to compute the shadow level-of-detail, `, more accurately than OpenGL’s mipmapping calculation.

We compute ` by: ∂s ∂t dX = ∂ , ∂ ³ x x ´ ∂s ∂t dY = ∂ , ∂ ³ y y ´ (11.1) A = dX dY | |

` = log2(√A).

This computation computes the area of the parallelogram formed by dX and dY rather than the

OpenGL computation that computes the area of a bounding square [115].

We create a set of unique shadow page requests (Steps 4–8 of Algorithm 2) using several data-

parallel algorithm primitives: sort, scan, and gather. The uniquify algorithm has four stages. First,

the connected-components step eliminates redundant requests between neighbors by marking only

requests whose immediate neighbor pixels below and to the left of the current pixel request a dif-

ferent page. Next, we compact the list to remove all unmarked page requests using a combination

of parallel-prefix scan and gather. Lefohn et al.’s original GPU ASM implementation uses Horn’s

O(nlogn) compaction implementation [59] for this task, reporting that this single kernel alone took

85% of the runtime of the entire shadow-mapping pipeline [82]. We developed an alternate, more

efficient O(n) implementation of parallel compaction that is significantly faster. This new imple-

mentation is described in detail in Section 11.4.2. After compaction, we then sort the resulting

stream by request page address using GPUSort [48], then mark unique elements by comparing each

element with its immediate predecessor. Finally, we compact the sorted list to remove all non-unique

elements, and read the compacted stream back to the CPU for the next step. 105

Algorithm Phase 2: Generating a quadtree of shadow maps

We generate a quadtree of shadow map pages by rendering into both the quadtree’s page tables and physical memory texture (Steps 9–12 of Algorithm 2). We insert new shadow pages into the GPU quadtree via the parallel memory allocation routine described in Section 11.2.

Next, we write shadow data into the newly allocated pages. In theory, we could render each page as

a separate pass, but this leads to a large number of geometry passes and requires a high-resolution

acceleration structure to achieve good performance. Instead, we again leverage the coherency of

shadow requests and bin pages into 1024 1024 superpages. Each superpage is rendered into a temporary buffer, and the requested shadow pages are copied into the quadtree physical memory by

drawing one quad for each page. All valid pages from a given superpage are processed in parallel.

We discuss performance implications of the superpage approach in Section 11.4.3.

Optimizing for Static Scenes

The above algorithm recomputes shadows on every frame and is hence applicable to scenes with

dynamic geometry or lighting. Some scenes, such as architectural walkthroughs, may be wholly

static, where the only difference from frame to frame is the position of the camera. With a static

scene, the data in the quadtree remains valid between frames; our algorithm can also be used for

these scenes in what we call cached mode by retaining the data structure and incrementally updating

it with any new requested shadow pixels. Because cached mode only adds information to a stored

quadtree, resulting in large memory usage, it may be necessary to periodically flush the structure

and rebuild it.

Quality vs. Runtime/Memory Tradeoffs

If the runtime for our shadow algorithm is unacceptably large, application developers may choose

to reduce the quality of the shadows in exchange for higher shadow performance. We provide two

orthogonal methods for doing so. 106

First, Phase 1 of our algorithm analyzes the shadow coordinates for every pixel and thus guarantees shadow correctness. Developers may choose to reduce the resolution of this analysis step. This reduces the cost of the sort and improves runtime while possibly missing some shadow pages. Note

that in this case, an additional ‘backup’ standard shadow map is necessary to handle lookup requests

not found in the quadtree. In practice, we find that downsampling 2x, 4x, and even 8x is possible

with little or no loss of shadow quality. Interestingly, the most likely viewing scenario that will result in missed shadows are incoherent receivers on which the error may not be perceptible.

Second, developers may also choose to reduce the resolution of the finest level of the shadow map,

which reduces the number of shadow page requests as well as the amount of memory required for the data structure (we also automatically apply this technique if the amount of required shadow data exceeds the amount of physical memory allocated for shadow pages). The result of this operation is a uniform loss of resolution for detailed shadows. As the resolution is reduced, the resulting shadows degrade to those no less objectionable than the perspective and projective aliasing artifacts found in standard shadow map methods. The two constraint techniques can be applied either separately or together.

A hybrid work and step efficient scan algorithm

The scan step of the stream compaction algorithm [59] was the main bottleneck of the GPU port of the initial ASM algorithm (see Chapter 11.3). Here, we describe an alternate implementation of a data-parallel scan operation that is 3–4 times faster than previous implementations. We notice that in each iteration i of the scan pass, only n/2i elements are doing useful work, where n is the length of the input stream. To remedy this we implement a hybrid work-efficient and step-efficient scan algorithm [12].

We start by describing the work-efficient scan algorithm. It proceeds in two stages. First, the reduce step shown below reduces the input stream a0 to a single element (Algorithm 3), which is the sum of all elements in a0. Thus we start with a0 and generate the streams ad that are needed in the down-sweep pass. 107

1: for d = 1...log2(n) do 2: for i = 0...n/2d 1 do 3: ad[i] ad 1[2 i] + ad 1[2 i + 1] Algorithm 3: The reduce (forward) stage of the work-efficient parallel scan algorithm. Note that this stage must store all intermediate results for use by the second phase.

Second, the down-sweep stage uses the streams ad, which contain partial sums, to generate the final prefix-sums 4.

1: for d = log (n) 1...0 do 2 2: for i = 0...n/2d 1 do 3: if i > 0 then 4: if i is odd then 5: ad[i] ad [i/2] +1 6: else 7: ad[i] ad[i] + ad [(i/2) 1] +1 8: else 9: ad[i] ad[i] Algorithm 4: The down-sweep (backward) stage of the work-efficient parallel scan algorithm. Note that this uses the results of the first stage to create the final result.

In contrast to Horn’s step-efficient algorithm, the size of the stream is halved in each pass; however,

the number of passes are doubled. In spite of the increase in the number of passes, the run-time complexity of this algorithm is O(n) compared to the O(nlogn) run-time complexity of Horn’s algorithm.

For streams with length smaller than the number of fragments that the GPU can execute in paral- lel, the work-efficient algorithm is inefficient since it does not utilize the parallelism offered and executes double the number of passes. Thus in we use use a hybrid algorithm that combines the step-efficient and work-efficient scan algorithms. We execute the reduce step of the work-efficient algorithm until the size of the stream is down to a configurable predetermined value that is equal to the number of fragments the GPU can operate on in parallel. We then run the step-efficient algo- rithm to compute the prefix-sum for the small stream. Finally we run the down-sweep step to update the partial sums. Our hybrid algorithm uses the computational resources of the GPU optimally since it does not do wasteful computation when the stream size is large and does not do wasteful passes when the stream size is small enough for the GPU to operate on all the fragments in parallel. 108

Our implementation of this hybrid algorithm is 3 times faster than Horn’s algorithm on a 512 512 stream and 3.8 times faster on a 1024 1024 stream.

11.4.3 Results and Discussion

We evaluate the performance, memory usage, and quality of four variants of GPU adaptive shadow

maps (including RMSMs) and standard shadow maps for five scenes. The four ASM variants in-

clude the design tradeoffs discussed in Section 11.4.1 and are iterative versus non-iterative and CPU-

based versus GPU-based scene analysis. The resolution-matched shadow map algorithm refers to

using both the non-iterative and GPU-based scene analysis.

Unless otherwise noted, the ASMs use a maximum effective shadow resolution of 32,7682 and a

page size of 322 texels. The standard shadow map has 20482 texels and uses OpenGL hardware

shadow maps. All results were collected on a 2.4 GHz AMD Athlon system with 1 GB of RAM

running Microsoft Windows XP. The GPU is a NVIDIA GeForce 7800 GTX running NVIDIA

version 83.21 drivers.

The five scenes included in our evaluation are:

Scene Number of Primitives Image

Furball 45,000 Figures 11.4 and 11.6

Robot 61,126 Figure 11.5

Skeletons 80,000 Figure 11.8

City 58,000 Figure 11.9

Trees 48,000 Figure 11.16

The robot, skeletons, and city scenes are representative of scenes used in recent shadow papers and represent a single character, a collection of detailed characters, and an outdoor scene with relatively simple models of various sizes. The trees scene is slightly more difficult, having a mixture of fine,

incoherent geometry and coarse objects. The furball is the most difficult scene, with almost all of the geometry being small hairs represented as individual lines. The many incoherent receivers result in many thin occluded surfaces and are a tough case for nearly all shadow map algorithms. 109

Performance Analysis

We evaluate the performance of our algorithm in three ways: across various scenes, across image sizes, and across multiple generations of graphics processors.

Figures 11.7(a) and 11.7(b) show the performance results for all five scenes for standard and adap-

tive shadow map algorithms at an image size of 10242. We achieve 30–38 frames per second (fps) for most static scenes and 13–28 fps for most dynamic scenes. The exception is the view of the furball scene where the furball covers the entire viewport (such as Figure 11.6 (left)). This diffi-

cult scene has a large number of visible surfaces and therefore generates many more page requests than the other scenes. Note that only the combination of both the non-iterative ASM algorithm and

GPU-based data-parallel scene analysis result in real-time performance.

For most scenes, resolution-matched shadows are 2–3 times faster than the iterative ASM method,

and up to 10 times faster in some cases. In no cases is it slower than the iterative method. RMSMs

are approximately 2 times slower than standard shadow maps for static scenes and 2–4 times slower

for dynamic scenes.

The results also show that the data-parallel GPU scene analysis algorithms are required in order to

take advantage of the image-shadow coherence described in Section 11.4.1. Using the non-iterative

method with CPU-based analysis yields 0.5–1 frame per second because all pixels always generate

a shadow page request. Computing the unique page requests on the GPU substantially reduces the amount of data transferred to the CPU.

The performance of RMSMs scales much better with image size than the iterative or CPU-based

ASM methods. Figure 11.10 shows the results of the ASM variants running on two views of the furball scene at 5122 and 10242 image resolution. The CPU-based methods scale linearly with the number of pixels; however, the GPU-based methods take only 1–2 times as long to shadow 4 times

as many pixels.

Resolution-matched shadow maps are arguably not fast enough for games released today; however, we show that the approach scales well with GPU hardware improvements and will therefore likely be 110 viable in the near future. Figure 11.11 shows the performance of RMSMs running on two successive generations of GPUs. We observe a 2–5 times speedup between them.

Geometry Performance

Generating shadow data for the quadtree requires rendering portions of the scene at various resolu- tions. As described in Section 11.4.2, we generate shadow data by coalescing page requests into su- perpages. Each rendered superpage is one geometry pass over a subset of the scene. Figure 11.12(a) shows the number of superpages rendered to shadow various views of the five scenes. We see that

the iterative method requires 1–4 times more render passes than the non-iterative method.

Similarly, Figure 11.12(b) shows how the number of geometry passes scales with image size. Due to the increased coherency in the larger image, both techniques require less than two times as many passes to shadow four times as many pixels.

Memory Analysis

Removing the iterative edge-finding step from the ASM algorithm represents a space-time tradeoff.

While the previous section describes the performance benefits of the approach, this section describes the memory impact of the design. We analyze the memory consumption, utilization, and coherency for both iterative and non-iterative adaptive shadow maps.

Figure 11.13(a) shows the number of unique pages requested for various views of the five scenes with a dynamic light. The non-iterative method requires 1–3 times more shadow pages than the

edge-finding iterative method. Figure 11.13(b) shows how the memory requirements scale with image size.

RMSM Memory Summary At a 5122 image size, with effective resolution up to 32,7682, a

20482 physical memory buffer (16.7 MB) is typically sufficient to store the shadow pages. With

this resolution and a 322 page size, the page table occupies 4.3 MB, for a total of 21 MB. Increasing 111 the image resolution to 10242, with the same shadow resolution, often requires a larger physical

memory buffer for shadow pages. We tested up to the maximum size of 40962. A 20482 shadow

buffer is sufficient for a 10242 image size if the maximum shadow resolution is limited to 81922.

Memory Usage Efficiency

In Section 11.4.1, we asserted that shadow references in real scenes exhibit substantial locality within a page. We expect that any shadow page we render will result in a significant number of references into that page from the scene.

Figure 11.14 shows that this assertion is correct for the majority of the camera path through our tests scenes. Despite a large variation in the shadow complexity over these frames, all frames have substantial locality within a page, ranging from dozens to hundreds of unique shadow texel accesses per page. Because of this locality, a significant fraction (on average, about half) of all

generated shadow texels are actually used in the scene. Note that coherent shadow receivers such

as walls achieve the highest locality and most efficient use of shadow memory (50%–70%), while

incoherent receivers such as hair and leaves result in lower reuse (10%–30%).

Another measure of coherence is the ratio of unique page requests to rendered superpages. As described in Sections 11.4.2 and 11.4.3, we cluster the requested shadow pages into superpages and render all data for the superpage with a single geometry pass. For the uncached, non-iterative

ASM algorithm, superpages reduce the number of required geometry passes by a factor of 50 to

150 (see Figures 11.12(a) and 11.13(a)). In contrast, the cached ASM algorithm benefits less from

superpaging (achieving 2–20 times compression) and would benefit more from drawing individual

pages using aggressive frustum culling.

Figure 11.15 takes a closer look at two representative frames of our furball flythrough. Note that

very few of the possible quadtree pages are allocated or accessed. Those pages that are allocated, even with incoherent shadow receivers, exhibit substantial locality. 112

Limitations of Performance

Our implementation is limited in performance by two factors. The first is the sort step. Despite the initial compaction step that eliminates a large fraction of page requests, sort is an inherently expensive algorithm—up to 20 times as expensive as compaction—that takes a significant amount of the overall runtime. It is also a step that takes a variable amount of time; incoherent scenes will have a larger input to sort than coherent scenes. We analyzed the number of elements sorted in two camera paths for both cached and non-cached ASMs. For the incoherent furball scene, the average number of non-unique page requests for the coherent receiver views are about 3000 pages per frame.

Sorting these elements takes less than 0.5 ms. When viewing the furball, however, the number of

requests is 140,000, taking about 30 ms to sort.

Thus, for scenes with simple geometry, sort is both our largest cost and our largest source of vari-

ance. Reducing this cost requires reducing the number of initial (non-unique) page requests, which

we can control in two ways: avoiding views of highly incoherent shadow receivers or reducing the

resolution of the shadow analysis (Section 11.4.2). We note that users can animate these parameters

of our algorithm based on camera paths if a performance guarantee is required for known paths.

Also, unlike ASM’s iterative refinement, the upper bound for our sort step is easily determined a

priori.

In addition to sorting shadow page requests, the other dominant cost of our algorithm is generating

shadow data. The cost of this stage depends on the number of unique shadow page requests, the

coherence of those pages in shadow space, and a geometry engine with efficient frustum culling

support. Our current implementation does not use frustum culling, but as noted in Lloyd et al. [86],

geometry engines with very efficient culling are available and would improve the performance of

our shadow data generation stage.

Quality-Runtime Tradeoffs

In section 11.4.2, we describe two methods of trading off performance for quality, reducing the

resolution of either the shadow coordinate analysis computation or the reducing the finest LOD in 113 our quadtree data structure. Figure 11.17 shows that using either technique improves performance, and using both techniques together further improve performance. Figure 11.16 shows the results of

various shadow resolutions on a close-up of the tree scene. In general, reducing the resolution of the analysis gives a solid gain in performance for very little loss in visual detail (Figure 11.16), and reducing the resolution of the finest LOD is recommended for particularly incoherent receivers that would otherwise exhibit poor locality in the quadtree data structure.

Limitations

The largest limitation of our method is the additional memory and time cost over other shadow map approaches. We’ve shown the technique is capable of interactively rendering complex scenes at

10242 image resolution on current hardware. The method achieves a 2–5 times speedup over the last two generations of GPUs, but will likely take 1–2 more generations before being appropriate for applications such as games. However, the technique is immediately useful for applications such as interactive film preview rendering [103].

Another limitation is that, while ASMs closely approximate the goal of generating alias-free hard

shadows by rendering depth samples at the position requested by the shadow coordinates from the

current image, the discretization of the levels-of-detail (LOD) means that the sample positions are

approximate and not exact. The errors arising from the discrete LODs are not usually visible, but

we have seen artifacts in shadows of very thin geometry.

Conclusions

In summary, we present an image-based shadow algorithm, resolution-matched shadow maps, that

delivers correctly-sampled hard shadows for dynamic and static scenes at interactive rates on current

graphics hardware. The technique is currently suitable for real-time film preview rendering and

will become viable for real-time interactive 3D applications such as games within 1–2 generations

of graphics hardware. The technique is also applicable to offline software renderers that support

shadow mapping. 114

Performance for static scene 10242 image, frames-per-second Scene SM A/I/C A/N/C A/I/G A/N/G Robot 60 6–7 0.1–1 11–15 30–32 Skeletons 70 7–9 0.5–1 16–20 34–38 Trees 70 8–9 0.5–1 12–18 30–35 City 70 6–8 0.5–1 11–13 20–25 Furball-W 70 8–9 0.5–1 20–30 30–35 Furball-F 70 5–6 0.5–1 7–10 15–17

(a)

Performance for dynamic scene 10242 image, frames-per-second Scene SM A/I/C A/N/C A/I/G A/N/G Robot 45 3–4 0.5–1 4–7 13–24 Skeletons 45 5–7 0.5–1 7–8 16–20 Trees 45 3–4 0.5–1 6–7 13–14 City 70 2–3 0.5–1 6–8 16–28 Furball-W 40 1–2 0.5–1 2–3 20–25 Furball-F 40 0.5–1 0.5–1 4–5 6–7

(b) Figure 11.7: Performance comparison for both a static light (11.7(a)) and a dynamic light (11.7(b)) for a 20482 standard shadow maps (SM) and four variants of 32,7682 GPU adaptive shadow maps (A). The ASM variants include Iterative (I) versus Non-Iterative (N) and CPU-based (C) versus GPU-based (G) scene analysis. Furball-W is a view of the furball scene looking at the wall (Fig- ure 11.4, right), and Furball-F is the same scene but looking at the furball (Figure 11.4, left). The right-most column shows the performance for our non-iterative, GPU-based ASM. It achieves highly interactive frame rates for all static scenes and is 2–3 times faster than other ASM methods for dynamic scenes. Note that only the combination of both the non-iterative algorithm and GPU-based scene analysis results in high performance. This combination is what we call resolution-matched shadow maps. 115

Figure 11.8: Top shows the skeleton scene shown shadowed with a 32,7682 maximum effective resolution RMSM. The scene consists of 80,000 primitives and, at 10242 image resolution, renders with an RMSM at 16–20 frames per second at for a dynamic light and 34–38 frames per second for a static light. Bottom-Left shows a shadow closeup with the resolution-matched shadow map. Bottom-Right shows a 20482 standard shadow map. 116

Figure 11.9: Top shows the city scene shown shadowed with a 32,7682 maximum effective res- olution RMSM. The scene consists of 58,000 primitives and, at 10242 image resolution, renders at 16–28 frames per second at for a dynamic light and 20–25 frames per second for a static light. Bottom-Left shows a closeup of the bike rack and bench circled in the left image using resolution- matched shadow maps. Bottom-Right shows the same closeup with a 20482 standard shadow map. 117

Performance with image size 5122 image Dynamic furball scene, frames-per-second SM A/I/C A/N/C A/I/G A/N/G Look at wall 70 4–7 3–4 4–10 20–30 Look at furball 70 4–5 3–4 4–10 12–15 10242 image Look at wall 40 1–2 0.5–1 2–3 20–25 Look at furball 40 0.5–1 0.5–1 4–5 6–7 Figure 11.10: Performance scaling with varying image resolution for a 20482 standard shadow map (SM) and four adaptive shadow map (A) variants: iterative (I), non-iterative (N), CPU-based analysis (C), and GPU-based analysis (G). Our optimized ASM algorithm (far-right) is nearly the same speed for both image sizes with large, continuous visible surfaces. For the incoherent, furball view, our method requires only twice as much time to shadow 4 times as many texels. Also note that the method is up to 10 times faster than the iterative algorithm.

Performance with GPU generations 10242 image Dynamic tree scene, frames-per-second GPU GeForce 6800 GT GeForce 7800 GTX Zoomed out 5–6 10–12 Zoomed in 10–12 20–23 In tree 2–4 10–11 Figure 11.11: Performance scaling across two generations of graphics processing units (GPUs) for our non-iterative, GPU-based adaptive shadow map algorithm. Note that the performance improves 2–5 times with only one generation of GPU. 118

Rendered Superpages: iterative vs. non-iterative 10242 image Dynamic light, number of superpages Scene Iterative Non-Iterative Robot 35–61 31–32 Skeletons 30–63 35–64 Trees 40–104 32–77 City 53–154 59–149 Furball 45–155 35–39

(a)

Rendered superpages with images size View Dynamic furball scene, number of superpages 5122 image Iterative Non-Iterative Look at wall 79 15 Look at furball 42 34 10242 image Look at wall 155 35 Look at furball 49 36

(b)

Figure 11.12: Comparison of the number of rendered 10242 superpages for dynamic scenes for the iterative (ASM) and non-iterative (RMSM) adaptive shadow map algorithms. The iterative edge- finding algorithm requires 1–4 times more superpages (i.e., culled geometry passes) to build the quadtree shadow map than the non-iterative solution (Figure 11.12(a)). Figure 11.12(b) shows that both methods scale similarly with image resolution, and, due to the increased coherency in the larger image, they require less than two times as many passes to shadow four times as many pixels. Note that, for static scenes where the quadtree shadow map is valid for multiple frames, a very small number of superpages are required per frame. 119

Memory consumption: iterative vs. non-iterative Scene Number of 322 pages 10242 image Iterative Non-Iterative Robot 737–2467 3294–4666 Skeletons 1300–2200 2900–5600 Trees 2300–5500 3800–5900 City 1300–2900 3800–7100 Furball 1300–5000 4800–5900

(a)

Memory consumption with image size View Furball scene, number of 322 pages 5122 image Iterative Non-Iterative Look at wall 539 1376 Look at furball 3506 3948 10242 image Look at wall 1446 4886 Look at furball 5056 5472

(b) Figure 11.13: Comparison of memory consumption for iterative (ASM) and non-iterative (RMSM) 32,7682 adaptive shadow maps. The iterative version refines only on shadow edges, thereby using less memory but performing more slowly. The faster, non-iterative method uses approximately 1–3 times more memory (Figure 11.13(a)). Figure 11.13(b) shows the scaling of memory usage with image size. To shadow four times more pixels in the wall view, the iterative and non-iterative methods requires 2.7 and 3.5 times more memory, respectively. For the incoherent, furball view, both methods require 1.4 times more memory. Note that static scenes, where the quadtree shadow map is valid for multiple frames, generate a very small number of page requests per frame (1–50). 120

0.8 800 ed ed Me Me ch ch an an ou ou 0.6 600 Te Te sT sT el el xe xe ex ex ls ls To To dT dT te te uc 0.4 400 uc ra ra he he ne ne dp dp Ge Ge er er of of Pa Pa 0.2 200 ge ge on on ti ti ac ac Fr Fr

0 0 100 125 150 175 200 225 250 FrameNumber(furball)

0.8 800 ed ed Me Me ch ch 0.7 700 an an ou ou Te Te sT sT el el xe xe ex ex ls 0.6 600 ls To To dT dT te te uc uc ra ra he he ne ne dp 0.5 500 dp Ge Ge er er of of Pa Pa ge ge on on ti ti 0.4 400 ac ac Fr Fr

0.3 300 20 40 60 80 100 120 140 FrameNumber(tree) Figure 11.14: These graphs were measured on the portions of the furball and tree scenes that re- quired a shadow lookup for every pixel. In these scenes, shadow pages each had 32 32 shadow entries. Over these scenes, we observe substantial locality of reference within a page. Coherent shadow receivers (e.g., planar objects) deliver the best locality, while incoherent receivers (e.g., hair or leaves) result in less coherence between shadow accesses and therefore less efficient use of shadow memory. 121

Figure 11.15: Visualization of pages allocated in the quadtree data structure we use to store shadow pages. We show levels-of-detail (LOD) for two frames of the furball flythrough. The left images are from frame 150, which had a fairly coherent shadow receiver. The right images, from frame 225, had a highly incoherent shadow receiver. The top images are visualizations of the entire LOD; the bottom images are closeups of their interesting regions. In the visualizations, pages that were never requested and are thus never allocated are colored green. Pages that were requested are colored from black (few texels in the page were accessed) to white (all texels in the page were accessed) according to the number of unique accesses made to that page. We conclude that even with incoherent shadow receivers, accesses into the quadtree data structure still exhibit substantial locality. 122

Figure 11.16: Quality comparison for a range of Resolution Matched Shadow Maps. At left is a wide-angle view of the scene of interest, which contains 45,000 triangles. From top-left to bottom- right, we show three closeups of that scene with 32,7682 maximum effective resolution, requiring 21.3 MB of GPU memory; 16,3842 maximum effective resolution, requiring 17.3 MB of GPU memory; and 8,1922 maximum effective resolution, requiring 16.3 MB of GPU memory. These examples render at 30–60 frames per second for static scenes and 12–30 frames per second for dynamic scenes at an image resolution of 5122 on an NVIDIA GeForce 7800 GTX GPU. 123

6 combo LOD 5 downsample 1 1 base = = e e s s a a 4 b b o o t t d d e e z z i i 3 l l a a m m r r o o n n 2 , , S S P P F F 1

0 0 50 100 150 200 250 300 Frame Number (furball) Figure 11.17: We compare frames-per-second performance between three adjustments to our base algorithm. Higher numbers are faster. The graph is normalized to the performance of “base”, which performs full-resolution shadow coordinate analysis and supports shadow LODs down to a fine level of detail (32,768 32,768). “downsample” reduces the size of the screen-space analysis pass by a factor of 4 in each direction; “LOD” reduces the resolution of the finest level of detail in the ASM to 8192 8192; and “combo” does both. The frames at the right of the graph—the slowest frames in the flythrough—correspond to frames with many incoherent shadow receivers, a challenging case for most shadow algorithms, but one we can counter by reducing shadow quality. 124

Chapter 12

A Heat Diffusion Model for Interactive Depth of Field Simulation

The previous two chapters described interactive rendering algorithms that were not previously

demonstrated on GPUs due to their data structure complexity. In contrast, this chapter describes

a new interactive rendering algorithm whose implementation has fairly simple data structure re-

quirements, yet was previously thought to be impossible due to the data access patterns required by tridiagonal linear solvers. Glift’s iterator abstraction clarified the refactoring required to im- plement the first direct, tridiagonal linear solver on the GPU and thereby enable a new algorithm

for interactive depth-of-field. The implementation also demonstrates an efficient GPU-based infi-

nite impulse response (IIR) filter that takes better advantage of the GPU’s parallelism than the one known previous implementation [49].

In real-world photography, lens focusing effects play an important role in image composition.

Blurry backgrounds focus a viewer’s eye on the foreground objects and blurry foreground objects provide context for the in-focus subjects. Generally, the only time that an entire image is in-focus in in brightly-lit, outdoor photographs where the lens aperture is very small. Not surprisingly, high-end, film-quality computer graphics uses depth-of-field effects in the same way as live-action cinematographers to direct the viewer’s eye to important objects in a scene. Unfortunately, accu- 125 rate simulation of depth-of-field is very costly, and thus interactive computer graphics traditionally ignores lens focusing effects and generates perfectly in-focus images. While approximate tech- niques exist for interactive depth-of-field effects, they are plagued by either slow performance or objectionable artifacts.

This chapter introduces a new model for interactively simulating depth-of-field (DOF) effects that overcomes the artifacts and performance problems of previous methods. Our solution casts the post-process depth-of-field problem in terms of a non-uniform heat diffusion model that can be solved in constant time per pixel. In order to use the approach for interactive rendering, we need an implementation built entirely on the GPU. The algorithm requires solving many tridiagonal linear systems in parallel. We begin by defining all matrices and vectors in terms of Glift components. We then describe a novel data-parallel, GPU-based, direct tridiagonal solver algorithm in terms of GPU iterators.

12.1 Prior work

Approaches to computing DOF vary in the detail with which they model the lens and light transport, their performance-quality tradeoffs and in their suitability to implementation on graphics hardware.

Demers provides a recent survey of approaches to the DOF problem [35].

In order to generate a high-accuracy result, a DOF computation must combine information about rays that pass through different parts of a lens. The accumulation buffer [51] takes this approach, simulating DOF effects by blending together the results of multiple renderings, each taken from slightly different viewpoints. Unfortunately, the method requires a large collection of renderings to achieve a pleasing result (Haeberli and Akeley [51] use 23 to 66 passes), and the enormous geo- metric complexity of film-quality scenes makes this prohibitive. It is not unusual for the geometry of film-quality scenes to exceed any available RAM, so doing multiple passes through the origi- nal geometry is out of the question for interactive film preview. other than to rely largely on the post-processing approach of Potmesil and Chakravarty [106]. Their work has inspired a variety of algorithms which can be divided into two major categories: 126

Scatter techniques (also known as forward-mapping techniques) iterate through the source color image, computing the circle of confusion for each source pixel and splatting its contributions to each destination pixel. Proper compositing requires a sort from back to front, and the blending must be done with high precision. Distributing energy properly in the face of occlusions is also a difficult task. Though scatter techniques are commonly used in non-real-time post-processing packages [35], they are not the techniques of choice for today’s real-time applications primarily because of the cost of the sort, the lack of high-precision blending on graphics hardware, and the difficulty of conserving total image energy.

Gather techniques (also known as reverse-mapping techniques) do the opposite: they iterate through the destination image, computing the circle of confusion for each destination pixel and with it, gath- ering information from each source pixel to form the final image. The gather operation is better suited for graphics hardware than scatter. Indeed, the most popular real-time DOF implementa- tions today all use this technique [35, 75, 96, 112, 113, 135]. Gather methods generally suffer from edge bleed at depth discontinuities; however, various modifications can mitigate the problem for nearly-in-focus regions. All of the real-time methods listed above use various methods to avoid im- plementing large, nonuniform blur kernels, including using a fixed number of samples spread across a varying-size region, creating a mipmap hierarchy, or uniformly sampling a fixed-size region and applying weights to those samples based on the circle of confusion. While the approximations are acceptable for some applications, these techniques cannot support large blurs without bleeding, aliasing, or performance problems. Another problem with the gather method is that implementing the blur with a convolution instead of the over blending operation performed by the scatter methods results in incorrect depth ordering.

The work most closely related to ours is the work of Bertalmio et al. [8]. They introduce the idea that the depth-of-field problem can be cast a heat diffusion problem. However, they do not address the problem of blurring underneath sharp regions (see Section 12.2.3) and use an iterative, forward

Euler solver that takes m2 steps to achieve a blur of width m. We solve the blurring underneath problem by introducing a specific heat term into the model, and we provide an efficient, non-iterative implementation by recognizing that we can use an Alternate Direction Implicit (ADI) solver and inventing a data-parallel GPU implementation of the solver. 127

Even if efficiently implemented on the target hardware, standard gather and scatter techniques have poor asymptotic complexity because the amount of work they do is the product of the number of pixels in the image and the average area of the circle of confusion. For an n n image, these al- gorithms have a worst-case complexity of O(n4), which is clearly problematic for high-resolution

film-quality images. Such worst-case examples can occur in two cases. First, when a blurry fore-

ground object is very close to the lens and covers a large portion of the image. Second, when the

user focuses the camera close to the lens and the background becomes very blurry. Most real-time

DOF approximations avoid these hard cases and support a limited amount of blur to cover common

cases. Such limitations are not acceptable for film preview applications.

The method described in this chapter is a gather method that supports large blurs with a constant

amount of work per pixel (independent of blur radius), respects boundaries between objects, and

runs at interactive rates on current GPUs. Like all gather methods, our method suffers from incorrect

compositing via convolution. Section 12.2.5 describes a layering method for limiting the impact of

this artifact.

12.2 Theory and Algorithm

The goal of post-process, DOF solutions is to perform a variable-width spatial blur of the original image, where the width of blur for each pixel depends on the circle of confusion computed from

the depth value. This section derives our algorithm for performing this variable-width blur with a constant amount of work per-pixel, irrespective of the blur size using recursive filters. The algorithm

achieves this goal using an Alternate Direction Implicit (ADI) solution to the inhomogeneous heat diffusion equation. Another way of describing the solution is as a variable-width, recursive filter.

Recursive filters use the results of the previous step to compute the results of the current step, and as a result they can transfer information over a large area in a constant amount of time. Our implementation of recursive filters requires solving many tridiagonal linear systems in parallel on the GPU. We use Glift data structure abstractions to represent the arrays of vectors and tridiagonal matrices, and Glift iterators to implement the linear solver on these structures. 128

12.2.1 Circle of Confusion

We begin with the thin lens equation. This geometric optics equation computes the size of the circle of confusion based on physical lens properties and an object’s depth. The thin lens equation is

(AF(P z)) d = , ¯ ¯ (12.1) ¯ (z(P F)) ¯ ¯ ¯ ¯ ¯ where d is the diameter of the circle of confusion, A is the lens aperture, F is the lens focal length,

P is the distance from the lens iris to the focal plane, and z is the distance of the object from the iris.

All units must be in a consistent distance unit (e.g., meters, millimeters, etc.). We convert d to pixels via the film size and number of pixels in the image. We also assume an isotropic, circular circle of

confusion, but the model can easily be extended to anisotropic, ellipsoids of confusion. Note that A is usually determined by the f number or f/stop. They are related by

F A = . (12.2) f/stop

Equation 12.1 defines the circle of confusion for camera-space depth values that are pre-projective

and linearly spaced. However, the values in an OpenGL depth buffer are in a post-projective, nor-

malized, non-linear space. In this case, the depth values must be converted to camera-space distance

before evaluating Equation 12.1 by

z z z = far near , (12.3) (z (z z ) z ) GL far near far where znear and zfar are the near and far plane in camera space, zGL is the depth value from the OpenGL depth buffer, and z is the resulting camera-space depth value used in Equation 12.1.

Combining Equations 12.1 and 12.3 yields

AFP(zfar znear) dscale = (P F)z z , near far AF(znear P) (12.4) dbias = (P F)z , near d = z d + d | GL scale bias| 129

1000

100 ) m m ( 10 n o i s u f

n 1 o

C 0 2000 4000 6000 8000 10000

f o

e

l 0.1 c r i C 0.01

0.001 Depth (mm)

Figure 12.1: Graph of the circle of confusion as a function of distance from the camera (Equa- tion 12.1). The camera parameters are f/stop of 2.0, focal length, F, of 75 millimeters, and focal plane, P, at 3 meters. Note that the circle of confusion goes to infinity at the camera, has a deep well near the focal plane (i.e., approaches and leaves zero quickly), and quickly approaches an asymptotic value for depths farther from the camera than P.

where dscale and dbias can be pre-computed based only on camera parameters [35]. Figure 12.1 shows a graph of Equation 12.1 for a camera with f/stop of 2.0, a focal length, F, of 75 millimeters, with the focal plane, P, set to 3 meters. Note that the circle of confusion goes to infinity at the camera, has a deep well near the focal plane, and quickly approaches an asymptotic value for depths farther from the camera than P.

For the purposes of this discussion, we define three depth regions: foreground, midground, and

background. Foreground are regions closer to the camera than P, midground are in-focus regions

near P, and background are regions farther from the camera than P.

We achieve a constant-time-per-pixel, variable-width spatial blur by casting depth-of-field as a heat

diffusion problem. We treat the RGB color values in the input image, x(u,v), as initial heat values

applied to a non-uniform material defined by the circles of confusion. The circles of confusion

define a thermal conductivity that varies across the material (i.e., at each pixel). We compute the

final DOF-blurred image, y(u,v), by letting the heat diffuse on the material. The final color of

a pixel is therefore a function of its initial color and the coupling to its neighbors defined by the 130 magnitude of the circle of confusion at that pixel.

We describe the algorithm in two stages: building the material model and solving the heat diffusion

equation on the material. The material model is defined by the circles of confusion at each pixel,

and the initial heat distribution is defined by the input colors.

We begin with the two-dimensional, non-uniform heat diffusion equation,

∂h(u,v) ρ(u,v)γ(u,v) = ∇ (β(u,v)∇h(u,v)), (12.5) ∂t

where ρ(u,v) is the material density, γ(u,v) is the specific heat, β(u,v) is the thermal conductivity, h(u,v) is the heat distribution at time t, and ∇ denotes the vector of partial derivatives in each spatial dimension. This equation describes how heat diffuses over time based on an initial heat distribution

and a material defined by a density, thermal conductivity, and specific heat. In our algorithm, the

initial heat distribution is the RGB values of the input image, the final heat distribution represents

the RGB values of the DOF-blurred image, and the material properties are defined by the circles of

confusion.

12.2.2 Single-Layer Depth-of-Field Algorithm

This section describes a simple, one-layer depth-of-field algorithm. Proceeding sections build atop

this simple model to obtain multi-layer solutions.

The Material Model

We begin by defining the fictitious material onto which the input image will be applied by deter-

mining values for the density, specific heat, and thermal conductivity. Together, these terms define

the thermal diffusivity, β(u,v) κ(u,v) = , (12.6) ρ(u,v)γ(u,v) 131 that defines how quickly heat diffuses in the material. Our initial model defines ρ(u,v) and γ(u,v) to be unity across the entire material. The material model is therefore entirely defined by the thermal conductivity, β(u,v), which we define in terms of the diameter of circles of confusion. Intuitively,

a small circle of confusion leads to a small thermal diffusivity and the input color is preserved.

Conversely, large circles of confusion lead to large thermal diffusivities, resulting in large diffusion

of the initial color sample.

With the goal of achieving an efficient solution to Equation 12.5, we choose to solve it with a simple implicit solver, backward Euler scheme, rather than an explicit forward Euler solver. Specifically, we use an Alternating Direction Implicit (ADI) solution. Using a backward Euler scheme means that we can directly compute a solution to Equation 12.5 at t = t +1 rather than having to compute it at many small time steps between t and t +1. Both of these solver decisions are made for efficiency reasons. The ADI solver lets us solve the simpler, 1D heat equation twice rather than the 2D heat equation once, and the backward Euler scheme permits a single-step solution rather than requiring many iterations. See Baraff et al. [6] for a complete overview of the use of implicit numerical methods in computer graphics. Because we use an alternating direction solver, we must simply solve the 1D heat equation separately for all rows (u) and all columns (v) in the image:

∂h ∂ ∂h ρ(u)γ(u) = (β(u) ). (12.7) ∂t ∂u ∂u

We discretize Equation 12.7 as,

hi ρiγi 4 = βi(hi+1 hi) βi 1(hi hi 1), (12.8) t 4

where i = [1,N] and β(0) = β(N) = 0. Note that setting the thermal diffusivity, β, to zero at the boundaries means that heat will not diffuse out of the system. If we now assume unit, homogeneous

ρ and γ and set the initial conditions to our input image h0(u,v), we get

0 β β hi hi = i(hi+1 hi) i 1(hi hi 1). (12.9)

Note that this final equation for the linear system means that each pixel is coupled to its nearest 132 neighbor, either above and below or left and right.

In order to derive an expression for β in terms of the circles of confusion, we look at the frequency

response of Equation 12.9 via the discrete Fourier transform of Equation 12.9 while assuming a uniform thermal conductivity, β. Clearly, assuming a uniform β is not entirely correct, given that we will have non-uniform values for β in our final material. This simplification allows us to derive

an expression for β, but does result in artifacts we address in Section 12.2.3.

To take the discrete Fourier transform of Equation 12.9, we note that Equation 12.9 is a discrete approximation of ∂ 2 0 hi hi h = β , (12.10) i ∂u2 and that ∂ 2h˜ = (iω)2h˜, (12.11) ∂u2

where h˜ is the Fourier transform of h. See Algorithm 5 for the derivation of Equation 12.11.

If f (x) is a function and f (ω) is its Fourier transform, then we can write the inverse Fourier transform as:

1 ω f (x) = f (ω)ei xdω. (12.12) √2π Z If we now differentiate with respect to x, we get

ω d f (x) = 1 d ( f (ω)ei x)dω dx √2π dx ω = 1 R f (ω) d (ei x)dω (12.13) √2π dx ω = 1 R(iω f (ω))ei xdω. √2π R Differentiating n times will give:

n d f (x) 1 ω = ((iω)n f (ω))ei xdω. (12.14) dxn √2π Z

Algorithm 5: Proof that taking the nth derivative with respect to x of a function, f (x), is equivalent to multiplying the Fourier transform of f (x) by iω.

The discrete Fourier transform of Equation 12.9 is therefore

h˜ h˜0 = β(iω)2h˜. (12.15) 133

Solving for h˜ to obtain the frequency response, we get

1 h˜ = h˜0. (12.16) 1 + βω2

Setting 1 β = ω2 (12.17) c gives the frequency response for a low-pass Butterworth filter [98],

1 0 h˜ = ω h˜ . (12.18) 1 + ( )2 ωc

Finally, given that the spatial width of a Butterworth filter corresponds to 1 , the final expression ωc for the thermal conductivity is

β = d2, (12.19) where d is the diameter of the circle of confusion.

We have now completely defined the material onto which the initial heat distribution (i.e., RGB

pixel values) will be applied and the depth-of-field diffusion take place. Like previous post-process,

depth-of-field techniques, our method takes each pinhole image pixel and blurs it over a region

defined by the circle-of-confusion diameter. However, unlike previous methods, we can achieve

variable width blurs in a constant amount of work per pixel (irrespective of the blur size). The next

section describes how we achieve this via our linear solver.

Solving the Heat Diffusion Equation

We obtain a depth-of-field solution by applying the initial RGB pixel values to the material model

defined in Section 12.2.2. Each direction of the ADI solution must solve Equation 12.9 for all

elements in the row or column of the image. This defines the following linear system, 134

h h0 = β (h h ) 1 1 1 2 1 h h0 = β (h h ) β (h h ) 2 2 2 3 2 1 2 1 h h0 = β (h h ) β (h h ) 3 3 3 4 3 1 3 2 . . . (12.20) . . . 0 β β hN 1 hN 1 = N 1(hN hN 1) N 2(hN hN 2) 0 β hN hN = N 1 hN 1.

Solving the ith equation for xi gives

0 β β β β hi = i 1hi 1 + (1 + i 1 + i)hi ihi+1. (12.21)

Note that the boundary conditions described in Section 12.2.2 are applied as part of the linear system solution. This leads to a tridiagonal linear system, Mh = h0, with M defined as

b c 0 h h0  1 1  1   1  0  a2 b2 c2  h2   h2             a b c  h  =  h0 , (12.22)  3 3 3  3   3   . . .  .   .   ......  .   .    .   .           0   0 an bn  hn   hn       and ci = ai+1. Equation 12.21 gives the expression for the matrix entries:

ai = βi 1 bi = 1 + βi 1 + βi (12.23)

ci = βi.

2 In order to preserve sharp features in the midground, we define βi = min(di,di+1) . Finally, we solve for the depth-of-field solution by solving the tridiagonal system, Mh = h0, for all columns in

the image followed by all rows. 135

The typical method for directly solving tridiagonal linear systems is LU-decomposition. The solver runs in constant time per entry and proceeds first by a step called forward substitution followed by a second step called backward substitution. The forward substitution step is equivalent to a forward, recursive infinite impulse response (IIR) filter, and the backward substitution is a backward, recursive infinite impulse response filter. The recurrence relation present in the solution to this tridiagonal system is the key feature of our algorithm that lets us achieve variable-width filters with a constant amount of work per pixel. As described in Section 12.3, this recurrence relation also creates the key challenge in creating a data-parallel, GPU implementation of the algorithm.

Figure 12.2 shows the results for this depth-of-field algorithm. Note that it correctly preserves sharp, in-focus regions while blurring the out-of-focus regions with no bleeding across depth dis- continuities. Section 12.3 describes the GPU implementation of a direct, tridiagonal solver and

Section 12.4 gives performance results for the algorithm. The solution requires appropriate data structures for arrays of tridiagonal matrices and a data-parallel implementation of the forward- and backward-substitution recurrence relations.

12.2.3 Separating the Background and Midground

The depth-of-field algorithm described in Section 12.2.2 works for some simple scenes (Figure 12.2), but generates objectionable artifacts in more complex scenarios. The upper-right image in Fig- ure 12.3 shows artifacts arising from this basic algorithm. The problem is that in-focus, midground objects have a thermal diffusivity close to zero and act as perfect insulators. Background heat, that should diffuse over a wide region, is blocked by the midground insulators. When animated, the result is that midground, sharp features appear to disturb the blurry background. This section describes our solution to this problem. We now use the previous algorithm for a midground DOF solution, define a new material model for the background, and composite the results to create a final solution. 136

Figure 12.2: Left: original pinhole image. Right: result of the simple, one-level depth-of-field algorithm described in Section 12.2.2. 137

Figure 12.3: Our original image is at upper left. Single-layer diffusion (upper right) results in arti- facts at the horizon adjacent to the leaves. This solution is used for the nearly in-focus, midground regions. We remove the artifacts by computing a separate background layer (lower left) and blend- ing it with the midground solution for the final result (lower right).

Background Material Model

In the material model defined in Section 12.2.2, we set the specific heat to unity everywhere. We define a new material model for the background layer that uses a non-uniform specific heat. We also use the specific heat to define an an alpha channel to composite the two solutions together.

Specific heat is defined as the amount of heat required to raise the temperature of a unit mass of material, one temperature unit. It can be thought of as the ability for a material to “store” heat. A low specific heat means that the material does not easily “store” heat and increases thermal diffusivity

(see Equation 12.6). A near-zero specific heat thus completely disregards the input heat and easily transmits neighboring heat across the region. We define specific heat to vary between γmin and one, depending on the circles of confusion, as:

γi = max(γmin,smoothstep(dL,dH ,di)), (12.24) 138

where γmin is the minimum specific heat (note that zero specific heat results in an infinite thermal diffusivity), dL is the lower cutoff circle-of-confusion diameter below which the specific heat is γmin, dH is the upper circle-of-confusion cutoff above which the circles of confusion are left unperturbed, and di is the circle of confusion at position i. The smoothstep(a, b, x) function returns zero if x < a, one if x > b and smoothly ramps between zero and one where a >= x <= b. The selection of dL and dH are free parameters in our model and define the cutoff region between midground and background. It is also possible to define more than one background layer and composite them together. In practice, we have found that two layers suffice to remove nearly all artifacts.

The last step in defining the background material model is to redefine the thermal conductivity in regions where we mask-out the in-focus, midground data. We accomplish this by applying a fixed- width IIR blur filter to the circles of confusion, where the width is dH . This effectively fills in the thermal conductivity values in midground regions with values from the surrounding background.

Figure 12.4 shows a visualization of the original and blurred circles of confusion.

An alternate, faster method than blurring the thermal conductivities is to instead run a single pass that clamps all circles of confusion less than dH to be equal to dH . With this model, we automatically set dL and dH as fractions of the the circle of confusion at the far cutting plane. As such, dL and dH are no longer free parameters of the algorithm with this approach. This optimization results in a 1.5 times speedup and generally produces results indistinguishable from using an IIR blur filter on the thermal conductivities.

Solving the Background Heat Diffusion Equation

Once the background material is defined, heat diffusion is solved in almost exactly the same way as the for the midground. The only difference is that Equation 12.21 is modified to included the non-uniform γ. The result is a linear system, where each equation is

β β β β 0 i 1 i 1 i i hi = hi 1 + (1 + + )hi hi+1. (12.25) γi γi γi γi 139

Figure 12.4: Visualization of the magnitude of the circles of confusion at each pixel, with dark regions corresponding to small circles (more in focus) and light regions corresponding to larger circles (less in focus). The left image here corresponds to Figure 12.3’s top right image, and the right image corresponds to Figure 12.3’s bottom left image.

One final detail is that the specific heat values used for the second ADI pass are the results of running the original specific heats through the first ADI pass.

Figure 12.3 shows the original pinhole image (top-left), the midground DOF solution (top-right), the background DOF solution (lower-left), and the composited result (lower-right). Note that com- posited result (lower-right) does not have the artifacts in the background portions of the midground

(top-right) image.

12.2.4 Solving the Foreground

Section 12.2.3 describes how we separately solve the midground and background layers. Unfortu- nately, the same method will not work for blurry foreground objects. In fact, this is a fundamental limitation of all post-process depth-of-field solutions that use only a single RGBZ input image. The problem is that foreground objects that are significantly in front of the focal plane become trans- parent. A true depth-of-field solution would reveal midground and background objects underneath the blurry foreground. With a single RGBZ input image, however, the information from obscured midground and background objects is not present, and a “hole” will instead be visible. This section 140 describes how our algorithm can be extended to provide correct depth-of-field if a second, separate

foreground RGBAZ input image is provided.

The algorithm described in Section 12.2.3 will actually produce a slightly different result than the one described in the previous paragraph. In the case where blurry foreground objects are over

blurry background objects, the two blurry regions will blend together and produce an acceptable result. However, in regions where blurry foreground objects intersect sharp, midground objects, the algorithm will blur correctly to the intersection, then show only the sharp midground. The

foreground will not blur over sharp features. It is arguable if this artifact or the “hole” artifact

described in the previous paragraph is more objectionable.

The material model for the foreground solution is the same as the background solution, except for three differences. First, the circles of confusion diameters are blurred with a fixed-width IIR filter using the input alpha values as the specific heat. This defines thermal conductivities in regions outside of edges in the original pinhole image. Second, the input alpha values are blurred with a variable-width IIR filter that uses the circles of confusion as blur diameters. Similar to the first step, this operation fills in specific heat values outside the borders in the original input image. Third, the specific heat values used for the RGB solution are those obtained by blurring the alpha values.

The end result is an RGBA DOF-blurred foreground image that can be composited atop a midground and background solution. Figure 12.5 shows the results of using two input images: an RGBZ pinhole image that defines the midground and background (flags and mountain) and an RGBAZ pinhole image that defines the foreground (fence). Note that the foreground input includes an alpha mask that defines where the foreground is defined.

12.2.5 Automatically Generating Multiple Input Images

As described in the preceding sections, accurate depth-of-field with objects in the foreground re- quires multiple input images. The example given in Section 12.2.4 uses a separate foreground and midground/background image, each separately rendered with the geometry split by the user be- tween the layers. In addition, separating the midground and background can lessen artifacts arising 141 from using convolution instead of order-dependent blending. Unfortunately, most scenes cannot be

easily split into two or three layers for depth-of-field post-processing. The worst case is a single

object, such as fence, that spans the foreground, midground, and background of an image. This section describes a simple method for automatically creating separate foreground, midground, and background images for any scenes based on the current camera parameters.

The technique was first described in NVIDIA’s Mad Mod Mike depth-of-field demo [96], and we generalize it here for our DOF algorithm. The idea is simply to adjust the near and far clipping planes of the camera for each layer and render the scene three times. In practice, changing the frustum can lead to rasterization differences between the layers, and we find it better to use user- defined clipping planes.

We define the position of the cutoffs between the layers to be the depth values at a fraction of the circle-of-confusion size at infinity. We can use this asymptotic value because, as shown in

Figure 12.1, the circle-of-confusion equation (Equation 12.1) has the convenient property of quickly approaching an asymptotic value for depth values, z > P. We can therefore decide where to make the layer cuts based on the this asymptotic value.

We derive an expression for the asymptotic circle of confusion by taking the limit of Equation 12.1 as z ∞ is → AF d = . (12.26) max P F Substituting the equation for this asymptotic circle of confusion for d in Equation 12.1 gives

AF (AF(P z)) c = , ¯ ¯ (12.27) P F ¯ (z(P F)) ¯ ¯ ¯ ¯ ¯ where c is a fraction of the asymptotic value. Solving Equation 12.27 for z gives the following two

expressions for the depth where the circle of confusion is a fraction, c, of the asymptotic value:

z = P , z < P, 1+c f (12.28) P z = 1 c , z > P. b

These two equations define depth values for cutting planes between foreground, midground, and 142

background layers based on only two intuitive parameters, c f and cb. These parameters are in the range [0,1] and represent the fraction of the asymptotic circle of confusion at which the layer boundaries are defined. In practice, we find that c f = 0.1 and cb = 0.5 work well for many camera configurations.

12.3 Implementation

We first describe the data structures used to implement the array of vectors and tridiagonal matrices in terms of Glift abstractions. We then describe the data-parallel algorithms used to simultaneously solve many tridiagonal systems in parallel on the GPU. In order to create a standalone application

with no external dependencies, the actual DOF code was not implemented with Glift; however, the Glift abstractions (especially GPU iterators) were invaluable in writing the code and creating a

GPU-compatible tridiagonal linear solver.

12.3.1 Data Structures

To begin, we can store each row of a tridiagonal matrix in a single RGB, 3-component texel. This

means that an entire tridiagonal matrix can be stored as a 1D array of 3-component floats. We therefore store all tridiagonal matrices for an entire image in a single 2D physical memory texture the size of the input image.

The virtual domain for the ADI solver consists of one index identifying the matrix and a second index identifying the current matrix row. We use a simple address translator to map both a row-wise array of 1D arrays and a column-wise array of 1D arrays onto the same physical memory buffer.

This lets both passes of the ADI solver use the same physical memory buffers for the matrices, input vectors, and output vectors; changing only the axis along which array offsets are applied.

We build the matrices on the GPU by using an iterator that traverses all rows of all matrices. We also use a neighbor iterator to obtain the circle-of-confusion values for elements i and i 1 to compute the thermal conductivity and specific heat. The shader code for the vertical and horizontal ADI 143 passes are identical. We simply change the address translators to change between orientations.

The hierarchical cyclic reduction solver described below requires an array of temporary matrices and partial solutions. We therefore build a hierarchy of the data structure and iterators defined

above. Each level of the solver iterates over all entries in the current level (see Algorithm 6).

12.3.2 Algorithm Implementation

The key to a fast implementation of our DOF computation is the efficient solution of the array of

tridiagonal systems in equation 12.22. While previous authors have successfully developed solvers

for a variety of linear systems on graphics hardware, none of the published algorithms are appropri-

ate for this particular case. The general linear algebra framework of Kruger¨ and Westermann [76]

supports memory layouts for both dense and sparse matrices, including banded matrices, but does

not provide direct solvers specific to tridiagonal matrices. Galoppo et al. support both Gauss-Jordan

elimination and LU decomposition on the GPU, but only for dense matrices [44]. GPU-based con-

jugate gradient solvers [14, 46, 55] are iterative and do not take advantage of the special properties

of banded or tridiagonal matrices. Moreover, for banded or tridiagonal systems, direct methods are

generally much faster and better conditioned than iterative techniques.

Since we are using an alternating direction solver, our implementation will first solve for all columns

and then use the results to solve all rows in parallel. Our algorithm requires four steps, all of which

exploit the parallelism of the GPU: construct tridiagonal matrices for each column in parallel, solve

the array of linear systems, then repeat those two steps on the rows. In the discussion below, we use

row terminology, but other than a necessary transpose, the procedure is the same for columns.

We begin by computing the tridiagonal matrix on the GPU. We can do so with a single GPU pass

that iterates over all rows of all matrices. We use a neighbor iterator for the thermal conductivity

and a single-element (stream) iterator for the specific heat (see Equation 12.25). At the end, for

an m n image, we have n m m tridiagonal matrices, each corresponding to a row of the input image. We solve each of these n systems in parallel to produce n solutions to the 1D heat diffusion

equation. For clarity below, we describe only the solution of a single row. 144

As we noted in Section 12.2, LU decomposition is the traditional method for solving a tridiagonal

system. Unfortunately, each iteration in the forward and back substitution steps of an LU decom- position relies on the previous step and hence cannot be parallelized. Instead, we use the method of

cyclic reduction [57, 67], a parallel-friendly algorithm often used on vector computers for solving tridiagonal systems, as a basis for our implementation.

Cyclic reduction works by recursively using Gaussian elimination on all the odd-numbered un- knowns in parallel. Algorithm 6 contains pseudocode for the cyclic reduction algorithm. During elimination, each of the odd-numbered unknowns is expressed in terms of its neighboring even- numbered unknowns, resulting in a partial solution and a new tridiagonal system, each with half the number of equations (Algorithm 6, lines 1–8). The process is repeated for logm steps until only one equation remains (Algorithm 6, line 9) along with a hierarchy of partial solutions to the system.

Next, the solution to this equation is fed back into the partial solutions, and after logm steps to prop- agate the known results into the partial solutions (Algorithm 6, lines 10–16), the system is solved.

While cyclic reduction requires more arithmetic than an LU solver, it still takes only a constant time per unknown and is amenable to an efficient GPU implementation.

We refactor the traditional description of cyclic reduction [67] so that we can express it with the limited data access patterns provided by GPU iterators (see Section 4.4). In particular, current

GPUs support only single-element output iterators with no scatter or neighbor access. The cyclic reduction algorithm shown in Algorithm 6 is structured such that the computation of an output element k requires data from input elements 2k 1, 2k, and 2k + 1. Expressing our computation in this way enables the GPU to iterate over contiguous domains at each level and requires only gather memory accesses.

We note that cyclic reduction represents a parallel solution to a recurrence relation and as such, can be efficiently solved in parallel using the scan primitive [11]. In graphics, Horn recently used scan to implement an O(nlogn) non-uniform reduction (i.e., stream compaction) primitive [59].

Section 11.4.2 of this thesis describe a fast, O(n), scan operation used to generate high-quality shadows. We note that the cyclic reduction solver used in our DOF algorithm can be cast in terms of a programmable scan operation and could leverage the same hybrid-optimized scan described in 145

1: for L = 1...log2(N + 1) 1 do j 2: for j = 0...2 (N + 1) 2 do L 1 L 1 3: α a /b 2 j+1 2 j γ L 1 L 1 4: c2j+1 / b2j+2 L L 1 5: a (αa ) j 2 j L L 1 α L 1 γ L 1 6: b j b2j+1 ( c2j + a2j+2) L L1 7: c (γc ) j 2 j+2 L L 1 α L 1 γ L 1 8: h j h2j+1 ( h2j + h2j+2) log2(N+1) 1 log2(N+1) 1 log2(N+1) 1 9: h h /b 0 0 0 10: for L = log2(N + 1) 2...0 do 11: for j = 0...2 j(N + 1) 2 do 12: jp j/2 13: if j is odd then L 14: hL h +1 j jp 15: else L L 16: hL (hL cLh +1 aLh +1 )/bL j j j jP j jP 1 j L L Algorithm 6: Pseudocode for our GPU-compatible cyclic reduction tridiagonal solver. a j , b j , and L L c j refer to the matrix entries for row j in level L of the hierarchy of solutions, and h j is an element of the solution vector. Lines 1–9 represent the forward substitution and lines 10–16 represent the backward substitution. Note that both of these computations are parallelizable and rely only on gather memory accesses.

Section 11.4.2.

12.4 Results

Figure 12.2 shows several images generated with our DOF algorithm; the inputs to our algorithm are input images with color and depth information at every pixel. Figure 12.5 shows a series of images generated with the 3-layer variant of our algorithm, where we composite the near-field fence atop the mid- and far-range flags and mountains. Lastly, Figure 12.6 shows images with continuously varying depth, generated by automatically rendering three input images based on the camera parameters (see Section 12.2.5). 146

Figure 12.5: The image at the top left is an in-focus view of a scene that we render with our 3-layer DOF algorithm. Clockwise from top right, we show the results of our algorithm with cameras with a narrow aperture and a mid-distance focal plane; with a wide aperture and a distant focal plane; and with a wide aperture and a near focal plane.

12.4.1 Runtime and Analysis

We implemented our DOF system on a 2.4 GHz Athlon 64 FX-53, PCI-Express 16x system running

Windows XP with an NVIDIA GeForce 7800 GPU. Our implementation has image-space complex-

ity so its runtime is strictly a function of the image size. Using just the background and midground

layers, on a 256 256 image, we sustain 80–90 frames per second; on 512 512 image, we sus- tain 24–26 frames per second; and on a 1024 1024 image, we sustain 8–10 frames per second. At 1024 768 resolution, with the separate foreground layer added for the flag/fence images, the frame rate drops to 5–6 frames per second.

The performance of our solver is limited by GPU memory bandwidth. The data accesses are coher-

ent and sequential, but there is an insufficient amount of computation to hide the cost of the memory

accesses. We measured this result by observing the response of our algorithm to varying GPU core 147 clock speeds and GPU memory clock speeds. The application has very little response to GPU core

clock speed changes, but a direct and linear response to changes to the GPU memory clock speed.

The performance of our algorithm is suitable for use in high-quality film preview applications such

as those we target with our work, and we expect that further improvements in next-generation GPUs

will soon allow this technique to be used in real-time entertainment applications such as games.

The running time of our algorithm is limited by the performance of the fragment program that

implements the tridiagonal solver.

12.4.2 Limitations

As we discussed in Section 12.2, our ADI solution of the heat equation yields solutions that ap- proximate a Gaussian point-spread function across the circle of confusion. The photography term

for the point-spread function in this context is “bokeh”, and it is determined the the combination of iris shape and lens optics. Because the Gaussian distribution arises as a fundamental property of the diffusion equation1, generating other distributions is problematic for our technique. However, the Gaussian distribution is physically meaningful in photography; Buhler and Wexler indicate that such a distribution of light produces a “smooth” or “creamy” effect “similar to a Leica lens” [19].

Our ADI solution breaks up the computation into separate horizontal and vertical passes that could potentially produce anisotropies. Any issues with anisotropies could be mitigated by taking multiple diffusion steps; we have not seen any severe anisotropic effects in practice.

One of the biggest limitation of our method is common to all gather DOF methods: incorrect blend-

ing of image samples. Gather methods use a convolution rather than the correct, order-dependent

over blending operation used by scatter methods. If only one or two layers are used, the errors

can sometimes be quite objectionable. However, we have found that using separate layers for the

foreground, midground, and background greatly reduces the artifacts arising from this fundamental

limitation of all gather methods.

1The Gaussian distribution is the Green’s function (and in this case, the impulse response) of the diffusion equation. This means that if a delta function (i.e., point) is given as an input signal and allowed to diffuse, the resulting distribution will be Gaussian. The width of the Gaussian distribution is proportional to the square root of the diffusion time. 148

Another limitation of our method is that correct treatment of the foreground requires separate blur-

ring of each depth layer between the eye and the midground. It is possible, albeit slow, to use depth peeling [38] to capture each depth layer in the foreground, blur the image with our heat diffu-

sion solver, and composite all of the results. Hardware support for order-independent transparency

might also help, but this hardware is many years away and it is unclear if it would help, given the

requirement that the samples must be blurred before compositing.

In summary, this chapter introduces a new depth-of-field (DOF), post-process algorithm that uses a

heat diffusion model to calculate accurate DOF effects at real-time rates. Unlike previous methods,

our algorithm achieves high quality and interactive speed at the same time. It properly handles

boundaries between in-focus and out-of-focus regions, while attaining interactive frame rates and

constant computation time per pixel irrespective of blur size. Our implementation of the algorithm

also introduces cyclic reduction to the GPU community using a GPU-compatible, gather memory

access pattern. Glift iterators and data structure abstractions guided the refactoring of the solver.

The real-time performance of our system makes it suitable today for interactive film preview, and

continued advances in the performance of graphics hardware will likely make it attractive soon for

games and other real-time applications. 149

Figure 12.6: Depth-of-field (DOF) results when automatically generating three input images based on current camera parameters. We render the foreground, midground, and background separately, perform heat-diffusion DOF on each layer, then composite the results. Top shows the original, non- blurred input image. Bottom-Left shows the DOF algorithm with all data in a single input image. Bottom-Right shows the DOF algorithm with three separate layers. Note the correctly transparent foreground blur in the right-hand image. 150

Part V

Conclusions and Future Work 151

Chapter 13

Future Work

This chapter gives an overview of possible and promising future work based on Glift and each of the applications: octree 3D paint, quadtree shadow maps, and heat-diffusion depth-of-field.

13.1 Glift

The version of Glift in this thesis defines abstractions for building random-access GPU data struc- tures on current hardware. Glift can be seen as a step toward building a library such as the Stan- dard Template Library (STL) for GPUs, but defines only the building blocks rather than the final structures. It is our hope that Glift serves as a proof-of-concept for GPU language and abstraction designers, and that there will one day be a standard library of GPU data structures and algorithms.

This library will not only be a GPU implementation of STL, but will include graphics-specific data structures such as octrees as well as the building blocks for users to easily create their own struc- tures. 152

13.1.1 A Programming Model for Commodity Parallelism

As parallel computing moves from the machine room to the desktop, computer science is amidst a time of unique opportunity to radically redefine the programming model for commodity computer science. GPUs offer one possible future platform and much work remains to define an appropriate programming model (of which Glift is one small piece). However, other architectures such as IBM’s

Cell processor and multicore CPUs from AMD and Intel may be the processor of the future. Many open questions remain in defining how these commodity parallel processors will be programmed:

Can existing serial programming paradigms be augmented to efficiently program these new architectures?

Will software-managed threads be sufficient for hiding memory access latency or will hardware- managed threads be required for good performance?

Will libraries such as Glift provide enough infrastructure to allow languages such as C++ to be used for these processors?

Will programmers have to explicitly program the memory hierarchy?

Will programmers have to explicitly declare their memory access patterns to achieve reason- able parallel performance?

Is there a small set of memory access patterns (i.e., new iterator types) that can be used to define a large number of algorithms?

For data-parallel programming models, one significant unaddressed problem is the automatic man- agement of temporary buffers. Temporary buffers can be quite large and it is important for a program to use as few as possible. Solutions such as breaking the computation into independent chunks, auto- matic allocation and reuse of temporary buffers, etc. are required to reduce the memory requirement of data-parallel algorithms and free the programmer from the tedium of manually managing buffers.

Leaving this problem unsolved is analogous to a serial programmer being required to manually man- age registers. 153

13.1.2 Generic Algorithms

Glift’s introduction of iterators to GPU programming opens the possibility of defining generic,

reusable GPU algorithms. The Standard Template Library (STL) defines generic algorithms in terms

of specific types of iterators. A promising next step is to create a set of fundamental data-parallel

algorithms such as scan, sort, compact, etc. in terms of Glift iterators. These algorithms could

operate on the elements stored in any Glift data structure that supports the appropriate iterators.

Unlike current GPU algorithm implementations, these generic algorithms could be easily distributed and reused.

13.1.3 Impact of Future GPU Architectures

The best resource for information about upcoming GPU architectures is information regarding the next Microsoft DirectX specification [9]. The information made public thus far indicates that near- term GPU hardware may support some subset of the following features: integer support, a new shader stage called a geometry shader, virtual memory, and unified shader cores. This section

discusses some possible implications for Glift on these upcoming changes.

To begin, the addition of integer data types and arithmetic in shaders will greatly simplify the spec- ification of Glift address translators. While floating-point, normalized addressing simplifies mul- tiresolution addressing, it also leads to insidious addressing rounding errors. In practice, address translators for floating-point addressing are very difficult to make robust and usually consume more instructions than an integer/bit operation equivalent.

The geometry shader is a new shader stage between vertex shading and rasterization. It enables programmable primitive assembly and opens up new possibilities. The most obvious implication for

Glift is that geometry shaders will make it much easier for GPUs to generate and modify their own iterators. Very little research has been done that includes GPU-generated iterators because doing so requires generating geometry on the GPU. The geometry shader is designed to generate geometry, make it much easier to generate GPU iterators, and therefore enable new classes of complex, data- structure-modifying algorithms to execute efficiently entirely on the GPU. 154

Section 8 describes that many recent GPU data structures are based on a multidimensional page

table structure. Page tables have the benefit of supporting constant time insertion, deletion, and

data access. Their downside is the memory required to represent the page table. It is reported

that next generation GPUs will support virtual memory and therefore have hardware page tables.

Will it be possible to leverage this hardware page table support to represent user-configurable data

structures such as the multiresolution data structure described in Section 9.3 and used in the Oc-

tree 3D Paint, Adaptive Shadow Map, and Resolution-Matched Shadow Map applications? Will

libraries like Glift therefore be implemented atop hardware primitives? Will users be able to build

hardware-accelerated data structures? This would require letting users render to page tables. Would

a Translation Lookaside Buffer (TLB) cache for the page table entries provide better performance

than the general texture cache being used by our current data structures?

Current GPUs contain separate shading cores for the vertex and fragment processors. If future GPUs

use a set of unified shader cores for vertex, geometry, and fragment processing, will the fragment

engine remain the substrate of choice for GPU computation? For example, will it become efficient

to implement GPU iterators as points processed by the vertex engine?

If future GPUs support arrays of input textures that can be relatively addressed, Glift could virtualize the presence of multiple physical memory buffers by adding an additional dimension to the physical

address that is the array index in an array of textures. This is not currently possible to do efficiently

because arrays of input textures must be indexed by compile-time indices.

The latest generation of ATI GPU is reported to support a scatter operation in the fragment engine.

This scatter does not have a resolve operation and so implements the Arbitrary-CRCW machine

model (see Section 2.1). While at first glance this looks like a much easier machine model to program than the more restricted CREW machines, parallel execution limits the utility of this oper- ation. This scatter operation can only be used if a global address space is defined for the data being written. For example, this version of scatter does not help solve the stream compaction problem

because the number of non-NULL elements is not known before the computation begins. Many parallel operations will require some sort of Combining-CRCW model to define the semantic of concurrent writes to the same location. Samet concludes that scan-based compaction is a more ef- 155 fective tool than scatter for building complex data structures in parallel [58]. Only if an ordering

(i.e., an address) is known a priori is scatter a useful primitive for modifying GPU data structures.

Lastly, ATI’s latest generation GPU supports much more efficient fragment-level conditional exe- cution than previous GPUs [1]. As GPU support for efficient branching improves, data structures with non-uniform access consistency may perform equivalently to those with uniform access con- sistency. As such, structures such as trees may become more attractive than memory-hungry page table structures.

13.1.4 Additional GPU Data Structures

This thesis describes only random-access data structures, yet much of computer science (and the

STL) regularly uses non-random-access structures such as sets, hash tables, and linked lists. These structures have not been demonstrated on GPUs and doing so is promising future work. This section gives an overview of ideas for implementing these structures.

To begin, as noted by Alexander Stepanov [120], parallel iteration over the elements of a data structure requires an address space. This means that a GPU representation of an irregular structure such as a linked list must have an address space that enumerates all of its elements. The elements need not be listed in any particular order, but parallel iteration must be possible. A data structure that supports only sequential iteration could support read-only GPU usage, but will not support GPU iteration over its elements.

GPU Set

We begin with description of a GPU set structure. Given that we know we need an address space, we choose a bit-vector set representation. Such representations are efficient for densely populated sets for which a linear address space can be defined. In a bit-vector set, each bit represents an element in the set, and a one or zero value indicates if that element is in the set or not. Bit-vector sets support constant time insertion, deletion, and lookup but require a O(n) operation to generate an iterator for 156 all elements, where n is the number of bits in the address space rather than the number of elements

in the set. We define the physical memory buffer of the set to be a single-component, 8-bit internal texture format. Each bit of the 8 bits will be used as a set element. The virtual addresses of the set are bit indices (keys). The physical addresses are 2D values containing the byte number and relative bit number (offset in the byte).

The bit-vector set supports parallel read access simply by reading the appropriate byte and checking the value of the appropriate bit. Note that this masking will be much easier when GPUs support bitwise operations and integers rather than only floating point operations.

Because the address space of the bit vector defines an absolute ordering, a bit-vector set easily supports parallel set insertion. Given a collection of unordered, non-unique keys to insert into the

set, all keys could be drawn as points with their position set to be the physical address of the byte containing the appropriate bit. The framebuffer is set to combine fragments with bitwise OR, and the value of each point is a power-of-two value representing the appropriate bit. Using the GPU in this way actually represents a Combine-CRCW machine model, where the framebuffer logical operation defines the Combine function. Although drawing points is not an efficient operation, the parallelism obtained may enable this operation to perform reasonably. If native scatter support, such as with the latest ATI chip, can be combined with framebuffer logical operations, this will likely provide a more efficient Combine-CRCW implementation than drawing points.

The parallel erase operation is similar to the set insertion, but the color value of each point should be the bitwise NOT of the bit being erased and the Combine mode (framebuffer logical operation) should be set to AND.

Creating an iterator over all elements currently in the set is by far the most difficult operation to

implement. It requires a one-to-eight data amplification to generate up to eight elements from each one-byte texel. One possibility is to render to a buffer containing an element for each possible bit in the set (the entire address space), write the virtual address of the bit into the framebuffer, then

compact the result using the algorithm described in Section 11.4.2. This compacted array of virtual

addresses could then be efficiently iterated over. Note that this algorithm requires a memory buffer

that is 32 times the size of the bit vector. It is possible that the geometry processor will make it easy 157 and efficient to implement this operation.

GPU Hash Table

The hash table is arguably one of the most often-used data structures in general computer science, yet no GPU implementation has been previously described. We do not implement one, but speculate

how it might be possible.

We begin by defining the hash table as a grid of lists. Each list is comprised of 1D pages of elements.

The idea is to implement a chained hash table, but grow the chains in pages of elements rather than

individual elements to amortize the cost of inserting and addressing elements. As such, the virtual

address would be a hash key, and the physical addresses would be the head of the list, the page

number, and the offset within the page.

Reading from the hash table requires hashing to the correct list, traversing the pages and elements within the list. This is a linear-time (with respect to the length of the list), non-uniform consistency

traversal.

Parallel hash insertion requires resolving the number of collisions created in each list and allocating

the number of pages required to fulfill all collisions. The number of collisions could possibly be

computed by rendering the elements to insert as points to a buffer that is the size of the top-level

grid, using alpha blending to count the number of collisions. To insert the actual elements, each

element in the same list needs a unique address. For this, the elements may have to be sorted by grid

location or it is possible that a trick such as Purcell’s stencil routing [109] could be used to write

elements to the correct location.

Generating a GPU iterator over the elements could be done in constant time if a vertex buffer of

allocated pages were always kept up-to-date as a side-effect of the insert operation. The iterator

draws a quad over each active page and delivers the virtual address of each element as the value of

the iterator. This address (hash key) is trivial to deliver because, just as in a CPU hash table, the key

must be stored in the table. 158

GPU Linked List

Linked lists are a challenging case for GPUs for several reasons. We define a linked list to be an ordered collection of nodes that supports O(1) insertion and deletion in the middle of the list and

O(n) find operations (lookups). However, to support efficient parallel GPU iteration, our structure must support some sort of address space. Unlike sets and hash tables, linked lists have no native global address space. The ordering of elements depends on the data currently stored in the list.

We propose a structure that has a varying-length array of node pointers and a separate data array for storing the nodes. The varying-length array could be implemented with a structure similar to the stack described in Section 9.2. The ordering of the pointers is irrelevant with respect to the node

ordering. In fact, the layout could be ragged and sparse unless efficient GPU iteration were required, in which case the pointers could be compacted with a stream compaction operation.

Parallel find is a linear-time operation and is trivially supported by following pointers starting from the head of the list. Parallel insertion and deletion are much more difficult. To begin, because the ordering of a list depends on the elements in the list, we propose supporting only the insertion of contiguous sub-lists. The sub-list would be created by sorting the pending elements and linking their pointers together to create a list. This list can then be inserted into the larger list by changing pointers at the beginning and end of the sub-list. It is unclear how to define the ordering semantic of a parallel list insertion if arbitrary, individual nodes are allowed to be inserted in parallel. The problem is that, unlike a set, the order of elements in a linked list depends only on the insertion order and not on any key or other property of the data.

13.2 Octree 3D Paint

Our octree texture implementation described in Chapter 10 demonstrates that the algorithm and data structure are possible to implement at interactive rates on current graphics hardware. A number of interesting challenges exist for future work. First, our implementation does not include support for multiple data nodes per octree node, such as the normal-keyed additional entries described in 159 the original papers [7, 34]. Lefebvre et al. did have some support for this feature in their work performed simultaneously with ours [79]. Adding support for this feature simply means adding another dimension to the address translator and storing an array or small hash of elements in each octree node rather than a single texel value. Second, further improvements in fast brushing and fast, correct mipmap generation are worth investigating. Lastly, we did not define file formats for our data structure, but doing so should be easy because the page table and physical data are stored as textures. These textures would be enough to recreate a static octree texture. Reconstructing an editable octree texture would require only a small additional amount of metadata. It might also be interesting to have an interactive authoring system write out files in the brick map texture format

supported by Pixar’s RenderMan renderer. Our paged octree textures are very similar to that used

by brick maps.

13.3 Quadtree Shadow Maps

Chapter 11 introduces GPU adaptive shadow maps and the resolution-matched shadow map (RMSM)

algorithm that generates nearly alias-free hard shadows at interactive rates on current graphics hard-

ware. There are a number of interesting extensions and improvements to these approaches, espe-

cially the RMSM algorithm that may be worth exploring.

First, the dominant bottleneck of the algorithm is the sort operation used to perform a uniquify oper-

ation on the GPU. Is there a faster method for performing uniquify rather than sort? We considered scattering the virtual page numbers (vpns) into the domain of all possible vpns, but discounted it due to the cost of scatter (i.e., via rendering points) and the cost of the required compaction. However, it is worth revisiting this option, especially with hardware that supports scatter since an Arbitrary-

CWCR machine model suffices for this operation. Perhaps there is another, faster way to perform uniquify on current or future GPUs?

Second, it is possible that a different data structure may be more memory efficient yet still perform close to as well as the structure we’ve selected. We currently use a single-level page table to repre- sent the quadtree of shadow map pages. The benefits include fast and uniformly consistent lookups; 160 and fast insertions, deletions, and writes. The downside is the memory footprint of the page ta-

ble and the limited resolution that results. We sometimes find that going to effective resolutions

of greater than 32,0002 is required to avoid aliasing, yet doing so requires either larger pages or a

larger page table. Would a two-level page table solve this problem? Would a tree structure solve the

problem or would it be too complex to update in parallel at interactive rates? In addition, will we

be able to leverage the virtual memory support in future hardware to represent a sparse, enormous

mipmap hierarchy instead of having to explicitly keep the data structure ourselves?

Third, we do not support approximate soft shadows via percentage closer filtering (PCF) in our cur-

rent implementation. However, it should be straightforward to implement the algorithm described

in Fernando et al. [40] atop our structure. The hierarchical structure should make it possible to

greatly accelerate the blocker search, and the virtual addressing should make it easy to implement

large PCFs without having to worry about the resolution at which the data are actually stored. The

only consideration is that the current algorithm does not define shadow data outside of the pages

requested by the current camera view, yet large PCFs might sample outside this region. As such, a

“backup” fixed-resolution shadow map might be required to provide shadow data that is not present

in the hierarchical structure. The other question is, at what rate should the resolution of the shadow

data decrease as a function of distance from the center sample?

Fourth, our shadow algorithm has many similarities to the multilevel ray tracing paper presented

in ACM SIGGRAPH 2005 [111], insofar that we generate coherent pages of “ray hits.” We use

a rasterizer to generate these ray hits at the correct resolution, but what are the other similarities

between our resolution-matched shadow mapping algorithm and hierarchical ray tracing? They

both require similar acceleration structures to perform well and take advantage of ray coherency.

Could a variant of our algorithm be used to efficiently generate alias-free samples for direct viewing

rays rather than only shadow rays?

Fifth, under what circumstances is it possible to jettison the adaptive data structure entirely and sim-

ply allocate a cropped mipmap hierarchy of shadow buffers based on the minimum and maximum

shadow coordinates in the current image? In the worst case, this approach could require an enor-

mous amount of memory (if the quadrilateral representing the screen pixels in shadow space is very 161 distorted). Could an alternate projection scheme redistribute the shadow samples into a rectangle or would this bring all of the special case problems of perspective-shadow-map-based approaches?

Lastly, our approach is not only applicable for interactive rendering, but could also be use in of-

fline renderers such as Pixar’s RenderMan. An implementation could simply generate the shadow pages on-demand and keep the quadtree of shadow pages live throughout the rendering. Given the

coherency that we observe in our tests, it is very likely that many shading samples would request

the same shadow page, the results would be equivalent to ray tracing sharp shadows (especially if

a quadtree was used that did not have a maximum effective resolution), but would be much more

efficient. The quadtree could also be saved to disk as a new shadow map file format.

13.4 Depth of Field

The heat diffusion model for post-process depth-of-field (DOF) presented in Chapter 12 solves many

of the problems with other post-process DOF methods. The algorithm requires a constant amount

of per pixel, regardless of the blur size. It also respects the boundaries between depth discontinuities

to avoid the edge bleeding problems of other methods. It is likely that the results of this interactive

rendering algorithm are good enough to be used as a fast, approximate depth-of-field solution for

software renderers if the foreground and midground/background pinhole camera input images could

be automatically generated.

The implementation introduces infinite impulse response (IIR) filters and a direct solver for tridiago-

nal systems to the real-time rendering, GPU community. There are a number of other promising ap- plications of fast, IIR filters and direct tridiagonal solvers, including shallow water simulation [68], large Gaussian blurs, etc. In addition, it would be interesting to explore if variable-width texture fil- tering (e.g., anisotropic filtering or percentage-closer filtering for shadows) could be recast in terms of a variable-width IIR filter. 162

Chapter 14

Conclusions

This thesis demonstrates that a data structure abstraction for graphics processing units (GPUs) can simplify the description of new and existing data structures, stimulate development of complex GPU algorithms, and perform equivalently to hand-coded implementations. We define the GPU computa- tion model in terms of parallel iteration over data structure elements and demonstrate iteration over complex structures. This thesis also presents a case that future interactive rendering solutions will be an inseparable mix of general-purpose, data-parallel GPU programming (GPGPU) and traditional graphics programming.

We introduce an abstraction, Glift, for easily creating, accessing, and traversing parallel, random-

access data structures on graphics hardware. The abstraction factors GPU data structures into four

components: physical memory, programmable address translation, virtual memory, and iterators.

We implement the abstraction as a C++ and Cg template library and show that the resulting code is

nearly as efficient as low-level, hand-written code.

Glift makes it possible to separate GPU data structure and algorithm development and description.

We demonstrate how this separation reduces programming complexity in multiple ways. First, we present simple examples which demonstrate the reduction in code complexity and clear separa- tion of data structures and algorithm. Second, we characterize a large number of previous GPU

data structures in terms of Glift abstractions, thereby illuminating many similarities between seem- 163 ingly diverse structures. Third, we describe novel, complex GPU data structures—a GPU stack, quadtree, and octree—in terms of generic Glift components. Lastly, we demonstrate four novel interactive rendering algorithms built using Glift data structures: octree 3D paint, adaptive shadow maps, resolution-matched shadow maps, and a new depth-of-field algorithm. The implementation and description of these applications is greatly simplified by the Glift abstraction and the separation

of algorithm from data structures.

In the same way that efficient implementations of data structure libraries like the Standard Template

Library (STL) and Boost have become integral in CPU program development, an efficient GPU

data structure abstraction makes it possible for vendors to offer implementations optimized for their

architecture and application developers to more easily create complex applications. In addition,

Glift’s parallel iteration model helps bridge the gap between CPU and GPU programming mod-

els and defines a potentially unified approach to expressing computation on disparate commodity

parallel architectures. 164

Part VI

Appendix 165

Appendix A

Glift C++ Source Code Example

This appendix presents the complete C++ source code for the 4D array example in Section 9.1. The

purpose of this appendix is to show the source code before and after transforming it with Glift.

The C++ source for the non-Glift example is: § ¤ // ... Initialize OpenGL rendering context ... vec4i virtArraySize (10 , 10 , 10 , 10); // Compute sizes

int numElts = 0; for ( int i = 0; i < virtArraySize . size (); ++ i ) { numElts += virtArraySize [i]; } vec2i physArraySize = int ( ceilf ( sqrtf ( numElts ) ) );

vec4f sizeConst (1 , 1 , 1 , 1); sizeConst [1] = virtArraySize [0]; sizeConst [2] = virtArraySize [0] * virtArraySize [1]; sizeConst [3] = virtArraySize [0] * virtArraySize [1] * virtArraySize [2];

// Allocate 2 arrays that hold vec4f values GLuint array1_texId ; glGenTextures (1 , & array1_texId ); glBindTexture ( GL_TEXTURE_2D , array1_texId ); glTexParameteri ( GL_TEXTURE_2D , GL_TEXTURE_MIN_FILTER , GL_NEAREST ); glTexParameteri ( GL_TEXTURE_2D , GL_TEXTURE_MAG_FILTER , GL_NEAREST ); glTexParameteri ( GL_TEXTURE_2D , GL_TEXTURE_WRAP_S , GL_CLAMP_TO_EDGE ); 166 glTexParameteri ( GL_TEXTURE_2D , GL_TEXTURE_WRAP_T , GL_CLAMP_TO_EDGE ); glTexImage2D ( GL_TEXTURE_2D , 0 , GL_RGBA32F_ARB , physArraySize .x() , physArraySize .y(), GL_RGBA , GL_FLOAT , NULL );

GLuint array2_texId ; glGenTextures (1 , & array2_texId ); glBindTexture ( GL_TEXTURE_2D , array1_texId ); glTexParameteri ( GL_TEXTURE_2D , GL_TEXTURE_MIN_FILTER , GL_NEAREST ); glTexParameteri ( GL_TEXTURE_2D , GL_TEXTURE_MAG_FILTER , GL_NEAREST ); glTexParameteri ( GL_TEXTURE_2D , GL_TEXTURE_WRAP_S , GL_CLAMP_TO_EDGE ); glTexParameteri ( GL_TEXTURE_2D , GL_TEXTURE_WRAP_T , GL_CLAMP_TO_EDGE ); glTexImage2D ( GL_TEXTURE_2D , 0 , GL_RGBA32F_ARB , physArraySize .x() , physArraySize .y(), GL_RGBA , GL_FLOAT , NULL );

// Create Cg shader CGprogram prog = cgCreateProgramFromFile ( cgCreateContext () , CG_SOURCE , " laplacian .cg", CG_PROFILE_FP40 , " main " , NULL );

// Bind shader and enable programmable fragment pipeline cgEnableProfile ( CG_PROFILE_FP40 ); cgGLBindProgram ( prog );

// Bind parameters to shader CGparameter array1Param = cgGetNamedParameter (prog , " array1 "); cgGLSetTextureParameter ( array1Param , array1_texId ); cgGLEnableTextureParameter ( array1Param );

CGparameter virtSizeParam = cgGetNamedParameter (prog , " virtSize "); cgSetParameter4v ( virtSizeParam , virtArraySize . data () );

CGparameter physSizeParam = cgGetNamedParameter (prog , " physSize "); cgSetParameter2v ( physSizeParam , physArraySize . data () );

CGparameter sizeConstParam = cgGetNamedParameter (prog , " sizeConst "); cgSetParameter4v ( sizeConstParam , sizeConst . data () );

// Specialize size parameters cgSetParameterVariability ( virtSizeParam , CG_LITERAL ); cgSetParameterVariability ( physSizeParam , CG_LITERAL ); cgSetParameterVariability ( sizeConstParam , CG_LITERAL ); 167

// Compile and load shader cgCompileProgram ( prog ); cgGLLoadProgram ( prog );

// Create Framebuffer Object and attach "array2" to COLOR0 GLuint fboId ; glGenFramebuffersEXT (1 , & fboId ); glBindFramebufferEXT ( GL_FRAMEBUFFER_EXT , fboId ); glFramebufferTexture2DEXT ( GL_FRAMEBUFFER_EXT , GL_COLOR_ATTACHMENT0_EXT , GL_TEXTURE_2D , array2_texId , 0 );

// Render screen-aligned quad (physArraySize.x(), physArraySize.y()) glMatrixMode ( GL_PROJECTION ); glLoadIdentity (); glMatrixMode ( GL_MODELVIEW ); glLoadIdentity (); glViewport (0 , 0 , physArraySize .x() , physArraySize .y ()); glBegin ( GL_QUADS ); glVertex2f ( -1.0f , -1.0 f ); glVertex2f (+1.0f , -1.0 f); glVertex2f (+1.0f , +1.0 f ); glVertex2f ( -1.0f , +1.0 f); glEnd (); ¦ ¥

In contrast, the C++ code for the Glift version is: § ¤ // ... Initialize OpenGL rendering context ... typedef glift :: ArrayGpu ArrayType ;

// Allocate 2 arrays that hold vec4f values vec4i virtArraySize (10 , 10 , 10 , 10); ArrayType array1 ( virtArraySize ); ArrayType array2 ( virtArraySize );

// Create Cg shader CGprogram prog = cgCreateProgramFromFile ( cgCreateContext () , CG_SOURCE , " laplacianGlift .cg", CG_PROFILE_FP40 , " main " , NULL );

// Instantiate Glift Cg types GliftType arrayTypeCg = glift :: cgGetTemplateType < ArrayType >(); prog = glift :: cgInstantiateParameter ( prog , " array1 " , arrayTypeCg );

GliftType iterTypeCg = glift :: cgGetTemplateType < ArrayType :: gpu_iterator >(); prog = glift :: cgInstantiateParameter ( prog , "it" , iterTypeCg ); 168

// Get GPU range iterator for all elements in array2 // - This is a neighborhood iterator that permits relative // indexing within the specified neighborhood. vec4i origin (0 , 0 , 0 , 0); vec4i size = virtArraySize ; vec4i minNeighborhood ( -1 , -1 , -1 , -1); vec4i maxNeighborhood (+1 , +1 , +1 , +1); ArrayType :: gpu_neighbor_range rit = array2 . gpu_neighbor_range ( origin , size , minNeighborhood , maxNeighborhood );

// Bind parameters to shader CGparameter array1Param = cgGetNamedParameter (prog , " array1 "); array1 . bind_for_read ( array1Param );

CGparameter iterParam = cgGetNamedParameter (prog , "it"); rit . bind_for_read ( iterParam );

// Specialize size parameters array1 . set_member_variability ( array1Param , CG_LITERAL ); rit . set_member_variability ( iterParam , CG_LITERAL );

// Compile and load program cgCompileProgram ( prog ); cgGLLoadProgram ( prog );

// Bind shader and enable programmable fragment pipeline cgEnableProfile ( CG_PROFILE_FP40 ); cgGLBindProgram ( prog );

// Create Framebuffer Object and attach "array2" to COLOR0 GLuint fboId ; glGenFramebuffersEXT (1 , & fboId ); glBindFramebufferEXT ( GL_FRAMEBUFFER_EXT , fboId );

array2 . bind_for_write ( fboId , GL_COLOR_ATTACHMENT0_EXT );

// Compute across domain specified by address iterator glift :: exec_gpu_iterators ( rit ); ¦ ¥ 169

Note that the Glift version for this simple example is approximately half the number of lines of

code. The savings is significantly more for complex structures containing multiple address trans-

lation and data textures. The intent of the Glift version is much clearer than the raw OpenGL/Cg

version, yet Glift remains a low-enough level library that it can be easily integrated into existing

C++/OpenGL/Cg programming environments. 170

Appendix B

C++ Template Type Factories

B.1 Introduction

Glift parameterizes each component using C++ templates. To maintain independent, composable modules, the template parameters to Glift classes must be orthogonal. The client of a Glift class must not be expected to understand the complex relationship between the different template arguments.

This appendix describes a novel generic programming design pattern called template type factories

used by Glift to support orthogonal and deferred creation of types. The pattern improves upon and

replaces the template-template parameter mechanism that C++ provides for this purpose.

One method of parameterizing a C++ class template by orthogonal types is to use template-template

parameters. Template-template parameters are an advanced feature of C++ that provide an ad-

ditional level of indirection beyond that provided by standard template parameters (see Alexan-

drescu [3] for a description of how template-template parameters are used in generic programming).

They permit classes to specify template classes as type parameters. The result is that a class can

defer the complete specification of types until after all template arguments are known. For example,

a class could take a template-template parameter that is a container template (such as vector) and another parameter that is the data type for the container. These two orthogonal types can be combined to define the complete container type within the class. 171

Unfortunately, the powerful promise of lazy creation of types made by template-template param-

eters is tarnished by several shortcomings. We introduce template type factories as an alternative

to template-template parameters. Template type factories improve upon template-template parame-

ters and do not require any language extensions. The problems with template-template parameters

include:

1. The template parameter list for template-template parameters must match exactly. Default

template parameters cannot be hidden.

2. Template-template parameters are private to the class to which they are passed. They cannot

be published to clients of the class in the same way that non-type and type template parameters

can be published (via const static int, enum, and typedef). This is an undesirable and

different behavior from other template parameter types.

3. Parameter lists of template-template parameters cannot contain template-template parameters.

This means that the level of indirection provided by template-template parameters can be

applied only once, rather than recursively. Moreover, this restriction further makes template-

template parameters frustratingly different from other types of template parameters.

Template type factories provide a simple solution to these problems.

B.2 Template Type Factories

The problem that template-template parameters unsuccessfully attempt to solve is simply a compile-

time version of the problem that Abstract Object Factories solve at run-time. In both cases, the

programmer wants to defer the complete specification of a type until some later time. In the case of

object factories, we want to defer the decision about exactly what type to create until runtime. In

the case of template-template parameters, we want to defer the specification of what type to create

until template instantiation time (i.e., until the other template arguments are present).

Given the connection between object factories and template-template parameters, we can leverage 172 object factory idioms to solve the template-template parameter problem. What we really want are template type factories. We pass around tags that refer to types instead of template-template param- eters and then use a type factory to create the type once all arguments have been collected.

Type factories are simply type functions (as defined in Vandevoorde et al. [126]) that take a type tag (an empty struct) and a set of template parameters and produce a non-templated type. Put another way, type factories are fixed-length type lists of types from the same family that support constant-time lookup rather than linear time lookups. Type factories can also be thought of as trait classes [95] of the type tag structures. However, type factory’s combination of type tags and factories creates a new, powerful pattern that goes well beyond the usual usage of trait classes.

B.3 Analysis

Type factories solve problem 1 by hiding the actual template parameter list of each type in the type family. Default template parameters can be used with type factories as expected.

Problem 2 is solved by the fact that classes can publish the type tag. This allows clients of the class

to leverage the template-template information.

Problem 3 is solved by the fact that the type tag is a perfectly legal parameter in a template-template

parameter list. Moreover, using type tags instead of template-template parameters means that you often don’t even need to use template-template parameters at all. You simply pass around the type

tag. Type factories can be arbitrarily nested, and so the level-of-indirection promised by template- template parameters can be truly recursive.

B.4 Code Example § ¤ # include < vector > # include # include < iostream > # include < iterator > # include < string > 173

//////////////////////////////// // Definition of Type Factory //////////////////////////////// template < class Tag , class T , class S = int > struct ContainerFactory {};

//////////////////////////////// // Definition of Members of "Container" Type Family ////////////////////////////////

//////////// Vector ///////////////// // 1) Define "tag" type struct VectorTag {};

// 2) Define partial specialization of factory template < class T> struct ContainerFactory < VectorTag , T > { typedef std :: vector Type ; };

//////////// List ///////////////// struct ListTag {}; template < class T> struct ContainerFactory < ListTag , T > { typedef std :: list Type ; };

// Erroneous tag definition (no containerFactory) struct ErrTag {};

///////////////////////////////////////////////////// // Example of using a "Tag" as a template parameter ///////////////////////////////////////////////////// template < class T , class ContainerTagParam > struct Foo { typedef T ValueType ; typedef ContainerTagParam ContainerTag ; typedef typename ContainerFactory < ContainerTag , T >:: Type ContainerType ; 174

static void Print ( T val ) { ContainerType myContainer ; myContainer . push_back ( val );

copy ( myContainer . begin (), myContainer . end (), std :: ostream_iterator ( std :: cout , "\n" )); } };

int main () { typedef Foo VecType ; typedef Foo ListType ;

VecType :: Print (5); ListType :: Print (6);

// ERROR: Passing "float" instead of a ContainerTag //typedef Foo ErrType; //ErrType::Print(5);

// Client usage of ContainerTag typedef ContainerFactory < VecType :: ContainerTag , std :: string >:: Type ContainerType ;

ContainerType myContainer ; myContainer . push_back ( " clientTagUse " );

copy ( myContainer . begin (), myContainer . end (), std :: ostream_iterator < std :: string >( std :: cout , "\n" ));

return 0; } ¦ ¥ 175

Appendix C

Separating Types from Behavior in C++

Glift uses C++ templates quite heavily. As the number of template parameters grows, managing them using standard C++ coding practices becomes quite cumbersome. Unchecked proliferation of

template parameters can lead to the following problems.

1. Adding a new template parameter is difficult because the definition of all out-of-class methods

must also be updated.

2. Tracking dependencies between template parameters is complex, error-prone, and ends up

cluttering the default parameter declarations.

3. Separating private and public type definitions is nearly impossible. This is because, unlike

member declarations, typedefs used in a class must be declared before use. In addition, type

derivations often require intermediate private types before a public type can be derived.

4. Being constrained to a single template parameter interface and set of default types for a class

is too restrictive. The class implementation should be separate from its type parameter list. A

class should publish its type requirements (part of the Concept for the class), but its declara-

tion should not be coupled to a single, fixed type parameter declaration. 176

Glift solves the above template parameter management problem by factoring class definitions into three separate components:

1. User-visible class declaration,

2. Implementation class, and

3. Types class.

C.1 User-Visible Class Declaration

The User-Visible class declaration is the class (with associated template parameters) that clients will instantiate, and its name is the name you would give to a standard class definition. The user-visible class declaration has the following properties:

the template parameters must be orthogonal with no dependencies between them. Type fac- tory tags can help a great deal with this,

default types may be defined,

the class publicly inherits from the implementation class, and

the class is empty other than mirroring the constructors provided by the implementation class and publishing methods from the implementation base class via using declarations.

C.1.1 Code Example § ¤ template < typename T , typename S> class MyClass : public MyClass_Impl < MyClass_Types > { public : // ... Must mirror the constructors from MyClass Impl ... }; ¦ ¥ 177

It is important to note that class authors can create multiple User-Visible classes for a single Im-

plementation class. This allows a single Implementation (behavior) class to be used with different default type parameters, different parameter lists, etc. based on specific use.

C.2 Types Class

The Types class is a type traits class [126] that contains only type information. All type derivation

from the template parameters is contained within this class (usually a struct). This is tightly coupled to the Implementation class. A Types class has the following properties:

it contains only type information (no data members or methods) and

it has the same template parameter signature as the user-visible class.

Separating the Types class from the implementation solves the following problems.

To add a new user-visible template parameter, programmers only need to change the User- Visible class and the Types class (this is a constant, small amount of code to change because

it does not involve the Implementation).

The Types class encapsulates all dependencies between the input template parameters and derives all types used by the Implementation class.

The Types class can computed arbitrarily complex type dependencies without cluttering the definition of the Implementation class.

C.2.1 Code Example § ¤ template < class T , class S> struct MyClass_Types { // ... define all typedefs and type derivations needed // by the Implementation class }; ¦ ¥ 178

C.3 Implementation Class

The Implementation class defines the behavior of the class and contains the bulk of the code nor-

mally specified by a class definition. This is a standard class definition except that it takes only one template parameter—the Types class. Building the Implementation class in this way has the following benefits.

The fact that the Implementation class contains only a single template parameter makes it much easier to read (the reader can understand the code’s behavior independently from the

type information/dependencies).

Using a Type class to pass in all template parameters makes it possible to add/remove template parameters without affecting code that is independent of the change (just as using an object or

struct in structured programming allows the number of parameters to change without altering

the interface).

All public typedefs required by the Concept can be declared at the top of the class definition— making the type requirements of the Concept very clear.

C.3.1 Code Example § ¤ template < class Types > class MyClass_Impl { public : // ... get all template parameter information, including // derived types, from Types. The implementation class // can even be derived from a base class type defined in // Types or choose to derive from Types.

// ... Standard class definition ... }; ¦ ¥ 179

C.4 Questions

Q) Why can’t the User-visible class and the Types class be the same class?

A) The problem is that base classes cannot be parameterized by their derived classes. This means it is impossible for the Implementation class to obtain a typedef defined in the derived (Types) class.

Q) Does the current C++ language fully support this design pattern?

A) Almost. One significant shortcoming is that the public interface of templated base classes is

not automatically added to the public interface of the derived class. This forces authors of User-

Visible classes to redeclare typedefs and methods. The code example below does not re-publish this

information because Visual Studio 2003 permitted this behavior; however, the more strict gcc 4.x

compilers require explicit declarations to publish base class methods and types. In the Glift code

base, I have used macros to lessen this burden, but that is unsatisfactory for all of the usual reasons

that macros are problematic.

Q) Why not have users directly instantiate the type class and get rid of the User-Visible class en-

tirely? Wouldn’t this be more a flexible pattern than having a fixed interface defined in the User-

Visible class.

A) I considered that but found the resulting client code unacceptably complex. I was also not able to

find precedent for this idiom in Boost. In fact, most Boost libraries do define a traditional-looking

template class that the client instantiates.

The pattern does not exclude this usage, however. It is possible to by-pass the User-Defined class, create a custom Types class, and directly instantiate the Implementation class. For example: § ¤ typedef MyCustomTypes da_types ; typedef MyClass_Impl < da_types > da_class ; ¦ ¥

The resulting class will be fully compatible with all other uses of MyClass. This feature is actu- ally quite significant because it demonstrates the template parameters are truly decoupled from the implementation. 180

C.5 Complete Code Example § ¤ # include < iostream > using namespace std ;

// Define template parameters here // - NOTE: This is the ONLY place that they are // declared!!! :) :) :) # define PARAM_LIST class P0 , class P1 # define ARG_LIST P0 , P1

// Forward decls template < PARAM_LIST > class MyClass ; template < PARAM_LIST > struct MyClass_Types ; template class MyClass_Impl ;

// User-visible convenience class template < PARAM_LIST > class MyClass : public MyClass_Impl < MyClass_Types < ARG_LIST > > { public : // Mirror MyClass Impl’s constructors MyClass ( P0 a , P1 b ) : MyClass_Impl < MyClass_Types < ARG_LIST > >(a , b) {} };

// Type information for implementation // - Resolves all dependencies of types, etc. Users // can also define their own version of this guy if they need // to customize the type derivation for their application. template < PARAM_LIST > struct MyClass_Types { typedef P0 first_type ; typedef P1 second_type ; };

// Implementation of class // - Not normally instantiated directly by user but can be if they // want to create a custom Types trait class. 181

template < class Types > class MyClass_Impl { public : typedef typename Types :: first_type first_type ; typedef typename Types :: second_type second_type ;

MyClass_Impl ( first_type a , second_type b ) : m_a (a), m_b (b) {}

first_type a () const { return m_a ; } second_type b () const { return m_b ; }

private : first_type m_a ; second_type m_b ; };

int main () { MyClass m (3 , 3.1415 f); cout << m.a() < < " ," << m.b() < < endl ;

return 0; } ¦ ¥ 182

Bibliography

[1] ATI All-In-Wonder x1900 256MB Review. http://www.beyond3d.com/reviews/ati/ aiwx1900/, 2006.

[2] Timo Aila and Samuli Laine. Alias-free shadow maps. In Proceedings of the Eurographics Symposium on Rendering, pages 161–166. Eurographics Association, June 2004.

[3] Andrei Alexandrescu. Modern C++ Design: Generic Programming and Design Patterns Applied. Addison-Wesley, 2001.

[4] Ping An, Alin Jula, Silvius Rus, Steven Saunders, Tim Smith, Gabriel Tanase, Nathan Thomas, Nancy Amato, and Lawrence Rauchwerger. STAPL: An adaptive, generic paral- lel C++ library. In Workshop on Languages and Compilers for Parallel Computing, pages 193–208, August 2001.

[5] Matthew H. Austern, Ross A. Towle, and Alexander A. Stepanov. Range partition adaptors: a mechanism for parallelizing STL. SIGAPP Appl. Comput. Rev., 4(1):5–6, 1996.

[6] David Baraff, John Anderson, Michael Kass, and . Physically based mod- elling. ACM SIGGRAPH Course Notes, July 2003.

[7] David Benson and Joel Davis. Octree textures. ACM Transactions on Graphics, 21(3):785– 790, July 2002.

[8] Marcelo Bertalmio, Pere Fort, and Daniel Sanchez-Crespo. Real-time, accurate depth of field using anisotropic diffusion and programmable graphics cards. In Second International Symposium on 3D Data Processing, Visualization and Transmission (3DPVT’04), pages 767– 773, 2004.

[9] Beyond3D. DirectX Next Early Preview. http://www.beyond3d.com/articles/ directxnext/, 2003.

[10] Alecio´ P. D. Binotto, Joao˜ L. D. Comba, and Carla M. D. Freitas. Real-time volume rendering of time-varying data using a fragment-shader compression approach. In IEEE Symposium on Parallel and Large-Data Visualization and Graphics, pages 69–75, October 2003.

[11] G. E. Blelloch. Scans as primitive parallel operations. IEEE Transactions on Computers, 38(11):1526–1538, 1989. 183

[12] Guy E. Blelloch. Prefix sums and their applications. Technical Report CMU-CS-90-190, School of Computer Science, Carnegie Mellon University, November 1990.

[13] Guy E. Blelloch. NESL: A nested data-parallel language (version 2.6). Technical Report CMU-CS-93-129, School of Computer Science, Carnegie Mellon University, April 1993.

[14] Jeff Bolz, Ian Farmer, Eitan Grinspun, and Peter Schroder¨ . Sparse matrix solvers on the GPU: Conjugate gradients and multigrid. ACM Transactions on Graphics, 22(3):917–924, July 2003.

[15] Boost. Boost C++ libraries. http://www.boost.org/, 2005.

[16] Ian Buck. Taking the plunge into GPU computing. In Matt Pharr, editor, GPU Gems 2, chapter 32, pages 509–519. Addison Wesley, March 2005.

[17] Ian Buck, Kayvon Fatahalian, and Pat Hanrahan. GPUBench: Evaluating GPU performance for numerical and scientific applications. In 2004 ACM Workshop on General-Purpose Com- puting on Graphics Processors, pages C–20, August 2004.

[18] Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian, Mike Houston, and Pat Hanrahan. Brook for GPUs: Stream computing on graphics hardware. ACM Trans- actions on Graphics, 23(3):777–786, August 2004.

[19] Juan Buhler and Dan Wexler. A phenomenological model for bokeh rendering. In ACM SIGGRAPH 2002 Conference Abstracts and Applications, July 2002. http://www.flarg. com/Graphics/Bokeh.html.

[20] William W. Carlson, Jesse M. Draper, David E. Culler, Kathy Yelick, Eugene Brooks, and Karren Warren. Introduction to UPC and language specification. Technical Report CCS-TR- 99-157, IDA Center for Computing Sciences, Bowie, MD, USA, 1999.

[21] Nathan A. Carr, Jesse D. Hall, and John C. Hart. The ray engine. In Graphics Hardware 2002, pages 37–46, September 2002.

[22] Nathan A. Carr and John C. Hart. Painting detail. ACM Transactions on Graphics, 23(3):845– 852, August 2004.

[23] Bradford L. Chamberlain, Sung-Eun Choi, E Christopher Lewis, Calvin Lin, Lawrence Sny- der, and W. Derrick Weathersby. The case for high level parallel programming in ZPL. IEEE Computational Science and Engineering, 5(3):76–86, July/September 1998.

[24] Eric Chan and Fredo´ Durand. An efficient hybrid shadow rendering algorithm. In Proceed- ings of the Eurographics Symposium on Rendering, pages 185–195. Eurographics Associa- tion, 2004.

[25] Chialin Chang, Alan Sussman, and Joel Saltz. Object-oriented runtime support for com- plex distributed data structures. Technical Report UMIACS-TR-95-35, Univ. of Maryland Institute for Advanced Computer Studies, College Park, MD, USA, 1995. 184

[26] Hamilton Y. Chong and Steven J. Gortler. A lixel for every pixel. In Rendering Techniques 2004: 15th Eurographics Workshop on Rendering, pages 167–172, June 2004.

[27] Martin Christen. Ray tracing on GPU. Master’s thesis, University of Applied Sciences Basel, 2005.

[28] Per H. Christensen and Dana Batali. An irradiance atlas for global illumination in complex production scenes. In Rendering Techniques 2004, pages 133–141, June 2004.

[29] M. Cole and S. Parker. Dynamic compilation of C++ template code. In Scientific Program- ming, volume 11, pages 321–327. IOS Press, 2003.

[30] Robert L. Cook, Thomas Porter, and Loren Carpenter. Distributed ray tracing. In Computer Graphics (Proceedings of SIGGRAPH 84), volume 18, pages 137–145, July 1984.

[31] Greg Coombe, Mark J. Harris, and Anselmo Lastra. Radiosity on graphics hardware. In Proceedings of the 2004 Conference on Graphics Interface, pages 161–168, May 2004.

[32] Franklin C. Crow. Shadow algorithms for computer graphics. In Computer Graphics (Pro- ceedings of SIGGRAPH 77), volume 11, pages 242–248, July 1977.

[33] David E. Culler, Andrea C. Arpaci-Dusseau, Seth Copen Goldstein, Arvind Krishnamurthy, Steven Lumetta, Thorsten von Eicken, and Katherine A. Yelick. Parallel programming in split-c. In Super Computing, pages 262–273, 1993.

[34] David (grue) DeBry, Jonathan Gibbs, Devorah DeLeon Petty, and Nate Robins. Painting and rendering textures on unparameterized models. ACM Transactions on Graphics, 21(3):763– 768, July 2002.

[35] Joe Demers. Depth of field: A survey of techniques. In Randima Fernando, editor, GPU Gems, pages 375–390. Addison Wesley, March 2004.

[36] Alexandre Duret-Lutz, Thierry Geraud,´ and Akim Demaille. Design patterns for generic programming in C++. In In the Proceedings of the 6th USENIX Conference on Object- Oriented Technologies and Systems (COOTS), pages 189–202, San Antonio, Texas, USA, January-February 2001. USENIX Association.

[37] Manfred Ernst, Christian Vogelgsang, and Gunther¨ Greiner. Stack implementation on pro- grammable graphics hardware. In Proceedings of Vision, Modeling, and Visualization, pages 255–262, November 2004.

[38] Cass Everitt. Interactive order-independent transparency. Technical report, NVIDIA Cor- poration, May 2001. http://developer.nvidia.com/object/Interactive_Order_ Transparency.html.

[39] Kayvon Fatahalian, Timothy J. Knight, Mike Houston, Mattan Erez, Daniel Reiter Horn, Larkhoon Leem, Ji Young Park, Manman Ren, Alex Aiken William J. Dally, , and Pat Hanra- han. Sequoia: Programming the memory hierarchy. In Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, page to appear, November 2006. http://graphics. stanford.edu/projects/sequoia. 185

[40] Randima Fernando. Percentage-closer soft shadows. In ACM SIGGRAPH 2005 Conference Abstracts and Applications, August 2005.

[41] Randima Fernando, Sebastian Fernandez, Kavita Bala, and Donald P. Greenberg. Adaptive shadow maps. In Proceedings of ACM SIGGRAPH 2001, Computer Graphics Proceedings, Annual Conference Series, pages 387–390, August 2001.

[42] Tim Foley and Jeremy Sugerman. KD-Tree acceleration structures for a GPU raytracer. In Graphics Hardware 2005, July 2005. To appear.

[43] Tom Forsyth. Practical shadows. In Game Developers Conference 2004, March 2004. http: //www.eelpi.gotdns.org/papers/papers.html.

[44] Nico Galoppo, Naga K. Govindaraju, Michael Henson, and Dinesh Manocha. LU-GPU: Efficient algorithms for solving dense linear systems on graphics hardware. In Proceedings of the 2005 ACM/IEEE Conference on Supercomputing, page 3, 2005.

[45] Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissides. Design patterns: Elements of reusable object-oriented software. Professional Computing Series, Addison Wesley, 1995.

[46] Nolan Goodnight, Cliff Woolley, Gregory Lewin, David Luebke, and Greg Humphreys. A multigrid solver for boundary value problems using programmable graphics hardware. In Graphics Hardware 2003, pages 102–111, July 2003.

[47] Naga K. Govindaraju, Nikunj Raghuvanshi, Michael Henson, David Tuft, and Dinesh Manocha. A cache-efficient sorting algorithm for database and data mining computations using graphics processors. Technical Report TR05-016, University of North Carolina, 2005.

[48] GPUSort: A high performance GPU sorting library. http://gamma.cs.unc.edu/ GPUSORT/, 2005.

[49] Simon Green. Image processing tricks in OpenGL. http://download.nvidia.com/ developer/presentations/2005/GDC/OpenGL_Day/OpenGL_Image_Processing_ Tricks.pdf, 2005.

[50] Brian Guenter, Todd B. Knoblock, and Erik Ruf. Specializing shaders. In Proceedings of SIGGRAPH 95, Computer Graphics Proceedings, Annual Conference Series, pages 343–350, August 1995.

[51] Paul E. Haeberli and Kurt Akeley. The accumulation buffer: Hardware support for high- quality rendering. In Computer Graphics (Proceedings of SIGGRAPH 90), volume 24, pages 309–318, August 1990.

[52] Mark Harris and Ian Buck. GPU flow control idioms. In Matt Pharr, editor, GPU Gems 2, chapter 34, pages 547–555. Addison Wesley, March 2005.

[53] Mark J. Harris, William Baxter III, Thorsten Scheuermann, and Anselmo Lastra. Simulation of cloud dynamics on graphics hardware. In Graphics Hardware 2003, pages 92–101, July 2003. 186

[54] J.-M. Hasenfratz, M. Lapierre, N. Holzschuch, and F. Sillion. A survey of real-time soft shadows algorithms. Computer Graphics Forum, 22(4):753–774, December 2003.

[55] Karl E. Hillesland, Sergey Molinov, and Radek Grzeszczuk. Nonlinear optimization frame- work for image-based modeling on programmable graphics hardware. ACM Transactions on Graphics, 22(3):925–934, July 2003.

[56] W. Daniel Hillis and Guy L. Steele Jr. Data parallel algorithms. Communications of the ACM, 29(12):1170–1183, December 1986.

[57] R. W. Hockney. A fast direct solution of Poisson’s equation using Fourier analysis. Journal of the ACM, 12(1):95–113, January 1965.

[58] E. G. Hoel and H. Samet. Data-parallel primitives for spatial operations. In ICPP95, pages III:184–191, August 1995.

[59] Daniel Horn. Stream reduction operations for GPGPU applications. In Matt Pharr, editor, GPU Gems 2, chapter 36, pages 573–589. Addison Wesley, March 2005.

[60] Mehdi Jazayeri. Component programming: a fresh look at software components. In Euro- pean Software Engineering Conference, pages 457–478, September 1995.

[61] III John V.W. Reynders and Julian C. Cummings. The pooma framework. Comput. Phys., 12(5):453–459, 1998.

[62] Elizabeth Johnson and Dennis Gannon. HPC++: Experiments with the parallel standard template library. In International Conference on Supercomputing, pages 124–131, 1997.

[63] Gregory S. Johnson, Juhyun Lee, Christopher A. Burns, and William R. Mark. The irregular Z-buffer: Hardware acceleration for irregular data structures. ACM Transactions on Graph- ics, 24(4):1462–1482, October 2005.

[64] L. V. Kale and Sanjeev Krishnan. Charm++: Parallel Programming with Message-Driven Objects. In Gregory V. Wilson and Paul Lu, editors, Parallel Programming using C++, pages 175–213. MIT Press, 1996.

[65] Filip Karlsson and Carl Johan Ljungstedt. Ray tracing fully implemented on programmable graphics hardware. Master’s thesis, Chalmers University of Technology, 2004.

[66] Steve Karmesin, Scott Haney, Bill Humphrey, Julian Cummings, Tim Williams, Jim Crotinger, Stephen Smith, and Eugene Gavrilov. Pooma: Parallel object-oriented methods and applications. http://acts.nersc.gov/pooma/, 2002.

[67] George Em Karniadakis and Robert M. Kirby II. Parallel Scientific Computing in C++ and MPI : A Seamless Approach to Parallel Algorithms and their Implementation. Cambridge University Press, 1st edition, 2003.

[68] Michael Kass and Gavin Miller. Rapid, stable fluid dynamics for computer graphics. In Computer Graphics (Proceedings of SIGGRAPH 90), volume 24, pages 49–57, August 1990. 187

[69] Ricky A. Kendall, Masha Sosonkina, William D. Gropp, Robert W. Numrich, and Thomas Sterling. Parallel programming models applicable to cluster computing and beyond. In Are Magnus Bruaset and Aslak Tveito, editors, Numerical Solution of Partial Differential Equations on Parallel Computers, volume 51 of Lecture Notes in Computational Science and Engineering. Springer-Verlag, 2005.

[70] John Kessenich, Dave Baldwin, and Randi Rost. The OpenGL Shading Language version 1.10.59. http://www.opengl.org/documentation/oglsl.html, April 2004.

[71] T. Kilburn, D. B. G. Edwards, M. J. Lanigan, and F. H. Sumner. One-level storage system. IRE Transactions on Electronic Computers, EC-11:223–235, April 1962.

[72] Emmett Kilgariff and Randima Fernando. The GeForce 6 series GPU architecture. In Matt Pharr, editor, GPU Gems 2, chapter 30, pages 471–491. Addison Wesley, March 2005.

[73] Joe Kniss, Aaron Lefohn, Robert Strzodka, Shubhabrata Sengupta, and John D. Owens. Oc- tree textures on graphics hardware. In ACM SIGGRAPH 2005 Conference Abstracts and Applications, August 2005.

[74] Martin Kraus and Thomas Ertl. Adaptive texture maps. In Graphics Hardware 2002, pages 7–16, September 2002.

[75] Jaroslav Kriˇ vanek,´ Jirˇ´ı Zˇ ara,´ and Kadi Bouatouch. Fast depth of field rendering with surface splatting. In Computer Graphics International, pages 196–201, 2003.

[76] Jens Kruger¨ and Rudiger¨ Westermann. Linear algebra operators for GPU implementation of numerical algorithms. ACM Transactions on Graphics, 22(3):908–916, July 2003.

[77] David J. Kuck. A survey of parallel machine organization and programming. ACM Comput. Surv., 9(1):29–59, 1977.

[78] Sylvain Lefebvre, Samuel Hornus, and Fabrice Neyret. All-purpose texture sprites. Technical Report 5209, INRIA, May 2004.

[79] Sylvain Lefebvre, Samuel Hornus, and Fabrice Neyret. Octree textures on the GPU. In Matt Pharr, editor, GPU Gems 2, chapter 37, pages 595–613. Addison Wesley, March 2005.

[80] Aaron Lefohn, Joe Kniss, and John Owens. Implementing efficient parallel data structures on GPUs. In Matt Pharr, editor, GPU Gems 2, chapter 33, pages 521–545. Addison Wesley, March 2005.

[81] Aaron Lefohn, Shubhabrata Sengupta, Joe Kniss, Robert Strzodka, and John D. Owens. Dy- namic adaptive shadow maps on graphics hardware. In ACM SIGGRAPH 2005 Conference Abstracts and Applications, August 2005.

[82] Aaron E. Lefohn, Joe Kniss, Robert Strzodka, Shubhabrata Sengupta, and John D. Owens. Glift: Generic, efficient, random-access GPU data structures. ACM Transactions on Graph- ics, 26(1):60–99, 2006. 188

[83] Aaron E. Lefohn, Joe M. Kniss, Charles D. Hansen, and Ross T. Whitaker. Interactive defor- mation and visualization of level set surfaces using graphics hardware. In IEEE Visualization 2003, pages 75–82, October 2003.

[84] Aaron E. Lefohn, Joe M. Kniss, Charles D. Hansen, and Ross T. Whitaker. A streaming narrow-band algorithm: Interactive computation and visualization of level-set surfaces. IEEE Transactions on Visualization and Computer Graphics, 10(4):422–433, July/August 2004.

[85] Erik Lindholm, Mark J. Kilgard, and Henry Moreton. A user-programmable vertex engine. In Proceedings of ACM SIGGRAPH 2001, Computer Graphics Proceedings, Annual Conference Series, pages 149–158, August 2001.

[86] Brandon Lloyd, Sung-eui Yoon, David Tuft, and Dinesh Manocha. Subdivided shadow maps. Technical Report TR05-024, University of North Carolina at Chapel Hill, 2005.

[87] Frank Losasso, Fred´ eric´ Gibou, and Ron Fedkiw. Simulating water and smoke with an octree data structure. ACM Transactions on Graphics, 23(3):457–462, August 2004.

[88] David B. Loveman. High performance fortran. IEEE Parallel Distributed Technology, 1(1):25–42, 1993.

[89] William R. Mark, R. Steven Glanville, Kurt Akeley, and Mark J. Kilgard. Cg: A system for programming graphics hardware in a C-like language. ACM Transactions on Graphics, 22(3):896–907, July 2003.

[90] Tobias Martin and Tiow-Seng Tan. Anti-aliasing and continuity with trapezoidal shadow maps. In Rendering Techniques 2004: 15th Eurographics Workshop on Rendering, pages 153–160, June 2004.

[91] Michael McCool, Stefanus Du Toit, Tiberiu Popa, Bryan Chan, and Kevin Moule. Shader algebra. ACM Transactions on Graphics, 23(3):787–795, August 2004.

[92] Patrick S. McCormick, Jeff Inman, James P. Ahrens, Chuck Hansen, and Greg Roth. Scout: A hardware-accelerated system for quantitatively driven visualization and analysis. In IEEE Visualization 2004, pages 171–178, October 2004.

[93] DirectX pixel shader 3.0 specification. http://msdn.microsoft.com/library/ default.asp?url=/library/en-us/directx9_c/directx/graphics/reference/ assemblylanguageshaders/pixelshaders/ps_3_0.asp, 2004.

[94] DirectX vertex shader 3.0 specification. http://msdn.microsoft.com/library/ default.asp?url=/library/en-us/directx9_c/directx/graphics/reference/ assemblylanguageshaders/vertexshaders/vs_3_0.asp, 2004.

[95] Nathan Myers. Traits: a new and useful template technique. C++ report, 7(5):32–35, June 1995. http://www.cantrip.org/traits.html.

[96] GPU programming exposed: The naked truth behind NVIDIA’s demos. http: //download.nvidia.com/developer/presentations/2005/SIGGRAPH/Truth_ About_NVIDIA_Demos.pdf, 2005. 189

[97] OpenGL Architecture Review Board. ARB fragment program. Revision 26. http://oss. sgi.com/projects/ogl-sample/registry/ARB/fragment_program.txt, 22 August 2003.

[98] A.V. Oppenheim and R.W. Schafer. Discrete-Time Signal Processing. Prentice-Hall, 1989.

[99] John Owens. Streaming architectures and technology trends. In Matt Pharr, editor, GPU Gems 2, chapter 29, pages 457–470. Addison Wesley, March 2005.

[100] PCI Special Interest Group. PCI Express: Performance scalability for the next decade. http: //www.pcisig.com/specifications/pciexpress/.

[101] Darwyn Peachy. Texture on demand. Pixar Animation Studios Technical Memo, 1990.

[102] Mark S. Peercy, Marc Olano, John Airey, and P. Jeffrey Ungar. Interactive multi-pass pro- grammable shading. In Proceedings of ACM SIGGRAPH 2000, Computer Graphics Proceed- ings, Annual Conference Series, pages 425–432, July 2000.

[103] Fabio Pellacini, Kiril Vidimce, Aaron Lefohn, Alex Mohr, Mark Leone, and John Warren. Lpics: A hardware-accelerated relighting engine for computer cinematography. ACM SIG- GRAPH 2005, ACM Transactions on Graphics, 2005.

[104] D. Pham, S. Asano, M. Bolliger, M. N. Day, H. P. Hofstee, C. Johns, J. Kahle, A. Kameyama, J. Keaty, Y. Masubuchi, M. Riley, D. Shippy, D. Stasiak, M. Wang, J. Warnock, S. Weitzel, D. Wendel, T. Yamazaki, and K. Yazawa. The design and implementation of a first-generation CELL processor. In Proceedings of the International Solid-State Circuits Conference, pages 184–186, February 2005.

[105] Matt Pharr. An introduction to shader interfaces. In Randima Fernando, editor, GPU Gems, chapter 32, pages 537–550. Addison Wesley, March 2004.

[106] M. Potmesil and I. Chakravarty. A lens and aperture camera model for synthetic image generation. In Computer Graphics (Proceedings of SIGGRAPH 81), volume 15, pages 297– 305, August 1981.

[107] Kekoa Proudfoot, William R. Mark, Svetoslav Tzvetkov, and Pat Hanrahan. A real-time procedural shading system for programmable graphics hardware. In Proceedings of ACM SIGGRAPH 2001, Computer Graphics Proceedings, Annual Conference Series, pages 159– 170, August 2001.

[108] Timothy J. Purcell, Ian Buck, William R. Mark, and Pat Hanrahan. Ray tracing on pro- grammable graphics hardware. ACM Transactions on Graphics, 21(3):703–712, July 2002.

[109] Timothy J. Purcell, Craig Donner, Mike Cammarano, Henrik Wann Jensen, and Pat Hanra- han. Photon mapping on programmable graphics hardware. In Graphics Hardware 2003, pages 41–50, July 2003.

[110] William T. Reeves, David H. Salesin, and Robert L. Cook. Rendering antialiased shadows with depth maps. In Computer Graphics (Proceedings of SIGGRAPH 87), volume 21, pages 283–291, July 1987. 190

[111] Alexander Reshetov, Alexei Soupikov, and Jim Hurley. Multi-level ray tracing algorithm. ACM Transactions on Graphics, 24(3):1176–1185, August 2005.

[112] Guennadi Riguer, Natalya Tatarchuk, and John Isidoro. Real-time depth of field simulation. In Wolfgang F. Engel, editor, ShaderX2: Shader Programming Tips and Tricks with DirectX 9, chapter 4.7, pages 529–556. Wordware, 2003.

[113] Thorsten Scheuermann and Natalya Tatarchuk. Improved depth-of-field rendering. In Wolf- gang Engel, editor, ShaderX3: Advanced Rendering with DirectX and OpenGL, chapter 4.4, pages 363–377. Charles River Media, 2004.

[114] Jens Schneider and Rudiger¨ Westermann. Compression domain volume rendering. In IEEE Visualization 2003, pages 293–300, October 2003.

[115] Mark Segal and Kurt Akeley. The OpenGL Graphics System: A Specification (Version 2.0 - October 22, 2004), October 2004.

[116] Pradeep Sen. Silhouette maps for improved texture magnification. In Graphics Hardware 2004, pages 65–74, August 2004.

[117] Pradeep Sen, Michael Cammarano, and Pat Hanrahan. Shadow silhouette maps. ACM Trans- actions on Graphics, 22(3):521–526, July 2003.

[118] SGI. The standard template library: Introduction. http://www.sgi.com/tech/stl/stl_ introduction.html, 1994.

[119] Marc Stamminger and George Drettakis. Perspective shadow maps. ACM Transactions on Graphics, 21(3):557–562, July 2002.

[120] Alexander Stepanov and Al Stevens. Al Stevens interviews Alex Stepanov. http://www. sgi.com/tech/stl/drdobbs-interview.html, 1995.

[121] Robert Strzodka and Alexandru Telea. Generalized distance transforms and skeletons in graphics hardware. In Proceedings of EG/IEEE TCVG Symposium on Visualization (VisSym ’04), pages 221–230, 2004.

[122] Herb Sutter. The free lunch is over: A fundamental turn toward concurrency in software. Dr. Dobb’s Journal, 30(3), March 2005.

[123] Laszl´ o´ Szirmay-Kalos, Barnabas´ Aszodi,´ Istvan´ Lazan´ yi, and Maty´ as´ Premecz. Approxi- mate ray-tracing on the GPU with distance imposters. Computer Graphics Forum, 24(3), September 2005. To appear.

[124] Marco Tarini, Kai Hormann, Paolo Cignoni, and Claudio Montani. PolyCube-Maps. ACM Transactions on Graphics, 23(3):853–860, August 2004.

[125] Niels Thrane and Lars O. Simonsen. A comparison of acceleration structures for GPU as- sisted ray tracing. Master’s thesis, University of Aarhus, August 2005. 191

[126] David Vandevoorde and Nicolai M. Josuttis. C++ Templates: The Complete Guide. Addison Wesley, 2002.

[127] T. L. Veldhuizen and M. E. Jernigan. Will C++ be faster than Fortran? In Proceedings of the 1st International Scientific Computing in Object-Oriented Parallel Environments (IS- COPE’97), Lecture Notes in Computer Science. Springer-Verlag, 1997.

[128] Todd L. Veldhuizen. Scientific computing: C++ versus Fortran: C++ has more than caught up. Dr. Dobb’s Journal of Software Tools, 22(11):34, 36–38, 91, November 1997.

[129] Todd L. Veldhuizen and Dennis Gannon. Active libraries: Rethinking the roles of compilers and libraries. In Proceedings of the SIAM Workshop on Object Oriented Methods for Inter- operable Scientific and Engineering Computing (OO’98). SIAM Press, 1998.

[130] Ingo Wald, Timothy J. Purcell, Jor¨ g Schmittler, Carsten Benthin, and Philipp Slusallek. Re- altime ray tracing and its use for interactive global illumination. In Eurographics 2003, State of the Art Reports, pages 85–122, September 2003.

[131] Lance Williams. Casting curved shadows on curved surfaces. In Computer Graphics (Pro- ceedings of SIGGRAPH 78), volume 12, pages 270–274, August 1978.

[132] Michael Wimmer, Daniel Scherzer, and Werner Purgathofer. Light space perspective shadow maps. In Eurographics Symposium on Rendering, pages 143–151, June 2004.

[133] Andrew Woo, Pierre Poulin, and Alain Fournier. A survey of shadow algorithms. IEEE Computer Graphics & Applications, 10(6):13–32, November 1990.

[134] Katherine A. Yelick, Luigi Semenzato, Geoff Pike, Carleton Miyamoto, Ben Liblit, Arvind Krishnamurthy, Paul N. Hilfinger, Susan L. Graham, David Gay, Phillip Colella, and Alexan- der Aiken. Titanium: A high performance java dialect. Concurrency: Practice and Experi- ence, 10(11/13):825–836, September/November 1998.

[135] Tin-Tin Yu. Depth of field implementation with OpenGL. Journal of Computing Sciences in Colleges, 20(1):136–146, October 2004.