Delay Streams for Graphics Hardware
Total Page:16
File Type:pdf, Size:1020Kb
Rendering commands from host Higher-order Tessellator primitives Vertex shader Vertex attributes Clipping and back-face culling Causal occlusion test & write Tile depth values scenes may have millions ofrealism triangles but and increase a the fill high rate depthing requirements complexity, complex even further. pixel Realistic shadertive programs. 3D applications Shadows are beginning improveuncommon to the and compute frame visual the buffers illuminationrisen have us- dramatically. high Screen dynamic resolutions ranges.number of 1600x1200 of Interac- pixels pixels are andable not the fill cost rate of ratherapplications, rendering than such individual the as pixels geometry state-of-the-art have processingwidths computer of power. games, almost 20GB/s. is Both Still the the Modern the avail- bottleneck consumer-level in graphics real-time rendering cards have video memory1 band- Introduction independent transparency, antialiasing, stream processing Keywords: and Realism—Hidden line/surface removal I.3.7 [COMPUTER GRAPHICS]:GRAPHICS]: Three-Dimensional Picture/Image Graphics ware Generation—Display Architecture—GraphicsCR algorithms; Processors; Categories: I.3.3tivity information. [COMPUTER ing samples in adaptive supersampling arecan thus be replaced detected by in connec- hardware.using Previously delay used heuristics streams. for collaps- Finally,of we order-independent describe transparency how can discontinuity bemark edges substantially scenes. reduced We by alsocessing demonstrate and how the video memoryWe memory requirements show bandwidth two- usage totriangle in fourfold may common be efficiency culled improvements bench- sequent by in triangles primitives generate pixel that occlusion pro- processing were information. units. submitted While As a a triangle result, resides in the the delaymay stream, change the sub- interpretation ofthe current traditional events. causal renderingtransparency and architectures, edge because antialiasing cannot futurewell-known be problems, data properly such as solved occlusion using culling,In order-independent causal processes decisions do not depend onAbstract future data. Many visible We propose adding a ∗ ‡ † [email protected] wili@hybrid.fi [email protected].fi r Occlusion Discontinuity edge detection e l l information Compressed o r t n Delay stream FIFO vertices and o render states c y r 3D graphics hardware, occlusion culling, order- o Delayed occlusion test Helsinki University of Technology and m Min/max Z e visible for full 8x8 O m F o I pixel tiles t e Rasterizer F n I.3.1 [COMPUTER GRAPHICS]: Hard- d i e delay stream m h d V t a n Hybrid Graphics, Ltd. p e e r e Pixel shader t p d s e s d d y e n c n i h a - n Alpha, stencil and depth test c Timo Aila r e e a e r Delay Streams for Graphics Hardware c a d m r r p a e s r O f Y Order-independent transparency? f f n between the vertex and pixel , u a e r b t r N u t x e Blending ∗ T Compressed vertices, render states, pixel masks after it. Hybrid Graphics, Ltd. and University of Helsinki would have a greater impact ifangles triangles cannot that be sorted byin order the to application. render Also, them correctly.portant occlusion However, culling algorithms hardware-generated tri- is limited.global Transparent information triangles about must the be scene,rasterized sorted the in effectiveness the of order certain im- they are submitted. Asto the be hardware sent has to no theto hardware. the application so that objectstives that and are pixels. entirely These hidden units dohave also not supply have been visibility information introduced back limited for to avoiding the rasterization edgespressed of of prior hidden to triangles. transmission primi- to Separatemats. video occlusion Color memory. and culling Antialiasing depth can units values be higher-order of pixels are primitives. cached on-chip andvideo Textures com- memory or are generated on-the-fly storedwidth from displacement in and maps and compressed pixel for- processing work.pixel. Triangle data isfor stored antialiasing in in turn increase the theometry number consumes of samples more used memory for and each i.e., is the prone average to number aliasing. of Demands times each pixel is drawn. Detailed ge- where triangles are placed afterthe geometry processing. overall performance. Thehardware most design important for addition reducingintroduce is the several the bandwidth relatively delay minor requirements stream Figure modifications and 1: (marked improving in gray) into the could be used to occlude triangles submitted before them. Ville Miettinen Established rendering semantics dictate that triangles must be A number of approaches are used for reducing the memory band- Architecture of a modern, DirectX9-level graphics hardware. We † Petri Nordlund Bitboys Oy are going to be ‡ in front Contributions This paper concentrates on optimizing the pixel jority of bandwidth problems are avoided but the limited amount of processing performance and bandwidth requirements of certain key video memory makes capturing large scenes impractical. areas in consumer graphics hardware. Our main contribution is us- Several experimental architectures have been proposed. Saar- ing the concept of delay streams for increasing rendering perfor- COR [Schmittler et al. 2002] uses ray casting [Appel 1968] for de- mance (see Figure 1). By introducing a delay between the geometry termining the visible surfaces and performs occlusion culling im- processing and rasterization stages of a graphics pipeline, we en- plicitly. A few architectures consist of multiple rasterization nodes hance load-balancing and gather knowledge about the global scene and create the final image by using composition [Fuchs et al. 1989; structure. This information is used for improving the performance Molnar et al. 1992; Torborg and Kajiya 1996; Deering and Naegle of several algorithms: 2002]. The sort-last nature of these architectures makes the issue of 1. We introduce an additional occlusion culling test after the de- occlusion culling difficult to address efficiently. lay to further reduce depth complexity. Compared to exist- Zhang et al. [1997] note that an occlusion query can be split into ing hardware implementations [Morein 2000] our approach two sub-tests: one for coverage and one for depth. The result is renders 1.8–4 times fewer pixels in our test scenes. In fill conservatively correct, as long as the coverage test is performed us- rate bounded applications this translates almost directly to the ing full accuracy. Their lower resolution depth test uses a depth frame rate (Section 3). estimation buffer (DEB) which conserves memory by storing a sin- gle conservative depth value for a block of pixels. We utilize DEB 2. We improve existing work [Wittenbrink 2001] for rendering in our occlusion culling unit. order-independent transparency. In our test scenes the storage Durand [1999] provides a comprehensive survey on visibility requirements are an order of magnitude smaller than in the determination. Akenine-Moller¨ and Haines [2002] cover modern previous method (Section 4). hardware occlusion culling algorithms as well as application-level 3. We demonstrate a new hardware algorithm that can identify culling techniques. silhouette edges of objects using geometric hashing. This in- formation is used to reduce the number of pixels that are an- tialiased. For our test models this approach supersamples only Order-independent transparency In order to correctly han- 0.3–2.1% of the screen pixels (Section 5). dle all OpenGL [1999] and DirectX [2002] blending modes, the visible transparent surfaces have to be blended in a back-to-front We have implemented the delay stream and occlusion culling order. Techniques such as volume splatting [Westover 1990] and units in VHDL. The other components have been simulated us- surface splatting [Zwicker et al. 2001] use transparent surfaces ex- ing behavioral models. We present results for all three algorithms tensively. along with bandwidth measurements and compare them to existing The Z-buffer algorithm cannot handle transparent surfaces cor- approaches in several publicly available test scenes and industry- rectly [Catmull 1974]. To overcome this limitation the A-buffer standard benchmark applications. algorithm stores a linked list of fragments for each pixel [Carpenter 1984]. The A-buffer also incorporates subpixel coverage masks for antialiasing of triangle edges. The final color of a pixel is computed 2 Related Work by first sorting the fragments according to increasing depth and then recursively blending the fragments from back to front using accu- Occlusion culling Greene and Kass [1993] avoid most per- mulated coverage masks. Molnar et al. [1992] discuss difficulties of pixel depth comparisons by organizing the depth buffer into a hi- implementing an A-buffer in hardware. Nevertheless, a few imple- erarchy. The scene is represented as an octree that provides an ap- mentations exist [Chauvin 1994; Lee and Kim 2000]. The handling proximate front-to-back traversal. For each octree node the occlu- of interpenetrating surfaces is improved by Schilling et al. [1993] sion query ‘Would this node contribute to the final image if ren- and Jouppi and Chang [1999]. The latter describe an architecture dered?’ is asked. If not, all triangles inside the node and its chil- that maintains as few as three active fragments per pixel and cre- dren can be skipped. Occlusion queries were first supported in ates convincing images with a reasonable storage cost. However, Denali GB graphics hardware [Greene and Kass 1993]. Extend- combinations of different blending modes are not considered. ing OpenGL to handle occlusion queries is discussed by Bartz et Mammen [1989] and more recently Everitt [2001] describe a al. [1998]. The queries are now implemented in consumer graph- multi-pass algorithm that peels transparent layers one at a time and ics hardware, e.g., NVIDIA GeForce3. Meißner et al. [2001] ex- blends them to the frame buffer. Two depth buffers are used concur- tend the queries by storing visibility masks of bounding volumes rently for determining the farthest unprocessed transparent surface for subsequent culling of individual triangles and blocks of pixels.