97 Siggraph Final Version.Mword
Total Page:16
File Type:pdf, Size:1020Kb
Hardware Accelerated Rendering Of Antialiasing Using A Modified A-buffer Algorithm Stephanie Winner*, Mike Kelley†, Brent Pease**, Bill Rivard*, and Alex Yen† Apple Computer (one pass per subpixel sample) through the hardware rendering ABSTRACT pipeline. The resulting image is very high quality, but the per- formance degrades in proportion to the number of subpixel This paper describes algorithms for accelerating antialiasing in samples used by the filter function. 3D graphics through low-cost custom hardware. The rendering architecture employs a multiple-pass algorithm to perform An A-buffer implementation does not require several passes of front-to-back hidden surface removal and shading. Coverage the object data, but does require sorting objects by depth before mask evaluation is used to composite objects in 3D. The key compositing them. The amount of memory required to store the advantage of this approach is that antialiasing requires no addi- sorted layers is limited to the number of subpixel samples, but tional memory and decreases rendering performance by only it is significant since the color, opacity and mask data are 30-40% for typical images. The system is image partition needed for each layer. The compositing operation uses a based and is scalable to satisfy a wide range of performance and blending function which is based on three possible subpixel cost constraints. coverage components and is more computationally intensive than the accumulation buffer blending function. The difficulty CR Categories and Subject Descriptors: I.3.1 of implementing the A-buffer algorithm in hardware is de- [Computer Graphics]: Hardware Architecture - raster display de- scribed by Molnar [12]. vices; I.3.3 [Computer Graphics]: Picture/Image Generation - display algorithms; I.3.7 [Computer Graphics]: Three- The A-buffer hardware implementation described in this paper Dimensional Graphics and Realism - visible surface algorithms maintains the high performance of the A-buffer using a limited amount of memory. Multiple passes of the object data are Additional Key Words and Phrases: scanline, antialias- sometimes required to composite the data from front-to-back ing, transparency, texture mapping, plane equation evaluation, even when antialiasing is disabled. The number of passes image partitioning required to rasterize a partition increases when antialiasing is used. However, only in the worst case is the number of passes 1 INTRODUCTION equal to the number of subpixel samples (9, in our system). It is possible to enhance the algorithm as described in [2, 3] to This paper describes a low-cost hardware accelerator for correctly render intersecting objects. The current implementa- rendering 3D graphics with antialiasing. It is based on a tion does not include that enhancement. Furthermore, the algo- previous architecture described by Kelley [10]. The hardware rithm correctly renders images of moderate complexity which implements an innovative algorithm based on the A-buffer [3] have overlapping transparent objects without imposing any that combines high performance front-to-back compositing of constraints on the order in which transparent objects are sub- 3D objects with coverage mask evaluation. The hardware also mitted. performs triangle setup, depth sorting, texture mapping, transparency, shadows, and Constructive Solid Geometry (CSG) operations. Rasterization speed without antialiasing is 2 SYSTEM OVERVIEW 100M pixels/second, providing throughput of 2M texture- The hardware accelerator is a single ASIC which performs the mapped triangles/second1. The degradation in speed when 3D rendering and triangle setup. It provides a low-cost solu- antialiasing is enabled for a complex scene is 30%, resulting in tion for high performance 3D acceleration in a personal com- 70M pixels/second. puter. A second ASIC is used to interface to the system bus or PCI/AGP. The rasterizer uses a screen partitioning algorithm Several hardware algorithms have been developed which main- with a partition size of 16x32 pixels. Screen partitioning re- tain either high quality or performance while reducing or elimi- duces the memory required for depth sorting and image com- nating the large memory requirement of supersampling [11,8]. positing to a size which can be accommodated inexpensively An accumulation buffer requires only a fraction of the memory on-chip. No off-chip memory is needed for the z buffer and of supersampling, but requires several passes of the object data dedicated image buffer. The high bandwidth, low latency path between the rasterizer and the on-chip buffers improves perfor- mance. * 3Dfx Interactive, San Jose, CA USA, [email protected], The system's design was guided by three principles. We strove [email protected] to: † Silicon Graphics Computer Systems, Mountain View, CA USA, 1. Balance the computation between the processor and hard- [email protected], [email protected] ware 3D accelerator; ** Bungie West, San Jose, CA USA, [email protected] 2. Minimize processor interrupts and system bus bandwidth; and 1 50 pixel triangles, with tri-linearly interpolated mip-mapped textures. 3. Provide good performance with as little as 2 MB of dedi- cated memory, but to have performance scale up in higher memory configurations. The principles inspired the following features: • The hardware accelerator implements triangle setup to re- 2.1 Front-to-Back Antialiasing duce required system bandwidth and balance the computa- The antialiasing algorithm is distributed among three of the tional load between the accelerator and the host proces- major functional blocks of the ASIC (see Figure 1): the Plane sor(s). Multiple rendering ASICs can operate in parallel to Equation Setup, Hidden Surface Removal, and Composite match CPU performance. Blocks. • The hardware accelerator only interrupts the processor when it has finished processing a frame. This leaves the CPU free to perform geometry, clipping, and shading operations for the next frame while the ASIC is rasterizing the current Triangles frame. Resubmit • The partition size is 16x32 pixels so that a double-buffered Display List z buffer and image buffer can be stored on-chip. This re- Traversal duces cost and required memory bandwidth while improving performance. External memory is required for texture map system storage, so texture map rendering performance scales with I/O that memory's speed and bandwidth. SDRAM controller Plane Equation In addition to these three design principles, another goal was Setup to provide hardware support for antialiased rendering. Two types of antialiasing quality were desired: a fast mode for inter- active rendering, and a slower, high quality mode for producing final images. Scan Conversion For high quality antialiasing, the ASIC uses a traditional accu- mulation buffer method to antialias each partition by rendering the partition at every subpixel offset and accumulating the re- sults in a off-chip buffer. Because this algorithm is well known Pixels [8], this high-quality antialiasing mode is not discussed in this Hidden Surface paper. 32x16 Removal The more challenging goal was to also provide high quality an- pixel tialiasing for interactive rendering in less than double the time RAM CSG and Shadow needed to render a non-antialiased image. We assumed that this type of antialiasing would only be used for playback or pre- viewing, so it could only consume a small portion of the die area. Therefore the challenge in implementing antialiasing was how to properly antialias without maintaining the per Texture Shading and pixel coverage and opacity data for each of the layers individu- Cache Texture ally. Mapping Our solution to this problem involves having the ASIC per- form Z-ordered shading using a multiple pass algorithm (see the Appendix for psuedo-code of the rendering algorithm). This permits an unlimited number of layers to be rendered for each 32x16 Composite pixel as in the architecture presented by Mammen [11]. pixel However, because Mammen's architecture performs antialiasing RAM by integrating area samples in multiple passes to successively antialias the image, the number of passes is equal to the number of subpixel positions in the filter kernel. For example, render- Scanout I/O ing an antialiased image using a typical filter kernel of 8 sam- ples would require 8 times as long as rendering it without an- tialiasing. Obviously this is too high a performance penalty for use in interactive rendering. image buffer With our modified A-buffer algorithm, the number of passes re- Figure 1. Rasterization ASIC pipeline quired to antialias an image is a function of image complexity (opacity and subpixel coverage) in each partition, not the The Plane Equation Setup calculates plane equation parameters number of subpixel samples. The worst case arises when there for each triangle and stores them for later evaluation in the are at least 8 layers which have 8 different coverage masks relevant processing blocks. The Scan Conversion generates which each cover only one subpixel. This rarely, if ever, oc- the subpixel coverage masks for each pixel fragment and curs in practice. In fact, we have found that an average of only outputs them to the rendering pipeline. During the Hidden 1.4 passes is required when rendering with a 16x32 partition Surface Removal, fragments of tessellated objects are flagged and an 8 bit mask. for specific blend operations during shading. The Composite A discussion of the details of the system architecture follows Block shades pixels by merging the coverage masks and alpha the discussion of the antialiasing algorithm implementation. values. one layer contains the depth of the data composited during pre- 2.2 Coverage Mask Generation vious passes and the second layer contains the front-most depth of the data which has yet to be composited. We use a staggered subpixel mask, as shown in Figure 2. Each pixel is divided into 16 subpixels, but only half of the samples Pixels which are completely covered by opaque objects are re- are used. The mask is stored as an 8 bit value using the bit as- solved in a single pass.