Hardware Accelerated Rendering Of Antialiasing Using A Modified A-buffer Algorithm

Stephanie Winner*, Mike Kelley†, Brent Pease**, Bill Rivard*, and Alex Yen† Apple Computer (one pass per subpixel sample) through the hardware rendering ABSTRACT pipeline. The resulting image is very high quality, but the per- formance degrades in proportion to the number of subpixel This paper describes algorithms for accelerating antialiasing in samples used by the filter function. 3D graphics through low-cost custom hardware. The rendering architecture employs a multiple-pass algorithm to perform An A-buffer implementation does not require several passes of front-to-back hidden surface removal and shading. Coverage the object data, but does require sorting objects by depth before mask evaluation is used to composite objects in 3D. The key compositing them. The amount of memory required to store the advantage of this approach is that antialiasing requires no addi- sorted layers is limited to the number of subpixel samples, but tional memory and decreases rendering performance by only it is significant since the color, opacity and mask data are 30-40% for typical images. The system is image partition needed for each layer. The compositing operation uses a based and is scalable to satisfy a wide range of performance and blending function which is based on three possible subpixel cost constraints. coverage components and is more computationally intensive than the accumulation buffer blending function. The difficulty CR Categories and Subject Descriptors: I.3.1 of implementing the A-buffer algorithm in hardware is de- []: Hardware Architecture - raster display de- scribed by Molnar [12]. vices; I.3.3 [Computer Graphics]: Picture/Image Generation - display algorithms; I.3.7 [Computer Graphics]: Three- The A-buffer hardware implementation described in this paper Dimensional Graphics and Realism - visible surface algorithms maintains the high performance of the A-buffer using a limited amount of memory. Multiple passes of the object data are Additional Key Words and Phrases: scanline, antialias- sometimes required to composite the data from front-to-back ing, transparency, texture mapping, plane equation evaluation, even when antialiasing is disabled. The number of passes image partitioning required to rasterize a partition increases when antialiasing is used. However, only in the worst case is the number of passes 1 INTRODUCTION equal to the number of subpixel samples (9, in our system). It is possible to enhance the algorithm as described in [2, 3] to This paper describes a low-cost hardware accelerator for correctly render intersecting objects. The current implementa- rendering 3D graphics with antialiasing. It is based on a tion does not include that enhancement. Furthermore, the algo- previous architecture described by Kelley [10]. The hardware rithm correctly renders images of moderate complexity which implements an innovative algorithm based on the A-buffer [3] have overlapping transparent objects without imposing any that combines high performance front-to-back compositing of constraints on the order in which transparent objects are sub- 3D objects with coverage mask evaluation. The hardware also mitted. performs triangle setup, depth sorting, texture mapping, transparency, shadows, and Constructive Solid Geometry (CSG) operations. Rasterization speed without antialiasing is 2 SYSTEM OVERVIEW 100M pixels/second, providing throughput of 2M texture- The hardware accelerator is a single ASIC which performs the mapped triangles/second1. The degradation in speed when 3D rendering and triangle setup. It provides a low-cost solu- antialiasing is enabled for a complex scene is 30%, resulting in tion for high performance 3D acceleration in a personal com- 70M pixels/second. puter. A second ASIC is used to interface to the system bus or PCI/AGP. The rasterizer uses a screen partitioning algorithm Several hardware algorithms have been developed which main- with a partition size of 16x32 pixels. Screen partitioning re- tain either high quality or performance while reducing or elimi- duces the memory required for depth sorting and image com- nating the large memory requirement of supersampling [11,8]. positing to a size which can be accommodated inexpensively An accumulation buffer requires only a fraction of the memory on-chip. No off-chip memory is needed for the z buffer and of supersampling, but requires several passes of the object data dedicated image buffer. The high bandwidth, low latency path between the rasterizer and the on-chip buffers improves perfor- mance. * 3Dfx Interactive, San Jose, CA USA, [email protected], The system's design was guided by three principles. We strove [email protected] to: † Computer Systems, Mountain View, CA USA, 1. Balance the computation between the processor and hard- [email protected], [email protected] ware 3D accelerator; ** Bungie West, San Jose, CA USA, [email protected] 2. Minimize processor interrupts and system bus bandwidth; and 1 50 pixel triangles, with tri-linearly interpolated mip-mapped textures. 3. Provide good performance with as little as 2 MB of dedi- cated memory, but to have performance scale up in higher memory configurations. The principles inspired the following features: • The hardware accelerator implements triangle setup to re- 2.1 Front-to-Back Antialiasing duce required system bandwidth and balance the computa- The antialiasing algorithm is distributed among three of the tional load between the accelerator and the host proces- major functional blocks of the ASIC (see Figure 1): the Plane sor(s). Multiple rendering ASICs can operate in parallel to Equation Setup, Hidden Surface Removal, and Composite match CPU performance. Blocks. • The hardware accelerator only interrupts the processor when it has finished processing a frame. This leaves the CPU free to perform geometry, clipping, and shading operations for the next frame while the ASIC is rasterizing the current Triangles frame. Resubmit • The partition size is 16x32 pixels so that a double-buffered Display List z buffer and image buffer can be stored on-chip. This re- Traversal duces cost and required memory bandwidth while improving performance. External memory is required for texture map system storage, so texture map rendering performance scales with I/O that memory's speed and bandwidth. SDRAM controller Plane Equation In addition to these three design principles, another goal was Setup to provide hardware support for antialiased rendering. Two types of antialiasing quality were desired: a fast mode for inter- active rendering, and a slower, high quality mode for producing final images. Scan Conversion For high quality antialiasing, the ASIC uses a traditional accu- mulation buffer method to antialias each partition by rendering the partition at every subpixel offset and accumulating the re- sults in a off-chip buffer. Because this algorithm is well known Pixels [8], this high-quality antialiasing mode is not discussed in this Hidden Surface paper. 32x16 Removal The more challenging goal was to also provide high quality an- pixel tialiasing for interactive rendering in less than double the time RAM CSG and Shadow needed to render a non-antialiased image. We assumed that this type of antialiasing would only be used for playback or pre- viewing, so it could only consume a small portion of the die area. Therefore the challenge in implementing antialiasing was how to properly antialias without maintaining the per Texture Shading and pixel coverage and opacity data for each of the layers individu- Cache Texture ally. Mapping Our solution to this problem involves having the ASIC per- form Z-ordered shading using a multiple pass algorithm (see the Appendix for psuedo-code of the rendering algorithm). This permits an unlimited number of layers to be rendered for each 32x16 Composite pixel as in the architecture presented by Mammen [11]. pixel However, because Mammen's architecture performs antialiasing RAM by integrating area samples in multiple passes to successively antialias the image, the number of passes is equal to the number of subpixel positions in the filter kernel. For example, render- Scanout I/O ing an antialiased image using a typical filter kernel of 8 sam- ples would require 8 times as long as rendering it without an- tialiasing. Obviously this is too high a performance penalty for use in interactive rendering. image buffer With our modified A-buffer algorithm, the number of passes re- Figure 1. Rasterization ASIC pipeline quired to antialias an image is a function of image complexity (opacity and subpixel coverage) in each partition, not the The Plane Equation Setup calculates plane equation parameters number of subpixel samples. The worst case arises when there for each triangle and stores them for later evaluation in the are at least 8 layers which have 8 different coverage masks relevant processing blocks. The Scan Conversion generates which each cover only one subpixel. This rarely, if ever, oc- the subpixel coverage masks for each pixel fragment and curs in practice. In fact, we have found that an average of only outputs them to the rendering pipeline. During the Hidden 1.4 passes is required when rendering with a 16x32 partition Surface Removal, fragments of tessellated objects are flagged and an 8 bit mask. for specific blend operations during shading. The Composite A discussion of the details of the system architecture follows Block shades pixels by merging the coverage masks and alpha the discussion of the antialiasing algorithm implementation. values. one layer contains the depth of the data composited during pre- 2.2 Coverage Mask Generation vious passes and the second layer contains the front-most depth of the data which has yet to be composited. We use a staggered subpixel mask, as shown in Figure 2. Each pixel is divided into 16 subpixels, but only half of the samples Pixels which are completely covered by opaque objects are re- are used. The mask is stored as an 8 bit value using the bit as- solved in a single pass. When a pixel contains portions of two signments shown in Figure 2. or more triangles, it is desirable to merge the pixel fragments so that the pixel can be fully composited in one pass. We 0 1 considered and explored several methods, but did not find a 2 3 satisfactory solution which permits processing of pixel 4 5 6 7 fragments in a single pass. We considered using object tags that could be compared during Figure 2. Staggered sub-pixel mask sorting[3], but rejected that approach because of its limitations This staggered mask is similar to the mask used in the triangle and the burden it places on software. Object tags require extra processor [5]. It uses only half the memory a grid-aligned 4x4 memory in the rasterizer as they must be stored along with the requires but offers nearly the same quality of antialiasing. depth data for each pixel. The number of unique tags is thus Better antialiasing quality can be achieved by increasing the limited by hardware memory. Software must assign a unique tag subpixel samples to 64 and using a 32 bit mask. To support to each object and must determine how to best reuse tags when that the on-chip image buffer would require nearly 60% more the number of objects exceeds the number of tags the hardware capacity. supports. The mask generation is performed by treating each scanline as Another method which has been used for combining pixel 4 subscanlines and computing 4 coverage segments by using fragments is to identify ones with similar depths and combine the scan conversion parameters. The triangle edge intersection them if their colors are similar [17]. In our architecture the with the scanline is calculated first. The edge intersection colors are not available during hidden surface removal, so the solves the following linear equations: method of combining pixel fragments can only use the depth data. It is difficult to determine when two depths are similar and Xbegin = Slopebegin* (CurrentY - Y0) + X0. should be considered equal. Some software renderers use the minimum of the depth gradient to determine a tolerance within Xend = Slopeend * (CurrentY - Y1) + X1. which objects are considered to have equal depth. Since the depth gradients in x and y are readily available, namely the a where Y , Y ,X and X are the end points of the edges. Then 0 1 0 1 and b plane equation parameters (see Section 3.2), this seems the begin and end values for the 4 subscanlines are calculated to be a perfect option. and each pixel is clipped against those segments. The associ- ated coverage mask bit is asserted if the subpixel is not clipped Unfortunately, in practice it is possible to create scenes in by the segment. which small triangles, approaching a pixel in size, have large gradients which can not be properly sorted. The gradients Figure 3 shows an example where each color represents the 16 output by the Plane Equation Setup are only accurate if the pixel subpixel samples. The column on the left represents the is completely covered, so they are not representative of the [begin, end] values of each segment. A subpixel's coverage actual gradient of a pixel fragment. It is more difficult to mask bit is asserted when it is greater than or equal to the begin compute the true gradient of a pixel fragment, so we tried using value and less than the end value. The coverage mask for each a fixed tolerance. However, even when a fixed tolerance is used pixel is shown in the bottom of the figure. pixel data is incorrectly discarded during the multiple pass front-to-back depth sorting operation. [2,10] [1,10] We decided that rather than implement a solution which causes [1,10] serious artifacts we would prefer to have a robust solution and [0,10] compromise on performance. The solution we used is to only consider two depths to be equal when they match precisely.

0xE8 0xFF 0x55 2.4 Shading For Antialiasing Figure 3. Coverage Mask Generation. Shading is implemented in the Composite Block, which, like A single set of linear equation evaluators can achieve one pixel the Hidden Surface Removal block, has a two-layer buffer for per clock output for even small size triangles such as 2X2. In storing pixel colors and masks. Since the buffers occupy on- hardware solving a linear equation is more efficient in terms of chip memory it is necessary to minimize the state information speed and area than using lookup tables as was done by stored in them. Consequently, the pixel color, alpha, and mask Schilling [16]. As in Schilling's design [14] and the Reality for the previously composited data is stored as a single 40-bit Engine[2] we exploited the fact that the mask generation is value. In addition, a single bit controls the blending functions closely related to scan conversion and reused much of the cir- for color, alpha, and mask. The details of how the composite cuitry between those functions. block functions as part of the pipeline are described later in this paper. 2.3 Fragment Merging Consider the example of 3 layers of data as shown in Figure 4. The Hidden Surface Removal block includes an on-chip buffer Object A's data completely covers the pixel, having a mask of for two layers of pixel depth data (depth value and shadow and 0xff and an alpha of 0.5. At the end of the first pass the first CSG state information). When objects are rendered in a single layer contains A's color, alpha, and mask. pass, only one layer is used. When multiple layers are needed, each layer's mask and alpha are saved to properly shade subse- increasing depth quent layers. Unfortunately, the number of layers is un- bounded, as is the memory required to store them, so this is not an option. Using the opaqueMask flag to control the blending of the colors and masks allowed us to conserve memory and produce high quality images with some transparency.

A B C Figure 6 shows a more common type of scene rendered with an- tialiasing enabled. The artifacts which appear where the blue mask = 0xFF mask = 0xE8 mask = 0xE8 cone overlaps the red cone are a result of the loss of per-layer alpha = 1.0 alpha = 1.0 alpha = 0.5 mask and alpha data when the mask coverage is combined with the alpha as each layer is composited from front-to-back. As Figure 4. 3 layer composite example. mentioned in [2], it is not possible to correctly antialias the B's data is opaque, but does not completely cover the pixel, intersection of the cones using an alpha antialiasing algo- having a mask of 0xE8. First, B's color and alpha is scaled by rithm. its mask coverage, 0.5. The result is blended with the data from the previous pass using an AoverB operation [14] where:

I = IFront + (1 - Front ) •IBack IBack is the color component intensity of the back object (B), IFront is the color component intensity of the front object (A) pre-multiplied by Front (A's alpha), and (1 - Front ) is the transmission coefficient of the front object. Blending the mask is a more complex operation. In the A- buffer algorithm the new mask is the bitwise OR of the two masks. In our algorithm, the coverage and transmission coefficients of the two objects are compared and the masks are either bitwise ORed or one mask is selected depending on the results of the comparison. When the front-most layer's mask completely covers the new mask and the new data is more opaque than the front-most layer's, the new mask replaces the previous mask. The opaqueMask flag is asserted for that pixel and is stored in the RAM. Object C is composited during the third pass. It is opaque and has a mask of 0xE8. When the opaqueMask flag is asserted C's mask is clipped by the previously composited data's mask, re- Figure 6. Antialiasing artifacts sulting in a clipped mask of 0x0. Then C's color and alpha are scaled by the clipped mask coverage and blended behind the Figure 7 shows a more complex scene rendered with and without frontmost layer using the AoverB function. antialiasing. Figure 5 shows an image generated with the opaqueMask flag. The scene contains a mostly-transparent layer covering an opaque black triangle which is closer than an identical opaque white triangle. The background is an opaque red layer. Figure 5 was generated with the opaqueMask feature disabled. Notice the fringing artifacts that result when the mask of the white trian- gle is not clipped by the black triangle.

a. b. a. Figure 5. overlapping triangles (a) using the opaqueMask flag and (b) without using the opaqueMask flag.

Implementing the A-buffer algorithm in hardware requires sav- ing the mask for each layer. Instead of combining the masks, opaque objects. This is particularly important since a texture mapped object's alpha values cannot be determined until they are retrieved from memory. Unlike the architecture described by Kelley, objects in a parti- tion are not depth sorted before shading. Kelley’s architecture stored four layers of depth and required multiple passes to sort additional layers. Consequently, the number of layers that can occupy the same depth is limited to 3 or 4. This architecture can render an unlimited number of layers at any depth. First Pass Sorting Scenes that contain only opaque data and no shadows or CSG can be resolved in a single pass through the pipeline. Otherwise multiple passes are required to resolve the final color for each pixel in the scene. The operations that occur in the second and subsequent passes through the rendering pipeline differ from those that occur in the first. During the first pass each input pixel depth is compared with b. the depth of the frontmost object received so far for the pass Figure 7. Not antialiased (a) and antialiased (b). (see the Appendix for psuedo-code). The depth is a 25 bit float- ing point value with a 19 bit mantissa and a 6 bit exponent. Objects are not sorted before compositing begins. Instead, any 2.5 Antialiased Intersections object which passes the depth sort test is passed down the It is possible to antialias the intersections by calculating or re- pipeline immediately. constructing subpixel depth values for each layer. Methods for When the compositing engine receives the coverage mask, doing this are described by [2] and [3]. The method described in opacity, and color data for a pixel, it stores the data in the im- the A-buffer algorithm works for the intersection of 2 objects, age buffer. Any data already in the buffer it is overwritten but breaks down when more than two objects intersect unless (discarded) since it would fall behind the new data. Before be- the Zmin and Zmax values are saved for each layer. ing discarded, though, it is examined to determine if it would The method for antialiasing intersections used in the Reality have contributed to the final pixel color. If so, a flag is set for Engine uses the x and y slope values and a single depth sample that pixel which and is used at the end of the pass to initiate per layer to reconstruct subpixel depth values. This is more another rendering pass of the same partition. accurate than the A-buffer method, but is computationally in- After the last object enters the pipeline, the Display List tensive and requires storing subpixel depth values in the z- Traversal sends a synchronization token before moving to the buffer. next partition. The Composite block interrupts the Display Neither of these solutions was feasible to implement in an ar- List Traversal when it receives that synchronization token chitecture with limited memory, particularly since antialiasing (labeled Resubmit in Figure 1) and the Display List Traversal CSG and shadows requires maintaining subpixel CSG and determines whether another pass is required before moving to shadow data in the depth buffer. A single CSG and shadow the next partition. The latency incurred by waiting for the in- sample requires 12 bits of data, so 8 sub-samples would add 84 terrupt can be minimized with a predictive algorithm that be- bits of data for each of the two layers for a total of 168 bits for gins re-fetching the data for another pass or pre-fetching data each pixel. In order to accommodate that much data on-chip we for the next partition. would have to reduce the partition size which would decrease the Unlimited Equal Depth Layers performance of non-antialiased scenes. It is possible to composite an unlimited number of layers If Perfection Is Required which have equal depth in a single pass. Equal layers are identi- There are two methods which can be used to render a high qual- fied during the depth sort and are composited using additive ity antialiased image. Supersampling can be achieved by ren- blending after the new layer is clipped by the coverage mask of dering each partition at higher resolution and filtering it as a the existing layer. post-processing operation (using a separate image processing In the current implementation, equality occurs when the depth ASIC). An alternative is to use an accumulation buffer method values match precisely. Performance would be improved if a to antialias the scene by rendering it several times using differ- robust method for combining objects which have nearly equal ent subpixel offsets. The result of each rendering pass can be depths could be used (refer to the previous section on fragment accumulated in a off-chip buffer for each partition. merging). 2.6 Hidden surface removal Subsequent Pass Sorting With this architecture, unlimited visible layers can be raster- The sort and composite operations are modified when multiple ized using multiple passes as in the algorithms described by passes are required. The data in the depth and image buffers is [11] and [10]. Mammen's algorithm requires that all of the retained at the end of each pass. Second depth and image opaque objects be rasterized before transparent objects are ren- buffers are used for storing the input pixel data. dered. As with the architecture described by Kelley, this does not require transparent objects to be rasterized separately from During the depth sort, each object is compared with the final depth of the previous pass and the front-most depth of the cur- rent pass. If the object's depth falls between the two, or if it matches the front-most depth of the existing pass, it is passed In order to reduce the size of the display list, triangles can share to the composite block. vertices. Through the use of the QuickDraw ™ 3D and QuickDraw™ 3D RAVE TriMesh data structures vertex sharing The composite block blends the colors of any objects which is easily achieved even after an object has been clipped and are equal in depth using an AoverB blend. The masks are com- projected onto the screen. In the best case, vertex sharing per- bined and the results are stored in the second buffer. As in the mits each new vertex to define two new triangles. In that case case of the first pass, writing the second buffer sometimes the number of triangles is double the number of vertices. causes data which was composited earlier during the same pass to be overwritten. It is necessary to determine if that discarded 3 SYSTEM ARCHITECTURE data would have contributed to the final pixel color. Again, a flag is set for that pixel which is used at the end of the pass to The rendering tasks are divided between the host CPU and 3D initiate another rendering pass of the same partition. accelerator to balance the overall system. The host CPU per- forms the transformation, clipping and shading functions. The When the input object's depth is equal to that of the previously algorithms which perform these functions are described in de- composited data, its coverage mask is clipped by the previous tail in [1,4,9]. It also generates the display list using a linked pass's coverage mask and blended (AoverB) with the data in the list structure to link the object lists for each partition. The composite buffer. hardware accelerator performs the rasterization by following This architecture requires all data for a partition to be submitted the linked list. It DMAs the object data from system memory during each pass. In the case of equal depths it is also impor- and only interrupts the CPU at the end of the list/frame. tant that the data arrive in the same order since the first object The ASIC rasterizer is a pipelined design shown in figure 1. It at a particular depth is considered to be in front of any objects reads triangle data and outputs texture mapped, shaded, pixels. at the same depth which are received later (first-come-first-ren- The depth and image buffers are on-chip to minimize latency dered). and maximize bandwidth. The ASIC is clocked at 100MHz. 2.7 Image Partitioning 3.1 Display List Traversal Several screen partition based rasterizers have been proposed This module reads the triangle vertex data by following a linked or built [13,6,15,18]. A motivating factor in using image par- list which was constructed by software during the partition titioning is that the depth and image buffers can be stored on- sort. chip. This improves performance and reduces the pin count of the ASIC (assuming the buffers would have dedicated ports for 3.2 Plane Equation Setup performance reasons). As in the case of other partition based renderers [10,17], performance is also improved since multiple The scan conversion of the triangles is performed using plane passes are used to resolve the portions of the final image which equation evaluation as in the PixelPlanes design [7]. A plane contain the greatest depth complexity. equation is used to describe the relationship between a plane in screen space and any three points inside the plane. The primary disadvantage of partition based renderers is the Algebraically, a plane can be described using this equation: inherent latency resulting for the need to construct a bucket sorted display list. Another disadvantage of partition based z = a * x + b * y + c. rasterizers is that data which appears in multiple partitions The equation can be evaluated for any point (x,y,z) in screen must be transferred from system memory multiple times or space. The linear relationship between the three points (x , cached locally. Some partition based designs [5,10] used a one 0 y0, z0), (x1,y1,z1), (x2,y2,z2) and the above plane equation dimensional partition. In the best case an object had to be is transferred once for every scanline it touched. It is important to exploit the inherent 2 dimensional image coherence to z0 x0 y0 1 a reduce the system memory bandwidth required to transfer the æ ö æ öæö object data. çz1÷ = çx1 y1 1÷ b èz2ø èx2 y2 1øècø Two Dimensional Bucket Sorting The plane equation setup module takes the vertex data and gen- After the triangles are projected into screen space they are par- erates the coefficients, a, b, and c, for each parameter's plane titioned into each 16x32 partition that intersects with the tri- equation. The parameters include the color, alpha, depth, and angle. Since determining which partition a triangle belongs in texture map coordinates. can be computationally expensive 2 different algorithms are used depending on the size of the triangle. Triangles that are The plane equation setup eliminates the standard two passes of approximately the size of a single partition or smaller are un- linear interpolation (lirp) in both x and y directions and is likely to span more than 1 or 2 partitions and are hence easier more accurate. The coefficient calculation is implemented in a to sort. Small triangles are included in the bucket for each par- systolic array, so the internal bandwidth is greater than a two tition which is overlapped by the triangle's bounding box. pass lirp implementation [10]. Triangles that are much larger than a partition are more difficult Once the coefficients are calculated, they are passed down the since their orientation will effect which partitions they pipeline and stored in the module where they will be used to intersect. It is important to avoid adding triangles to partitions evaluate the associated parameters; for example, the depth coef- that do not intersect the triangle, this can waste memory, ficients are stored in the Hidden Surface Removal module. The bandwidth and performance. The edge slopes of large triangles coefficients can be passed down a separate pipeline at a rate of are used to compute which partitions they cover. one coefficient per clock cycle and double-buffered in the eval- uation module, thus minimizing the overhead and bandwidth associated with plane equation parameter passing. The use of plane equation evaluation for a given shading parameter at a specific pipeline stage is more efficient than passing down the evaluated parameters for each pixel during every clock cycle. 3.3 Hidden Surface Removal This module depth sorts a pixel per clock cycle. There is Current 4 enough storage for two layers of depth, shadow, and CSG data. During the first pass of a partition only one layer is used. Most 3.4 Texture Map Lookup recent 4 The texture mapping module implements traditional mip- mapped bilinear and trilinear texturing [19]. A target square or Figure 8. Most recent four texel reuse. non-square texture map can be up to 2048 texels on a side and either 16 or 32 bits per texel. Each texture mapped pixel pro- Line to line temporal locality is illustrated in Figure 9. The duced by this system results from applying a filter function to figure shows a triangle's bilinear sampling trace correspon- either four (bilinear) or eight (trilinear) texels. The filter func- dence between screen and texture space. Each consecutive pixel tion is a linear interpolation between texel samples, weighted on a scan line has an increasing X value (i.e., line A starts at by the fractional portion of the horizontal (u) and vertical (v) X=1 and ends at X=8). There are six lines composing an ex- lookup indices. ample triangle in this figure (A through F). Bilinear sample points occur at the intersection of the A through F arrows and Our goal was to perform trilinear texture mapping at an average the X=1 through X=8 lines. The texels sampled for line A are run rate of one pixel per clock cycle. The difficulty in achiev- shaded in the texture space portion of the figure. Those texels ing this performance goal arises from the fact that the interpo- sampled by line A at x=4 and again by line B at x=4 are shaded lation operation needs an input feed rate of four or eight 32-bit in red. texel colors per clock cycle. Building a texture memory sub- system that could sustain this bandwidth (1600 MBytes per Note that there is no simple spatial locality for a texture cache second sustained random accesses for bilinear mapping, 3200 to utilize short of the cache itself embodying a texture lookup MBps for trilinear) is not feasible in a personal computer to- mechanism. However, given the fact that our rasterization is day. Instead, we chose to build an on-chip texel cache to pro- horizontally bounded (by our partition size), there is temporal vide most of the necessary bandwidth. locality which can be utilized by a relatively small associative- type cache. The texel cache harnesses two different types of pixel to pixel temporal locality from the bilinear texel access patterns. We A 4-port fully associative cache with LRU (least recently used) refer to these types as "most recent four" and "line to line". replacement policy could be used to capture both line to line Both patterns of temporal locality arise from our use of mip- and most recent four texel reuse, however such a structure would mapping, which limits the pixel to pixel sampling stride be unnecessarily large. The line to line texel reuse will always through texture space. In our texel cache system, we use two be correlated through X. As an example, in Figure 9 two of the different but coupled cache modules. One cache module captures texels sampled by line A (the upper two shown in red) were most recent four reuse, and the other captures line to line reuse. sampled at X=4; the remaining re-used texel was sampled at Two such cache pairs provide the eight necessary texel samples X=5. By limiting our associative search aperture to sampled for trilinear texture mapping. texels adjacent to and including X (X+1, X, and X-1) we sim- plify the lookup and compare hardware. Only a three read port, The most recent four access pattern is illustrated in Figure 8. In one (or two) write port cache tag memory is needed, with twelve this figure, a pixel associated with a particular draw object re- (maximum) comparators. A fully associative cache would re- quired the center 2x2 block of texels (shown in blue) for bilin- quire an N read port, four write port cache tag memory with N*4 ear interpolation; the next pixel associated with that object comparators. will need a similar 2x2 block to be sampled (shown as a bolded square outline). But, because the sampling stride is con- strained, we are sure to re-use at least one of the four texels from screen space texture space x=3 the previous pixel. On average we can expect to hit about two x=4 texels per pixel when caching the most recent four texels. The A x=5 Y structure needed to implement this cache is a 4-entry, 4-port A B B fully associative cache with an always-replace write policy for C all four entries. C x=1 D D E x=2 F E F

0 1 2 3 4 5 6 7 8 x=6 X x=7 x=8

Figure 9. Controller. Four addresses per cycle for each bilinear pixel are By selectively saving one (or two) of the four texel samples for generated. Eight addresses per cycle are generated for each tri- each pixel (indexed to X), it will be available for the next line linear pixel. At the bottom of the texel cache pipeline a corre- of a particular draw object. The choice of which texel to save is sponding number of colors (four or eight) are presented to the easily determined by examining the sign bits of the object's u Color Blend ALUs where color highlights and color modulation and v gradients. For example, if the next line in pixel space is are applied to the raw bilinear or trilinear pixel color. The final going to sample down and to the right of the current line in rendered pixel color is passed further down the pipeline for texel space, then we save the bottom right of the four current compositing with other pixels. texels (or bottom two texels with a two write port cache tag memory). 3.5 Front-to-back Compositing To determine if the line to line cache has a hit, the pixel space The final pixel processing is performed by compositing the in- indices X+1, X, and X-1 are used to look up the previously coming pixel layers in front-to-back order, after which the re- stored addresses (stored as texel cache tags) for the one selected sulting ARGB values are output to the frame buffer. There is texel associated with that X value. A match indicates a hit. At enough storage for 2 layers of image data. During the first pass the end of that cycle, a new cache tag (the selected texel ad- only one layer is used. The second layer contains the final im- dress) is written to the cache tag store. age data for the previous partition and can be scanned out of the In our implementation, we chose to allow the line cache to re- ASIC while the next partition is being composited (at least dur- port exactly one hit from the possible three texels examined at ing the first pass). This is a standard double-buffering tech- X+1, X and X-1. To minimize the overall cache performance nique. degradation due to redundant hits between the most recent four and line caches, an inhibit signal is passed from the four cache 3.6 Scalable Performance module to the line cache module that suppresses reporting of A low cost, single card implementation of the 3D accelerator is redundant texel hits. In this way, the line cache will always re- shown in figure 11. Texture map data is stored in the SDRAM port a unique hit if one is available. connected to the 3D accelerator. The SDRAM connected to the With the four cache hitting an average of two texels per pixel, I/O interface is optional; it can be used to store additional tex- and the line cache frequently hitting a unique texel per pixel, an ture map data or vertex data. Storing the texture map data lo- average of nearly three out of the needed four texels can be cally reducing the PCI bandwidth when texture mapping is used. achieved in the high resolution mip-map. The low resolution Storing vertex data locally reduces the PCI bandwidth when tri- map has a texel stride half that of the high resolution map and angles cross partition boundaries since they will only be therefore achieves even better average hits. The remaining one loaded onto the card once. It also reduces PCI bandwidth if re- to two average texels needed per trilinear pixel can be comfort- submission is required to resolve the final pixel values for a ably read from our pipelined SDRAM memory system. partition. Due to the 10 to 12 cycle read latency of the SDRAM memory PCI system, it was necessary to split the cache tags and cache data in the pipeline. The cache tag control module shown in Figure 10 contains all of the cache tag state for data that will arrive and be cached some time later in the cache data controller. The I/O 2 MB synchronizer module aligns the arrival of texel color data from interface SDRAM the SDRAM memory system with the arrival of cache tag data from the Texel Cache Tag Control module. Texture lookup ALUs 3D 2 MB accelerator SDRAM Texel Cache Tag Control Figure 11. Low cost card implementation. Pipelined SDRAM To improve performance it is possible to use multiple rasterizer Memory System ASICs in parallel as described by Fuchs in [6]. The I/O inter- face ASIC can drive up to 4 rasterizer ASICs as shown in Figure 12. The image buffer outputs of each rasterizer are merged by a Tag Data Frame Buffer Interface (not designed as part of this project) FIFO FIFO which transfers each partition to the frame buffer (VRAM). Sync The frame buffer must be local to the rasterizer ASIC since PCI could not sustain the bandwidth required to support a Texel Cache 1280x1152 display at 30fps (177 MBytes/sec). A typical in- Data Control put bandwidth is 100 MBytes/sec which can be sustained by 32 bit 33MHz PCI. Color Blend ALUs

Figure 10. Split cache tag-data architecture. The Texture Lookup CPUs in Figure 10 calculate texel sample addresses and pass these addresses to the Texel Cache PCI ACKNOWLEDGMENTS The authors wish to thank Paul Baker and Jack McHenry for supporting this project in Apple's Interactive 3D Graphics group. Thanks to Bill Garrett and Sun-Inn Shih for their I/O 64 MB reviews. Interface SDRAM APPENDIX: PSEUDO-CODE

The following pseudo-code summarizes the rendering algo- 3D+ 3D+ 3D+ 3D+ rithm: 64MB 64MB 64MB 64MB SDRAM SDRAM SDRAM SDRAM RenderFrame() { /* object loop: transform, shade, sort */ foreach (object) Frame Buffer { Interface VRAM Transform (object); Shade (object); PartitionSort (object); } to display

Figure 12. High performance card implementation /* partition loop: rasterize */ foreach (partition) 4 CONCLUSIONS { InitPartition (partition); These are the main design goals met by the system: Rasterize (partition); High Performance Antialiasing } } The performance when the modified A-buffer antialiasing is used is only 40% slower than when antialiasing is used. This is The object loop is executed by the host CPU and the partition much better than performance degradation required for using an loop funtionality is embodied in the ASIC. accumulation buffer or supersample method of antialiasing. The following pseudo-code is a simplified version of the multi- Low Memory Bandwidth and Capacity pass rasterization loop: The off-chip memory requirement for the depth buffer is elimi- nated. There is an on-chip image buffer so that the output is a /* rasterize loop: first layer */ write-only path to the frame buffer memory. Implementing foreach (object) these buffers on chip also reduces memory bandwidth. The on { chip texture cache reduces the memory bandwidth needed for foreach (pixel) texture mapping. Dedicated memory can be used for texture and { vertex data to further reduce the system memory bandwidth and if (depth_pixel[x][y] <= depth_buf[x][y]) improve rasterization performance. { depth_buf[x][y] = depth_pixel[x][y]; Balance Between CPU and Rendering ASIC composite_buf[x][y] = Composite(pixel); } The ASIC only interrupts the CPU at the end of each frame. } This is required to process the 3D geometry as quickly as pos- } sible. Using dedicated memory with the 3D accelerator will while ( Resubmit ) also improve the system performance since it reduces the { bandwidth load on the system memory. firstDepth = depth_buf; 5 FUTURE WORK firstComposite = composite_buf; clear(depth_buf); As mentioned it is necessary to develop a robust method for clear(composite_buf); merging pixel fragments. This will reduce the number of passes required to perform antialiasing, improving image foreach (object) quality and performance. The quality of antialiasing can be { further improved by increasing the number subpixel samples foreach (pixel) from 16 to 64. A method of antialiasing interpenetrating { objects must also be incorporated. Finally, additional data if (depth_pixel[x][y] > firstDepth[x][y]) should be included in the z and image buffers to properly { antialias shadows and CSG. if (depth_pixel[x][y] !> depth_buf[x][y]) { depth_buf[x][y] = depth_pixel[x][y]; composite_buf[x][y] = Conference Proceedings), volume 26, number 2, pages CompositeFirst+pixel(pixel); 231-240. July 1992. } [13] F. Park. Simulation and Expected Performance Analysis } of Multiple Processor Z-Buffer Systems. Computer } Graphics, (SIGGRAPH 80 Conference Proceedings), } pages 48-56. 1980. } [14] Thomas Porter and Tom Duff. Compositing Digital REFERENCES Images. Computer Graphics, (SIGGRAPH 84 Conference [1] and T. Jermoluk. High-Performance Polygon Proceedings), volume 18, number 3, pages 253-259. July Rendering. Computer Graphics (SIGGRAPH 88 1984. ISBN 0-89791-138-5. Conference Proceedings), volume 22, number 4, pages [15] PowerVR, NEC/VideoLogic 1996. 239-246. August 1988. [16] Andreas Schilling. A New Simple and Efficient [2] Kurt Akeley. RealityEngine Graphics. SIGGRAPH 93 Antialiasing with Subpixel Masks. Computer Graphics, Conference Proceedings, pages 109-116. August 1993. (SIGGRAPH 91 Conference Proceedings), volume 25, ISBN 0-89791-601-8. number 4, pages 133-141. July 1991. [3] Loren Carpenter. The A-buffer, an Antialiased Hidden [17] Jay Torborg and James Kajiya. Talisman: Commodity Surface Method. Computer Graphics, (SIGGRAPH 84 Realtime 3D Graphics for the PC. SIGGRAPH 96 Conference Proceedings), volume 18, number 3, pages Conference Proceedings, pages 353-363. 1996. 103-108. July 1984. ISBN 0-89791-138-5. [18] G. Watkins. A Real-Time Visible Surface Algorithm. [4] Michael Deering and S. Nelson, Leo: A System for Cost Computer Science Department, University of Utah, Effective 3D Shaded Graphics. SIGGRAPH 93 Conference UTECH-CSC-70-101. June 1970. Proceedings, pages 101-108. August 1993. [19] Lance Williams. Pyramidal Parametrics. SIGGRAPH 83 [5] Michael Deering, S. Winner, B. Schediwy, C. Duffy and Conference Proceedings, pages 1-11. July 1983. N. Hunt. The Triangle Processor and Normal Vector Shader: A VLSI System for High Performance Graphics. Computer Graphics, (SIGGRAPH 88 Conference Proceedings), volume 22, number 4, pages 21-30. August 1988 [6] Henry Fuchs. Distributing a Visible Surface Algorithm over Multiple Processors. Preceeding of the 6th ACM- IEEE Symposium on Computer Architecture, pages 58- 67. April, 1979. [7] Henry Fuchs et al. Fast Spheres, Shadows, Textures, Transparencies, and Image Enhancements in Pixel- Planes. Computer Graphics, (SIGGRAPH 85 Conference Proceedings), volume 19, number 3, pages 111-120. July 1985. [8] and Kurt Akeley. The Accumulation Buffer: Hardware Support for High-Quality Rendering. Computer Graphics, (SIGGRAPH 90 Conference Proceedings), volume 24, number 4, pages 309-318. August 1990. ISBN 0-89791-344-2. [9] Chandlee Harrell and F. Fouladi. Graphics Rendering Architecture for a High Performance Desktop Workstation. SIGGRAPH 93 Conference Proceedings, pages 93-100. August 1993. [10] Michael Kelley, K. Gould, B. Pease, S. Winner, and A. Yen. Hardware Accelerated Rendering of CSG and Transparency. SIGGRAPH 94 Conference Proceedings, pages 177-184. 1994. [11] Abraham Mammen. Transparency and Antialiasing Algorithms Implemented with the Virtual Pixel Maps Technique. IEEE Computer Graphics and Applications, 9(4), pages 43-55. July 1989. ISBN 0272-17-16. [12] Steven Molnar, John Eyles, and John Poulton. PixelFlow: High-Speed Rendering Using Image Composition. Computer Graphics, (SIGGRAPH 92