Institutionen för systemteknik Department of Electrical Engineering

Examensarbete

A Modular 3D Graphics Accelerator for FPGA

Examensarbete utfört i Datateknik vid Tekniska högskolan vid Linköpings universitet av

Jakob Fries, Simon Johansson

LiTH-ISY-EX--11/4479--SE Linköping 2011

Department of Electrical Engineering Linköpings tekniska högskola Linköpings universitet Linköpings universitet SE-581 83 Linköping, Sweden 581 83 Linköping

A Modular 3D Graphics Accelerator for FPGA

Examensarbete utfört i Datateknik vid Tekniska högskolan i Linköping av

Jakob Fries, Simon Johansson

LiTH-ISY-EX--11/4479--SE

Handledare: Andreas Ehliar isy, Linköpings universitet Examinator: Olle Seger isy, Linköpings universitet

Linköping, 5 July, 2011

Avdelning, Institution Datum Division, Department Date

Division of Computer Engineering Department of Electrical Engineering 2011-07-05 Linköpings universitet SE-581 83 Linköping, Sweden

Språk Rapporttyp ISBN Language Report category —

 Svenska/Swedish  Licentiatavhandling ISRN  Engelska/English  Examensarbete LiTH-ISY-EX--11/4479--SE -uppsats  Serietitel och serienummer ISSN D-uppsats  Title of series, numbering —   Övrig rapport  URL för elektronisk version http://www.da.isy.liu.se http://www.ep.liu.se

Titel En modulär 3D-grafikaccelerator för FPGA Title A Modular 3D Graphics Accelerator for FPGA

Författare Jakob Fries, Simon Johansson Author

Sammanfattning Abstract

A modular and area-efficient 3D graphics accelerator for tile based rendering in FPGA systems has been designed and implemented. The accelerator supports a subset of OpenGL, with features such as mipmapping, multitexturing and blend- ing. The accelerator consists of a software component for projection and clipping of triangles, as well as a hardware component for rasterization, coloring and video output. Trade-offs made between area, performance and functionality have been described and justified. In order to evaluate the functionality and performance of the accelerator, it has been tested with two different applications.

Nyckelord Keywords Tile-based Rendering, 3D Graphics Accelerators, FPGA, Customizable Graphics Accelerator

Abstract

A modular and area-efficient 3D graphics accelerator for tile based rendering in FPGA systems has been designed and implemented. The accelerator supports a subset of OpenGL, with features such as mipmapping, multitexturing and blend- ing. The accelerator consists of a software component for projection and clipping of triangles, as well as a hardware component for rasterization, coloring and video output. Trade-offs made between area, performance and functionality have been described and justified. In order to evaluate the functionality and performance of the accelerator, it has been tested with two different applications.

Sammanfattning

En modulär och utrymmeseffektiv 3D-grafikaccelerator för tile-baserad rendering i FPGA-system har designats och implementerats. Acceleratorn stöder en delmängd av OpenGL med funktioner som mipmapping, multitexturering och blending. Ac- celeratorn är uppdelad i en mjukvarudel för projektion och klippning av trianglar och en hårdvarudel för rastrering, färgsättning och utritning till skärm. Avväg- ningar som gjorts mellan area, prestanda och funktionalitet har beskrivits och motiverats. För att evaulera funktionalitet och prestanda har acceleratorn testats med två olika applikationer.

v

Acknowledgments

We would like to thank our examiner Olle Seger and our supervisor Andreas Ehliar for the opportunity to work with this interesting and challenging project. We would also like to thank our opponents Jesper Eriksson and Johan Holmér for providing comments and feedback on our report.

vii

Contents

1 Introduction 1 1.1 Background ...... 1 1.2 Purpose ...... 1 1.3 Scope ...... 1 1.4 Report Outline ...... 2

2 3D Graphics 3 2.1 Triangle Rasterization ...... 3 2.1.1 Fill Convention ...... 3 2.1.2 Edge Function Approach ...... 4 2.1.3 Scanline Conversion Approach ...... 4 2.1.4 Linear Interpolation ...... 5 2.2 Depth Testing ...... 6 2.3 Texture Mapping ...... 6 2.3.1 Mipmapping ...... 6 2.3.2 Texture filtering ...... 7 2.3.3 Multitexturing ...... 8 2.4 Blending ...... 10

3 System Architecture 11 3.1 Platform Assumptions ...... 11 3.2 Implications of Platform Limitations ...... 11 3.3 Partitioning of the Graphics Pipeline ...... 14

4 Hardware Implementation 15 4.1 Proposed Configuration ...... 15 4.1.1 Notes About the Limitations of the Configuration . . . . . 15 4.2 Module Descriptions ...... 16 4.2.1 GPU ...... 19 4.2.2 Video Out ...... 19 4.2.3 VBlank Swap Helper ...... 20 4.2.4 Frame Renderer ...... 21 4.2.5 Triangle Fetcher ...... 21 4.2.6 Tile Renderer ...... 22 4.2.7 Triangle Handler ...... 22

ix x Contents

4.2.8 Fragmenter ...... 23 4.2.9 Triangle Parameter Calculator ...... 23 4.2.10 Paramgen ...... 24 4.2.11 Fragment Generator ...... 25 4.2.12 Depth Buffer ...... 26 4.2.13 Fragment Queue ...... 28 4.2.14 Colorizer ...... 29 4.2.15 Texturer ...... 30 4.2.16 Texture Unit ...... 30 4.2.17 Blender ...... 32 4.2.18 Color Buffer Dumper ...... 33 4.3 Communication Protocols ...... 33 4.3.1 Four-phase req/ack ...... 33 4.3.2 Strobe and Busy Signals ...... 34 4.3.3 Simple Burst Protocol ...... 34 4.4 Interfacing with the GPU ...... 34 4.4.1 Registers ...... 35

5 Software Implementation 37 5.1 Software Component ...... 37 5.2 Mesa Hooks ...... 37 5.3 Triangle Clipping ...... 38

6 Tools 41 6.1 Software Renderer ...... 41 6.2 Scheduler ...... 41

7 Evaluation by Simulation 45 7.1 Evaluation Platform ...... 45 7.2 Data Acquisition ...... 45 7.3 Evaluation of Tailored Accelerators ...... 45 7.3.1 3 ...... 46 7.3.2 Teeworlds ...... 47

8 Evaluation using FPGA 53 8.1 Evaluation of Tailored Accelerator ...... 53

9 Conclusions 55

10 Future Work 57 10.1 Faster clipping ...... 57 10.2 Texture prefetching ...... 57 10.3 Improved Scheduler ...... 57 10.4 Generated Fragment Shader ...... 58 10.5 Automatic Configuration Generation ...... 58 10.6 Central OpenGL State Storage ...... 58 10.7 Parallelization ...... 58 Contents xi

Bibliography 59

A Data Formats 61 A.1 Triangle Data Format ...... 61 A.2 Texture Format ...... 61 A.3 Framebuffer Format ...... 61

Chapter 1

Introduction

1.1 Background

Powerful embedded systems are becoming more common. When a short time-to- market is important or when the systems are produced in low volume, FPGAs can be used to add custom hardware functionality. If such a system is to display graphics, it would be cost efficient to have the graphics functionality inside the FPGA. There exists a variety of options for displaying 2D graphics in this manner, but there are few options when it comes to 3D graphics.

1.2 Purpose

The purpose of this thesis is to explore how to design a hardware graphics ac- celerator that is adaptable to the application that it is used for, in order to use the minimal amount of resources while generating graphics adequately fast. An architecture is proposed, implemented and its performance is evaluated. The implementation will be used in a realistic system with a single external memory. Applications running on this kind of system are typically not very ad- vanced, so only a subset of OpenGL is supported: a fixed-function pipeline with mipmapped multitexturing, depth testing and blending.

1.3 Scope

The scope of this report is to provide some context for a 3D graphics accelerator and to describe the architecture of a triangle rasterizer that is highly tunable for performance vs. area cost. The architecture is then evaluated using real-world applications.

1 2 Introduction

1.4 Report Outline

General background information about rendering 3D graphics is given in chapter 2.

Assumptions about the target platform are described in chapter 3.

The hardware and software implementations of the accelerator are described in chapters 4 and 5, respectively.

Some tools have been developed to aid in the development. They are described in chapter 6.

Chapters 7, 8 and 9 contain evaluation, results and conclusions of the work and possible future improvements are listed in chapter 10. Chapter 2

3D Graphics

2.1 Triangle Rasterization

The most common way of rendering and displaying 3D graphics on the screen is by representing geometry as a set of triangles, then projecting them onto the screen plane and lastly filling some pixels in a frame buffer with the correct color. This section describes how to convert an already projected triangle into a set of points in the plane. There are two main approaches to triangle rasterization: the edge function approach and the scanline conversion approach. In addition, fill convention and linear interpolation are important parts of rasterization. These topics are described below.

2.1.1 Fill Convention When triangles share a common edge, one wants to avoid any pixels on this edge being missed, but also to avoid drawing the same pixel twice. This can be done by using a fill convention, a set of rules that describe which pixels should and should

v2

v1

a ˆy v0

ˆx

Figure 2.1: Characterization of an edge as a “bottom” edge.

3 4 3D Graphics not be drawn for any triangle [2]. A common fill convention is to keep pixels that are inside the triangle, precisely on the “upper” or the “left” edge, but avoiding pixels precisely on the “lower” or “right” edge. Assuming anti-clockwise winding, and a coordinate system as seen in figure 2.1, the vector a = v1 − v0 can be used to characterize the edges of the triangle as follows. a is a “top” or “left” edge if ay < 0∨(ay = 0∧ax < 0). Conversely, if ay > 0∨(ay = 0∧ax > 0), a is a “right” or “bottom” edge. According to these criteria, the edge in figure 2.1 should be considered a “bottom” edge.

2.1.2 Edge Function Approach In order to decide whether a point in the plane is inside a triangle, one can employ edge functions [8]. These are simply functions that assign negative numbers to points on one side of a line and positive numbers to points on the other side, while points on the line are assigned zero. Specifying one such function for the lines given by each side of the triangle, the sign of these functions at any point indicates whether the point is inside the triangle or not. For a line given by the points a and b, an edge function can be given as the z component of the cross product of the vector from a to a point (x, y) and the vector from a to b. This gives the formula

Eab(x, y) = (x − ax)(by − ay) − (y − ay)(bx − ax) (2.1) which can be written incrementally as

Eab(x + ∆x, y + ∆y) = Eab(x, y) + ∆x(by − ay) − ∆y(bx − ax) (2.2) Using edge functions, one can calculate the values of E for a corner of the bounding box of a triangle, then incrementally for all pixels in the bounding box, requiring only one addition per step. In order to further limit the number of pixels visited, blocks of pixels can be considered and tested against the Edge functions, using the “Linear Edge Function Test” [4]. This test is also useful for triangle clipping, see section 5.3. There are also more complex ways to traverse the pixels of a triangle, as described in [8].

2.1.3 Scanline Conversion Approach Another approach to triangle serialization is the scanline conversion approach. A scanline is a horizontal line of pixels in a frame, and for each such line passing through the triangle, there is one span of pixels considered to be inside the triangle. This span can be found by traversing the edges of the triangle in parallel, thus keeping track of the left and right ends of the span. If the triangle has a horizontal edge, there are only two edges to follow in this manner and all is well. In the general case, however, one will have to switch to another edge part of the way into the rasterization. Several techniques can be used to traverse an edge of the triangle, but to our knowledge all of them involve a division per edge of the triangle in order to calculate the slope. For a nice description of scanline conversion, see [5]. 2.1 Triangle Rasterization 5

2.1.4 Linear Interpolation Given values for some parameter, say l, at each vertex of a two-dimensional trian- gle, the value of l at each point inside the triangle should be calculated. Assuming that the parameter varies linearly across the triangle, this can be done through linear interpolation. Picture the triangle in a three-dimensional space, where the third dimension corresponds to the parameter l. The normal of the plane in which the triangle lies can be found by calculating the cross product of two of the sides:

    x1 − x0 x2 − x0 n = y1 − y0  × y2 − y0  l1 − l0 l2 − l0   (y1 − y0)(l2 − l0) − (l1 − l0)(y2 − y0) =  (l1 − l0)(x2 − x0) − (x1 − x0)(l2 − l0)  (2.3) (x1 − x0)(y2 − y0) − (y1 − y0)(x2 − x0)   nx = nx nl

Note that the third component of the normal, nl, is independent of l, and thus is the same for all parameters of the triangle. This per triangle constant is referred to as c. Through the equation for the plane

nxx + nyy + nll = D (2.4) the rate of change of l with respect to x and y can be found as follows:

n x + n y l = − x y (2.5) nl ∂l n = − x (2.6) ∂x nl ∂l n = − y (2.7) ∂y nl Given these rates of change, the value of l can be calculated at any point in the xy-plane as ∂l ∂l l(x, y) = l + (x − x ) + (y − y ) (2.8) 0 0 ∂x 0 ∂y or incrementally as ∂l ∂l l(x + ∆x, y + ∆y) = l(x, y) + ∆x + ∆y (2.9) ∂x ∂y While the triangles, once projected, have two-dimensional screen coordinates, the perspective projection causes parameters to vary nonlinearly across the trian- l gles. By linear interpolation of z , where z is the z-coordinate before projection, 6 3D Graphics

1 and subsequent division by a linearly interpolated z , the correct value for l can be recovered. When applied to texture coordinates, this is referred to as perspective correct texture mapping.

2.2 Depth Testing

It is important that the displayed image looks like intended. In 3D scenes this often means that objects close to the camera are not occluded by objects further away. An attempt of achieving correct occlusion is to first sort all objects in the scene and render them back-to-front, overdrawing whole or partial objects when needed. This algorithm is called painter’s algorithm and works well in most scenes. However, it wastes precious time by applying color to a lot of fragments that will not be used in the final image, because of overdrawing. A faster and more flexible alternative involves using a depth buffer to keep track of how far “into the scene” each pixel in the framebuffer is. Before a fragment is colorized, its depth value is compared to the depth value stored at the corre- sponding location in the depth buffer. Typically, if the fragment passes (often this is equivalent with it being closer to the camera than the previous fragment) the depth buffer is updated with the new depth value and the fragment is passed down the pipeline for further processing, otherwise the fragment is discarded.

2.3 Texture Mapping

An important technique used to increase detail in rendered images without increas- ing the complexity of the geometry is called texture mapping. To apply a texture map (usually a regular 2D image) to the surface of a triangle, a texture coordinate is associated to each of the triangle’s vertices. To determine the influence of the texture map to a certain point in the triangle, a texture coordinate in that point is calculated and used to calculate which pixel (in the context of texture mapping: texel) in the texture map to look up. Figures 2.2a and 2.2b illustrate the huge difference this technique makes by showing the same triangle, without and with texture mapping, respectively.

2.3.1 Mipmapping To have high detail in textures the texture maps should be large. When mapping a large texture on a small triangle (e.g. when viewing geometry from afar), the sample points of the texture will be spread far apart. This introduces obvious aliasing effects. To eliminate the aliasing effects the texture can be sampled after an applying an averaging (low-pass) filter. With the mipmapping technique, this is done as a preprocessing step for each relevant texture. The texture in processed by halving its width and height, typically using bilinear interpolation, multiple w h w h w h times, generating mipmap levels with dimensions 2 × 2 , 4 × 4 , 8 × 8 and so forth. A texture and its mipmap levels is depicted in figure 2.3. When applying a 2.3 Texture Mapping 7

(a) Triangle without texture mapping (b) Triangle with texture mapping

Figure 2.2: Demonstration of texture mapping

Figure 2.3: A texture and its mipmaps texture to a triangle, the mipmap that comes closest to satisfying 1 texture texel per screen pixel is selected and used. It might seem wasteful to store multiple versions of the same texture, but it can be shown that the storage space of the additional mipmap levels is actually only one third of the storage space of the original texture level. Additionally, in systems employing a texture cache, it is great to be able to use small textures.

2.3.2 Texture filtering Since textures are mapped onto geometry without regard of the rendered frame’s resolution, it is very likely that the texture coordinate of a fragment does not point in the middle of one texel, but rather in between texels. In the scenario where this happens, a color must still be applied to the fragment. It is common to interpolate the colors of the neighboring texels linearly. A faster (but worse quality-wise) alternative is to just pick the color of the nearest texel. Similarly, when the ideal mipmap level to sample is not an integer, the conflict can be solved by interpolating sample values from the two nearest mipmaps, or just from one of them. Using different settings for how to solve these conflicts can result in 8 3D Graphics

(a) Using nearest/nearest filtering

(b) Using linear/linear filtering

Figure 2.4: Demonstration of the effects of filter modes

coloring a fragment taking from 1 to 8 actual texture lookups. For comparison, see figure 2.4a and figure 2.4b, which use nearest texel/nearest mipmap lookup and interpolate pixel/interpolate mipmap, respectively. The figures are sized to reflect how the effects would be on a 3" monitor with a resolution of 320 × 240.

2.3.3 Multitexturing

Several textures can be mapped onto the same geometry at the same time. This is called multitexturing. It may sound nonsensical to apply colors from two or more texture maps to the same object, but texture maps can represent more than just colors. It is common to apply light to static scenes by precalculating the light each piece of geometry receives, storing it in a special texture map called a light map and then apply it as a texture on the geometry. To demonstrate how powerful this technique can be, compare figures 2.5a and 2.5b. 2.3 Texture Mapping 9

(a) Scene without light mapping

(b) Scene with light mapping

Figure 2.5: Demonstration of multitexturing through light mapping 10 3D Graphics

(a) No transparency (b) Full transparency (c) Half transparency

Figure 2.6: Examples of blending in action

2.4 Blending

When a fragment color is to be written to the framebuffer, the target pixel may already have been assigned a color. In this case, one can choose to either replace the color, or to blend the two colors in some way. A common usage of blending is for transparency. For example, text is often drawn as texture mapped rectangles, one for each character. The textures have a foreground color, say white, and a background color, say black, see figure 2.6a. When drawing text on top of other graphics, oftentimes only the foreground color, the character itself, is desired. In this case, one can use a fourth color channel, called alpha, to cause the background of the text to be transparent, as seen in figure 2.6b. A value of 0 results in the black background color not being written to the framebuffer at all, while a value somewhere between 0 and 1 will cause the background and framebuffer colors to be blended by that ratio. The latter is useful if one wants a semi-transparent text background, as seen in figure 2.6c. Other applications of this type of blending include smoke and muzzle flame effects. Chapter 3

System Architecture

3.1 Platform Assumptions

The purpose of the graphics processor developed in this master thesis is to offer a way of adding hardware accelerated 3D rendering capabilities to an existing embedded FPGA system. It is assumed that the target platform will have a large, fast single port RAM connected to the FPGA and that the FPGA is programmed with a soft CPU and an array of peripherals. See figure 3.1 to get an idea of how the graphics processor fits into the platform. The triangles to be rendered are fetched using DMA by the graphics processor. The part of the graphics accelerator placed inside the CPU is responsible for packing triangles generated by the accelerated application in a form suitable for the graphics processor. Since the FPGA will have a set of peripherals already programmed onto it, it is important that the graphics processor requires as little hardware, in the form of block rams, DSP blocks (multipliers) and LUTs, as possible. As the target platform only has one external memory with one port, the graph- ics processor will share this port with the CPU and all other peripherals. The embedded system will only be used for one application, so the software generating triangles will not be replaced over time and the required feature set of the GPU will remain the same.

3.2 Implications of Platform Limitations

As has already been mentioned, there is little free area on the FPGA. Further- more, many modules share the same memory port of an already high-latency memory. Therefore, using few hardware resources and minimizing memory ac- cesses is crucial. This section treats how to design the system in order to meet these requirements. In order to render a triangle, fragments are in general generated, depth tested, colored and finally blended with the framebuffer. This means that one must have

11 12 System Architecture

Monitor

FPGA GPU CPU Peripheral Memory

Software Component Peripheral

Figure 3.1: The target platform, accelerator components in grey

Figure 3.2: A frame with tiles and a triangle. Shaded tiles overlap with the triangle 3.2 Implications of Platform Limitations 13 access, for each fragment, to the corresponding depth and color values from the depth and frame buffers, respectively. These buffers are what “buffers” in the text below refers to. Triangles are generated in sequence by some software. One approach to ren- dering is to, as each triangle arrives at the GPU, read the buffers from memory, render the triangle, then write the buffers back to memory. This is of course a horribly bad approach given the limitations mentioned above. Not only would it cause excessive memory accesses, but it would also require on-chip memory the size of the buffers, i.e. 525 kibibytes or about 260 block rams, with a frame size of 320 × 240 pixels. The reading of the buffers can be eliminated by keeping them in on-chip memory until the frame is finished, then writing the finished frame to memory. This is not nearly good enough, however, since the large on-chip memory is still required. Another approach is to figure out which parts of a frame would possibly be modified by the triangle, read those parts of the buffers, render the triangle, then write them back. Since the largest possible triangle is as large as half the screen, this is only a small improvement. If the parts that divide the frame form a rect- angular grid, as seen in figure 3.2, which is preferable for several reasons, they are called tiles. Next, one might decide to split the triangle into smaller ones that fit within tiles, read one tile from memory, render the triangles which lie in that tile, then write the tile back, repeating this for each relevant tile. This requires reading and writing just as much data as the previous approach, but costs only enough on-chip memory to hold one tile of each buffer. This is a big improvement, but there is still more to be done. In order to eliminate the reading of tiles, one can split and sort the generated triangles into tiles, before sending all triangles for one tile to the GPU at once. This is similar to keeping the whole buffers in memory, except that it is done on tile-level, so only a tile-sized on-chip memory is required, at the cost of some preprocessing and main memory storage. This final approach is what was chosen for implementation. When a textured fragment arrives at the colorizer, the correct texture data must be fetched. Always fetching this data from external memory is very slow, because of the latency of the memory. Fetching many neighboring texels at the same time and caching this data allows for fast texture mapping as well as po- tentially fewer memory accesses, because of cache hits. Usually, small steps are taken in the texture, resulting in the same or neighboring texels being accessed soon after each other. Since the stepping direction in the texture is unknown, caching quadratic blocks of pixels is a good way to make cache misses less likely. While texture caching costs on-chip memory, it is necessary in order to get the performance needed for texture mapping. For a more in-depth analysis of the impact of tile-based rendering and texture caching on area, memory bandwidth utilization and power requirements, see [3]. Some of the values used to described the triangles, such as texture coordinates, are not explicitly bounded in range or precision, and in practice use floating point representation. However, floating point calculations in FPGA hardware are im- practical, since a floating point unit takes much space, and more than one is needed for parallel computations. Therefore, fixed point calculations, which require far 14 System Architecture fewer hardware resources, are used in this work. Fixed point numbers are limited in range and precision, so these attributes must be carefully chosen to best fit the most commonly used values.

3.3 Partitioning of the Graphics Pipeline

A conceptual illustration of the graphics pipeline is shown in figure 3.3. The grey area shows what parts of the pipeline are the most computationally intensive and for that reason have been implemented in hardware. The triangle projection will be done in software.

Application

matrices triangles

Triangle Projection

triangles

Triangle Rasterization

fragments Depth Test

fragments

Fragment colors Texture Coloring Memory

colors Graphics Renderer

Blending

colors Frame Buffer Memory

Figure 3.3: A graphics pipeline. The grey area shows what functionality has been implemented in hardware Chapter 4

Hardware Implementation

4.1 Proposed Configuration

The proposed configuration of the system supports the following features:

• Multitexturing with two texture units, modulate used for combining two texels

• Mipmapping with 7 mipmap levels, only supported filtering modes for texel lookup/mipmap selection are nearest/nearest

• Blending, all modes from OpenGL

• Depth testing, 24 bit depth buffer

• The only renderable primitive is the triangle

• Texture width and height are limited to powers of 2 between 4 and 256, inclusive

• 320 × 240 resolution with 24 bit color

• VSynced buffer swapping

The modules and their hardware requirements are described in the next section.

4.1.1 Notes About the Limitations of the Configuration More filtering modes than nearest/nearest could be added by extending texture headers by a few bits to contain filtering function selection. Additionally, logic needs to be added to support multiple texture lookups and interpolation of multiple texels for each fragment. The main reasons for not supporting more modes are the additional cycle time these lookups would take, as well as the expensive hardware (multipliers) needed for interpolation.

15 16 Hardware Implementation

The fact that texture dimensions are powers of 2 is heavily exploited to avoid multiplications when calculating which texel to lookup for a given texture coor- dinate, as well as calculating which block a texel resides in. The lower bound of texture dimensions (4) is set so that pixel blocks are always fully occupied to simplify logic, and the upper bound (256) is set to guarantee good enough pre- cision when interpolating texture coordinates. The texture dimension limitations are non-trivial to overcome. The resolution supported is fixed mainly because the module generating video signal is rather inflexible. The vast majority of the logic for rendering triangles is based on tiles rather than frames and are not affected at all if the frame resolution is changed. If the module responsible for the video signal is modified, the resolution that the accelerator supports could be increased easily, although of course with a performance hit.

Table 4.1: Hardware utilization per module Module LUTs CLB slices Dffs 18 Kib BRAMs DSP48Es GPU 164 189 76 0 0 Frame Renderer 0 0 0 0 0 Triangle Fetcher 320 166 924 0 0 Tile Renderer 27 243 133 0 0 Triangle Handler 125 483 793 0 0 Fragmenter 0 0 0 0 0 Triangle Parameter Calculator 205 182 527 0 0 Paramgen 1706 971 3274 0 2 Fragment Generator 1190 324 1118 0 0 Depth Buffer 174 99 506 0 0 Fragment Queue 187 180 344 4 0 Colorizer 52 26 263 0 4 Texturer 265 137 290 4 0 Texture Unit 0 606 234 439 3 0 Texture Unit 1 161 70 235 2 0 Blender 281 91 219 0 1 Frame Buffer Tile Memory 67 22 66 2 0 Color Buffer Dumper 1019 493 1407 0 0 Video Out 1409 727 1320 1 0

4.2 Module Descriptions

The focus of this work has been to develop a 3D graphics accelerator that uses few hardware resources, while being modular and easily configurable. This sec- tion allows the reader of this report to understand how the accelerator works by describing each of the modules that make up the GPU. The function and interface of each module is described, as well as possible trade-offs (mostly performance vs. hardware usage trade-offs). These trade-offs 4.2 Module Descriptions 17 are of interest to this report as they aid in tailoring of the accelerator for a specific set of requirements. The hardware used by each module, not counting any modules residing inside that module, is listed in table 4.1. Area optimizations have been done chiefly with regard to block rams and multipliers, since this is relatively straight forward compared to minimizing LUT usage, which would take significantly more time and effort. The total resource utilization for some specific configurations of the accelerator is shown in table 7.1. Note that the total utilization information was reported by a build tool, while the utilization per module was extracted using a custom script, wherefore some discrepancies may be present. 18 Hardware Implementation Triangle Memory GPU Frame Renderer Triangle Fetcher TileRenderer TriangleHandler Fragmenter TriangleParameter Calculator Paramgen iue41 iga ftemdlso h accelerator the of modules the of Diagram 4.1: Figure Generator Fragment VBlank Swap VBlank Depth Buffer Helper Fragment Queue Texturer Texture Unit 0 Unit Memory Texture Colorizer Texture Unit 1 Unit Blender Frame Buffer TileMemory Color Buffer Color Dumper Monitor Video Out Frame Buffer Memory 4.2 Module Descriptions 19

4.2.1 GPU

Function The GPU reads triangles from memory one tile at a time, producing a frame buffer in memory which is then shown on an external screen. The required inputs are as follows. • ≥ 75 MHz clock signal • triangle data in memory as specified in appendix A.1 • base address of the triangle data • base addresses of the frame buffers The GPU needs both read and write access to some memory, but is agnostic about what type of memory or whether many memories are used. This is accom- plished by employing the Simple Burst Protocol as described in section 4.3.3. The directly underlying modules of the GPU are Video Out, VBlank Swap Helper and Frame Renderer. These are very loosely connected, using VBlank Swap Helper to make sure that frame buffers are swapped while no pixels are drawn on the screen to eliminate tearing.

4.2.2 Video Out

Function The video out module is responsible for generating the video signals needed to display a framebuffer on a monitor. Internally, it has a memory big enough to fit all pixels on one line. The line is conceptually split into blocks of 32 pixels. Once the signal corre- sponding to all pixels in such a block has been generated, the block is marked as 20 Hardware Implementation old. The video out module will then copy 32 pixels from main memory to the line memory and mark the block as updated. It is necessary to prefetch pixels in this manner because main memory access latencies are hard to predict, and it is a catastrophe if pixels have not been fetched in time.

Hardware-Saving Trade-offs

It is difficult to reduce block ram usage of this module further as some pixels need to be prefetched. One might want to use more block rams to prefetch more pixels if the main memory latency is very unpredictable. If an upper bound can be guaranteed on the latency of memory reads, e.g. by using a priority based arbitration scheme on the bus, the amount of prefetching can be limited, and less on-chip memory used.

4.2.3 VBlank Swap Helper

Function

Tearing can be avoided by modifying the frame buffer connected to video out only in the blanking period. Another approach that doesn’t put an extra constraint on how long it takes to render a frame is double buffering; while the renderer writes to a frame buffer (the back buffer), the last frame buffer (the front buffer) generated is displayed on the screen. When the renderer is finished generating the frame, the address of the front and back buffers are swapped with the effect of the newly generated frame is displayed on the screen. The address swap must occur when the video generating unit is not generating a signal, in the blank period, to avoid tearing. This modules takes as input the swap request from Frame Renderer as well as the signal from Video Out that indicates whether the Video Out module is in VBlank. It then applies the bitwise AND operation on these to produce a second swap request. Logic in GPU will swap the frame buffer addresses when both of these swap signals are high. This module is very cheap but has great utility. It is not recommended to remove it. 4.2 Module Descriptions 21

4.2.4 Frame Renderer

Function This module contains everything necessary to render a complete frame. It mainly connects other modules.Frame Buffer Tile Memory is double-buffered so that Tile Renderer can render one tile while Color Buffer Dumper dumps the last rendered tile to main frame buffer memory.

Hardware-Saving Trade-offs All modules contained within this module are necessary for correct operation. The only saving possible on this level is to modify Frame Buffer Tile Memory to be single buffered to save a block ram, possible if synchronizing logic is added to make sure that Tile Renderer and Color Buffer Dumper don’t access it at the same time.

4.2.5 Triangle Fetcher

Function It is the responsibility of the triangle fetcher to read triangles from memory and pass triangles on to the rest of the system. The format of the triangles in memory is described in A.1. Triangles are passed on to the tile renderer via a req/ack interface, where triangle fetcher is the passive part. In order to signify that the last triangle of a tile has been sent, a nack signal is used.

Hardware-Saving Trade-offs A triangle entry in memory is 6 64-bit words long. Such entries are currently read, one at a time, with burst reads. This saves on-chip storage, since only one entry has to be kept in the triangle fetcher at once. If longer bursts are desired — in 22 Hardware Implementation order to minimize the overall time this module occupies the bus — one probably wants to use a block ram for storage of a block of triangles.

4.2.6 Tile Renderer

Function Tile Renderer makes sure that all the triangles in one tile are rendered. It does this by cleverly delegating all work to other modules contained inside it. Its layout is best described visually, see figure 4.1. The purpose of having a queue for fragments between Tile Renderer and Colorizer is that even if one of these modules suddenly stalls the other can still continue working.

Hardware-Saving Trade-offs This module contains very little hardware by itself, but big savings can be made by modifying the configuration of the modules contained within. For instance, if texture mapping is not needed, Texturer can be removed. Sim- ilarly, if blending is not needed, Blender can be removed. If few enough fragments will be processed per frame, it is possible to remove Fragment Queue and let fragments be generated in the exact same pace as they are colored.

4.2.7 Triangle Handler

Function The triangle handler processes triangles into fragments which are ready to be colored. The depth test has been placed here, before the coloring, so as to decrease the number of fragments which are colored but not visible. The triangle handler actively fetches triangles from the triangle fetcher via a req/ack protocol, and writes fragments which have passed the depth test to the fragment queue. 4.2 Module Descriptions 23

Hardware-Saving Trade-offs The triangle handler uses little hardware apart from a register containing the next triangle to be processed. There is a check in the triangle handler which makes sure that the triangles have a nonempty bounding box. This check involves a few adders and comparisons. If one can ensure that no empty triangles are passed to the hardware, for example by checking this in software, the above-mentioned check can be removed.

4.2.8 Fragmenter

Function The fragmenter breaks triangles into fragments for depth testing and shading. It’s directly underlying modules are the triangle parameter calculator and the fragment generator. These modules communicate through a strobe/busy protocol, the triangle parameter calculator being the active part. The triangle parameter calculator fetches triangles from outside the fragmenter, so the fragmenter simply forwards these signals.

4.2.9 Triangle Parameter Calculator

Function The triangle parameter calculator takes as input a triangle and produces inter- polation parameters for the per-vertex attributes of the triangle, as well as for the half-space functions describing the edges and parameters used to decide the mipmap level for texture mapping. Most of these calculations are handled by its only underlying module, called Paramgen. Further processing of mipmap level parameters and correction of half-space function values for the fill convention is 24 Hardware Implementation then done by the triangle parameter calculator. These two modules communicate through a modified strobe/busy protocol, where a strobe is sent to paramgen to order the start of calculation, and a strobe is received to indicate that the calcu- lation is done. This is fine, since the triangle parameter calculator actively waits for paramgen to finish. For fetching triangles, the triangle parameter calculator uses the req/ack pro- tocol, playing the active part. The strobe/busy protocol is used for sending the calculated parameters on to the fragment generator.

Hardware-Saving Trade-offs Four adders are used for the mipmap level calculations. Three subtractors are used for the fill convention compensation of the half-space function values. If mipmapping is not desired, these adders can be removed. The hardware usage can be lowered by doing calculations in sequence instead of in parallel.

4.2.10 Paramgen

Function This unit performs calculations of interpolation parameters of per-vertex attributes. It was generated from lisp-like code by the scheduler, described in section 6.2. The values calculated are, more precisely, the initial value for interpolants, at one corner of the bounding box of the triangle, and the stepping values for the interpolants in the x and y directions. These are then used in the fragment generator for linear interpolation of the attributes. See section 2.1.4 for information on linear interpolation.

Hardware-Saving Trade-offs Note that, even though two multipliers are required, they work together as one, handling multiplication of larger numbers. Synthesis of Paramgen was done with Precision. In order to get a pipelined multiplier, several registers were added on the output of the multiplier and re- timing was enabled. Unfortunately, the retiming resulted in other registers being moved and other functional units limiting the clock frequency. For this reason, the “preserve_signal” attribute was set on the result registers of all functional units, and the maximum clock frequency increased. This resulted in some unnecessarily 4.2 Module Descriptions 25 wide registers in Paramgen, but the synthesis tool managed to limit the widths of the functional units themselves, so the result is acceptable. There are many ways to save hardware or increase performance by modifying this unit. For one, the number of available functional units, given as inputs to the scheduler, can be changed arbitrarily, as long as at least one unit for each necessary operation is present. Of course, the less functional units are used, the more cycles it takes to do the calculation. Furthermore, depending on the number of texture units used, and on whether mipmapping is enabled, calculations for texture coordinates and mipmap levels can be skipped. The calculations require, for each interpolant, 8 multiplications and 6 additions, which adds about 8 cycles to the calculations for each triangle. The fragments generated using these parameters vary in number, and some fail the depth test, so, in order to keep the rest of the pipeline working, the parameter calculations cannot take too long. This means that in order to have more interpolants, one must increase the number of functional units, multipliers in particular.

4.2.11 Fragment Generator

Function The fragment generator iterates over the fragments of a triangle’s bounding box, incrementally calculating interpolant and half-space function values. It determines whether a fragment is inside the triangle, and in that case passes it on to the next module. Initially, each interpolant’s value at the top left corner is known. The value at the start of the next line is calculated by adding the interpolation step in the y direction. For each step in the x direction, the current value can be calculated by simply incrementing by the interpolation step. When the edge of the bounding box is reached, the value at the start of the next line is used, and these steps are repeated until all fragments have been examined.

Hardware-Saving Trade-offs For simplicity and speed, each interpolant has its own adders for incrementing by the interpolation step. Ideally, the fragment generator should examine one fragment each cycle, since many will end up failing the half-space function test or the depth-test. If one can afford more cycles per fragment, fewer adders could be used. Additionally, any unused interpolant saves two adders and two registers. 26 Hardware Implementation

4.2.12 Depth Buffer

Function This module is responsible for determining whether a fragment passes the depth test. A hardware schematic of the module can be seen in figure 4.2. The following parameters are passed to the module for each test and are nec- essary for its operation:

• the fragment’s (x, y) position relative to the position of the tile

• the fragment’s depth value

• which depth function to use

• the number of times the depth buffer had been cleared when the fragment’s triangle, its clear count

• whether to write the fragment’s depth value and clear count to the depth buffer if depth test passes

The module has an internal memory, a depth buffer, with one slot for each pixel in a frame buffer tile. Each such slot contains a depth value and a clear count. When a fragment is to be tested, the depth value and clear count corresponding to the fragment’s position are fetched and then compared. The depth test passes when

DepthF unc(new_depth, old_depth) ∨ new_clear_count > old_clear_count

When the application running on the CPU requests a clear of the depth buffer a counter is increased instead of actually clearing the depth buffer. The value of the counter at the time of a triangle render request is attached to the triangle. A triangle with a higher clear count than another triangle will be drawn on top of it, regardless of depth values. By using the clear count technique, it is not necessary to explicitly clear the depth buffer whilst rendering a tile, saving precious processing cycles. Typical depth functions are Always, which always passes the depth test re- gardless of depth values, and Less, which only passes fragments that are closer to the camera than the one stored in the depth buffer. 4.2 Module Descriptions 27 registers always enabled data fragment fragment data out 0 clear strobe busy 1 0 clear 0 = + 0 1 1 TILE_SIZE depth clear logic reset test y test x Conc Conc in progress fragment x fragment y 0 1 0 Write Read 0 1 memory depth buffer ren wen 0 0 test strobe 1 1 0 0 > reset passed? Comps 0 1 Figure 4.2: Hardware schematic of the Depth Test module func depth clear count comparisons of depth values Conc value depth mask depth stall enable for registers 28 Hardware Implementation

Hardware-Saving Trade-offs If it is known that the application does not need a depth buffer, for example if it uses painter’s algorithm to sort the 3D geometry before rendering, this module can be replaced by logic that always passes the tested fragment. No block rams would then be needed.

4.2.13 Fragment Queue

Function The fragment queue serves as a buffer between the triangle handler and the col- orizer. When the fragmenter halts because new interpolation parameters must be calculated, the queue still feeds the colorizer with fragments. When the col- orizer needs to wait for texture data because of a cache miss, the queue allows the fragmenter to continue working.

Hardware-Saving Trade-offs The large amount of block rams stems from the data width of fragments, and the desire to be able to push and pop fragments in one cycle. The number of bits needed to represent a fragment directly influences the number of block rams used. By sacrificing some features, such as multitexturing or mipmapping, the fragment can be made smaller and thus the number of block rams required lowered. The number of block rams can also be made smaller by allowing pushes and pops to and from the queue to take more than one cycle, storing a fragment in several rows of a block ram. By measuring the utilization — the number of used slots — of the queue over time, one can decide on optimizations to be made. If the queue is often full, fragments are produced faster than necessary, and the part of the pipeline before the queue can be slowed down, usually for hardware utilization gains. If the queue is often empty, the fragments are consumed faster than necessary, which, although unlikely, means that the part of the pipeline after the queue can be slowed down. In the event that the utilization of the queue varies very little around some value, the depth of the queue can be adjusted to match that value more closely. By matching the depth of the queue to the amplitude of the variation in uti- lization, one can greatly decrease the size of the queue while taking a minimal performance hit. In order to visualize queue utilization over time, with one pro- ducer and one consumer working independently, a simple simulation program was 4.2 Module Descriptions 29

25 25

20 20

15 15

10 10

5 5

0 0 0 50 100 150 200 250 300 0 50 100 150 200 250 300 (a) Queue depth 25 (b) Queue depth 4

Figure 4.3: Simulation of utilization of the fragment queue written in MATLAB. Figure 4.3a shows a simulation of the utilization of a queue with 25 slots. The production and consumption of 200 items takes 262 time units, and the amplitude of the “noise” is about 4. Running the simulation again with a queue with 4 slots yields the result shown in figure 4.3b. The process still takes 262 time units. Further lowering the queue depth causes a performance hit; a dept of 1 causes the process to take 304 time units. Of course, if this resizing of the queue ever causes the queue to become empty where before it was not, the performance will be worse.

4.2.14 Colorizer

Function Colorizer takes a fragment as input and calculates its final color. In the imple- mentation of the graphics processor, all fragments are textured with one or two textures, so the functionality of this module is to ask for the texture lookups to be performed and multiplying the resulting colors. If this graphics processor supported dynamic fragment shaders, the behavior of this module would be configurable in run time.

Hardware-Saving Trade-offs By sequence the multiplications needed for multiplying the texture colors, only one multiplier needs to be used in this module. 30 Hardware Implementation

4.2.15 Texturer

Function The texturer wraps two instances of the texture unit module and exposes a conve- nient interface for performing texture lookups. A req/ack protocol is implemented to allow lookups to take an arbitrary amount of time, convenient because lookups where a cache miss occurs take longer than those where a cache hit occurs. At the beginning of each frame all texture headers are read and stored in internal memory and the texture units are requested to clear their cache.

Hardware-Saving Trade-offs By reducing the number of textures supported (currently 2048), the number of block rams required by this module can be greatly reduced. Also, a smarter implementation of this module could cache recently used texture headers instead of storing all of them internally, further reducing the number of block rams required. Two texture unit instances are used so that they can have their own caches. In applications where only one texture unit is required, or where few enough texture lookups are performed so that some cache thrashing is acceptable, only one texture unit needs to be instantiated.

4.2.16 Texture Unit

Function The texture unit is responsible for looking up a color in a texture. The texture id, texture size, texture coordinates as well as the derivatives of the texture co- ordinates (how quickly they change between neighboring pixels) are passed to it and are used to select which mipmap level of the texture to select as well as what texel in the mipmap to lookup. Combined, the texture id, mipmap level, and 4.2 Module Descriptions 31

texture id mipmap block x block y texel offset

Interleave

Concat. enable

tag index disp.

Concat.

Tag Texel mem mem stb_lookup FSM Compare cache hit

lookup done

stb_color_looked_up color

Figure 4.4: Simplified hardware schematic of the texture cache

texel coordinates are used to calculate the address in main memory where the texel color can be found. The texel color is fetched from main memory unless is already resides in a local cache. When a cache miss occurs, a block of 4 × 4 texels containing the requested texel and its neighbors is fetched from the selected mipmap and stored in cache. Due to the nature of texture lookups, it is very likely that the same texel or texels close to it are fetched soon afterwards, which motivates fetching blocks of texels at a time to reduce the amortized fetch cost per texel. For a detailed description of how lookups are performed, see [6]. The hardware of the core functionality of the texture unit is depicted in figure 4.4. To avoid cluttering the schematic, only signals relevant to a successful cache lookup are included. If a cache miss occurs, the FSM will make sure that the corrent texel block is read from memory and that the tag and texel memories are updated accordingly before trying again. 32 Hardware Implementation

Hardware-Saving Trade-offs

In the proposed configuration there are two instances of Texture Unit, with a capacity of 128 and 32 cached blocks, respectively. The number of blocks rams necessary for a texture unit is n + m, where n = dcacheable_blocks/32e, necessary for texel storage and m = dcacheable_blocks/512e, necessary for tag storage. Modifying the number of cacheable blocks for the texture units have a direct impact on the number of block rams used. By increasing the cache size, perfor- mance can be increased at the cost of block rams. Of course, block ram usage can be reduced by reducing the cache size. If the graphics processor could handle more than the simplest texture filter- ing mode (nearest mipmap, nearest texel), this unit would require at least one multiplier.

4.2.17 Blender

Function

It is the responsibility of this module to blend incoming fragment colors with the pixel color at the same position in the frame buffer tile. In principle, this is done by reading the color value at the position (destination color) and blending it with the incoming fragment color (source color) by first scaling them and then adding the products. The scale factors are calculated by functions passed to this module along with the fragment. Blending modes where the source/destination colors are multiplied by 0 or 1 can be handled specially to reduce time spent multiplying or waiting for memory to be read.

Hardware-Saving Trade-offs

In the worst case, 8 multiplications need to be performed. In some blending heavy applications (e.g. applications using lots of transparency) one might opt to use more multipliers to reduce blending time. The implemented blender performs the multiplications sequentially and thus requires only one multiplier. 4.3 Communication Protocols 33

4.2.18 Color Buffer Dumper

Function The color buffer dumper writes a rendered tile to the framebuffer in memory. It does this by repeatedly accumulating a line of the tile in a register, then burst writing the line. The address of the start of each line of the tile in the framebuffer must be calculated, as well as the address of the start of the next tile to be dumped.

4.3 Communication Protocols

There is a lot of communication between the modules described above. For a typical triangle, data flows through 13 of the modules before all of the the pixel locations and corresponding color values for that triangle have been calculated. Two protocols are used for synchronizing modules in our system. They can be used when sending data from one module (the sender) to another (the receiver), and this is the context in which they are described below. There is also a protocol called Simple Burst Protocol, for communication with external units.

4.3.1 Four-phase req/ack Before a transaction between two modules is initiated, both req and ack are low. The sender drives req high to indicate that it wants to send data to the receiver. The sender holds any data it wants to send to the receiver until the receiver drives ack high. The sender then drives req low and waits until the receiver drives ack low before the transaction is concluded. This protocol allows for two modules to be developed completely independent of each other as it places no constraints about the relative speed of modules; the scenario where the sender produces data faster than the receiver can consume it will never happen. One disadvantage of this protocol is that it takes at least 4 clock cycles for every datum to be transmitted. If care is taken in the implementation of this protocol, it can be used to commu- nicate between different clock domains. In our case it could allow different parts of the accelerator to be clocked at different speeds, potentially increasing the data throughput by clocking crucial modules faster and lowering power consumption by clocking less crucial modules slower. This opportunity has not been investigated further in this work. 34 Hardware Implementation

Table 4.2: Simple Burst Protocol Signal Name Direction Description address[32] out Start address of transaction data_out[64] out Word to write data_in[64] in Read word strobe_start_transaction out Indicates start of transaction read_not_write out Indicates read or write word_count[9] out Number of words to burst strobe_word_delivered in Indicates that a word was read or written strobe_transfer_finished in Indicates end of transaction

4.3.2 Strobe and Busy Signals An alternative method is to use a strobe signal that pulses to indicate that data is available (the data is only guaranteed to be valid for the same cycle!). Before the sender is allowed to strobe it needs to make sure that it does not strobe in the same cycle as the receiver’s busy signal is asserted. This simpler protocol allows for faster transaction (one datum can be transmitted every cycle!), but is more error prone as the receiver always needs to be ready to handle data in every cycle where the busy signal is not set. Additionally, the implementation of the sender gets trickier as a Moore machine is not able to implement this protocol in the scenario where the receiver suddenly pulls busy high in the same cycle as strobe is asserted. These limitations resulted in this protocol only being used in the most time critical paths.

4.3.3 Simple Burst Protocol For communication with external units, memories in particular, we designed a simple protocol supporting burst reads and writes. The signals of the resulting protocol are described in table 4.2. The rate of transfer of individual words is controlled by the external unit, so differences in word length or speed does not affect the protocol. It is up to the external unit to keep count of the words in a burst, and to be ready to initiate a transaction at any time. When interfacing with the PLB bus, as described in section 4.4, the maximum supported burst length is 212 − 1 bytes, which is why the word_count in the simple burst protocol has a maximum value of 29 − 1.

4.4 Interfacing with the GPU

In order to fit the GPU into the platform described in section 3.1 the simple burst protocol must be translated to the one used by the platform. In the case of a MicroBlaze based system, the bus of choice is PLB. Because PLB is rather complex, an IP core called PLBV46 Master Burst was used for this translation. The Master Burst, in turn, requires another adapter, which has been named sbp2mb (simple 4.4 Interfacing with the GPU 35 burst protocol to master burst), to speak the simple burst protocol. The addresses of the frame buffers, texture headers, triangle buffer as well as the req/ack signals for a new triangle buffer are exposed as registers on the PLB bus through the use of a PLB slave IP core called PLBV46 Slave Single. In a typical system, the software running on the CPU will allocate large con- tinuous buffers and pass pointers to these to the GPU via the registers. Some notes about these registers are in order.

4.4.1 Registers In the system implemented on the platform described in section 7.1, the addresses passed to registers 4.2, 4.3, 4.4, 4.5 have in common that they must be aligned to a 8 byte word size, meaning that their 3 least significant bits must be 0. This is because the start address of burst reads and writes from and to the memory on this platform must be aligned in this way. The Status register 4.1 is used to start a frame rendering; when the Triangle Buffer Address and Texture Headers Address registers are setup correctly, writing a 1 to the lowest bit of the Status register will tell the GPU that new data is waiting to be rendered. The bit will remain 1 until the GPU accepts the new data and has copied the values of the registers. It is then possible to change the mentioned registers. The data pointed at will be used by the GPU for a while after that. The Framebuffer Address registers are read continuously by the GPU and any changes to them will take effect immediately. One of the framebuffers is displayed on the connected monitor, while the other one is written to by the GPU when a frame is rendered. Which of the framebuffers is used for what is unknown to the CPU and they can be swapped at any time.

Register 4.1: Status (0x00)

Reserved VBlankReq/Ack Sync Swaps

31 2 1 0

Unspecified 0 0 Reset

Register 4.2: Triangle Buffer Address (0x04)

Address

31 0

0x00000000 Reset 36 Hardware Implementation

Register 4.3: Texture Headers Address (0x08)

Address

31 0

0x00000000 Reset

Register 4.4: Framebuffer 0 Address (0x0c)

Address

31 0

0x00000000 Reset

Register 4.5: Framebuffer 1 Address (0x10)

Address

31 0

0x00000000 Reset Chapter 5

Software Implementation

5.1 Software Component

The GPU expects triangle and texture data to be present in the memory. It is the responsibility of the soft CPU to write this data to memory. In order to generate the data, one can of course run software on the soft CPU itself. Another solution — suitable for testing and development — is to stream the data via a serial port or Ethernet from an external computer. In order to minimize memory writes, blocks of triangles are written at a time. These blocks are organized in linked lists, one per tile. This means that the first block of each list has a fixed location in memory, while the following blocks can be written at the next free space once they become full. A more detailed description of the triangle storage format can be found in appendix A.1. All triangles for a tile must be present once rendering of that tile begins, so all triangles of a frame must be written to memory before the GPU can start working on that frame. The base address of the triangle data is written to the corresponding register on the GPU, and a buffer swap is requested. This allows the CPU to write new triangle data while the GPU is working on the old data in another location in memory. If one wishes to sacrifice some performance for memory utilization gains, one can instead wait for the GPU to finish processing all the triangle data before rendering of the next frame begins, so that the old data is overwritten. This requires reading the status register of the GPU, which indicates whether it has finished working on the current triangle data, periodically. Before the start of each frame, the GPU fetches texture headers from memory. This allows textures to change in between, but not during, frames. For information on the texture storage format, see appendix A.2.

5.2 Mesa Hooks

One way to generate the triangle data required by the GPU is to run an OpenGL application, storing information about the triangles instead of rendering them.

37 38 Software Implementation

Figure 5.1: The result of clipping the triangle in figure 3.2. Dotted lines show the cuts that were made

Using Mesa’s software rasterizer, all rendered primitives eventually arrive at some triangle rasterization function. Which one is used is chosen depending on the OpenGL state, i.e. whether depth testing, texture mapping, or blending is enabled, etc. By hooking this function, one can instead record data about the triangle. The last thing that a program does for a frame is to swap the frame buffers (assuming double buffering), so once the corresponding Mesa function is called, it is safe to assume that all triangles for the frame have been passed to Mesa. The function has been modified to flush the OpenGL pipeline, so as to ensure that all triangles for the frame have been recorded. In addition to per vertex data — coordinates and texture coordinates — each triangle entry contains texture id:s, depth test function, depth test mask, blending 1 functions, “clear count” — see section 4.2.12 — and the constant c , as described in section 2.1.4. While much of this state is accessible in the hooked triangle rasterization function, the clear count needs to be tracked by intercepting calls to 1 glClear, and the constant c must be calculated. The latter is done in software because the CPU is assumed to have a division instruction, and implementing it in the GPU would cause hardware duplication. At any time, the OpenGL application may decide to upload a texture to the GPU. By hooking the corresponding Mesa function, the texture data can be in- tercepted and recorded. Here, one can also do texture up- or downscaling to keep within the hardware limits on texture size. This way of generating triangle and texture data can of course be used on an external computer, but also on a soft CPU running and some OpenGL application.

5.3 Triangle Clipping

Because the GPU uses tile-based rendering, the triangles are placed in buckets according to the tiles they belong to. If a triangle belongs to more than one tile, it is cut into smaller triangles, each of which belong to one tile. This is done in 5.3 Triangle Clipping 39 software, which allows the rendering hardware to be simple, since it only needs to handle triangles. In order to detect whether a triangle intersects a tile, the LET test [4] is used. Each tile in the bounding box of the triangle is checked in this way. When an intersecting tile is found, the part of the triangle that is inside the tile is found, by use of the Sutherland-Hodgman algorithm [9]. The resulting polygon is convex, and is thus easily triangulated by connecting one of the vertices to all others. Finally, the triangles are transformed to tile relative coordinates before being stored in a tile. The result of clipping the triangle shown in figure 3.2 is shown in figure 5.1, where dotted lines show the cuts.

Chapter 6

Tools

6.1 Software Renderer

A software renderer has been developed to help evaluate the importance of render features by allowing rapid feature implementation and testing with target appli- cations. The software renderer only has access to the same data as the hardware implementation and it is thus guaranteed that the results produced by the soft- ware renderer can also be achieved in the hardware implementation. When the same problem can be solved in several ways, the software renderer has been of great help as a framework for quickly integrating implementations in a working renderer. In particular, it was helpful when selecting which triangle rasterization approach was best suited for hardware implementation. As was described in 5.2, Mesa has been hooked to allow for accumulation of triangles and state during a frame. Instead of sending this data to the GPU, Mesa can call the software renderer to create an image to be displayed on the screen. This is done through further modification of Mesa’s function for swapping buffers. After the image has been created, the other Mesa hooks are temporarily disabled, the rendered image is displayed using OpenGL functions, and the Mesa hooks are enabled again. Concrete examples of what the software renderer was used for include determin- ing that mipmapping is essential for image quality and that vertex color support is (perhaps somewhat surprisingly) not strictly necessary in some applications (e.g. Quake 3).

6.2 Scheduler

There are many calculations to be done for each triangle; the bounding box, initial and stepping values for interpolation of depth, texture coordinates and half-space functions. These calculations involve many multiplications, 46 in total, assuming two texture units. By doing the multiplications in series the number of required multipliers can be greatly reduced, at the cost of calculations taking more cy-

41 42 Tools cles. Furthermore, to reach acceptable clock frequencies, the multipliers must be pipelined, so that each individual multiplication spans several cycles. Manually scheduling and keeping track of all these calculations is both impractical and error prone. Automatically generating such hardware from a high-level description is known as High-level Synthesis, and several HLS tools are available. After attempting to use Mentor Graphics’ Catapult C tool, and not managing to produce the required hardware, we decided to develop our own, custom tool for the purpose of HLS. The scheduler is a tool which takes as input a description of calculations to be performed, a list of available functional units, and outputs VHDL code that does the job. The calculation description is given in a simple, Lisp-like language, which supports constants, temporary variables and function definitions. Example 7·x 6.1 shows some code turning x and y into magic, where magic = 3.5·x·y = 2 ·y. The resulting schematic, after synthesis with Precision, can be seen in figure 6.1. In addition to the hardware shown in the schematic, a state register, some control logic and a strobe interface are generated. In order to simplify the scheduler, several assumptions are made about the input, output and intermediary values. Inputs and outputs are assumed to be 32 bit integers. The multipliers in a Virtex 4 FPGA can handle 18 bit signed operands, or one 35 bit and one 18 bit signed operands using two multipliers. For backwards compatibility with Virtex 4, such multipliers are instantiated by scheduler. The result of such a multiplication is 53 bits, so for simplicity all intermediary values are 53 bit signed. Once the generated hardware is connected to the rest of the system, so that actual bit widths from the operands are known, the synthesis tool can limit the widths of signals, registers and functional units. See section 4.2.10 for a description of a generated module.

Example 6.1: Example of Lisp-like code

(in x) (in y)

( def three-and-a-half-x (>> (+ (* x "7") "1") "1")) ( def magic (* three-and-a-half-x y))

( out magic ) 6.2 Scheduler 43

x y

1

+

7 >>1

*

magic

Figure 6.1: Example of generated hardware

Chapter 7

Evaluation by Simulation

7.1 Evaluation Platform

For evaluation of the accelerator, a system with a MicroBlaze soft processor was built, as described in 4.4. A development platform from Digilent called ML505, housing a Virtex 5 XC5VLX110T speed grade 1 FPGA, was used. The memory connected to this FPGA is a 256 MiB DDR2 memory with a theoretical maximum bandwidth of roughly 3 GiB/s. The system clock frequency was set to 125 MHz.

7.2 Data Acquisition

Hardware counters were added in the appropriate places in order to count the time taken for rendering one frame, the number of fragments in the fragment queue, the amount of data read and written from and to memory, and the number of cache hits and misses. The system was then simulated using ModelSim and triangle and texture data acquired from running the desired application. The simulated memory had a latency of 100 cycles for initiating read and write bursts, to mimic the access latency of off-chip memory. The simulation was repeated for different cache sizes, and the results tabulated. A 100 cycle latency for starting a memory access is pessimistic but still some- what realistic; a test frame was rendered using a build of the accelerator on the target platform. The render took approximately 2500000 cycles and approximately 27500 bursts were initiated by the accelerator, resulting in an upper bound of 91 cycles per memory access burst. For counting triangles before and after clipping, as well as the sizes of the bounding boxes of clipped triangles, the software rasterizer code was modified.

7.3 Evaluation of Tailored Accelerators

The triangle setup time, the time taken for calculation of parameters for attribute interpolation, should ideally be smaller than the time it takes to generate the

45 46 Evaluation by Simulation

Table 7.1: Hardware utilization and max frequency of tailored configurations Application LUTs CLB slices Dffs 18 Kib BRAMs DSP48Es Max freq. Quake 3 8290 3096 12384 19 7 164 MHz Teeworlds 6896 2508 10029 16 6 155 MHz number fragments of a typical triangle, in order to keep the fragmenter busy. With the half-space function approach employed in the implementation, all fragments in the bounding box of a triangle are examined. Therefore, the number of triangles with a given number of fragments in their bounding box is an interesting measure. Figure 7.2 shows graphs of this measure for two applications. Looking at this data, one might draw the conclusion that a setup time larger than the time it takes to generate about 27 fragments would be detrimental to performance. However, as will be shown shortly, this is not the bottleneck, and additional hardware costs need not be imposed to increase the performance of the triangle setup. The graphics accelerator was tailored for two different applications with dif- ferent requirements, Quake 3 and Teeworlds, and evaluated. Ideally, many frames should have been evaluated to produce more general results. However, due to a lack of time, only one representative frame was picked from each application and used for evaluation. Common for both applications was that the test frames were 320 × 240 pixels. For hardware resource utilization as well as the fastest possible clock frequency (according to the Precision synthesis tool), see table 7.1. It is worth noting that these tests were run on a GPU completely insulated from the negative effects on memory bandwidth a concurrently running CPU would have. Furthermore, the Frames Per Seconds (FPS) measurement should not be used to predict the FPS the game would have running on a MicroBlaze. To attain the stated FPS the CPU would have to crunch through all game logic calculations and generate all triangles for a frame faster than the GPU is able to render all triangles from the last frame, something that might prove difficult to achieve given the MicroBlaze’s limited performance.

7.3.1 Quake 3 Quake 3 is a game of the 3D shooter genre. It was picked as an example of an application with complex 3D scenes and where the focus is more on complex geometry than detailed textures. This game requires features such as mipmapping as well as multitexturing. The scene seen in figure 7.1a was used for measurements. It consists of 3719 triangles before clipping and 5869 triangles after clipping. There are about 400 loaded textures, requiring about 10 mebibytes of memory. Tables 7.2(a) - (f) show performance measurements for different configurations of the cache sizes. The unit used for cache size is blocks of 4 × 4 texels. It is obvious from the tables that the size of texture cache 0 is more important for performance than the size of texture cache 1. This is because texture unit 1 is used for light maps, which are few, and of which small parts are used, which leads to many cache hits regardless of the cache size. Configuring the texture caches 7.3 Evaluation of Tailored Accelerators 47 to hold 32 blocks each results in good performance while still using the minimum amount of block rams (2 per cache). The number of fragments in the fragment queue during the rendering of the Quake 3 frame, with cache 0 and cache 1 both sized 32, is shown in figure 7.3a. It is seen that the fragment queue is almost never empty. This means that the colorizer virtually always has work to do. It is on the other hand not always full either, so it plays its intended role as a buffer between the fragmenter and the colorizer. Because the fragment queue is seldom empty, it can be concluded that the fragmenter is not the bottleneck when running Quake 3, as stated earlier. The fragment queue could probably be reduced in size when running Quake 3, but removing it completely would affect performance negatively.

7.3.2 Teeworlds Teeworlds in a game that uses 3D acceleration even though it is a 2D . It draws large triangles with detailed textures and is picked as an example of an application where texture lookup performance is more important than fast triangle throughput. Teeworlds does not require mipmapping or multitexturing, and the graphics accelerator has been configured without these features when running this game. The scene shown in figure 7.1b was used for evaluation. It consists of 2330 triangles before clipping and 5835 triangles after clipping. There are 18 loaded textures, requiring about 3 mebibytes of memory. Table 7.3 displays performance measurements for different configurations of the cache size. The unit used for cache size is blocks of 4 × 4 texels. As expected (because Teeworlds is much more texture heavy than Quake 3) a larger texture cache is required. Great performance increases are observed with larger texture caches. A texture cache size of 64 is good for performance while still not using too many block rams. If a texture cache size of 64 is picked, 3 block rams will be used. The number of fragments in the fragment queue during the rendering of the Teeworlds frame, with a cache size of 64, is shown in figure 7.3b. It is seen that the fragment queue is virtually always completely full. This means that fragments are generated in a faster pace than they can be consumed by the colorizer, and that the fragmenter is stalled for a large number of cycles. Thus, the bottle neck when running this application is clearly not the fragmenter, but rather the modules responsible for coloring fragments. The fragment queue is seemingly a lot less important when running Teeworlds than it is when running Quake 3. It could be reduced in size or perhaps even removed completely. 48 Evaluation by Simulation

(a) Quake 3

(b) Teeworlds

Figure 7.1: Screenshots of the games used for evaluation 7.3 Evaluation of Tailored Accelerators 49

4000

3000

2000 Triangles 1000

0 0 27 54 81 108 135 162 189 216 243 270 297 324 351 378 405 432 459 486 513 Fragments in bounding box (a) Quake 3

4000

3000

2000 Triangles 1000

0 0 27 54 81 108 135 162 189 216 243 270 297 324 351 378 405 432 459 486 513 Fragments in bounding box (b) Teeworlds

Figure 7.2: Triangle count versus bounding box size after clipping

600

400

200 Fragments in queue 0 0 0.5 1 1.5 2 2.5 6 Cycle x 10 (a) Fragment queue utilization in Quake 3

600

400

200 Fragments in queue 0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 Cycle x 10 (b) Fragment queue utilization in Teeworlds

Figure 7.3: Utilization of fragment queue 50 Evaluation by Simulation

Table 7.2: Effects of texture cache size on rendering of one frame of Quake 3

Texture Cache 0 Size 2 4 8 16 32 64 2 40.0 32.2 30.3 26.8 26.3 25.8 Texture Cache 1 Size 8 37.0 29.3 27.4 23.9 23.5 23.0 32 36.2 28.5 26.7 23.1 22.7 22.2 (a) Render time ms vs. cache size

Texture Cache 0 Size 2 4 8 16 32 64 2 25.0 31.1 33.0 37.4 38.0 38.7 Texture Cache 1 Size 8 27.0 34.2 36.5 41.9 42.6 43.6 32 27.6 35.1 37.5 43.2 44.0 45.0 (b) FPS vs. cache size

Texture Cache 0 Size 2 4 8 16 32 64 2 2.28 1.79 1.67 1.45 1.42 1.39 Texture Cache 1 Size 8 2.09 1.60 1.49 1.26 1.24 1.20 32 2.04 1.55 1.44 1.21 1.19 1.15 (c) Bus usage MiB vs. cache size

Texture Cache 0 Size 2 4 8 16 32 64 2 56.9 55.6 55.1 54.1 54.0 53.7 Texture Cache 1 Size 8 56.5 54.8 54.2 52.9 52.7 52.4 32 56.3 54.5 53.9 52.3 52.1 51.8 (d) Bus usage MiB/s vs. cache size

Cache Size Cache Hit Rate % 2 82.7 4 89.6 8 91.3 16 94.4 32 94.8 64 95.3 (e) Cache hit rate vs. cache size (Texture Unit 0)

Cache Size Cache Hit Rate % 2 89.2 8 95.1 32 96.8 (f) Cache hit rate vs. cache size (Texture Unit 1) 7.3 Evaluation of Tailored Accelerators 51

Table 7.3: Effects of texture cache size on rendering of one frame of Teeworlds Cache Size Render Time FPS Bus Usage Bus Usage Cache Hit Rate ms MiB MiB/s % 2 87.3 11.5 4.35 49.8 62.4 4 75.9 13.2 3.63 47.9 69.7 8 72.4 13.8 3.41 47.1 71.9 16 51.9 19.3 2.12 40.9 85.0 32 50.2 19.9 2.02 40.2 86.1 64 43.8 22.8 1.61 36.8 90.3 128 42.8 23.4 1.55 36.2 90.9 256 41.7 24.0 1.48 35.5 91.6

Chapter 8

Evaluation using FPGA

Eventually, a variant of Linux called MicroBlaze Linux, maintained by Xilinx, was built for the platform described in section 7.1. Quake 3 was modified to use a cus- tom version of OSMesa [1], and cross-compiled for MicroBlaze. Finally the system could be tested on the FPGA using a real-time application. Teeworlds was not tested, since making it run on the platform would require extensive modifications of the Teeworlds code, which would take too much time.

8.1 Evaluation of Tailored Accelerator

A test of the accelerator tailored for Quake 3 was performed by running Quake 3 on the MicroBlaze CPU, both with and without the accelerator. When measuring the performance of the accelerator, the time spent on running the application, triangle clipping, and rasterization was measured separately. The results of the measurements using the accelerator are shown in table 8.1. As can be seen, running Quake 3 on this platform is not viable. However, the total time taken for software rendering of a frame is 36 seconds, so the accelerator improves the performance by a lot. It is also evident that the main bottleneck in the accelerated system is the triangle clipping. The triangle clipping can be sped up by optimizing the software, likely by finding a better algorithm, but building an accelerator for the task is probably more straightforward and would yield a larger performance increase. An alternative to making the clipping faster is to simply not do it at all. This would cause some increase in area because of larger values of the triangle parameters. In the current software implementation, triangle parameters are interpolated using perspective correct interpolation when clipping, so that the hardware can do linear interpolation with small image quality impact. If the clipping is not done in software, one can choose to either implement full perspective correct interpolation in hardware, requiring a division per pixel, or to do linear interpolation only. Further investigating this issue is left as future work.

53 54 Evaluation using FPGA

Table 8.1: Quake 3 render time measurements, using the tailored accelerator Render step Time Application and projection 750 ms Clipping 1310 ms Rasterization 20 ms Total 2080 ms Chapter 9

Conclusions

The project went well, and most of what we set out to do has been accomplished. The accelerator supports most OpenGL features used by Quake 3, and is capable of rendering frames at more than 30 FPS, which makes it suitable for real-time execution of applications with similar complexity. The accelerator has been tested on triangle and texture data from single frames on the ML505 board with good results. It would have been interesting to try running an application like Quake 3 on the soft CPU, or streaming data from an external computer, but due to lack of time, we have not been able to do so.

55

Chapter 10

Future Work

10.1 Faster clipping

The clipping has been identified as the main bottleneck. To improve performance, one can either accelerate the clipping by implementing it in hardware, or remove clipping entirely, modifying the fragmenter accordingly. Both approaches trade area for performance, but the latter is probably easier to implement, while also simplifying the system. Currently, perspective correct interpolation of triangle vertex parameters is done while clipping. If perspective correct interpolation is desired, this calculation will have to be moved to hardware, causing a per-pixel division.

10.2 Texture prefetching

When a texture cache miss occurs, the texturing of fragments is stalled until the texture data has been fetched from memory. In order to improve texture lookup performance, the appropriate texture data can be prefetched when a fragment enters the fragment queue, so that it is already present in the cache when the lookup is performed. This will require some additional on-chip memory for the texel block queue. This approach is described in [6].

10.3 Improved Scheduler

The scheduler could be improved by adding type awareness and signal bit width optimizations. This would yield better results and also allow the scheduler to be used for a broader set of tasks.

57 58 Future Work

10.4 Generated Fragment Shader

In modern GPU:s, the fragment pipeline can be configured by the use of fragment shader programs, which can control how the resulting fragment color is calculated. By generating colorizer from code written in the lisp-like language, the pipeline can be configured, though in build-time only, in a manner similar to that which fragment shader programs allow. This would require extensions to the language, to allow communication with other modules, such as a texture unit.

10.5 Automatic Configuration Generation

An ultimate goal of a continuation of this work would be to automatically gen- erate a hardware configuration for a given target application and given resource constraints. This would require models for hardware usage and performance for all relevant modules, as well as automatic analysis of application feature and per- formance requirements. A tool somewhat similar to this is described in [7].

10.6 Central OpenGL State Storage

Currently, all relevant OpenGL state is stored in each triangle or fragment. In- stead, state snapshots can be stored separately in memory and referenced by index or pointers. This would decrease the required data transfers as well as the size of on-chip memories.

10.7 Parallelization

Large performance gains can be made by duplication of parts or the whole of the tile renderer. For the applications examined in this thesis, texture lookup is the main bottleneck, but duplication of the texturer does not solve this problem. Instead, could gain performance by duplication of the tile renderer, letting two or more tiles be processed at once. For applications where fragment generation speed is limiting performance, the fragmenter might be duplicated, but care would need to be taken to handle the resulting fragments in the right order. Bibliography

[1] OSMesa: Off-screen Rendering. http://www.mesa3d.org/osmesa.html. [2] Michael Abrash. How do you fit polygons together? In Graphics Programming Black Book, pages 712 – 713. 2001.

[3] I. Antochi. Suitability of Tile-Based Rendering for Low-Power 3D Graphics Accelerators. PhD thesis, Delft University of Technology, 2007.

[4] I. Antochi, B. Juurlink, S. Vassiliadis, and P. Liuha. Scene Management Models and Overlap Tests for Tile-Based Rendering. In Proceedings of the EUROMI- CRO Systems on Digital System Design (DSD’04), 2004.

[5] Kurt Fleischer. Polygon Scan Conversion Derivations. Technical report, Cali- fornia Institute of Technology, Pasadena, CA, USA, 1995.

[6] Homan Igehy, Matthew Elridge, and Kekoa Proudfoot. Prefetching in a Tex- ture Cache Architecture. In Proceedings of the 1998 Eurographics/SIGGRAPH Workshop on Graphics Hardware, 1998.

[7] B. Juurlink, I. Antochi, D. Crisu, S. Cotofana, and S. Vassiliadis. GRAAL: A Framework for Low-Power 3D Graphics Accelerators. Computer Graphics and Applications, IEEE, 28(4):63–73, july-aug. 2008.

[8] Juan Pineda. A Parallel Algorithm for Polygon Rasterization. In SIGGRAPH ’88 Proceedings of the 15th annual conference on Computer graphics and in- teractive techniques, 1988.

[9] Ivan Sutherland and Gary Hodgman. Reentrant Polygon Clipping. Commu- nications of the ACM, 17(1):32–42, 1974.

59

Appendix A

Data Formats

This appendix describes the data formats of inputs and outputs of the GPU. Tables are used to represent packed binary data. Multibyte values are always written in big-endian order.

A.1 Triangle Data Format

Triangles are stored in linked lists, one list per tile, with 7 triangles per list node. Table A.1 shows the format of a triangle. The notation s:m:f is used for fractional numbers. s indicates whether there is a sign bit or not, m is the number of magnitude bits, and f is the number of fractional bits. The list nodes have the format described in table A.2. There is an array containing the first node of each list, and the address of this array is what is passed to the GPU. The nodes are aligned to 64-bit boundaries.

A.2 Texture Format

Textures are stored in a 32 bpp format, 8 bits for each color and alpha. Textures are required to have power-of-two sizes in both dimensions. Each texture has a header, the format of which is described in table A.3. The headers are located in an array, the address of which is passed to the GPU, and point to the start of their respective texture data. If a texture is mipmapped, the header will indicate so, and data for up to 6 mipmap levels can follow immediately after the data of level 0. Each mipmap is expected to be half as large in both dimensions as the one on the level below it.

A.3 Framebuffer Format

Each framebuffer is 320 × 240 pixels large. Each pixel has full 32-bit RGBA color, which means there are 8 bits per channel, and the channels are stored in the order

61 62 Data Formats

Table A.1: Triangle format

Value OpenGL Constant 0 GL_ZERO 1 GL_ONE 2 GL_SRC_COLOR 3 GL_ONE_MINUS_SRC_COLOR 4 GL_DST_COLOR 5 GL_ONE_MINUS_DST_COLOR 6 GL_SRC_ALPHA 7 GL_ONE_MINUS_SRC_ALPHA 8 GL_DST_ALPHA 9 GL_ONE_MINUS_DST_ALPHA 10 GL_SRC_ALPHA_SATURATE (a) Blending functions

Value OpenGL Constant 0 GL_NEVER 1 GL_LESS 2 GL_EQUAL 3 GL_LEQUAL 4 GL_GREATER 5 GL_NOTEQUAL 6 GL_GEQUAL 7 GL_ALWAYS (b) Depth functions

Field Type Size (bits) x 0:6:4 10 y 0:5:4 9 z over w unsigned 24 s0 1:5:8 14 t0 1:5:8 14 s1 1:5:8 14 t1 1:5:8 14 (c) Vertex data format

Field Type Size (bits) vertices vertex[3] 297 1 over c 1:9:8 18 texture id 0 unsigned 11 texture id 1 unsigned 11 clear count unsigned 4 depth mask boolean 1 depth function unsigned 3 source blend function unsigned 4 destination blend function unsigned 4 padding — 31 (d) Triangle data format A.3 Framebuffer Format 63

Table A.2: List node format

Field Type Size (bits) padding — 29 triangles in node unsigned 3 next node address unsigned 32 (a) List node header format

Field Type Size (bits) header header 64 triangles triangle[7] 2688 (b) List node data format

Table A.3: Texture header format Field Type Size (bits) width exponent unsigned 4 height exponent unsigned 4 mipmapped boolean 1 padding — 23 start address unsigned 32

RGBA. The video out module does not take alpha into account, but reading and writing 32-bit values is easier than packing 24-bit pixels.

Upphovsrätt Detta dokument hålls tillgängligt på Internet — eller dess framtida ersättare — under 25 år från publiceringsdatum under förutsättning att inga extraordinära omständigheter uppstår. Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för icke- kommersiell forskning och för undervisning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av doku- mentet kräver upphovsmannens medgivande. För att garantera äktheten, säkerhe- ten och tillgängligheten finns det lösningar av teknisk och administrativ art. Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan be- skrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsmannens litterära eller konstnärliga anseende eller egenart. För ytterligare information om Linköping University Electronic Press se förla- gets hemsida http://www.ep.liu.se/

Copyright The publishers will keep this document online on the Internet — or its possi- ble replacement — for a period of 25 years from the date of publication barring exceptional circumstances. The online availability of the document implies a permanent permission for anyone to read, to download, to print out single copies for his/her own use and to use it unchanged for any non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional on the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility. According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement. For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/

c Jakob Fries, Simon Johansson