Master of Science Thesis Lund, spring 2010

Load balancing in a tiling rendering pipeline for a many-core CPU

Rasmus Barringer* Engineering Physics, Lund University

Supervisor: Tomas Akenine-Möller†, Lund University/ Corporation

Examiner: Michael Doggett‡, Lund University

Abstract

A tiling rendering architecture subdivides a image into smaller parts to be rendered separately. This approach extracts parallelism since different tiles can be processed independently. It also allows for efficient cache utilization for localized data, such as a tile’s portion of the frame buffer. These are all important properties to allow efficient execution on a highly parallel many-core CPU. This master thesis evaluates the traditional two-stage pipeline, consisting of a front-end and a back-end, and discusses several drawbacks of this approach. In an attempt to remedy these drawbacks, two new schemes are introduced; one that is based on conservative screen-space bounds of geometry and another that splits expensive tiles into smaller sub-tiles.

* [email protected][email protected][email protected] Acknowledgements

I would like to thank my supervisor, Tomas Akenine-Möller, for the opp- ortunity to work on this project as well as giving me valuable guidance and supervision along the way. Further thanks goes to the people at Intel for a lot of help and valuable discussions: thanks Jacob, Jon, Petrik and Robert! I would also like to thank my examiner, Michael Doggett.

1 Table of Contents

1 Introduction, aim and scope ...... 3 2 Background ...... 4 2.1 Overview of the rasterization pipeline ...... 4 2.2 GPUs, CPUs and many-core architectures ...... 5 2.3 Tiled rendering ...... 6 3 Implementation and testing environment ...... 9 3.1 Scheduling and dependency analysis ...... 9 4 Case studies ...... 11 4.1 F.E.A.R...... 11 4.2 Unreal Tournament 3 ...... 16 4.3 Simple scene ...... 18 4.4 Conclusions ...... 20 5 The pre-front-end ...... 21 5.1 Screen space bounds ...... 22 5.2 Dependency analysis and the pre-front-end ...... 23 5.3 Preserving submission order ...... 25 5.4 Results and discussion ...... 25 6 Tile splitting ...... 29 6.1 Per-tile cost estimation ...... 29 6.2 Front-end counters ...... 32 6.3 Split heuristic ...... 34 6.4 Dispatch heuristic ...... 34 6.5 Special rasterizer ...... 34 6.6 Results and discussion ...... 35 7 Conclusion and future work ...... 43 8 References ...... 44

2 1 Introduction, aim and scope

The goal of this project was initially to realize an idea that was conceived at Intel concerning load balancing of tiling rendering pipelines. First, the con- ventional two-stage pipeline was analyzed and then a pre-front-end was added, in an attempt to remedy some of the issues discovered. The purpose was to improve performance. As the project evolved, another idea was conceived that concerns cost estimation and splitting of tiles. This report is organized into the following sections: background, imple- mentation and testing environment, case studies, pre-front-end, tile splitting, and, conclusion and future work. The background section contains information on rasterization pipelines as well as the (GPU), (CPU), and the many-core CPU. The implementation and testing environment section describes the system used for implementing our new algorithms and for evaluating performance. The case studies section describes a number of scenes that are used to highlight some of the problems in the existing pipeline. The following two sections explain our two novel schemes aiming to improve the overall performance of a tiling rendering pipeline. The conclusion and future work section discusses some lessons learned as well as things that may be interesting to research more in-depth in the future.

3 2 Background

2.1 Overview of the rasterization pipeline In a rasterization based three-dimensional , geometry is mapped to screen-space using some transformation from three-dimensional to two-dimensional space. This two-dimensional geometry can then be rasterized in the form of on the screen, i.e., pixels in the frame buffer. In the most basic case, the geometry consists of triangles and the transformation consists of a series of matrices; a world matrix, a view matrix and a projection matrix. A triangle consists of three vertices that contain their position and optionally other attributes such as color values.

The world matrix, , represents the transformation from object-space to world-space; if a group of triangles is moving around it makes sense to use a transformation matrix for its movement rather than moving the individual

coordinates of each triangle. The view matrix, , corresponds to the viewer’s position and rotation in the scene. The projection matrix, , projects the vertices on the two-dimensional screen. The combined matrix can now be formed:

=

Transforming a point is now reduced to a simple matrix-vector multiplication:

̂ = where ̂ represents the screen-space position in homogenous coordinates (before division by ). Once the triangles are projected on the screen, half-plane equations1 can be formulated for the edges of each triangle [8], called edge equations. Given a triangle with the screen-space vertex positions = (,) , = (,) and = (,) the edge equations take the form (cf. cross product):

(, ) = ( −)(−) − ( −)(−) (, ) = ( −)(−) − ( −)(−) (, ) = ( −)(−) − ( −)(−)

Given that , and are defined in a counterclockwise manner, these equations are all positive when a point is inside the triangle (see Figure 1 below). This property can be used to determine whether pixels (or regions

1 The term half-plane equation comes from the fact that each equation divides the plane in two parts; one where it is positive and one where it is negative.

4 thereof) are inside a triangle. One possibility is to use the recursive descent algorithm described by Greene [4].

(, )

− − + +

+ (, ) −

(, )

Figure 1. A triangle and its edge functions. A point (, ) is inside the triangle if all edge functions are positive.

Pixels that are inside should then be colored according to the appearance of the triangle. This appearance can be described using various color values, lighting calculations and texture lookups. At the dawn of graphics processing units (GPUs), which are used to accelerate real-time graphics, the graphics pipeline was limited to a fixed function path much like the one described. Transformations were limited to simple affine transforms described by matrices. The coloring consisted of simple per-vertex local illumination models coupled with texture lookups. In more recent times, both the vertex processing and the processing have become programmable. A GPU now executes specialized programs, called vertex , for each vertex determining its screen-space position as well as other attributes needed for per-pixel processing. For each pixel inside a triangle another program, a pixel , is executed that given the interpolated per- vertex attributes determines the final color of the pixel. The trend is clearly moving towards a more programmable pipeline as more recent hardware includes a geometry shader capable of generating triangles on the fly. Next generation GPUs include programmable tessellation support in the form of hull and domain shaders. Pixels may also be rasterized to off-screen surfaces that is later used for . The general term that includes both frame buffers and such off-screen surfaces is render target.

2.2 GPUs, CPUs and many-core architectures Modern GPUs include some fixed function logic, but mostly they consist of floating-point stream processor units (SPUs). These SPUs are specifically designed to run highly parallel shader programs. As GPU programmability

5 increases, they are able to facilitate an increasing number of applications traditionally reserved for CPUs. Many parallel compute intensive applications have been able to benefit from the parallel processing capabilities of GPUs. This has coined the term general purpose computing on GPU (GPGPU). Still there are limitations to the memory model of GPUs that render them unable to achieve good performance in certain scenarios, e.g. when using irregular data structures. Further, limitations in the programming model makes specialized hardware necessary for some parts of the pipeline. On workloads where this hardware has low utilization this on-die real estate could have been better spent on e.g. more SPUs. If the hardware, on the other hand, has very high utilization, it might present a bottleneck. Ordinary CPUs are optimized for single threaded code and includes logic for out-of-order execution that exploits parallelism of a single thread. This is accomplished by identifying independent instruction streams that can be executed in parallel. For already parallel workloads this logic delivers bad and requires large amount of on-die real estate. Contrary to GPUs, ordinary CPUs handles irregular data structures very well due to a general purpose memory hierarchy. The next step towards a fully programmable platform may be the many-core CPU. It consists of many instantiations of an in-order core equipped with a full-featured instruction set, such as IA-322. In order to facilitate visual computing workloads, such as a rasterization pipeline, each core should incorporate a wide unit (VPU). Such a unit can work more or less independently from the general purpose logic. As such, vector instructions could be executed in parallel with scalar code. One problem in-order cores face is the inability to hide pipeline stalls and cache misses. These effects can be reduced by using several hardware threads for each core, at the expense of context switching. Only one thread is executed at a time; whenever a thread stalls, another thread resumes execution. This is efficient as long as there is enough parallel work. Since the hardware threads share the same cache, they should ideally work on the same dataset. As previously mentioned, GPUs often resort to specialized hardware for different parts of the pipeline such as rasterization, alpha blending and depth testing. Each of these parts needs to be designed with peak throughput in mind. Inevitably either hardware resources will be wasted or will present a bottleneck for the entire pipeline. On a many-core CPU, this logic can be implemented in software. This means that when e.g. alpha blending is disabled, the same resources can be used for increased shading performance.

2.3 Tiled rendering Special considerations need to be made when designing a rendering pipeline for a many-core CPU. First and foremost, the task needs to be parallel with min-

2 Intel Architecture, 32-bit. Often generically called x86.

6 imal synchronization. Secondly, we need to optimize memory bandwidth by making efficient use of the cache hierarchy. There exist many different rasterization methods. One way of diff- erentiating them is based on where sorting of geometry, required due to para- llelism, is performed [6]. Modern GPUs often use sort-last fragment, meaning that individual frag- ments are sorted prior depth testing and blending. Pixel shader results are put into small screen aligned buffers used for sorting. It is then possible to depth test and blend the results in correct order using a single memory access to the frame buffers for each pixel. However, if a screen aligned buffer gets full, mul- tiple memory accesses are required for a single rendering pass. This is typically the case when depth complexity is high. This rendering method could be difficult to implement efficiently on a many-core architecture depending on how memory is shared between different cores. A general purpose cache hierarchy might evict sort buffers not used for a period of time, thus increasing memory bandwidth. Also, when several cores access the same buffer, synchronization to enforce cache coherency might introduce considerable latency. Another approach is used by sort middle algorithms. Here, individual prim- itives, in our case triangles, are vertex shaded and sorted into screen aligned regions. This method is used by tiling renderers as described by Fuchs et al. [1] and Seiler et al. [9]. A tiling rendering architecture subdivides a computer gr- aphics image into smaller parts to be rendered separately. This is efficient for a many-core architecture since the different tiles can be processed in parallel by different cores. Further, the screen space spatial coherence allows for data such as a tile’s part of the frame buffer to be efficiently stored in a core’s cache. In order for each tile to be efficiently processed in isolation, the triangles overlapping a given tile must be known. The set of overlapping triangles is called a bin hence the process of distributing triangles amongst the tiles is termed binning. Traditionally, the tiled rendering pipeline consists of two separate stages; the front-end and the back-end. When a user application submits commands to a graphics API3 (e.g. OpenGL or ) it is custom to output triangles on the screen through draw calls. A draw call either contains vertices and triangles or references existing buffers containing such data. These buffers usually reside in dedicated graphics memory and have been uploaded while loading scene assets. A draw call uses the current rendering state such as shaders and blend mode to transform, rasterize and shade the geometry onto the frame buffer. The draw call is therefore the input primitive of the graphics API, for which rendering work is performed. As described, the rendering pipeline receives individual draw calls as input, together with their rendering states. It proceeds to split those draw calls into

3 Application Programming Interface. An interface implemented by one software program that enables interaction with other programs.

7 individual geometry batches of a suitable size. These geometry batches are then sent to the front-end. The front-end distributes individual geometry batches to the available cores so that each geometry batch can be processed in parallel. Each core performs vertex shading of individual vertices of the geometry batch and performs the binning of the actual triangles. Vertex shading of non-positional components may be delayed until the back-end. While one geometry batch is being vertex shaded, triangles may spread to different tiles. With the intention of reducing inter-core synchronization, one may utilize per core queues for each tile that together form its bin. These separate queues are then merged prior to rasterization while preserving draw call submission order. The front-end is performed to completion before any back-end work is begun, i.e., there is a synchronization point between the front-end and back-end. The back-end performs rasterization as well as shading of the resulting fragments for each tile in parallel on different cores. By using tiles of a size that fits in a core’s cache, scenes with arbitrary depth complexity can be rendered using a single memory access to the frame buffers for each pixel. A problem with this approach is that a single primitive may be binned to multiple tiles which in turn increases memory bandwidth. This effect is limited as long as triangles are small compared to the tile size [6].

8 3 Implementation and testing environment

The algorithms have been implemented in a framework simulating a potential many-core architecture. The timings for each frame have been normalized to a nominal value of 1 that represents the frame time of the unmodified simulation. The relative speedup provided by the algorithms is, however, meaningful. The framework is a commercial trade secret of the Intel Corporation which is why some details are omitted. Scenes from games have been evaluated using trace-files. The trace-files contain logs of all relevant information passed from the game to the Direct3D API while playing. Thus, by parsing a trace-file, the simulator is able to recreate the exact sequence of frames offline. Simpler scenes have been executed immediately using the user application that generates them. All scenes have been rendered using 32 simulated cores; one scheduler core and 31 worker cores.

3.1 Scheduling and dependency analysis The primary method to enhance the load balance among all cores, originally present in the framework, is to process multiple render targets and/or have multiple frames in flight at the same time. Such a system needs a scheduler to schedule the work from different render targets. Further, the scheduler needs to be aware of dependencies between render targets and their pipeline stages. This amounts to a dependency graph where each render target is represented by one or more nodes. A similar system is described by Seiler et al. [9], although they use a slightly different nomenclature. In the context of scheduling, a bin-set is a collection of front-end work, a set of bins and a render target. The graphics pipeline may start working on a render target before all draw calls targeting the render target are submitted by the user application, thus a single render target may result in several bin-sets. In the ideal case the user application will have a lead of at least one render target with respect to the graphics pipeline. In this case there will be a 1:1 mapping between render targets and bin-sets. For this reason the terms bin-set and render target will be used interchangeably throughout the remainder of this report, in the context of scheduling. The basic scheduling unit is a node. A bin-set consists of two scheduling nodes; the front-end node and the back-end node. The scheduler has a list of nodes that are waiting to be processed, the wait-list, as well as a list containing nodes that are being processed, the active-list. When a bin-set is put into the wait-list, dependencies to other bin-sets need to be determined. E.g. if two bin- sets affects the same render target, there will be a dependency between their back-ends. If one bin-set renders to a texture that is used to shade an object in another bin-set’s back-end there will, again, be a dependency between their back-ends. It is rare for front-ends to have dependencies but it does happen,

9 e.g. when one back-end renders to a texture that is used to displace geometry in a front-end. There is always a dependency added between a bin-set’s front- end and back-end. When all dependencies for a node in the wait-list are resolved, it is promoted to the active-list. The scheduler runs on a dedicated core, and the rest of the cores are worker cores. The active-list is constantly polled for work by the worker cores. Worker cores traverse the list starting with the oldest node asking it for work. If work is found, they exit and perform the work. If there is no work available at that node at the moment, the core checks the next node in the list. This way multiple nodes can be processed at the same time while keeping the latency as low as possible. Once a node has finished all its work, the scheduler is signaled to remove this node from the list and resolve any dependencies for other nodes. This usually promotes other dependent nodes to the active-list.

10 4 Case studies

In order to assess the current framework, different workloads, both games and smaller applications, were analyzed. We found that a workload often falls into one of two categories; those with very few render targets and those with many. In this section two such workloads are presented. One is from the game F.E.A.R. and one is from the game Unreal Tournament 3. Additionally, a simple test scene was constructed to showcase the problems found in the real world workloads.

4.1 F.E.A.R. F.E.A.R. is a game that makes heavy use of shadow volumes, a technique originally described by Crow [2]. Such scenes often employ multiple passes aff- ecting the same render target; one pass for each light. The game utilizes a single render target to perform the majority of work in the scene. A less significant portion consists of post-process effects as well as BLIT4 operations. The BLIT operation is typically used to transfer the rendered buffer to the screen. The frame chosen for measurements is shown in Figure 2 below. This scene is particularly interesting since it showcases a very large shadow volume covering a big portion of the screen; there is a shadow caster (in this case a fan) behind the viewer, projecting the volume down the corridor. A view of the fan is shown in Figure 3.

4 Block Image Transfer; an operation where at least two surfaces are combined into one using a raster operator.

11

Figure 2. The frame from F.E.A.R. used for measurements.

Figure 3. A frame from F.E.A.R. showing the fan that casts a shadow down the corridor in Figure 2 above.

12

Figure 4. The major render target ofaf single framme in F.E.A.R. measured in isolation. The dark grey bars represent front-end (FE) work and the light grey bars represent back-end (BE) work.

The major work of a frame is shown in isolation in Figure 4 above. By isolating it we disallow other render targets and/or frames to be processed concurrently. The purpose of this is to get a clear view of the proportions of the work involved, as well as the variation in processing time for different work items. For this game, the back-end is significantly more expensive than the front-end. The variation in processing time of different work items are great, resulting in a lot of idle time for most cores. Note the dependency between the front-end and the back-end; the back-end cannot start before all work in the front-end is finished. In order to fully analyze a single isolated frame, post-processing and BLIT operations needs to be investigated (shown in Figure 5 below). There are three different bin-sets for this purpose. Their back-ends are all dependent on each other, forcing one to finish before the next can begin. These bin-sets are very different from the major bin-set of the frame. They contain significantly less work, and there is no front-end work to speak of. Typically, a full screen quad is rendered resulting in a very short front-end and a not too expensive back- end. The bin-set in the middle has lots of idle time. This is probably a result from lock contention when each core “steals” tasks from the scheduler; the individual work items are too short to efficiently utilize all cores.

13

Figure 5. Post-processing and BLIT operations of a single frame in F.E.A.R. The dark grey bars representn front-end (FE) work and the light grey bars represent back-end (BE) work. Note that the front-end work items are hardly visible.

Summing up the processing times for the major render target as well as post-processing and BLIT, we get 1.04 time units per frame.

14

Figure 6. A repeated frame in F.E.A.R. The timeline encompasses a single frame from start to end. Note the overlap with previous and following frames. The dark grey bars represent front-end (FE) work and the light grey bars represent back-end (BE) work.

In an attempt to utilize the scheduler to enhance the load balance, the same frame is repeated multiple times and the scheduler is activated to attempt to process multiple frames at the same time. Measurements from such a frame are shown in Figure 6 above. This reduces the frame time to 1.00 time units per frame (this is our basis for normalization). However, the latency per frame (the time elapsed from the start of the frame until it is finished) is increased from 1.04 to 1.50 time units. Since both frames access the same render target, the framework is unable to execute their back-ends concurrently. It would generally require too much memory to decouple them by duplicating the render target; a single high-resolution MSAA frame buffer may already be pushing the limit of the graphics memory. The front-ends are, however, completely decoupled allowing them to be executed concurrently. The result is that the front-end of the next frame is executed when cores are starved for work in the previous back-end. The front-end is executed “for free” since the cores would otherwise be idle. The back-end is unchanged and, since it is the most expensive pipeline stage, the benefit is not great. What does happen is that the latency of a frame is greatly increased. This happens because the next frame is pulled in early, before the previous frame is finished. This may very well increase the response time the player experiences while playing the game.

15 4.2 Unreal Tournament 3 Unreal Tournament 3 falls into the category of scenes utilizing a lot of render targets. The frame used for measurements is shown in Figure 7 below. Measurements from a single frame repeated multiple times are shown in Figure 8.

Figure 7. The frame of Unreal Tournament 3 used for measurements.

16

Figure 8. A repeated frame in Unreal Tournament 3. The timeline encompasses a single frame from start to end. Noote the overlap with previous and following frames. The dark grey bars represent front-end (FE) work and the light grey bars represent back-end (BE) work.

Here the dependency analysis really pays off. On this granularity, it is hard to notice any significant idle times. All render targets overlap beautifully. The effects of greater latency for having multiple frames in flight should be significantly lower for this game since the scheduler should be able to stay busy with all render targets within a frame until the very end of it. A close-up of a part of the timeline is shown in Figure 9 below. Here it becomes apparent that there is a significant amount of idle time after all. Since there is work available, the conclusion is once again lock contention; in both the front-end and the back-end.

17

Figure 9. A part of a repeated frame in Unreal Tournament 3. Thhe dark grey bars represent front-end (FE) work and the light grey bars represent back-end (BE) work.

4.3 Simple scene The simple scene was constructed to showcase a scene where the current framework has bad load balancing and performance. It uses a single render target and everything is drawn using the same moderately complex pixel shader. The pixel shader uses multiple samples of a three-dimensional noise texture for both coloring and normal mapping (using finite differences). The vertex shader performs nothing but a few simple transformations. The frame used for measurements is depicted in Figure 10 below.

18

Figure 10. The frame of the simple scene used for measurements.

The terrain is moderately complex and the smallest sphere down to the right is most tessellated. The current framework dispatches tiles to the available cores from left to right and top to bottom. The result is that the last tile started by a core is also the most expensive. To simplify the analysis this scene is rendered with long pauses between frames. The measurements from a single frame are shown in Figure 11 below.

19

Figure 11. A single frame of the simple scene. The dark grey bars represent front-end (FE) work and the light grey bars represent back-end (BE) work.

The result is quite similar to that of F.E.A.R. The front-end has some idle time but there is a lot of idle time in the back-end. Actually, more cores are idle than active over the course of the entire back-end. Following the back-end for rendering the scene is a series of small back-end tasks. These represent BLIT operations that transfer the render target to the screen. The total time required to render a single frame has been normalized to 1.00 time units.

4.4 Conclusions The current framework does a good job of handling scenes with a lot of independent render targets. When this is not the case, the results are less than ideal. Both the front-end and the back-end may suffer severe load imbalance. Further, processing multiple frames at the same time may significantly increase the latency of a frame while sometimes only providing a small benefit in . In some cases, the work items were found to be too cheap to achieve good load balance. It may be beneficial to detect these cases and concatenate such small tasks into larger ones. In reality, the lock contention may be exaggerated by the measurements since the process of performing the measurements introduces additional inter-core synchronization.

20 5 The pre-front-end

Our first attempt to improve the load balance was to extend the rendering pipeline to include a pre-front-end5. The basic idea is to avoid the synch- ronization between the front-end and the back-end and start processing a tile as soon as all overlapping geometry has been binned. The front-end will focus its efforts on finishing all geometry batches overlapping the most expensive tile remaining. Once all such overlapping geometry batches are finished in the front-end, the back-end can start processing the tile immediately. In order to determine which tiles a certain geometry batch overlaps, we need to have conservative screen space bounds encapsulating all triangles it contains. How these are generated is explained in more detail in Section 5.1 below. In order to know which tile the front-end should focus on, we need a per- tile quantity that can be used to compare the cost of one tile to another. A simple cost estimation quantity is to use a triangle density estimate. A triangle density estimate can be associated with each geometry batch using the number of triangles it contains divided by the screen space area of its bounding box. This estimate can then be accumulated in all overlapping tiles. The result is a per-tile triangle density estimate. The pre-front-end consists of binning the geometry batches based on their respective bounding box. During this process the per-tile triangle density estimates are accumulated as explained above. The front-end and back-end is essentially unaltered. The novel idea is how they can be performed in parallel. In our initial implementation, each core behaved according to the following scheme:

1. Find the highest density tile with all overlapping geometry batches processed by the front-end, and perform its back-end work. Once this is done, the data of the tile can be freed and its memory reused. The core then reiterates step 1. 2. If no tile can be found in step 1, instead find the highest density tile with overlapping geometry batches still to be processed and perform front-end work for the overlapping geometry batch first submitted to the pipeline. The core then reiterates step 1.

This scheme will cause tiles to be processed by the back-end as soon as possible. However, this is not always what we want. Consider a scene where a complex object occupies a single tile and the rest of the tiles are simply cleared with a background color. Even the expensive tile needs to be cleared, so the clear command will be the first item to be processed in the front-end.

5 The pre-front-end algorithm was conceived by the Advanced Rendering Technology group at Intel in Lund: Robert Toth, Jacob Munkberg, Jon Hasselgren, Petrik Clarberg and Tomas Akenine-Möller.

21 Immediately after the clear has been processed all empty tiles will be ready for back-end processing and then processed. Then the expensive tile would be the last one to be processed, making the load more imbalanced. The same reasoning can easily be adapted to other scenes. The scheme we ended up using tries to fetch both front-end and back-end work. If only one of the items is found, it is executed. However, if both are found the density of the back-end work is compared to the density of the most expensive tile the front-end work overlaps. The most expensive work item is then executed. It is not a strict requirement to have explicit synchronization between the pre-front-end and the front-end. Front-end items can be dispatched before the pre-front-end is finished, but it is not possible to use the triangle density estimate of the tile before the pre-front-end is finished. This may lead to a larger memory footprint and worse performance, in some cases, and needs to be investigated. For the purpose of this report, the pre-front-end is performed on a single core, mainly due to intimate memory access patterns when binning bounds and accumulating tile density. If front-end tasks are started before the pre-front-end is finished they are dispatched in submission order. Immediately after the pre- front-end is finished, the triangle density based ordering takes effect making the front-end focus on the most expensive tiles. Using the pre-front-end, memory requirements can be lowered for the bins; we only need to keep geometric data for active tiles in memory. The data is not generated by the front-end until it is really needed and when a tile is fully processed in the back-end, its memory can be reclaimed and reused. Another effect is that once the pre-front-end is complete the pipeline is made more parallel since the front-end and back-end can execute concurrently. Load balancing of the back-end is improved by processing tiles with a lot of geometry as early as possible.

5.1 Screen space bounds Calculating the bounds of each draw call in screen space has been deemed outside the scope of this thesis. It is our belief that such bounds will be needed by other algorithms in the near future, e.g. when binning higher order primitives such as Bézier patches. If this assumption is true, the calculation of such bounds can be assumed to be “free”. However, it is interesting to know that there exist efficient methods for providing bounding boxes without executing the vertex shader for each vertex in the geometry batch [5, 7]. In order to evaluate the effectiveness of this algorithm, the measured frame has been rendered in two passes; the first pass renders the scene using the conventional pipeline while tagging and outputting screen space bounds for each geometry batch. The next pass uses the bounds from the previous pass to render the frame with a pipeline including a pre-front-end. Since the bounds

22 are known, the pre-front-end is extremely fast. The measurements are still relevant because they evaluate the algorithm at its best; using tight bounds that are cheap to acquire.

5.2 Dependency analysis and the pre-front-end When including a pre-front-end, the dependency analysis must be modified. The dependencies of the pre-front-end should typically be the same as the front-end. Thus, it is natural to put them in the same node in the dependency graph (see Figure 12 below).

FE BE PFE/FE BE

Figure 12. Putting the pre-front-end and front-end in the same node in the dependency graph.

Note that the strict dependency between the front-end and back-end node disappears (indicated using a dashed line) since they now operate in parallel. When using separate nodes for the front-end and the back-end, some form of communication is needed between them. To begin with, the back-end cannot start a work item until all overlapping front-end work is done. Secondly, the scheduler needs to be able to interleave front-end and back-end work. Recall, from Section 3.1, that our scheduler keeps a number of independent nodes in a list of active work. Work is acquired by going through the list, in order of first submitted node to reduce latency, terminating when a node has work available. In order to interleave front-end and back-end work, the front-end must signal that no work is available (even when there is) for the back-end to get a chance to produce. A dependency graph is shown in Figure 13 below illustrating its structure before and after the pre-front-end is added.

23 FE BE PFE/FE BE

FE BE PFE/FE BE

Figure 13. A normal dependency graph is shown to the left. To the right, the pre-front-end is added using separate scheduler nodes for the front-end and back-end.

Another solution is to put everything in a single node, and to add a dummy node representing the back-end dependencies. The combined front-end/back- end node will then promise not to perform any back-end work until the dummy node has all dependencies resolved. Also, the dummy node must not be completed (and its dependents resolved) until the combined front-end/back-end node is. This reduces the need for communication since the dummy node does not do anything; it always reports that no work is available. A modified version of the dependency graph is shown in Figure 14.

BE- FE BE PFE/FE BE dummy

BE- FE BE PFE/FE BE dummy

Figure 14. A normal dependency graph is shown to the left. To the right, the pre-front-end is added, putting all logic in one node and keeping track of back-end dependencies using a dummy node.

To reduce the amount of new code required, a variation of the dummy method was chosen as basis for our implementation. In this variation there is a strict dependency between the combined node and the dummy node. The combined node will not execute any back-end work as long as the dummy node has more than one dependency, i.e., it will only execute back-end work if there is no dependency but the one from the combined node.

24 5.3 Preserving submission order It is generally important for the back-end to process triangles in the correct order, i.e., in the order they were submitted to the pipeline. If a transparent blend mode is active, the drawing order may have significant impact on the result. Another example is when z-buffering (depth buffering) is disabled, e.g. when drawing in two dimensions. The submission order then determines which triangle ends up in front. The traditional front-end always distributes front-end work in submission order to the different cores. Each core then puts triangles in a per-core triangle list in each overlapping tile. The result is that each per-core triangle list is in itself correctly ordered. When processing triangles in the back-end, the core can simply pick a triangle from the per-core triangle list with the lowest submission tag. The situation is not quite so easy when using the pre-front-end. The idea is to let the front-end focus on finishing the most expensive tile as soon as possible. Even though the overlapping front-end work items are performed in order, triangles may spread to nearby tiles. Thus, submission order is no longer guaranteed within a single per-core triangle list. In order to alleviate this situation, jump commands are inserted into the triangle lists. Each such jump command contains fields needed to build a red- black-tree; left, right and color. The jump commands are inserted when a geometry batch bins its first triangle to a triangle list. At the same time, the command is inserted into the red-black-tree, as a node, to preserve order. There is one red-black-tree for each per-core triangle list. Before the back-end starts, all jump commands are traversed in submission order while setting the destination of each jump command to point to the next geometry batch in submission order. The back-end is altered to simply follow the jumps when processing triangles.

5.4 Results and discussion The screen-space bounding boxes generated for the simple scene is shown in Figure 15 below. The resemblance to the original scene is clear. The application submitted the three spheres and the terrain in one draw call each. The pipeline has split each draw call into multiple geometry batches, each with their own bounding box.

25

Figure 15. To the left: The screen-space bounding boxes of the geometry batches in the simple scene. The green box shows the extents of the viewport. To the right: The simple scene.

4∙10 Triangles per tile

0

Figure 16. The triangle density of each tile in the simple scene. The scale is linear.

Note that we have at least one geometry batch that overlaps the entire viewport. The reason for this is that some triangles intersect the near clip plane. An automatic bounding algorithm would probably have a hard time generating bounds in this case, the fallback being assuming it covers the whole screen. Because of this, our implementation does the same thing. In this case, it is a part of the terrain that spans the near clip plane. Since part of the same geometry batch is inside the view frustum, it is hard to cull. If world-space bounds were known for the geometry batch, the intersection with the near- plane might be determined to occur outside the frustum, but no such algorithm have been employed in this case. The triangle density of each tile in the scene is shown in Figure 16. It is clear that the sphere in the bottom right corner is most expensive, reaching more than 40 000 triangles/tile.

26 Measurements from running the simple scene while utilizing the pre-front- end are shown in Figure 17 below. In this case there is no synchronization between then pre-front-and and the front-end, allowing front-end work items to be started in submission order before the pre-front-end is finished.

Figure 17. The simple scene run with a pre-fronnt-end without synchronization between the pre-fronton -end and the front-ennd. Notice how the front-end and back-end overlaps. The dark grey bars represent front-end (FE) work and the light grey bars represent bacck-end (BE) work. The striped bars represent prpre- front-end (PFE) work.

The total time required to render a single frame is reduced froom 1.00 to 0.86 time units. This is a 14% speedup compared to the original framework. The overlap between the front-end and the back-end is pretty small, probably due to the fact that the front-end is too short (or the granularity of the work items to coarse). The density estimation kicks in too late to have any effect which means that the front-end is not concentrating on getting tiles ready for the back-end as soon as possible. This will probably be less of a problem in more complex scenes as the number of front-end work items will go up. For completeness the same scene is rendered using a synchronization between the pre-front-end and the front-end (see Figure 18).

27

Figure 18. The simple scene run with a pre-front-end with synchronization between the pre-front-end and the front-end. Note how the front-end cannot start until the pre-front-end is finished. The dark grey bars represent front-end (FE) work and the light grey bars represent bacck-end (BE) work. The striped bars represent prre-front-end (PFE) work.

The overlap looks more promising in this case. By sacrificing a small amount of time in the beginning, not a single core is starved for work between the front-end and the back-end. However, the total time of execution for the entire frame is not shorter (it is approximately the same). The reason for this is that the most expensive tiles also depend on most of the front-end work. The highly tessellated spheres amount to many front-end items. The load imbalance is pushed from the front-end towards the back-end. This is a very important observation, which is used in Section 6 to device a new algorithm. In the case study of F.E.A.R., it was shown that a load imbalance in the front-end easily can be hidden by having multiple frames in flight at the same time. The reason being that there is seldom any dependencies that need to be met before a front-end can execute. Therefore, the benefit of this algorithm seems limited in terms of load balance. In terms of memory requirements it may have a higher benefit, but the required memory bandwidth is unchanged. Some applications, such as offline movie rendering, may actually be limited by the memory footprint. More tiles could potentially be processed at the same time in this case, if a pre-front-end is used.

28 6 Tile splitting

Using the pre-front-end may result in moderate performance gains, mostly due to better load balance in the front-end. Our investigations have shown that most front-ends are independent, allowing them to execute concurrently. It is thus easier to hide load imbalance in the front-end than in the back-end. Further, it should be relatively easy to get an even load in the front-end since the number of primitives each work item contains can be specified arbitrarily. A smart front-end could as an example consider the complexity of the vertex shader as well as how much front-end work remains when determining the number of primitives in each work item. Due to these premises, the tile splitting scheme will focus on load balancing the back-end. A natural way to balance the load of a single tile is to allow it to be split into several sub-tiles. This allows for a single tile to be processed by multiple cores in parallel. Splitting the tiles in the back-end does present some challenges:

• We need a way to estimate the time required for a single core to process a given tile in the back-end. For this purpose, we developed an inexpensive cost estimation model. • We need to gather information in the front-end to feed the cost estimation model. For this purpose, front-end counters are incorp- orated into the pipeline. • A heuristic of which tiles to split and how many splits to perform is needed. This is called the split heuristic. • A strategy is needed of how to distribute the (sub-)tiles among the available cores. This is called the dispatch heuristic. • The triangles may have to be re-binned after a split. We avoid this problem by using a special rasterizer.

6.1 Per-tile cost estimation The only way to accurately find out how much time is required for a single core to perform the back-end work of a tile, is to actually perform the work. We need to know approximately how long the required time is without actually doing the work. To that end, we develop an inexpensive cost estimation model that can calculate how much time is required to perform the back-end work for a tile. When the cost has been estimated for all tiles, there is a higher probability of distributing the work of all the tiles evenly among the available cores. The cost estimation model uses data that can be recorded from a typical front-end to give an estimate for the time required to process a given tile in the back-end. The data is recorded using counters. In order to estimate the cost of a single triangle, we need to know approximately how many samples it covers.

29 Ideally the area of the intersection between the “parent” tile and the triangle should be used as a measure. This is not feasible to compute since it requires expensive clipping of each triangle to the borders of the parent tile. Instead, each triangle can be roughly classified using observations made in the front-end when determining overlapping tiles, e.g. if the triangle is covering the whole tile. Additionally, really small triangles might use a special code path in the rasterizer, motivating the need for a special classification for such triangles. The different classifications are called triangle types. Note that the triangle types are independent of the current rendering state such as pixel shading or z- buffer (depth buffer) mode. The data for a given tile may include, but is not limited to:

• The number of triangles of a certain triangle type binned to the tile. • The sum of the pixel shading cost, e.g. cycles required to execute a certain pixel shader for a single fragment, for each triangle of a certain triangle type binned to the tile. • The number of triangles of a certain triangle type binned to the tile adhering to a specific rendering state, e.g. stencil-only or with early-z- cull enabled.

It is also useful to include counters for higher level constructs, such as the number of geometry batches containing triangles that overlap a tile. Each geometry batch will typically incur a certain amount of overhead. The model uses this data to form linear and logarithmic terms whose weighted sum represents the time it takes to process the tile. The logarithmic terms are used to model occlusion, i.e., when a triangle is visible it will usually take longer time to process that triangle since all shading needs to be computed for the pixels covered by the triangle, and when the triangle is occluded (obscured) by previously rendered triangles, execution will be less expensive. Cox and Hanrahan [1] create a model for this that converges to the logarithm of the number of overlapping triangles per pixel. The weights are determined by fitting the model to measured timings. This fitting can either be performed at runtime at suitable intervals or offline using data from numerous scenes. The logarithmic function used may be the floor of the 2-logarithm which is very efficient to calculate for integers. Two different models were tested; one non-linear and one linear. Our non- linear model is:

=+ +log1+ , (1)

where is the processing time and are the values of the counters. , , and are constants to be found through fitting. Since this model is non-

30 linear, it is suitable for offline fitting. The value of 1 is added within the logarithm to ensure that the resulting value is zero when there aren’t any contributing counters. A linear model is:

=+ + log(1 + ) ,(2)

This model is suitable for runtime fitting since it is comparatively inexpensive to perform linear fitting. Neither of the models should be considered more correct than the other. Geometry will either be drawn with early-z-cull enabled or disabled. If it is enabled, this means that each sample will be tested against the z-buffer prior to shading. In this way, shading computations can be skipped if the sample is occluded. If this state is disabled, the depth value will be tested after the sample has been shaded. In this case, we will not benefit from occlusion. This happens, e.g., when the pixel-shader changes the depth value; in this case, it is impossible to know the depth value until at least a part of the shader has been executed. In other circumstances, z-buffering might be disabled altogether. In these cases early-z-cull will be disabled since there can be no occlusion. The correctness of the models depends on in which order these different types of geometry are drawn. If all early-z-cull enabled geometry are drawn first, and with the assumption that their depth is evenly distributed, their cost is proportional to:

1 1 1 1+ + +⋯+ . 2 3

According to Cox and Hanrahan [1] this grows approximately as

log. With the same reasoning, if all early-z-cull disabled geometry is drawn first (which may occlude early-z-cull enabled geometry), the cost of the early-z-cull enabled geometry is proportional to:

1 1 +⋯+ , +1 +

which grows as log +−log (). A better model should therefore be a combination of the linear and non-linear models, but the basic assumptions are not strong enough to justify such a complex model, e.g. the depth values of geometry is hardly evenly distributed. It may not make sense to include logarithmic terms for all counters. Since the goal is to model occlusion, which saves shading time when a fragment is occluded, shading based counters for geometry with early-z-cull enabled should be included. The triangle counters of geometry with early-z-cull disabled should

31 also be included since they may occlude the early-z-cull enabled geometry. The constants for other logarithmic terms should be zero. It is up to the implementer of the rendering pipeline to decide which terms to include in the cost estimation. If the linear model is used (Equation 2), all information can be gathered when rendering frame , and before rendering of frame +1 starts, the coefficient are recomputed based on the gathered information. These coefficients are then used in Equation 2 to estimate the cost of each tile. One can also update the coefficients with a sliding average update, such as:

= + (1−),

where is a vector containing all the constants of the linear model. This approach avoids sudden jumps in the cost estimation model that otherwise may cause values to oscillate even when rendering an identical frame multiple times. The oscillation is possible since changing the model may change how tiles are split. This may in turn change the measurements used to correct the model. The value of is up to the user to set — it needs to be in the range of [0,1]. When = 1, we do not include the previous values of , and when = 0, we do not include the new value (which is therefore meaningless). The value should be somewhere in-between, e.g. = 0.5.

6.2 Front-end counters The front-end is modified to include per-tile counters for the information required by the cost estimation model. Prior to the front-end, all such counters are initialized to zero. Each time a triangle is binned to a tile, its triangle type is determined and the counter for that triangle type incremented. Other counters, e.g. those containing the sum of the pixel shader cost for a certain triangle type, are also modified accordingly. Note that if the weights are known beforehand, e.g. from offline fitting, all counters for linear terms can be collapsed into a single score by pre-multiplying the weights, thus reducing the storage requirements. The same reduction is true for linear combinations of counters within the logarithm in Equation 1 reducing the storage requirements to two variables for the whole expression. This is not the case for Equation 2 as each logarithmic term requires a separate counter. Since several cores typically access the same counters, it may be beneficial to have a unique set of counters for each core. This way, inter-core synch- ronization can be avoided. When estimating the cost for a tile, these per-core counters need to be accumulated into a single set of counters used by the cost estimation model. Such a setup is depicted in Figure 19. A simple example of how the front-end increments its counters, when a triangle is binned to a tile, is shown in Figure 20.

32 Front-end work Core 0 counters

Front-end work Core 1 counters Sum Cost estimation

⋮ ⋮

Front-end work Core n counters

Figure 19. Illustration of a system where each core has a unique set of counters for each tile. The results of the per-core counters are summed prior to estimating the cost.

Triangle No Tile/triangle pair overlaps tile? Done

Yes

First triangle of No geometry batch in this tile?

Yes Determine Batch-Counter ← Batch-Counter + 1 triangle type

Covers Normal Covers Covers full tile triangle 16x16 samples 2x2 samples

type ← 1 type ← 2 type ← 3 type ← 4

Triangle-Counter[type] ← Triangle-Counter[type] + 1 Shading-Counter[type] ← Shading-Counter[type] + Current-Shader-Length

Early-z-cull No enabled? Done

Yes

Early-Z-Counter[type] ← Early-Z-Counter[type] + 1 Early-Z-Shading-Counter[type] ← Early-Z-Shading-Counter[type] + Current-Shader-Length

Done

Figure 20. A flowchart showing how the per-tile counters are modified when a triangle is binned to a tile. This simple pipeline has four different triangle types and one rendering state.

33 6.3 Split heuristic A simple scheme is to always split a tile in half and to treat the cost of a tile as uniformly distributed: That is, if the tile is split into two parts, each part is assumed to have half the cost of the original tile. The splitting is continued recursively until the cost of each tile is considered low enough or until a smallest possible tile size is reached. In our case, the smallest possible tile size is chosen as 32x32 due to access patterns of swizzled render targets. After the front-end, the splitting heuristic is used to determine which tiles to split. As an example, if a tile covers 128x128 pixels, a tile split could be such that the tile is split into two non-overlapping 64x128 sub-tiles. The idea is that the cost for rendering one such sub-tile will be approximately half of the rendering time of the full tile. Hence, tile splitting can potentially reduce the time required to render a tile to 50% if the tile is split, and the sub-tiles’ back- end work is performed on two cores in parallel. The first step of the splitting heuristic is to estimate the cost of all tiles using the cost estimation model and the per-tile counters. The most expensive tiles are then selected and split recursively until the cost of each sub- tile is below a certain threshold ℎ (with the assumption that the cost of a sub- tile relative the cost of the whole tile is in direct relation to their areas in pixels). There is generally a certain overhead associated with splitting a tile. It is therefore crucial to only split when it is actually needed. Over splitting can lead to worse performance. If a scene has several independent render targets and/or multiple frames in flight at the same time, it might not be beneficial to split even expensive tiles. Because of this the threshold, ℎ, should be modified according to the amount of work in concurrent render targets. In our implementation, is set to 16 and ℎ is set to 1/3 of the highest cost tile (or 5 if this value is larger). If there are more than 8 concurrent render targets, splitting is disabled.

6.4 Dispatch heuristic The ordering heuristic attempts to get expensive tiles started with its back-end work as early as possible. This reduces the load imbalance at the end of the back-end. Therefore, the tiles are sorted based on their estimated cost, after splitting. They are then dispatched to available cores in that order, starting with the most expensive (sub-) tiles. If the tiles were sorted during the splitting phase, it may be unnecessary to sort them again. In this case, the sub-tiles could all be inserted at an appropriate location in the work queue to ensure approximate cost based ordering.

6.5 Special rasterizer We also propose to use a special rasterizer developed specifically for the tile splitting purpose. After splitting, one could redistribute a tile’s triangles among

34 its sub-tiles, i.e., test which sub-tiles a triangle overlaps, and for each such sub- tile, put the triangle in the sub-tile’s triangle list. However, this doesn’t fit well into the front-end/back-end divided pipeline. A better way is to let each core working on a sub-tile go through the entire triangle list of the “parent” tile. Our special rasterizer must then discard triangles outside the sub-tile’s region. It must also be modified to efficiently discard fragments outside the sub-tile during scan conversion. We have developed a special rasterizer that does exactly this. During hierarchical rasterization, the special rasterizer simply terminates the hierarchical traversal if the code reaches pixel regions outside the sub-tile’s pixel region. This makes the changes very small and compact to current pipelines, which is highly desirable. It also incorporates a triangle bounding-box test prior to scan conversion in order to quickly reject triangles which are outside the sub-tile altogether. The current implementation always splits a tile in half, along the longer axis, which results in dimensions which are always a power of two (assuming that the parent tile’s dimensions are a power of two). Most rejection tests can thus be implemented using efficient shift operations. Shading and blending code may be precompiled and optimized for a certain tile size, not well suited for adaptive tile splitting. A simple workaround is to perform work on a “hot-tile” of the original size that is resident in cache. Then, when storing to and loading from the actual render target, only the sub-tile is accessed. Letting each core go through the entire triangle list of the parent tile may increase memory bandwidth which might have an adverse effect on per- formance, and may offset the performance advantage gained by better load balancing. This, in combination with increased bin overlap, accounts for the overhead of splitting a tile.

6.6 Results and discussion It proved difficult to fit the coefficients of the non-linear model. Because of this, the linear model was chosen. One approach to cost estimation is, as explained above, to perform offline fitting using various scenes. The result of such a fit using the linear model is shown in Figure 21. The model has been trained using scenes from Call of Duty 4, 3DMark06, World in Conflict and Unreal Tournament 3. The individual counters used in the model are shown in Table 1. In order to limit the counters storage requirements, the model was limited to 7 logarithmic terms, making a total requirement of 8 variables to store the counters. Performing runtime fitting of the cost estimation model has been left as future work. During runtime fitting, a much smaller number of counters should suffice, e.g. only linear terms of triangle and shading counters, since the values are tailored for the specific scene.

35

Figure 21. The linear cost estimation model trained on several different scenes; Call of Duty 4, Crysis, 3DMark06, World in Conflict and Unreal Tournament 3. Each mark represents a tile from one of the render targets in one of the scenes. The graph shows measured timings vs. predicted. The red line shows where a mark should reside to have an error of zero.

Coefficient Type Multiplier

A = 1

B = batch count 1 bin size (in bytes) 1 full 1 normal 1 16x16 1 4x4 1 8x4 1 4x8 1 full shader normal shader 16x16 shader 4x4 shader 8x4 shader 4x8 shader full tile size normal tile size 16x16 tile size 4x4 tile size

36 8x4 tile size 4x8 tile size full z-buffer fast path normal z-buffer fast path 16x16 z-buffer fast path 4x4 z-buffer fast path 8x4 z-buffer fast path 4x8 z-buffer fast path full back-end vs normal back-end vs 16x16 back-end vs 4x4 back-end vs 8x4 back-end vs 4x8 back-end vs full early-z-cull normal early-z-cull 16x16 early-z-cull 4x4 early-z-cull 8x4 early-z-cull 4x8 early-z-cull full early-z-cull · shader normal early-z-cull · shader 16x16 early-z-cull · shader 4x4 early-z-cull · shader 8x4 early-z-cull · shader 4x8 early-z-cull · shader full tile size · shader normal tile size · shader 16x16 tile size · shader 4x4 tile size · shader 8x4 tile size · shader 4x8 tile size · shader full early-z-cull · tile size · shader normal early-z-cull · tile size · shader 16x16 early-z-cull · tile size · shader 4x4 early-z-cull · tile size · shader 8x4 early-z-cull · tile size · shader 4x8 early-z-cull · tile size · shader

C = normal 1 16x16 1 normal tile size 16x16 tile size 4x4 tile size 4x4 early-z-cull normal early-z-cull · shader

Table 1. Counters and coefficients used in the cost estimation.

In order to validate the model, real measured timings have been compared to those predicted by the model for our test cases. The results for the simple scene, F.E.A.R. and Unreal Tournament 3 are shown in Figure 22, Figure 23 and Figure 24 respectively.

37 1 Measurement 0.9 Estimation

0.8

0.7

0.6

0.5

0.4

Time (Normalized) Time 0.3

0.2

0.1

0 0 20 40 60 80 100 Tile #

Figure 22. Measured vs. predicted timings for tiles in the simple scene. The tiles are sorted according to increasing predicted timing.

38 1.2 Measurement Estimation 1

0.8

0.6

0.4

Time (Normalized) Time 0.2

0

-0.2 0 100 200 300 400 500 Tile #

Figure 23. Measured vs. predicted timings for tiles in F.E.A.R. The tiles are sorted according to increasing predicted timing.

39 1.2 Measurement Estimation 1

0.8

0.6

0.4

Time (Normalized) Time 0.2

0

-0.2 0 100 200 300 400 Tile #

Figure 24. Measured vs. predicted timings for tiles in Unreal Tournament 3. The tiles are sorted according to increasing predicted timing.

Note the big differences in processing time for different tiles. Unreal Tournament 3 has a relatively even cost while F.E.A.R. is very uneven. The simple scene is somewhere in between. The model fits the scenes well enough for the purpose of finding the 16 most expensive tiles and calculating a suitable threshold. Now that the coefficients for estimating the cost of a tile are known, we can apply tile splitting to the scenes. Measurements from the simple scene and F.E.A.R. are shown in Figure 25 and Figure 26 respectively. The scene from Unreal Tournament 3 uses so many render targets that no core is ever starved for work. The splitting heuristic will thus disable tile splitting for the scene. Because of this, the performance is unchanged. Should the splitting heuristic ignore the other render targets and split anyway, the performance would degrade since splitting a tile incurs a certain amount of overhead.

40

Figure 25. The simple scene run with tile splittinng (above) and without tile splitting (belowow). The dark grey bars represent front- end (FE) work and the light grey bars represent back-end (BE) work.

41

Figure 26. A repeated frame in F.E.A.R run with tile splitting. The timeline encompasses two frames from start to end. The dark grey bars represent front-end (FE) work and the light grey bars represent back-end (BE) work.

The load imbalance in the back-end has been much improved in both the simple scene and F.E.A.R. The time to process a frame in the simple scene is down from 1.00 to 0.67 time units compared to the original framework. This is a 33% speedup. The frame time of F.E.A.R. is down from 1.00 to 0.82 time units. An 18% speedup. Note that the frame from F.E.A.R. spans two frames in Figure 26. This is because the first frame contains the front-ends from two full frames. The scheduler keeps two frames in flight and tries to hide the load imbalance in the front-end of the first frame by starting to execute the front- end of the next one. It appears the next frame is not readily available to hide the entire load imbalance. The next frame has no front-ends at all since they were all executed before. Averaging two such frames yields the frame time of 0.82 time units.

42 7 Conclusion and future work

The pre-front-end have been shown to be able to hide some of the potential load imbalance in the front-end. However, such imbalance can also be hidden by using a smart scheduling algorithm. Also, the pre-front-end will probably not be usable until screen-space bounds for geometry batches are an intricate part of the graphics pipeline. Calculating them just for the purpose of the pre- front-end will likely not improve performance. However, if these bounds are available the pre-front-end has a very low overhead. This report featured limited results regarding the performance of the pre-front-end, as focus was shifted towards splitting of tiles. A more rigorous investigation is left as future work. In particular a more elaborate cost estimation of tiles may be needed that includes shader complexity of each geometry batch. Still, the information in the pre-front-end is very limited making a sophisticated cost model, such as the one used for splitting of tiles, infeasible. The tile splitting scheme has been shown to be successful at balancing the load of the back-end without adverse effects on latency. Some scenes will benefit greatly from this, e.g. F.E.A.R. got 18% speedup and the simple scene got 33% speedup. Other scenes, like those with lots of render targets, may not benefit at all. In the latter case, tile splitting is disabled resulting in unchanged performance. In order to balance the load of both the front-end and the back-end the pre-front-end and tile splitting schemes could potentially be combined. This would require a new splitting heuristic since it is impossible to find the most expensive tiles since different tiles are ready for back-end processing at different times. Alternatively, a scheme similar to tile splitting could be adopted for the front-end. The front-end could use the complexity of the current vertex shader as well as the amount of work remaining when determining the size of each geometry batch. The case studies section also highlighted another problem; some tasks were too small and caused lock contention. In order to alleviate this, really small tasks should be concatenated into larger ones, both in the front-end and in the back-end. The tile splitting scheme could easily be extended to include such logic.

43 8 References

[1] Cox, Michael, and Pat Hanrahan, “Pixel Merging for Object-Parallel Rendering: a Distributed Snooping Algorithm”, Proceedings of the 1993 Symposium on Parallel rendering, pp. 49 — 56, 1993.

[2] Crow, Franklin C., “Shadow Algorithms for Computer Graphics”, Computer Graphics (Proceedings of SIGGRAPH 77), vol. 11, 242—248, 1977.

[3] Fuchs, Henry, John Poulton, John Eyles, Trey Greer, Jack Goldfeather, David Ellsworth, Steve Molnar, Greg Turk, Brice Tebbs, Brice and Laua Israel, “Pixel-Planes 5: A Heterogeneous Multiprocessor Graphics System using Processor-Enhanced Memories”, Computer Graphics (Proceedings of ACM SIGGRAPH 89), vol. 23, no. 3, pp. 79-88, 1989.

[4] Greene, Ned, “Hierarchical Polygon Tiling with Coverage Masks”, Procee- dings of the 23rd Annual Conference on Computer Graphics and Interactive Techniques, pp. 65-74, August 1996.

[5] Hasselgren, Jon, Jacob Munkberg and Tomas Akenine-Möller, “Automatic Pre-Tessellation Culling”, ACM Transactions on Graphics, vol. 28, no. 2, article no. 19, 2009.

[6] Molnar, Steven, Michael Cox, David Ellsworth and Henry Fuchs, “A Sorting Classification of Parallel Rendering”, IEEE Computer Graphics and Applications, vol. 14, no. 4, pp. 23-32, 1994.

[7] Munkberg, Jacob, Jon Hasselgren, Robert Toth and Tomas Akenine-Möller, "Efficient Bounding of Displaced Bézier Patches", Proceedings of High Per- formance Graphics, June 2010.

[8] Pineda, Juan, “A Parallel Algorithm for Polygon Rasterization”, Computer Graphics (Proceedings of ACM SIGGRAPH 88), vol. 22, no. 4, pp. 17-20, 1988.

[9] Seiler, Larry, Doug Carmean, Eric Sprangle, Tom Forsyth, Michael Abrash, Pradeep Dubey, Stephen Junkins, Adam Lake, Jeremy Sugerman, Robert Cavin, Roger Espasa, Ed Grochowski, Toni Juan and Pat Hanrahan, “Larrabee: A Many-Core x86 Architecture for Visual Computing”, ACM Transactions on Graphics, vol. 27, no. 3, article no. 18, 2008.

44