Load Balancing in a Tiling Rendering Pipeline for a Many-Core CPU
Total Page:16
File Type:pdf, Size:1020Kb
Master of Science Thesis Lund, spring 2010 Load balancing in a tiling rendering pipeline for a many-core CPU Rasmus Barringer* Engineering Physics, Lund University Supervisor: Tomas Akenine-Möller†, Lund University/Intel Corporation Examiner: Michael Doggett‡, Lund University Abstract A tiling rendering architecture subdivides a computer graphics image into smaller parts to be rendered separately. This approach extracts parallelism since different tiles can be processed independently. It also allows for efficient cache utilization for localized data, such as a tile’s portion of the frame buffer. These are all important properties to allow efficient execution on a highly parallel many-core CPU. This master thesis evaluates the traditional two-stage pipeline, consisting of a front-end and a back-end, and discusses several drawbacks of this approach. In an attempt to remedy these drawbacks, two new schemes are introduced; one that is based on conservative screen-space bounds of geometry and another that splits expensive tiles into smaller sub-tiles. * [email protected] † [email protected] ‡ [email protected] Acknowledgements I would like to thank my supervisor, Tomas Akenine-Möller, for the opp- ortunity to work on this project as well as giving me valuable guidance and supervision along the way. Further thanks goes to the people at Intel for a lot of help and valuable discussions: thanks Jacob, Jon, Petrik and Robert! I would also like to thank my examiner, Michael Doggett. 1 Table of Contents 1 Introduction, aim and scope ...................................................................... 3 2 Background ................................................................................................ 4 2.1 Overview of the rasterization pipeline ............................................ 4 2.2 GPUs, CPUs and many-core architectures ..................................... 5 2.3 Tiled rendering ............................................................................... 6 3 Implementation and testing environment .................................................. 9 3.1 Scheduling and dependency analysis ............................................... 9 4 Case studies .............................................................................................. 11 4.1 F.E.A.R. ........................................................................................ 11 4.2 Unreal Tournament 3 .................................................................... 16 4.3 Simple scene .................................................................................. 18 4.4 Conclusions .................................................................................... 20 5 The pre-front-end ...................................................................................... 21 5.1 Screen space bounds ...................................................................... 22 5.2 Dependency analysis and the pre-front-end ................................... 23 5.3 Preserving submission order .......................................................... 25 5.4 Results and discussion ................................................................... 25 6 Tile splitting ............................................................................................. 29 6.1 Per-tile cost estimation .................................................................. 29 6.2 Front-end counters ........................................................................ 32 6.3 Split heuristic ................................................................................ 34 6.4 Dispatch heuristic .......................................................................... 34 6.5 Special rasterizer ............................................................................ 34 6.6 Results and discussion ................................................................... 35 7 Conclusion and future work ...................................................................... 43 8 References ................................................................................................. 44 2 1 Introduction, aim and scope The goal of this project was initially to realize an idea that was conceived at Intel concerning load balancing of tiling rendering pipelines. First, the con- ventional two-stage pipeline was analyzed and then a pre-front-end was added, in an attempt to remedy some of the issues discovered. The purpose was to improve performance. As the project evolved, another idea was conceived that concerns cost estimation and splitting of tiles. This report is organized into the following sections: background, imple- mentation and testing environment, case studies, pre-front-end, tile splitting, and, conclusion and future work. The background section contains information on rasterization pipelines as well as the graphics processing unit (GPU), central processing unit (CPU), and the many-core CPU. The implementation and testing environment section describes the system used for implementing our new algorithms and for evaluating performance. The case studies section describes a number of scenes that are used to highlight some of the problems in the existing pipeline. The following two sections explain our two novel schemes aiming to improve the overall performance of a tiling rendering pipeline. The conclusion and future work section discusses some lessons learned as well as things that may be interesting to research more in-depth in the future. 3 2 Background 2.1 Overview of the rasterization pipeline In a rasterization based three-dimensional graphics pipeline, geometry is mapped to screen-space using some transformation from three-dimensional to two-dimensional space. This two-dimensional geometry can then be rasterized in the form of pixels on the screen, i.e., pixels in the frame buffer. In the most basic case, the geometry consists of triangles and the transformation consists of a series of matrices; a world matrix, a view matrix and a projection matrix. A triangle consists of three vertices that contain their position and optionally other attributes such as color values. The world matrix, , represents the transformation from object-space to world-space; if a group of triangles is moving around it makes sense to use a transformation matrix for its movement rather than moving the individual coordinates of each triangle. The view matrix, , corresponds to the viewer’s position and rotation in the scene. The projection matrix, , projects the vertices on the two-dimensional screen. The combined matrix can now be formed: = Transforming a point is now reduced to a simple matrix-vector multiplication: ̂ = where ̂ represents the screen-space position in homogenous coordinates (before division by ). Once the triangles are projected on the screen, half-plane equations1 can be formulated for the edges of each triangle [8], called edge equations. Given a triangle with the screen-space vertex positions = (,) , = (,) and = (,) the edge equations take the form (cf. cross product): (, ) = ( −)(−) − ( −)(−) (, ) = ( −)(−) − ( −)(−) (, ) = ( −)(−) − ( −)(−) Given that , and are defined in a counterclockwise manner, these equations are all positive when a point is inside the triangle (see Figure 1 below). This property can be used to determine whether pixels (or regions 1 The term half-plane equation comes from the fact that each equation divides the plane in two parts; one where it is positive and one where it is negative. 4 thereof) are inside a triangle. One possibility is to use the recursive descent algorithm described by Greene [4]. (, ) − − + + + (, ) − (, ) Figure 1. A triangle and its edge functions. A point (, ) is inside the triangle if all edge functions are positive. Pixels that are inside should then be colored according to the appearance of the triangle. This appearance can be described using various color values, lighting calculations and texture lookups. At the dawn of graphics processing units (GPUs), which are used to accelerate real-time graphics, the graphics pipeline was limited to a fixed function path much like the one described. Transformations were limited to simple affine transforms described by matrices. The coloring consisted of simple per-vertex local illumination models coupled with texture lookups. In more recent times, both the vertex processing and the pixel processing have become programmable. A GPU now executes specialized programs, called vertex shaders, for each vertex determining its screen-space position as well as other attributes needed for per-pixel processing. For each pixel inside a triangle another program, a pixel shader, is executed that given the interpolated per- vertex attributes determines the final color of the pixel. The trend is clearly moving towards a more programmable pipeline as more recent hardware includes a geometry shader capable of generating triangles on the fly. Next generation GPUs include programmable tessellation support in the form of hull and domain shaders. Pixels may also be rasterized to off-screen surfaces that is later used for shading. The general term that includes both frame buffers and such off-screen surfaces is render target. 2.2 GPUs, CPUs and many-core architectures Modern GPUs include some fixed function logic, but mostly they consist of floating-point stream processor units (SPUs). These SPUs are specifically designed to run highly parallel shader programs. As GPU programmability 5 increases, they are able to facilitate an increasing number of applications traditionally reserved