POWER OPTIMIZATIONS FOR GRAPHICS PROCESSORS
B.V.N.SILPA
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING INDIAN INSTITUTE OF TECHNOLOGY DELHI
March 2011
POWER OPTIMAZTIONS FOR GRAPHICS PROCESSORS
by
B.V.N.SILPA
Department of Computer Science and Engineering
Submitted in fulfillment of the requirements of the degree of Doctor of Philosophy
to the
Indian Institute of Technology Delhi March 2011
Certificate
This is to certify that the thesis titled Power optimizations for graphics pro- cessors being submitted by B V N Silpa for the award of Doctor of Philosophy in Computer Science & Engg. is a record of bona fide work carried out by her under my guidance and supervision at the Deptartment of Computer Science & Engineer- ing, Indian Institute of Technology Delhi. The work presented in this thesis has not been submitted elsewhere, either in part or full, for the award of any other degree or diploma.
Preeti Ranjan Panda Professor Dept. of Computer Science & Engg. Indian Institute of Technology Delhi
Acknowledgment
It is with immense gratitude that I acknowledge the support and help of my Professor Preeti Ranjan Panda in guiding me through this thesis. I would like to thank Professors M. Balakrishnan, Anshul Kumar, G.S. Visweswaran and Kolin Paul for their valuable feedback, suggestions and help in all respects. I am indebted to my dear friend G Kr- ishnaiah for being my constant support and an impartial critique. I would like to thank Neeraj Goel, Anant Vishnoi and Aryabartta Sahu for their technical and moral support. I would also like to thank the staff of Philips, FPGA, Intel laboratories and IIT Delhi for their help. I owe my deepest gratitude to Microsoft Research India, for supporting my research by granting me MSR fellowship. I would also want to thank Intel India Pvt. Limited for funding my research. I would like to specially thank Kumar S S Vemuri for the mentorship I received from him. This thesis is dedicated to my family members who have shown immense patience and provided me with great support during the course of my work.
B V N Silpa
ABSTRACT
Advances in Computer Graphics have led to creation of sophisticated scenes with realistic characters and fascinating effects. As a consequence, the computational com- plexity of graphics applications has also increased tremendously. With increasing interest in sophisticated graphics capabilities in mobile systems, energy consumption of graphics hardware is becoming a major design concern in addition to the traditional performance enhancement criteria. This motivates us to focus on designing low power graphics pro- cessors for mobile devices. We present the first comprehensive power optimization work targetting the computer graphics rendering pipeline. The power minimization is targetted at different levels of abstraction: component level, compiler level, and full system level. The main contributions of this thesis are the following:
• A custom memory architecture for low power texture memory sub-system.
• A code optimization technique that reduces the computational complexity and hence the power consumption of the geometry engine.
• System level power optimization by Dynamic Voltage and Frequency Scaling for tiled graphics processors.
Among the different steps in the graphics processing pipeline, we observe that mem- ory accesses during texture mapping – a highly memory intensive phase – contributes 30-40% of the energy consumed in typical embedded graphics processors. This makes the texture mapping subsystem an attractive candidate for energy optimization. We argue that a standard cache hierarchy, commonly used by researchers and commercial graphics processors for texture mapping, is wasteful of energy, and propose the Texture Filter Memory, an energy efficient architecture that exploits locality and the relatively high degree of predictability in texture memory access patterns. Our architecture con- sumes 75% lesser energy for texturing in a fixed function pipeline and about 85% lesser energy in a parallel rasterization hardware. It also achieves 7% more hits than a parti- tioned cache generally used for multitexturing. Interestingly our proposed architecture also achieves higher performance than conventional texture mapping hardware. We also demonstrate that introduction of these filter buffers help greatly in reducing the leakage power consumption of the texture memory sub-system. Our proposed drowsy texture L1 with predictive wake-up helps in achieving 80% leakage power savings at the cost of less than 1% performance loss. We have observed that the geometry engine also contributes significantly towards the total power consumption in modern games. This is because the creation of scenes with increasing levels of detail is resulting in escalating the amount of geometry per frame, making the performance of the geometry engine one of the computationally intensive stages of the pipeline. In this thesis we propose a mechanism to reduce the amount of computation in the geometry engine, thereby reducing the power consumption of the geometry engine and at the same time speeding up the geometry processing. This is achieved by partitioning the vertex shader into position-variant and position-invariant parts and executing the position-invariant part of the shader only on those triangles that pass the trivial reject test. Our main contributions here are : (i) a partitioning algorithm that attempts to minimize the duplication of code between the two partitions of the shader and (ii) an adaptive mechanism to enable the vertex shader partitioning so as to minimize the overhead incurred due to thread-setup of the second stage of the shader. From the results we observe a saving of upto 50% of vertex shader instructions and hence a speed-up of upto 15%. Due to significant savings on number of vertex shader instructions, we can expect attractive saving on power consumed by the geometry engine. From the study of various modern games we observe that the workload varies sig- nificantly with time and hence can benefit from dynamic voltage and frequency scaling (DVFS) which saves the system level power consumption of the GPU. Since visual quality of graphics applications is highly dependent on the rate at which frames are processed, it is important to devise a DVFS scheme that minimizes deadline misses due to inaccuracies in workload prediction. We demonstrate that tiled-graphics renderers exhibit substantial advantages over immediate-mode renderers in obtaining access to frame parameters that help in enhancing the workload estimation accuracy. We also show that, operating at a finer granularity of “tiles” as opposed to “frames” allows early detection and corrective action in case of a mis-prediction. We propose an accurate workload estimation tech- nique and two DVFS schemes namely (i) tile-history based DVFS and (ii) tile-rank based DVFS for tiled-rendering architectures. The proposed schemes are demonstrated to be more efficient in terms of power and performance than the frame level DVFS schemes proposed in recent literature. With a system with 8 DVFS levels, our tile-history based DVFS scheme results in 60% improvement in quality (deadline misses) over the frame history based DVFS schemes and gives 58% saving in energy. The more sophisticated tile-rank based scheme achieves 75% improvement in quality over the frame history based DVFS scheme and results in 58% saving in energy. We have also compared the efficiency of the proposed tile-level DVFS schemes with frame-level schemes with increasing num- ber of DVFS levels, and found that while the frame-level schemes suffer from increasing deadline misses as the frequency levels increase, the impact on our tile-level schemes is negligible. The energy per frame-rate for our scheme is the minimum, indicating that it delivers the best performance-energy results.
Contents
List of Figures vii
List of Tables xi
1 Introduction 1 1.1 Introduction to Graphics Processing ...... 4 1.1.1 Application Stage ...... 4 1.1.2 Geometry ...... 9 1.1.3 Triangle Setup ...... 11 1.1.4 Rasterization ...... 12 1.1.5 Display ...... 15 1.2 Graphics Processor Architecture ...... 17 1.2.1 Immediate Mode Rendering Engines ...... 18 1.2.2 Tiled Graphics Engines ...... 23 1.3 Power Dissipation in a Graphics Processor ...... 26 1.4 Our Contribution ...... 28 1.5 Thesis Outline ...... 29
2 Literature Survey 31 2.1 Programmable Units ...... 31 2.1.1 Clock Gating ...... 31 2.1.2 Fixed Function ALUs ...... 33 2.1.3 Predictive Shutdown ...... 33 2.2 Texture Unit ...... 34 2.2.1 Low power cache configurations ...... 34
i ii CONTENTS
2.2.2 Texture Compression ...... 35 2.2.3 Clock Gating ...... 37 2.3 Frame Buffer ...... 37 2.3.1 Depth Buffer Compression ...... 38 2.3.2 Color Buffer Compression ...... 40 2.4 System Level Power Management ...... 41 2.4.1 Power Modes ...... 41 2.4.2 Dynamic Voltage and Frequency Scaling ...... 41 2.4.3 Multiple Power Domains ...... 47 2.5 Miscellaneous ...... 48
3 Texture Filter Memory 49 3.1 Introduction ...... 49 3.2 Texture Mapping Access Pattern ...... 52 3.3 Architecture of Texture Filter Memory ...... 56 3.3.1 Texture Buffer Array ...... 57 3.3.2 Address Comparators ...... 59 3.3.3 Controller ...... 59 3.4 Static Power Reduction due to Texture Filter Memory ...... 60 3.4.1 Predictive wake-up ...... 61 3.5 Extension to other architectures and filters ...... 64 3.5.1 Anisotropic Filtering ...... 64 3.5.2 Parallel Texturing ...... 65 3.5.3 Multi Texturing ...... 66 3.6 Experiments and Results ...... 67 3.6.1 Evaluation of the TFM architecture ...... 67 3.6.2 Leakage Power and Delay Comparision ...... 71 3.6.3 Parallel rasterization architecture ...... 72 3.6.4 Multitexturing ...... 73 3.7 Summary ...... 74 CONTENTS iii
4 Vertex Shader Partitioning 77 4.1 Introduction ...... 77 4.2 Shader Compiler ...... 82 4.2.1 Partitioning with Minimum Duplication ...... 82 4.2.2 Comparison with the na¨ıve algorithm ...... 87 4.2.3 Selective Partitioning of Vertex Shader ...... 88 4.3 Framework ...... 89 4.3.1 Partitioning the assembly code ...... 91 4.3.2 Vertex Input Buffer ...... 92 4.3.3 Feeder Unit ...... 92 4.4 Experiments and Results ...... 93 4.4.1 Performance Improvement ...... 93 4.4.2 Energy Reduction ...... 95 4.5 Summary ...... 96
5 DVFS for Tiled GPUs 99 5.1 Introduction ...... 99 5.2 Workload Estimation of a Tiled Graphics Processor ...... 104 5.2.1 Frame Rank Computation ...... 107 5.2.2 Tile Rank Computation ...... 110 5.2.3 Extraction of Ranks ...... 111 5.3 DVFS Scheme ...... 111 5.3.1 Tile History Based DVFS ...... 112 5.3.2 Tile Rank Based DVFS ...... 115 5.4 Experiments and Results ...... 117 5.4.1 Performance Impact ...... 118 5.4.2 Energy Saving ...... 120 5.5 Summary ...... 122
6 Conclusion and Future Work 125 6.1 Summary and conclusion ...... 125 6.2 Future Work ...... 127 iv CONTENTS
Bibliography 129 CONTENTS v vi CONTENTS List of Figures
1.1 Evolution of character modeling over the years [1]...... 2 1.2 Increasing transistor counts in GPUs [1] ...... 3 1.3 Power break-up in a typical desktop computer [2] ...... 4 1.4 Power break-up in a typical mobile computer [2] ...... 5 1.5 Graphics Pipeline ...... 6 1.6 Three different poses of the same character in a 3D game ...... 7 1.7 Transformations on an object ...... 8 1.8 Object space culling ...... 9 1.9 Space-Space transformations on an object ...... 10 1.10 Types of light sources ...... 11 1.11 Scan line conversion ...... 12 1.12 Texture mapping example ...... 14 1.13 Antialiasing ...... 15 1.14 Double buffering ...... 16 1.15 The CPU – GPU interface ...... 17 1.16 Fixed function GPU ...... 19 1.17 Triangles sharing vertices ...... 20 1.18 Indexed addressing into vertex buffer ...... 20 1.19 Tiled triangle traversal ...... 22 1.20 Unified shader architecture for graphics processor ...... 24 1.21 Tiled rendering ...... 25 1.22 Tiled graphics pipeline ...... 25 1.23 Tiled graphics processor architecture ...... 26 1.24 State management in tiled GPUs ...... 27
vii viii LIST OF FIGURES
1.25 Fraction of energy consumed in texture unit ...... 27 1.26 Footprint of the die of GT200 (approximately to the scale) ...... 28
2.1 Processing Element (PE) ...... 32 2.2 S3TC texture compression ...... 36 2.3 PID controller ...... 43 2.4 PID controller based DVFS for graphics processor ...... 44 2.5 Signature based DVFS for graphics processor ...... 47
3.1 Texture mapping to model a globe ...... 50 3.2 Oblique traversal of scanlines in texture space ...... 51 3.3 Footprint of a Bilinear filter ...... 52 3.4 Scenarios to which the texture footprint could be mapped ...... 53 3.5 Blocked representation of texture ...... 53 3.6 Distribution of texture accesses between various cases ...... 57 3.7 TFM architecture ...... 58 3.8 Pre wake up – Case 1 ...... 62 3.9 Pre wake up – Case 2 ...... 62 3.10 Pre wake up – Case 3 ...... 63 3.11 Pre wake up – Case 4 ...... 63 3.12 Pre-wakeup Prediction accuracy ...... 64 3.13 TFM for parallel texturing ...... 65 3.14 Hit rate comparison for texture memory architectures ...... 68 3.15 Average access energy comparison of texture memory architectures .... 69 3.16 Average access times comparison of texture memory architectures ..... 70 3.17 Area of several texture memory architectures ...... 70 3.18 Leakage power consumption of various texture memory architectures ... 71 3.19 Delay overhead of various drowsy policies ...... 71 3.20 Hit rate in parallel texture cache architecture ...... 72 3.21 Average access energy in parallel texture cache architecture ...... 73 3.22 Area of the parallel texture caching architectures ...... 73 3.23 Hit rates for multitexturing ...... 74 LIST OF FIGURES ix
4.1 Vertex shaded partitioning example ...... 78 4.2 Pipeline modified to support vertex shader partitioning ...... 80 4.3 DAG representing the data-flow in a vertex shader ...... 83 4.4 Vertex shader partitioning algorithm : Case 1 ...... 85 4.5 Vertex shader partitioning algorithm: Case 2 ...... 86 4.6 Comparison of proposed algorithm with the existing one ...... 88 4.7 Variation of Trivial Rejects across frames of UT2004 ...... 89 4.8 ATTILA architecture ...... 90 4.9 Modified ATTILA architecture ...... 91 4.10 Feeder unit architecture ...... 93 4.11 % Instructions saved due to adaptive vertex shader partitioning ...... 94 4.12 % Cycles saved due to adaptive vertex shader partitioning ...... 95 4.13 Frame captures from the games ...... 96
5.1 Workload variation in games ...... 100 5.2 I/O buffering for video decoder ...... 101 5.3 Buffer occupancy based DVFS for video decoder ...... 102 5.4 Buffer occupancy based correction of workload prediction ...... 102 5.5 Consecutive frames of UT2004 ...... 105 5.6 Consecutive frames of UT2004 ...... 106 5.7 Accuracy of tile history based prediction ...... 107 5.8 Consecutive frames of UT2004 ...... 108 5.9 Accuracy of frame rank based predictions ...... 110 5.10 Area of overlap as a measure of pixel count ...... 111 5.11 Accuracy of tile rank based prediction ...... 112 5.12 Tile level DVFS ...... 113 5.13 Deadline Misses at target frame rate of 60fps ...... 119 5.14 Frame Rate at target of 60fps ...... 120 5.15 Normalized Energy at 60fps ...... 120 5.16 Normalized Energy at 30fps ...... 121 5.17 Energy per Frame-Rate at 30fps ...... 122 x LIST OF FIGURES List of Tables
2.1 Exponential encoding for color buffer compression ...... 40
4.1 % Trivial rejects per frame ...... 78 4.2 % Instructions saved due to vertex shader partitioning ...... 81
5.1 DVFS results at 60 FPS ...... 123 5.2 DVFS results at 30 FPS ...... 124
xi xii LIST OF TABLES Chapter 1
Introduction
The field of computer graphics has advanced profoundly in recent years leading to creation of realistic characters and fascinating effects. What started as a tool to generate 2D pixel art has now evolved to be capable of generating complex 3D images incorporating intricate details of objects. The evolution of video game character animation over the years as shown in Figure 1.1 demonstrates this fact. The real time generation of these complex images in computer graphics applications was made possible by advancements in semiconductor industry that provided more and more transistors in each technology node. Initially, graphics applications were run on general purpose CPUs. However, the gradual increase in complexity of these applications has led to the development of hardware accelerators for graphics processing called Graphics Processing Units (GPUs). Since then, each generation of GPU has witnessed a tremendous increase in transistor density. This can be observed from Figure 1.2, which shows the transistor counts of several generations of nVidia graphics cards [1]. With aggressive technology scaling over the years, the computational capacity of mobile platforms has also increased tremendously. As a result, traditional 3D graphics applications which have been developed for either desktops or dedicated gaming consoles, such as gaming, GPS-backed maps, screen savers, and animated chats are emerging as possible applications for mobile devices as well. The challenge in porting complex 3D graphics applications onto mobile platforms is posed not as much by performance as power consumption. Since mobile devices are powered by battery and battery capacity is not increasing on par with processing power of chips, the gap between the demand and supply of power is widening. Moreover, due to constraints
1 2 Introduction
(a) Pacman(1980s): Basic 3D (b) Toy Soldier(2000): Real- image time reflection and shadow
(c) Zoltar(2001): Realtime lip (d) Dawn(2003): Realistic movements features
(e) Naluu(2004): Soft Shad- (f) Mad Mod Mike(2005): ows on skin and hair Realistic clothes
(g) Adrianne(2006): Complex (h) Human Head(2007): Real- shading and deformation istic skin texture and deforma- tion Figure 1.1: Evolution of character modeling over the years [1]. Introduction 3
10000 GTX 400 GeForce 8 1000 illions) Ge Force 6 GTX 200 M
100 GeForce 4 GeForce 7 (in GeForce FX
nt GeForce 256 u 10 Co
sistor 1 n NV1 Tra 0.1
Figure 1.2: Increasing transistor counts in GPUs [1]
on the size of these devices, the cooling solutions that can be used in these devices are limited. In fact, the current generation semiconductor technology is said to have hit the ”power wall”. As a consequence, the problem of inadequate cooling is not only limited to mobile devices, but also applicable to desktop systems. Thus, the increasing popularity of graphics applications has hence introduced an additional constraint of power in the design of a graphics subsystem to the already existing ones of performance and quality. A recent study on the PC energy efficiency trends by Intel Technologies [2], shows that the the graphics processor is a major component of the total energy dissipated in desktop and mobile computers as shown in Figure 1.3 and Figure 1.4 respectively.
From the figures we can see that the GPU consumes as much power as the CPU in desktop computers. And in mobile computers, GPU consumes double the power of CPU, thus making it the major source of power dissipation in these systems. Hence, it is very important to focus our attention towards low power solutions for graphics processing. In this thesis, we investigate the major sources of power consumption in a graphics processor and propose optimizations for the same. 4 Introduction
Power Supply Loss 22%
VR 1%
Other 7% Monitor 56%
HDD/DVD 4%
CPU 4% Graphics 6%
Figure 1.3: Power break-up in a typical desktop computer [2]
1.1 Introduction to Graphics Processing
The aim of graphics processing is to generate the view of a scene on a display device. The pipeline processes the complex geometry present in the scene, which is represented using several smaller primitives such as triangles, lines, etc., to produce the color corresponding to each position on a 2D screen called a pixel. Several operations are applied in sequential order, on the data representing the math- ematical model of an object, to create its graphical representation on the screen. The high level view of the flow of these operations, generally called the Graphics Pipeline, is illustrated in Figure 1.5.
1.1.1 Application Stage
The application layer acts as an interface between the player and the game engine. Based on the inputs from the player, the application layer places the view camera, which defines Introduction 5
Rest of the CPU Platform 7% 13% Cooling Fan Chipset 4% 13%
Power Supply Loss 7%
HDD/DVD 9% 14.1' LCD 33%
Graphics 14%
Figure 1.4: Power break-up in a typical mobile computer [2]
the position of the “eye” of the viewer in 3D space. The application also maintains the geometry database, which is the repository of the 3D world and the objects used in the game, represented as geometric primitives (triangles, lines, points etc). Every object is associated with a position attribute defining its placement in the world space, and a pose defining the orientation of the object and movable-parts of the object with respect to a fixed point on the object as shown in Figure 1.6. Animation is the process of changing the position and pose of the objects from frame to frame, so as to cause the visual effect of motion. The movement of objects in a frame can be brought about by a combination of translation, rotation, scaling, and skewing operations as shown in Figure 1.7:
• Translation: Displacement of an object from one position to another.
• Rotation: Movement of an object around an axis causing angular displacement. 6 Introduction
1. Position Camera 2. Animation Application 3. Frustum Culling 4. Occlusion Culling 5. Calculate LOD
1. Translate, Rotate & Scale 2. Tranform to view space Geometry 3. Lighting 4. Persepective Divide 5. Clipping and culling
1. Slope and Delta Triangle Setup Calculation 2. Scan−Line Convert
1. Shading GRAPHICS PIPELINE 2. Texturing Rasterization 3. Fog 4. Alpha, Depth & Stencil test 5. Antialiasing
1. Swap Buffers Display 2. Screen Refresh
Figure 1.5: Graphics Pipeline Introduction 7
(a) (b) (c)
Figure 1.6: Three different poses of the same character in a 3D game
• Scaling: Resizing of an object to cause the perception of depth. The object appears to magnify when it is approaching the viewer and diminishes as it moves away.
• Skewing: Reshaping an object by scaling it along one or more axes, effectively changing the pose of the object.
The application associates the objects in the geometry database with transformations as determined by the game play. The actual transformation operation on the primitives of the object happens in the next stage – the geometry stage of the pipeline. In addition, the application layer also identifies possible collisions among objects in the frame and generates a response in accordance with the game play. This layer is also responsible for processing the artificial intelligence (AI), physics, audio, networking, etc. Since graphics processing is highly computation intensive, efforts are made to reduce the workload by avoiding operations that would result either in no actual or perceivable change in what is displayed on the screen. One such technique called Frustum Culling is generally employed by almost all game engines to avoid rendering objects that fall totally outside the view frustum as shown in Figure 1.8(Object A). Another such technique is Occlusion culling, which is a visibility test performed by the application to identify the objects that are totally obstructed by some other object in the scene as shown in Figure 1.8(Object C). Since these objects would not be visible on the screen anyway, we can avoid processing them, thereby saving computation time. Another popular technique used for graphics workload reduction is adjustment of the precision at which objects are rendered based on their distance from the view camera. This is based on the observation that, as the distance between the object and eye increases, the precision with which its details could be perceived decreases. Hence a far-off object can be modeled with fewer primitives than those required to model a nearer object. This Level of Detail (LOD) based technique results in loss of model detail whereas frustum 8 Introduction
Y’ Y Y
A’ C’ C’ A A C X’ C X B’ X A’ B’ B B
Z Z Z’
(a) Translation (b) Rotation :
Y Y
A’ A’ C’ A A C C X X B B B’ Z Z
(c) Scaling (d) Skewing
Figure 1.7: Transformations on an object Introduction 9
A
C
B B
Camera (a) View Space (b) Screen Space
Figure 1.8: Object space culling and occlusion culling are non-lossy ones.
1.1.2 Geometry
The geometry engine receives the vertices representing the primitives of the objects as inputs from the application stage. The first step in the geometry stage is to apply the transformations (associated with them in the application stage) on the primitives. The various transformations are illustrated in Figure 1.9. In newer pipeline implementations, the geometry engine is also capable of animating the primitives. In this case, the trans- formations are generated and applied by the geometry engine itself. In addition to these transformations, the geometry engine also needs to apply space- space transformations on the primitives. Various spaces used to represent a scene are illustrated in Figure 1.9 and discussed below:
• Model Space: where each object is described with respect to a co-ordinate system centered at a point on the object.
• World Space: where all the objects that form the scene are placed in a common co-ordinate space.
• View Space: where the camera/eye forms the center of the world thus representing 10 Introduction
Y Y
X X Z Z
(a) Model Space (b) World Space (c) View Space
Figure 1.9: Space-Space transformations on an object
the world as seen by the viewer.
To transform the primitives from model space to view space, they are either first trans- formed to world space and then to view space, or directly transformed to view space. In terms of operations, these transformations are also a combination of translations and rotations. The next step is lighting the vertices taking into account the light sources present in the scene and also the reflections from the objects present in the scene. The lighting of primitives could be done either at vertex level or pixel level. Though pixel level shading results in better effects, the downside of per-pixel lighting is the resulting heavy computational workload. Thus, the choice between per-vertex and per-pixel shading is a trade-off between accuracy and workload. Various kinds of light sources are considered in 3D graphics, as shown in Figure 1.10.
• Directional Light: This refers to a source of light placed at an infinite distance from the object. This kind of light illuminates all the objects facing the source equally.
• Point Light: This is light emanating from a point source but spreading into all the directions. The illumination of this kind of light decreases with the distance from the light. Introduction 11
Cut−off Angle
Direction of Light
(a) Directional Light (b) Point Light (c) Spot Light
Figure 1.10: Types of light sources
• Spot Light: This refers to directional light emanating from a point source, illumi- nating the objects that fall in the conical region in which it spreads its light.
After the per-vertex operations of transformation and lighting are done, the vertices are assembled into triangles. Before the triangles are sent to the next stages of the pipeline for further processing, the primitives that would not contribute to the pixels that are finally displayed on the screen, are discarded so as to reduce the workload on the pixel processor. As a first step, the geometry engine identifies the triangles that fall partially or totally outside the view frustum. The primitives that fall totally outside the frustum are trivially rejected. If the primitives are not completely outside the frustum, they are divided into smaller primitives so that the part falling outside the frustum can be clipped off. In addition to the primitives falling totally outside the view frustum, the triangles that face away from the camera are also trivially rejected. This process is called back-face culling. For example, independent of the viewing point, half of a solid sphere’s surface is always invisible and hence can be discarded.
1.1.3 Triangle Setup
So far in the pipeline, the scene is represented in terms of triangles, lines, and points. But what is finally displayed on the screen is a 2D array of points called pixels. In order to progress towards this final objective, the triangles are first divided into a set 12 Introduction of parallel horizontal lines called scan-lines , as shown in Figure 1.11. These lines are further divided into points, thus forming the 2D array of points called the fragments. The scan-line conversion of the triangles occurs during the triangle setup phase; the scan lines are then passed on to the rasterization engine, which then generates pixels from these lines. While dividing the triangles into the corresponding scan lines, the triangle setup unit calculates the attributes – depth, color, lighting factor, texture co-ordinates, normals, etc., of the end points of the lines through interpolation of the vertex attributes of the triangle. A(x1,y1) (x1+1/m1, y1+1) (x1+1/m3, y1+1) slope=m1 slope=m3 B(x2,y2) (x1’, y2) where x1’=x1+(y2−y1)/m3 (x1’+1/m3, y2+1) (x2+1/m2, y2+1)
slope=m2 C(x3,y3)
Figure 1.11: Scan line conversion
1.1.4 Rasterization
The raster engine generates the pixels from the scan-lines received from the setup unit. Each pixel is associated with a color stored in a color buffer and depth stored in a depth buffer. These two buffers together form the framebuffer of the graphics processor. The aim of the pixel processor is to compute the color of the pixel displayed on the screen. The various operations involved in this processing are enumerated below:
Shading: In this step, the lighting values of the pixels are computed. This is done by either assigning the weighted average of vertex lighting values or, for greater accuracy, actually computing the lighting at each pixel using one of the following models: Introduction 13
• Flat Shading: The lighting value of pixel is assigned the average of lighting values of the vertices (lit in the geometry stage) of the primitive. This is the simplest of the shading models, but is highly inaccurate.
• Gouraud Shading: In this model, the lighting value of the pixels is computed as the weighted average of the lighting value of the end points of the scanline. This model gives good quality results with relatively small computation over- head, and hence has been the most popular shading techniques.
• Phong Shading: In this model, the shading normal of the pixel is generated as the interpolation of the shading normals associated with the ends of the scanline. The generated normal is used in the computations involved in lighting the pixel.
Texturing: Texture mapping is a technique for adding surface detail, texture, or color to an object and helps significantly in adding realism to the scene. This process could be visualized as something similar to wrapping a patterned paper to a plain white sphere as shown in Figure 1.12. In this technique, the color associated with each of the pixels of the image is looked-up from a stored 2D image called the texture and the mapping between the pixel and a point on the texture, called the texel, is based on a predefined mathematical function. This technique is very popular and most commonly used since this has made the generation of objects with surface ir- regularities such as the bumps on the surface of moon, objects with surface texture such as that on a wooden plank, etc., possible in a graphics pipeline.
Fog: After texturing, fog is added to the scene, giving the viewer perception of depth. Fogging effect is simulated by increasing the haziness of the objects with increasing distance from the camera. The fog factor is thus a function of the z-value of a pixel and could increase either linearly or exponentially. The fog factor is then applied to the pixel by blending it with the color computed in shading and texturing steps. Another application of fogging is to make the clipping of objects at the far clipping plane less obvious by fading their disappearance rather than abruptly cutting them out of the scene. 14 Introduction
(a) Texture (b) Textured sphere
Figure 1.12: Texture mapping example
Alpha and Depth: Alpha value is one of the attributes of a vertex that is used to model opacity of the vertex. This is required to model transparency and translucency of the objects, for example in simulating water, lens, etc. An opaque object occludes the objects that are behind it. Thus, if a pixel is opaque and the z-value of the pixel is less than the value present in the depth buffer at the position corresponding to the pixel, then the depth buffer and color buffer are updated with the attributes of the pixel. However, if the object is transparent, depending on its transparency, the color of the occluded objects have to be blended with the color of the object to simulate the effect of transparency. In addition to depth and color buffers, a graphics pipeline also has a stencil buffer. Generally, this buffer stores a value of 0/1 per pixel to indicate whether the pixel has to be masked or not. This is used to create many effects such as shadowing, highlighting, and outline drawing. The operations involved in these three tests together can be summarized as follows.
Anti-aliasing: When an oblique line is rendered, it appears jagged on the screen as shown in Figure 1.13(a). This is a result of discretization of a continuous function (line) by sampling it over a discrete space (screen). One way to alleviate this effect Introduction 15
Algorithm 1 Alpha, depth and stencil test 1: if StencilBuffer (x,y) =6 0 then 2: if Alpha =6 0 then 3: if DepthBuffer (x,y) ≥ z then 4: ColorBuffer (x,y) ← color 5: DepthBuffer (x,y) ← z 6: end if 7: end if 8: end if
is to render the image at a resolution higher than the required resolution and then filter down to the screen resolution. This technique is called full screen anti-aliasing (Figure 1.13(b)). The problem with this method is that it increases the load due to pixel processing. Hence an optimization called multi-sampling is generally used, which identifies the edges of the objects in the screen and applies anti-aliasing only to the edges.
(a) Line (b) Anti-aliased line
Figure 1.13: Antialiasing
1.1.5 Display
When a new frame is to be displayed, the screen is first cleared; then the driver reads the new frame from the framebuffer and prints it on the screen. Generally a screen refresh rate of 60fps is targeted. If only one buffer is used for writing (by the GPU) and reading (by the display driver), artifacts such as flickering are common because the GPU could 16 Introduction update the contents of the frame before they are displayed on the screen. To overcome this problem, generally double buffering is used, as shown in Figure 1.14, wherein the display driver reads the fully processed frame from the front buffer while the GPU writes the next frame to the back buffer. The front and back buffers are swapped once the
Back B1 Buffer
Write Read GPU Frame #N Display Frame #N−1
Front B2 Buffer
(a) GPU writes to B1 and Display driver reads from B2
Front Buffer B1
Read GPU Write Frame #N Display Frame #N+1
B2 Back Buffer
(b) Swap Buffers: GPU writes to B2 and driver reads from B1
Figure 1.14: Double buffering read and write operations to the front and back buffers respectively, are completed. The obvious drawback of double buffering is performance loss. Since the frame which needs slightly more than 16.67msec (1/60 of a sec) would be updated only in the next refresh cycle, the GPU cannot start processing the next frame until then. In such cases, the overall frame rate may fall to half the targeted frame rate even when the load is only slightly increased. To counter this problem, triple buffering using three buffers (one front and two back buffers) can be used. The GPU can now write to the additional buffer, Introduction 17 while the other buffer holds the frame to be refreshed. The choice of double buffering or triple buffering depends on the availability of memory space.
1.2 Graphics Processor Architecture
The computational workload of 3D graphics applications is so high that to achieve real time rendering rates, hardware acceleration for graphics processing is almost always nec- essary. Generally, the application layer executes on the CPU and the rest of the graphics processing is offloaded to Graphics Processing Units (GPUs). To enable ease of devel- opment and also application portability, an Application Programming Interface (API) is used to abstract the hardware from the application. The device driver that forms the interface between the CPU and GPU receives the API calls from the application and interprets them to the GPU. The interaction between CPU and GPU is shown in Figure 1.15
Host CPU Application CPU RAM Hard Disk API
Device Driver
Ring Buffer DMA Transfer Interface
GPU VRAM Graphics Card
Display
Figure 1.15: The CPU – GPU interface 18 Introduction
The commands from the application running on the CPU are passed on to the GPU through a ring buffer interface. The data associated with these commands, such as vertex attributes, textures, and shader programs are transferred from system memory to VRAM through Direct Memory Access (DMA) transfers. In addition to acting as temporary storage for input data, VRAM also needs to store the processed frames that are ready for display. This area in VRAM that is reserved for storing the processed frames is essentially the framebuffer. Since the GPU need not send the processed frames to CPU, the CPU need not wait for the GPU to complete the processing before issuing the next GPU command. This helps CPU and GPU to work in parallel, thus increasing the processing speed. In graphics applications, we observe that the input data set is operated upon by a large number of sequential operations. Hence, GPUs are generally deeply pipelined to enhance the throughput. Moreover, the data set is huge and operations on one data element are independent of operations on other data elements. Hence, each stage in the pipeline consists of multiple function units to support parallel processing of the data streaming into it. Commercial implementations of graphics processors are mainly of two types:immediate rendering engine and tiled rendering engine. The classification is based on the order in which primitives are rendered in the pipeline. In this section we describe the immediate mode rendering engines in detail, followed by the tiled mode GPUs.
1.2.1 Immediate Mode Rendering Engines
High–end discrete graphics architectures from nVidia and ATI follow immediate mode rendering. Figure 1.16 shows the high level architectural view of these graphics processors. The Host Interface acts as an interface between the host CPU and the GPU pipeline. It maintains the state of the pipeline, and on receiving the commands from the driver, updates the state and issues appropriate control signals to the other units in the pipeline. It also initiates the required DMA transfers from system memory to GPU memory to fill vertex buffer and index buffer, load the shader program, load textures, etc. Vertex buffer is generally implemented as a cache, since re-use of vertices is expected. The first block in the pipeline, Transformation and Lighting, is responsible for per- Introduction 19
CPU
Host Interface
Index Buffer Transformation Vertex I/P and Lighting Cache
Vertex O/P Cache
Primitive Assembly
Clipping and Culling
Triangle Setup
HZ Buffer Hierarchial Z−Test
Early Z −Test
Depth Pixel Processor Texture Cache Cache
Z−Test
Color Color Blend Cache
VRAM
Figure 1.16: Fixed function GPU 20 Introduction forming the transformation and lighting computations on the vertices. The vertex input cache and index buffer are used to buffer the inputs to this block. The primitives of an object are generally found to share vertices as shown in Figure 1.17 – vertex 3 is common to triangles T 1, T 2, and T 3. Index mode for addressing the vertices results in reduced CPU-GPU transfer bandwidth than transferring the actual vertices in the presence of vertex reuse [3].
1 3 5
T1 T3
T2
2 4
Figure 1.17: Triangles sharing vertices
Vertex to Index Mapping
Vertex v1 v2 v3 v4 v5
Index 1 2 3 4 5
T1 v1 v2 v3 T1 1 2 3 T2 v2 v3 v4 T2 2 3 4
T3 v3 v4 v5 T3 3 4 5
(a) Triangles represented in terms (b) Indexed triangle representation of vertices
Figure 1.18: Indexed addressing into vertex buffer
For the example shown in Figure 1.17, if we send the vertices forming each of the Introduction 21 triangles T1, T2, and T3 to the GPU, we need to send 9 vertices as shown in Figure 1.18. If each vertex is made of N attributes, where each attribute is a four component vector (e.g., x,y,z,w components for position; R,G,B,A components of color), a bandwidth of 9 × 4 × N floating point data is required. Instead, we could assign each vertex an index (a pointer to the vertex – of integer data type), and send 9 indices and only 5 vertices to the GPU. Thus, in indexed mode we send only 9 integers and 5 × 4 × N floating point data. Indexed mode for vertex transfer and the resulting bandwidth saving are depicted in Figure 1.18. The indices of the vertices to be processed are buffered into the index buffer and the attributes of these vertices are fetched into the vertex input cache, since they are expected to be reused [4]. The processed vertices are also cached into a vertex output cache so as to reuse the processed vertices. Before processing a new vertex, it is first looked up in the vertex output cache[4]. If the result is a hit, the processed vertex can be fetched from the cache and sent down the pipeline, thereby avoiding the processing cost. The transformed and lit vertices are sent to the Primitive Assembly unit which assem- bles them into triangles. These triangles are sent to the Clipper where trivial rejection, back-face culling, and clipping take place. The triangles are then sent to the Triangle Setup unit which generates fragments from the triangles. The scan-line conversion of tri- angles into lines does not exploit the spatial locality of accesses into the framebuffer and also the texture memory. Hence tiled rasterization, as shown in Figure 1.19, is generally employed. In this technique, the screen is divided into rectangular tiles and triangles are fragmented such that the pixels belonging to same tile are generated first before proceed- ing to pixels falling in a different tile. The accesses to the framebuffer and texture cache are also matched to this tile size so that accesses to memories can be localized [5]. The next unit is the pixel processor which shades and textures the pixels. Since texture accesses exhibit high spatial and temporal reuse, a dedicated cache called the Texture Cache is used in this unit to store the textures. Most architectures use depth based optimizations prior to pixel processing because a large number of fragments are often culled in the depth test that follows pixel processing. Thus, the time spent on shading and texturing such fragments is actually wasted. However, it is not possible to conduct the depth test prior the pixel processing because, the pixel processor can potentially change the depth or transparency of the pixel. In circumstances where it 22 Introduction
19 20
16 17 18 4 1617 18 1 2 5 6 7 8
3 9 10 11 12 24 13 14 15 25 21 22 23
Figure 1.19: Tiled triangle traversal
is known that the pixel processor would not change these parameters, we can always perform the depth test prior to pixel processing. This is known as the earlyZ test [6]. It is generally observed that if a pixel fails the depth test, the pixels neighboring it also fail the depth test with a high probability. This property is exploited by the Hierarchical Z Buffer algorithm in identifying the groups of pixels that could be culled, thus reducing the number of per-pixel z-tests [7, 8]. After being shaded and textured, the pixels are sent to the Render Output Processor (ROP) for depth, stencil, and alpha tests followed by blending, and finally, writing to the framebuffer. Generally, the z-cache and color cache are used in this block to exploit spatial locality in the accesses to the off-chip framebuffer. The initial generations of GPUs were completely hardwired. However, with rapid advances in computer graphics, there was a need to support a large number of newer operations on vertices and pixels. Fixed function implementations have been found in- adequate to support the evolving features in the field of graphics processing due to their restricted vertex and pixel processing capabilities. Programmable units to handle vertex and pixel processing have been introduced in the programmable graphics processors of recent years. The vertex and pixel programs that run on these programmable units are called Shaders. By changing the shader code, we can now generate various effects on the same graphics processing unit. A study of the workload characteristics of various applications on modern programmable Introduction 23 processors reveals that the relative load due to vertex processing and pixel processing varies with applications and also within an application [9]. This results in durations when the vertex processors are overloaded while the pixel processors are idle and vice- versa, leading to inefficient usage of resources. The resource utilization efficiency can be improved by balancing the load for both vertex and pixel processing on the same set of programmable units, leading to faster overall processing. This is illustrated in Figure 1.20. Modern games are expected to have a primitive count of about a million resulting in tens of millions of pixels. The operations on these millions of vertices and pixels offer the scope for a very high degree of parallelism. Moreover, large batches of these vertices and pixels share the same vertex shader and pixel shader programs respectively. Hence, the programmable units are generally designed as wide SIMT (Single Instruction Multiple Thread) processors. In Figure 1.20, we observe that the GPU consists of multiple pro- grammable units, each consisting of several processing elements (PEs). Different threads could be running on different programmable units, but within a programmable unit, the same thread is executed on a different data element in every PE. All these PEs can hence share the same instruction memory and decoder. This results not only in optimization of area, but also in considerable power savings because the costs of instruction fetch and decode are amortized over the group of threads running in tandem on the PEs of the programmable unit.
1.2.2 Tiled Graphics Engines
Tiled rendering is the process of dividing the task of rendering a frame into smaller sub-tasks of rendering a regular grid of tiles that the frame is composed of. This is the preferred rendering mode in very high performance graphics cards such as Pixel Planes [10], Xbox 360 [11], Microsoft Talisman [12], and Intel Larrabee [13] because multiple tiles can be rendered in parallel on multiple cores of a high-performance GPU. Interestingly, this mode of rendering is also attractive in low power, low form factor mobile graphics processors such as Mali [14] and PowerVR [15] because rendering a tile at a time requires fewer resources than those required to render the whole frame in a single pass. In conventional immediate mode rendering engines explained above, since the rendering order of the objects is not sorted and the consecutive objects could span any region of 24 Introduction
Host
Input Assembler Set Up / Rasterize / Z−Cull
Vertex Thread Issue Pixel Thread Issue
Programmable Unit Programmable Unit
PE PE PE PE PE PE PE PE
PE PE PE PE PE PE PE PE
PE PE PE PE PE PE PE PE
PE PE PE PE PE PE PE PE Thread Processor TF TF TF TF TF TF TF TF TF TF TF TF TF TF TF TF
L1 L1
ROP L2 ROP L2
Frame Buffer
Figure 1.20: Unified shader architecture for graphics processor
the frame as illustrated in Figure 1.21, these caches need to be large to exploit data reuse. Mobile graphics processors, owing to stringent area constraints, cannot afford to accommodate these huge on-chip caches. Hence, to address the issue of memory bandwidth requirement, they generally use tiled mode rendering instead of the traditional immediate mode rendering approach discussed above[16]. In this technique, the scene in divided into sub-regions called tiles (shown in Figure 1.21), and the primitives falling into each of these tiles are divided into bins after geometry processing. Each bin is now rendered independently, and requires only small on-chip depth and color caches that are just sufficient to hold the depth and color values of the tile. After the tile is completely Introduction 25
Tile 1 Tile 2 C A B Bin 1 Bin 2 Bin 3 Bin 4
Tile 3 Tile 4 A,B C C
Figure 1.21: Tiled rendering rendered, the contents of the on-chip depth and color buffer are transferred to the off-chip frame-buffer. The Figure 1.22 shows a high level view of operations in a tiled rendering architecture.
CPU Geometry Tiling Rasterization Pixel Frame Buffer Processing Processing
Figure 1.22: Tiled graphics pipeline
Figure 1.23 shows the architecture of a mobile graphics processor with tiled rendering. The primitives of the scene data are fetched from off-chip memory by the geometry engine. After transformation, lighting, and culling the primitives, follows the screen-space sub- division of the primitives. The screen is divided into tiles of say 32 × 32 pixels. Each primitive is placed into bin(s) corresponding to the tile(s) it overlaps with, as shown in Figure 1.21. This overlap is detected by testing the intersection of the bounding box of the primitive with the tiles of the frame. Each tile is associated with a data-structure called the tile list that maintains the list of primitives present in the tile along with the state of the primitives. State defines the operations that are to be applied on the primitive in the rasterization phase of the processing such as: (i) Textures associated with the primitive and the type of texture filtering to be used; (ii) Depth test enable/disable; and (iii) Blending mode to be used. An example showing the state management in a tiled architecture is shown in Figure 1.24. Triangle T1 belongs to Tile 1 and triangles T2 26 Introduction and T3 to Tile 2. Depth test is enabled for all the triangles, but texture is enabled only for T3. Hence, while binning the triangles into tiles, we associate T1 and T2 with their state(Enable Depth) and T3 with its state(Enable Depth and Enable Texture).
Primitives Geometry Engine
Tile Lists Binning
Texture Pixel Processor
Frame Buffer Color & Depth Buffer
Off−Chip Memory GPU
Figure 1.23: Tiled graphics processor architecture
At the end of geometry processing of each primitive, the processed vertices are stored back in off-chip memory and the corresponding tile list updated. Once geometry process- ing and tiling on all the primitives of the frame is completed, the tiles are processed in sequence. In this architecture, small color and depth buffers are sufficient, as opposed to the huge depth and color caches used in immediate mode rendering because we need to store the values of only 32x32 pixels on the chip. Once the tile is processed, the buffer contents are transferred to external framebuffer memory.
1.3 Power Dissipation in a Graphics Processor
Analysis of operations in a graphics pipeline demonstrates that most of the computa- tions are concentrated in the programmable units, texture units, and ROP units. Pro- Introduction 27
Input API Calls State for Tile 1 State for Tile 2 Enable Depth Enable Depth Enable Depth Triangle(1) Triangle(2) Triangle(1) Triangle(2) Enable Texture Enable Texture Triangle(3) Triangle(3)
Figure 1.24: State management in tiled GPUs grammable units execute a large number of floating point vector operations; texture units use large memory bandwidth to move the textures from VRAM to cache, and perform a large number of floating point operations for filtering the texels; ROP units are memory intensive, needing multiple reads and writes to the color and depth buffers. To estimate the relative power consumption of different pipeline stages of a GPU, we have used Qsil- ver [17] simulator. Qsilver gives a high level estimate of power consumption for various operations on the graphics pipeline. A plot of the energy consumption in different bench- marks at various stages of the graphics processor pipeline using Qsilver [17]simulation is shown in Figure 1.25. 1
0.8
0.6
0.4
0.2 Normalized Energy
0 City Fire Teapot Tunnel mean Benchmark
Frame Buffer Write z−Test Texture Mapping Setup and Rasterize Transform and Lighting Figure 1.25: Fraction of energy consumed in texture unit From Figure 1.25 we observe that vertex processing, pixel processing, texturing and 28 Introduction raster operations are the main sources of power dissipation in the GPU and the power consumed by the stages primitive assembly, clipping and triangle setup can be ignored. This observation is further strengthened the fact that most of the real estate of popular commercial GPUs is also occupied by PUs, texture units and ROPs. Figure 1.26 shows the footprint of nVidia’s GPU GT200 targeted for laptop computers [18].
ROP PU Tex Tex PU
Frame BufferRest Frame Buffer
PU Tex Tex PU ROP
Figure 1.26: Footprint of the die of GT200 (approximately to the scale)
1.4 Our Contribution
In this thesis we have proposed and evaluated several techniques to optimize power con- sumption of graphics processors. Since the visual quality of graphical applications is highly dependent on the rate at which the frames are processed, major compromises on performance are unacceptable. Hence, care is taken to optimize power without affecting the overall performance of the graphics processor. The main contributions of the thesis are
• We propose a customized memory architecture, named Texture Filter Memory for texture cache, which exploits spatial locality and predictability in texture accesses. Introduction 29
By buffering the blocks of texture in registers, we have replaced the high power cache lookups with low power register reads. We have proposed a smart lookup mechanism to maximize the hits into the registers with a small number of arithmetic operations.
• We propose a code optimization technique to avoid complex operations such as lighting, texturing, etc., on primitives that would be trivially rejected after vertex shading. We propose to partition the vertex shader code into position dependent and position independent parts and defer that position invariant part to post trivial reject stage of the pipeline. Since almost 50% of the vertices are expected to be trivially rejected in most applications, our technique results in huge power savings for geometry dominated applications.
• We propose a Dynamic Voltage and Frequency Scaling technique to exploit the variation in workload of graphics applications. We demonstrate that tiled-graphics processors exhibit substantial flexibility over immediate-mode renderers in obtain- ing access to some of the frame parameters that help in enhancing the workload estimation accuracy. We also show that, operating at a finer granularity of ”tiles” as opposed to ”frames” allows early detection and corrective action in case of a mis-prediction. We propose an accurate workload estimation technique and two DVFS schemes, namely (i) tile-history based DVFS and (ii) tile-rank based DVFS for tiled-rendering architectures.
1.5 Thesis Outline
In this thesis we investigate the opportunities to conserve the power consumption of graph- ics processors. Chapter 2 presents detailed literature survey of known power optimization techniques that are applicable to GPUs. In Chapter 3, we study the access patterns to the texture cache and propose a customized low power memory architecture that conserves both dynamic and leakage power of the texture memory subsystem. In chapter 4, we propose a vertex shader code optimization technique to increase the power-performance efficiency of geometry engine. In Chapter 5, we present a sophisticated workload estima- tion technique for tiled GPUs and propose a system level power optimization scheme of 30 Introduction
DVFS based on the same. Chapter 2
Literature Survey
In recent years, low power GPU designs have drawn a lot of attention in both research and commercial communities. In this chapter we present a survey of component level power optimizations such as using custom memory architectures, scaling down the complexity of functional units for better performance per power ratio, data compression to reduce power due to memory transfers, clock gating, etc. We also study system level power optimizations like block level power gating, dynamic voltage and frequency scaling, etc. We first present the unit level optimizations for the power hungry blocks of the GPU, followed by the system level power optimizations.
2.1 Programmable Units
2.1.1 Clock Gating
The high level view of a processing element(PE) in the Programmable Unit(PU) is shown in Figure 2.1. Each processing element in the programmable unit consists of a SIMD ALU working on floating point vectors. In addition to the SIMD ALU, there is a also a scalar ALU that implements special functions such as logarithmic, trigonometric, etc. The ALU supports multi-threading so as to hide the texture cache miss latency. Context switches between threads in a conventional processor causes some overhead since the current state (consisting of inputs and the auxiliaries generated) needs to be stored in memory and the state of the next thread has to be loaded from memory. In order to support seamless context switches between the threads, the PUs in a graphics processor store the thread
31 32 Literature Survey
Context #1 Context #N I/P AUX Reg Bank Reg Bank
Constant Reg Bank MuxContext Mux
Special SIMD ALU FU
Context DeMux
O/P Reg Bank
Figure 2.1: Processing Element (PE)
state in registers. The register file in the shader has four banks, one each to store input attributes, output attributes, constants, and intermediate results of the program. The constant register bank is shared by all threads whereas separate input/output and tempo- rary registers are allocated to each thread. The instruction memory is implemented either as a scratch pad memory where the driver assumes the responsibility of code transfer or through a regular cache hierarchy.
Clock gating of the various sub-blocks of a programmable unit presents itself as a huge power saving opportunity. Since the PUs support a large number of threads and use registers to save the state, large register files are needed. However, since only one thread is active at any given instant, it is sufficient to clock only the registers allotted to the active thread and gate the clock to the remaining the registers. Similarly, the special function units in the ALU are infrequently used, and hence, can be activated only when the decoder stage confirms the need. Literature Survey 33
2.1.2 Fixed Function ALUs
Mobile platforms generally use GPUs with separate PUs for vertex and pixel shaders. Hence, there is a scope for customizing these to the operations specific for vertex and pixel processing respectively. It is observed that vertex shakers give comparable quality at much lesser precision than pixel shaders. Thus researchers have explored the options of using integer or fixed function ALUs rather than floating point units for vertex shading [19, 20]. The observation has been that integer ALUs do not provide the required quality but fixed function implementations yield required quality at much better power budgets in comparison to floating point implementations. A few research efforts also report the benefit of using asynchronous function units in the ALUs. In [21], the authors suggest replacing the Booth multiplier in the MAC unit of the shader pipeline with a low power asynchronous multiplier.
2.1.3 Predictive Shutdown
Predictive shutdown is an effective technique for reducing power loss due to leakage in the idle components of a system. Due to workload variations, not all programmable units are fully utilized in every frame. Advance information about a frame’s workload can help estimate the number of cores required to process it within the time budget. By activating only the required number of cores and powering down the surplus ones, leakage power from the idle cores can be avoided, thereby leading to substantial power savings. A history based method could be used to estimate the utilization of the PUs [22]. Let th the number of active cores used to process the n frame be Sn and the rate at which it was processed be FPSn. Then the maximum rate at which each of the cores processed the frame is given as FPSn . Similarly, the maximum number of cores required to process Sn n +1th frame can be calculated as
Target frame rate for n +1th frame S +1 = (2.1) n minimum rate at which a core is estimated to process the frame
The expected rate at which the core processes a frame can be approximated to the mini- mum processing rate observed in processing of a window of m previous frames.
Based on previous history, the number of active cores Sn+1 required to process the 34 Literature Survey n +1th frame is given by the equation:
[FPStarget + α] Sn+1 = (2.2) min{FPSn , FPSn−1 ,...... , FPSn−m+1 } Sn Sn−1 Sn−m+1
The factor α is introduced so as to slightly overestimate the core requirement, so that small variations in the workload can be taken care of without missing the deadline. In the above formula, it is assumed that the entire duration in processing the frame is spent on the PUs, which is generally not true. The frame could as well be texture intensive or ROP intensive. The α factor also serves to reduce the effect of deadline misses due to under estimation of workload.
2.2 Texture Unit
2.2.1 Low power cache configurations
Hakura et al. [23] were the first to observe that texture access have high spatial and temporal locality. They demonstrated the ability of texture cache to reduce the memory bandwidth requirement of texture mapping and hence improve power and performance of texture memory. Later it was observed that the texture L1 cache, due to its small size, only handles intra primitive locality and not inter triangle or inter frame locality. Hence in [24] the authors suggest an external texture cache (L2 cache) between the internal L1 cache and the texture memory. They organize the L2 cache as a virtual memory, with a mechanism to translate from texture addresses to physical addresses. Texture cache architecture in parallel rasterization architectures was studied by Igehy et al. [25]. Serial rasterization architectures benefited from texture cache because of locality of accesses. In an architecture with parallel rasterization units with local texture caches, the spatial locality within each cache is not as high as in architectures with single texture unit. Hence they propose a shared texture memory architecture for effective bandwidth utilization by avoiding duplication of textures. In [26] the authors evaluate the effect of three hybrid access cache systems: victim cache, half-and-half cache and cooperative cache on conflict misses. They observed that the results varied a lot with the size of the cache. For an 8KB cache, victim cache Literature Survey 35 performs better that the others but for a 16KB cache, the performance of victim cache and half-and-half cache are comparable. In [27], the authors reduce power consumption by replacing the 16KB 2-Way associative cache with a very small (128-256 Byte) direct mapped cache. This reduces average power as the direct-mapped cache avoids several tag comparisons of set-associative caches, but at a considerable performance penalty (50%) due to high miss rates.
2.2.2 Texture Compression
To reduce the power consumed in fetching textures from off-chip caches, the texture memory bandwidth is reduced by transferring compressed textures from off-chip texture memory to the texture cache. Since texture accesses have a very high impact on system performance, the main requirement of texture compression system is that it should allow fast random access to texture data. Block compression schemes such as JPEG are not suitable for textures though they give high compression ratios. Since the accesses to texture memory are non-affine, it cannot be assured that the decompressed data is used up before the next block is fetched and decompressed. In cases where consecutive texture accesses alternate between a few texture blocks, the same block would have to fetched and decompressed multiple times resulting in increased block fetch and decompression overhead. Hence, we require compression schemes where the texels in a block can be decompressed independent of the other elements of the block. The S3TC compression technique is commonly used for this purpose[28]. In the S3TC technique, for a block of texels, two reference values and a few values generated by interpolation of the reference values are chosen such that each texel in the block can be approximated to one of the chosen values with least loss in accuracy. For example, if four values are to be used to represent the colors of a texture block, and c0 and c1 are the chosen reference values, two other colors (c2 and c3) are generated from interpolation of c0 and c1 as shown in Figure 2.2. For each texel in the block, the closest of the colors among c0 to c3 is chosen. Thus, a 4 × 4 tile would require 2 reference values and 16 2-bit offsets to generate the interpolants from the reference values instead of 16 texel values. Based on this principle five modes of compression named DXT1. . . DXT5 have been proposed, with varying accuracy and compression ratios. 36 Literature Survey
1−unit 1−unit 1−unit
C0 C2 C3 C1
2.C + C C + 2.C C2 = 0 1 C3 = 0 1 3 3
Figure 2.2: S3TC texture compression
1. DXT1 gives the highest compression ratio among all the variants of DXT compres- sion. Texels are usually represented as 32-bit values with the R,G,B,A components allotted 8-bits each. However, most of the time, textures do not need 32-bit accu- racy. Hence, DXT1 uses 16-bit representation for RGB components (5:6:5) of the reference colors and allows a choice of 0 or 255 for transparency.
The colors that could be generated from the two 16 bit reference values and 2 bits per texel which determine the weights for interpolation are shown below:
If c0 and c1 are the reference values, the other two colors are calculated as
If c0>c1 2c0+c1 c0+2c1 c2 = 3 and c3 = 3 else c0+c1 c2 = 2 and c3 =0
For a 4 × 4 tile size, this scheme needs 64 bits per tile giving 8:1 compression.
2. The DXT2 and DXT3 compression schemes encode alpha values also in addition to color values and the compression scheme is similar to that described in DXT1. Thus for a 4 × 4 tile, they need 64 bits for color as in DXT1 and an additional 64 bits for alpha values, giving a 4:1 compression. In DXT2, color data is assumed to be pre-multiplied by alpha, which is not the case in DXT3.
3. In the DXT4 and DXT5 schemes, color components are compressed as in DXT2/3 and for alpha compression, two 8-bits reference values are used and 6 other alphas are interpolated from them, giving 8 alpha values to choose from. The alpha en- coding is as shown below: Literature Survey 37
If α0 > α1, 6α0+α1 5α0+2α2 4α0+3α3 α2 = 7 , α3 = 7 , α4 = 7 , 3α0+4α4 2α0+5α5 α0+6α6 α5 = 7 , α6 = 7 , α7 = 7 else 4α0+α1 3α0+2α2 2α0+3α3 α2 = 5 , α3 = 5 , α4 = 5 , α0+4α4 α5 = 5 , α6 = 0, α7 = 255 DXT4 is used in case color is pre-multiplied with alpha and DXT5 is used if it are not. DXT4/5 also give 4:1 compression, but produce superior results for alpha values.
2.2.3 Clock Gating
Clock gating is a powerful technique that could be used to conserve the power dissipated in a texture unit. The textures are generally stored in such a way that the odd mipmap[29] and even mipmap levels map to different cache banks so that the texels could be fetched in parallel during tri-linear[29] interpolation. Moreover, the addressing and filtering units are also present in pairs so that the texels could be filtered in parallel so as to facilitate faster texture sampling. However, when the texels are filtered in bilinear mode, half of these units and texture banks are idle. There could also be intervals during which the vertex or pixel shader threads would not be using texturing at all. Since the requirement of texture and the type of filter used is a part of the state information that the driver sends to the GPU, texture enable and filtering modes are set before the processing of the batch starts. Ideally, half of these units could be powered off when bilinear filtering[29] is used and the entire texture module is switched off when texturing is not used. However, since the intervals between switching from one condition to another may not be large enough to merit powering-down of these circuits, clock gating is generally used to conserve the power associated with clocking these units.
2.3 Frame Buffer
Raster operations are highly memory intensive since they need multiple reads and writes to the off-chip framebuffer. Since off-chip accesses are slow and power hungry, these 38 Literature Survey framebuffer accesses affect the power consumption and also the performance of the sys- tem. Reducing the memory bandwidth between the GPU and framebuffer is therefore a very important power-performance optimization. Major techniques that are generally employed to reduce the required bandwidth between GPU and VRAM are:
• Re-use the data fetched from VRAM to the maximum extent before it is replaced by other data. Efforts made in this direction include extensive on-chip caching and blocked data accesses to maximize cache hits.
• Send compressed data to GPU from VRAM and decompress it on-chip so as to decrease the memory traffic. The decoder has to be simple enough so that the savings due to compressed data transfers dominates the decompression cost in terms of power and performance.
In this section we discuss the data compression strategies for memory bandwidth reduction of color and depth buffer.
2.3.1 Depth Buffer Compression
As described in Figure 1.19, the fragments are generated in tiles to exploit spatial locality of accesses to color, depth, and texture data. Several tiles of depth information are cached in the on-chip depth cache; whenever there is a miss in this cache, a tile of data is fetched into it from the off-chip VRAM. To reduce the memory bandwidth due to these transfers, VRAM stores and transfers compressed tiles, which are decompressed on-the- fly before they are stored in the on-chip depth cache. Differential Differential Pulse Code Modulation (DDPCM) is one of the popular compression techniques used for depth buffer compression [30]. It is based on the principle that, since the depth values of fragments of a triangle are generated by interpolation of depth values of the vertices, if a tile is completely covered by a single triangle, the second order differentials of the depth values across the tile would all be zeroes. The steps in the compression scheme are enumerated below:
1. Start with a tile of depth values. Assuming the tile is covered by a single triangle, the interpolated depth values in the tile would be as shown below: Literature Survey 39
2. Compute the column wise first order differentials, and then repeat the same step to obtain column-wise second order differentials.
3. Follow it up with row-wise second order differential computation.
We see that in the best case (i.e., when the triangle covers a tile), we need to store only one z value and two differentials. Thus, for an 8 × 8 block (which would originally need 64 × 32 bits), the compressed form would need 32 bits for the reference z value and 2 × 33 bits for the differentials. Since the depth values are generally interpolated at a higher precision than they are stored in, the second order differentials would be one among the values 0,-1, and 1. Hence two bits are required to encode the value of the second differential. Thus with an addition 61 × 2 bits for the second differentials, a total of 32+2 × 33 + 61 × 2 = 220 bits would be required instead of 2048 bits. With the two bits used to represent the differentials, four values can be realized. Since only values 0,-1, and 1 are required to be realized, the fourth value can be used to indicate the case when the differentials take values other than 0, -1, and 1. In this case a fixed number of 40 Literature Survey second order differentials are stored at higher precision and picked up in order each time a violation is indicated.
2.3.2 Color Buffer Compression
The transfers from color buffer to color cache are also done in tiles, so as to exploit spatial locality of accesses. Hence, block based compression schemes are used for these data transfers. Since color values are not always interpolated from vertex colors (they could be textured), the compression scheme used for depth buffer compression is not very efficient for color buffer compression. The difference between the color values of the neighboring pixels is small and this makes variable length encoding of the differences a suitable option for color buffer compression [31]. This compression technique is called exponent coding since the numbers are represented as s(2x − y), where s is the sign bit and y ∈ [0, 2x−1 − 1]. x + 1 is unary coded and concatenated with sign and y coded in normal binary coding to give the compressed value. For example, value 3 is represented as (22 − 1). Here x + 1 = 3 which is 1110 in unary coding, s = 0 and y = 1, hence the code for 3 is 111001. Table 2.1 shows the coded values for numbers in the range [−32, 32].
Value range Code
0b 0 10sb ±1 110sb ±2 1110sxb ±[3, 4] 11110sxxb ±[5, 8] 11110sxxxb ±[9, 16] 11110sxxxxb ±[17, 32] 11110sxxxxxb 8-bit absolute value
Table 2.1: Exponential encoding for color buffer compression
From the table we see that smaller numbers can be represented with relatively fewer number of bits than larger numbers. Since in most of the cases the differentials are observed to be small, significant compression ratios can be expected. Color values are used by both GPU and also the display controller. Compression helps reduce the bandwidth between the framebuffer and display controller also. Hence the display controller also needs a decompressor to decode the compressed color values read from the framebuffer. Literature Survey 41
2.4 System Level Power Management
In addition to the architectural power optimization techniques discussed so far, system level power management techniques can also be effective in reducing the power consump- tion by minimizing the wastage of power in a graphics subsystem. Techniques such as system level power gating, Vdd and Vth scaling, and DVFS scaling are efficient in saving power. These techniques, as applicable to GPUs, are discussed in detail in this section.
2.4.1 Power Modes
Graphics processors are used for accelerating various kinds of applications such as word processors, GUIs for various tools such as internet browsers, games, etc. Since the amount of graphics processing varies to a great extent from application to application, the GPU workload due to these applications also varies greatly. Moreover, there could be large in- tervals of time during which none of the applications requires graphics processing, leaving the GPU idle. Since it is not always required to operate the GPU at peak performance levels, a few power modes with varying performance levels are generally supported. For example, when the GPU is idle, it can be operated at minimum Vdd and Vth levels sav- ing maximum dynamic and leakage power. However, when 3D games, which use heavy graphics processing are running on the system, the GPU can be operated at maximum performance mode. Performance monitors are used to gauge the utilization of the GPU (similar to monitoring CPU utilization), and the operating system switches the GPU to a power mode that delivers the required performance level with minimum power consump- tion.
2.4.2 Dynamic Voltage and Frequency Scaling
In case of power management by mode switching, since the switching overhead is high, there is a relatively large difference between the thresholds that cause a transition be- tween power modes. The observation intervals are also large. However, applications such as games have been shown to exhibit significant variation in workload presented by dif- ferent frames in the application. Fine tuning the computational capacity of the GPU in response to such workload variations has a huge power saving potential. Dynamic voltage 42 Literature Survey and frequency scaling is a popular power optimization technique used by processors to match the computational capacity of the processors with varying workloads of the appli- cations running on them by adjusting the frequency and operating voltage at run time. Commercial processors such as Transmeta Crusoe, Intel Pentium Mobile, and ARM pro- vide support for DVFS. To control the operating point of the CPU, simple schemes such as PAST [32] and Aged Averages [33] are used to predict the expected utilization of the CPU in the current time interval based on the average CPU utilization observed in recent intervals. Since these schemes have no knowledge of the applications and decisions are taken only on the basis of the history of observed CPU workloads, they can suffer from significant performance degradation due to mis-predictions. DVFS schemes that have ap- plication knowledge have shown better power and performance characteristics [34]. Since the quality of service in games is highly sensitive to the frame rates, it is important to predict the workload accurately in order to minimize the number of frames missing their deadlines. Some techniques use the workload history to predict the expected workload of the current frame, while others attempt to extract hints from the frame state information to guide the workload prediction. Various prediction techniques proposed in literature are discussed in more detail in the following sections.
History based Workload Estimation
The history based workload estimation technique predicts the workload of the current frame from the workload of the previously rendered frames [35]. The simplest and most straightforward way to do this is to approximate the workload of the current frame to that of the previous frame. However, doing so would result in frequent voltage-frequency changes, which are not desirable, since switching from one voltage-frequency level to another imposes an overhead of stabilization time. To minimize the number of transitions, the average workload of a window of previous frames is used to guess the workload of the current frame. A large window size is helpful in reducing the number of voltage changes, but at the same time, leads to a larger number of frames missing the deadlines as a result of the slower correction mechanism. This history based workload prediction can be extended to estimate the workload of all the voltage islands in the design and the operating point of each of the islands can be tuned to match the workload changes Literature Survey 43 experienced by the island.
Control Theory based Workload Estimation
Control theory based DVFS takes into account the previous prediction error along with the previous predicted workload to predict the workload of the current frame [36]. Since it can adapt faster to the workload changes, it results in lesser number of frames missing their deadline. In a control based DVFS scheme, a simple Proportional Integral Derivative (PID) controller, as shown in Figure 2.3, is used as a closed loop feedback mechanism to adjust the predicted workload of the current frame based on the prediction errors for some of the previously rendered frames. The workload of the current frame wi is expressed as
wi = wi−1 + delta(w) (2.3) where delta(w) is the output from the PID controller. The proportional control regulates the speed at which the predicted workload responds to the prediction error of the pre- vious frame. The Integral control determines how the workload prediction reacts to the prediction errors accumulated over a few of the recently processed frames. The differen- tial control adjusts the workload based on the rate at which the prediction errors have changed over a few of the recent frames. Thus the correction value generated by the PID
Proportional Kp x Error
Set Point Integral Process Output Error Ki x Error + − Differential Kd x Error
Figure 2.3: PID controller
controller can be expressed as
delta(w)= Kp × Error + Ki × X Error + Kd × ∆Error (2.4) 44 Literature Survey
The contribution of each of the Proportional, Integral, and Differential components of the controller can be tuned by varying the coefficients Kp, Ki, and Kd respectively. The flow of operations that take place in a PID based DVFS scheme can be summarized as shown in Figure 2.4. Based on the difference between the actual workload and the predicted workload (Error) of the current frame, the PID controller estimates the workload for the next frame. The voltage and frequency of the system are scaled to match the computational capacity of the system with the predicted workload of the next frame. The frame is processed at this operating point and actual workload of the frame is observed to generate the Error value that drives the PID controller.
Error
PID Controller Generates Correction from Error
Workload = Previous Workload + Correction
Scale Voltage and Frequency
Process Frame & Measure Workload
Error = (Predicted−Measured) Workload
Figure 2.4: PID controller based DVFS for graphics processor
Frame Structure based Workload Estimation
In all the above discussed methods, the workload of a frame is estimated based on the history of previously processed frames. Hence the prediction would be good only when the scene remains almost the same across consecutive frame captures. The workload is bound to be mis-predicted when there is a significant change in the scene, which may result in frames missing their deadlines. To alleviate this problem, the frame structure Literature Survey 45 based estimation technique bases its prediction on the structure of the frame and the properties of the objects present in the frame [37]. Since this information is obtained prior to processing of the frame, the workload prediction could be based on the properties of the current frame rather than basing it on the workload of previous frames. In this approach, a set of parameters impacting the workload are identified and an analytical model for the workload as a function of these parameters is constructed. During the execution of the application, each frame is parsed to obtain these parameters and the pre-computed workload model is used to predict the expected workload of the current frame. For example, the basic elements that make up a frame in the Quake game engine can be enumerated as follows.
• Brush models used to construct the world space. The complexity of a brush model is determined by the number of polygons present in the model. If the average workload for processing a polygon is w, the workload W presented by n brush models each consisting of p polygons is represented as:
W = n × p × w (2.5)
• Alias models are used to build the characters and objects such as monsters, soldiers, weapons, etc. Alias models consist of geometry and the skin texture of the entity being modeled. The skin could be rendered in one of two modes – opaque and alpha blend. Since the geometry consists essentially of triangles, its workload is characterized in terms of the number of triangles and average area of the triangle. Since alpha blending and opaque blending present different workloads, the workload is parametrized for both modes of rendering. If the workload of processing a single
pixel with blending is wt and without blending is wo, the workload W due to alias
models consisting of Nt triangles with blend and No opaque triangles, of average area A is given by:
W = Nt × A × wt + No × A × wo (2.6)
• Textures applied to the surfaces of brush model to give a realistic appearance like
that of wood, brick wall, etc. The workload W due to applying Nt textures, where 46 Literature Survey
w is the workload for applying a single texture on N polygons with average area A, is given by:
W = Nt × N × A × w (2.7)
• Light maps to create lighting effects in the scene. Since they are similar to texture maps, the workload due to light-maps is estimated similar to the estimation for texture maps.
• Particles to create bullets, debris, dust, etc. The workload W due to rendering the
N particles, where the number of pixels in a particle i is given as Pi and workload for rendering one such pixel is w, is given by:
W = N × Pi × w (2.8)
Finally the total workload of the frame is the sum total of the workloads computed above.
Signature based Workload Estimation
The Signature based estimation technique aims to estimate the workload using the prop- erties of the frame in addition to the history of cycles expended in processing the previous frames [38]. Every frame is associated with a signature composed from its properties such as the number of triangles in the frame, average height and area of the triangles, the num- ber of vertices in the frame, etc. A signature table records the actual observed workload of the frame against the signature of the frame. Prior to rendering a frame, its signature is computed and the predicted workload of the frame is picked from the signature table. On rendering the frame, if there is a discrepancy between the observed and predicted workloads, the signature table is updated with the observed workload value. To compute the signature of the frame, we need the vertex count, triangle count, and also the area and height of the triangles. The pipeline has to be modified to facilitate the signature extraction since the triangle information can be obtained only after the triangle culling and clipping are performed. The modified pipeline is shown in Figure 2.5. The geometry stage is divided into vertex transformation and lighting stages. Triangle clipping and culling stages are now performed prior to lighting and a signature buffer is inserted Literature Survey 47 prior to the lighting stage to collect the frame statistics. Since we need the information of the entire frame to compute a meaningful signature, the buffer should be big enough to handle one frame delay. Signature based prediction works on the assumption that the computational intensity of the pre-signature stage is negligible and also can be performed on the CPU without hardware acceleration. For every signature generated, the best matching signature from the table is to be looked up. A distance metric shown in equation 2.9 is used to locate the signature that is closest to the current signature. For a signature S consisting of parameters s1,s2, ...sd and a signature T comprising t1, ...td in the signature table, the distance D(S, T ) is defined as
d |s − t | D(S, T )= i i (2.9) X s i=1 i
The signature that is at a minimum distance from the current signature could be looked up either by linear search or any other sophisticated searching mechanism.
Transform Lighting ClippingRasterization Pixel Processing
(a) Conventional Pipeline
Transform Signature Lighting Rasterization Pixel Processing & Clip Buffer
Monitor Perf & Extract Lookup Scale Update Sig Table Signature Sig. Table Vol & Freq (b) Pipeline enhanced with Signature based DVFS
Figure 2.5: Signature based DVFS for graphics processor
2.4.3 Multiple Power Domains
From the discussion in Section 1.3, it is clear that PUs, texture units, and ROPs are major components in the graphics processor that consume power. From the workload analysis of games it has been observed that some frames use a lot of texturing; others 48 Literature Survey load the programmable units; still others require a large number of ROP operations. Hence these three modules could be designed to have different sets of power and clock signals. Thus, the voltage and frequency of each of these domains can be independently varied in accordance to their load, leading to power savings.
2.5 Miscellaneous
In [39], the authors propose the processing of the difference of adjacent pixel values instead of directly on the pixel values. Spatial correlation in typical images lead to the difference being a small number on an average. Similar tonal locality observations are also exploited in [40, 41] to assign codes in order to reduce total bit transition count during serial transmission over the Liquid Crystal Display (LCD) bus to the LCD display device. In [42, 43], the authors observe that the eye’s visual perception depends on the intensity and the transmittance characteristic of the LCD panel. It is possible to adjust these parameters without affecting the perception quality; specifically, it is possible to reduce the intensity and thereby reduce power. Usage of fixed point arithmetic instead of floating point has been suggested for low power implementations of graphics processors in mobile applications [44]. Low power features in such implementations include using the valid instruction signal in the vertex shader to clock-gate the register files, preventing writes (reads can still proceed). Chapter 3
Texture Filter Memory
3.1 Introduction
From power analysis of a typical graphics pipeline using Qsilver [17] as shown in Fig- ure 1.25, we have seen that texture mapping is one of the components that contributes significantly to total power consumption. The Texture memory sub-system consumes upto 38% of the total dynamic energy consumption, making it a potential candidate for optimization. One of the main side-effects of technology scaling has been increasing lev- els of leakage power. Typically in caches, since cache lines leak power for most of their life time, leakage power of caches is much more than their dynamic power. It has been observed that in a 70nm cache, 77% of the total power consumption is from the leakage component where as the dynamic power consumption is only 23% of the total [45]. Leakage power consumption is directly proportional to area of the chip. Since signifi- cant area of the GPU real estate is occupied by texture memories as seen in Figure 1.26, texture memory is also a major contributor of leakage power of the GPU. In this chapter we aim to come up with a custom memory architecture for texture mapping which would optimize both dynamic and static power consumption of the texture memory sub-system resulting in significant overall power saving. Texture mapping is the process of mapping an image (in texture space) on to a surface (in object space) using some mapping function [29]. The process of texture mapping can be explained with a simple example shown in Figure 3.1. Consider modeling a globe. One way to do this is to represent the sphere as a large
49 50 Texture Filter Memory
(a) Object (b) Texture (c) Textured object
Figure 3.1: Texture mapping to model a globe
number of tiny triangles and associate the vertices of these triangles with appropriate colors so that after the triangles are passed through the pipeline, what finally appears on the screen looks like a globe. The modeling effort in this case is so huge that it makes rendering such models almost impossible. Things would be easier if we could just define the mapping of a few points on a sphere to the points on a 2-D world map, and the pipeline had the capability to associate the pixels with appropriate colors from the world map. This process of mapping the pixels on a 3D model to points on a 2D texture called texels, is called texture mapping or texturing. The figure 3.2 shows the process of texture mapping a primitive. From the figure we observe that texture space and object space could be at arbitrary distance and orientation with respect to each other. As a result, there is no one-to- one correspondence between the pixels on the object and texels of the texture. This necessitates the use of some texture filtering mechanism to attribute the best color to a pixel. Hence several texture filtering techniques are used to improve the quality of texture mapped images. Point filtering: This is the simplest of the filtering methods where the color of the texel that is nearest to the pixel center is picked up as the color of the pixel. This is the fastest but crudest form of texture filtering. Linear filtering: This is sightly more refined than the point filtering. In this filtering method, two texels closest to the pixel center are weighted averaged to obtain the color of the pixel. Though this is better than point filtering, it still doesn’t yield acceptable Texture Filter Memory 51
A
B C
C
A B
Figure 3.2: Oblique traversal of scanlines in texture space quality. Bilinear filtering: This is one of the most common filtering techniques used for texture mapping. In bilinear filtering the weighted average of four texels nearest to the pixel center gives the color of the pixel (Figure 3.3). Trilinear filtering: In order to produce good results for varying levels of depth (lod) at which the object could be viewed, the texture image is stored at various resolutions called mip-maps [29] and the nearest one is picked up for filtering at run time based on the lod. In bilinear filtering, abrupt changes from one mip-map level result in noticeable changes in quality of the image. To avoid this artifact, in trilinear interpolation, the bilinearly interpolated values from the two nearest mip-map levels are averaged to give the color of the pixel. Anisotropic Filtering: When the rendered object is at an oblique viewing angle with respect to the camera, bilinear and trilinear texture filtering do not give satisfactory image quality. Since the latter’s filter pattern is a square, it performs well only when the object is viewed head-on and leads to blurring when viewed at an angle. In such cases, anisotropic filtering[29] is generally used, in which the footprint of the filter is 52 Texture Filter Memory generated at run time depending on the level of obliqueness of the viewing angle. The current commercial implementations of anisotropic filtering might require upto 128 texels for generating the color of single pixel. Since, this filtering incurs a heavy performance cost, it is limited to very few pixels of the scene.
3.2 Texture Mapping Access Pattern
Since almost all of the common filtering methods inherently use bilinear filtering, we study the access pattern of bilinear filtering in detail. Texture mapping with bilinear filtering exhibits high spatial and temporal locality. This is because:
• to compute the color of a pixel we need to fetch four neighboring texels,
• consecutive pixels on the scan line map to neighboring texels, and
• consecutive scanlines of a primitive share texels.
In addition to locality, texture mapping also exhibits predictability in access pattern. As seen in Figure 3.3, access to texel t1(base texel) is followed by accesses to texels t2,t3 and t4. Thus access to texel t1 gives us the information about the accesses to the next three pixels. A conventional 4-way associative cache architecture for texture memory as suggested in [23] is oblivious to such predictability in accesses. The four required reads from the tag and data arrays in addition to the four tag comparisons for each texel fetch make it very power hungry. We propose a customized memory architecture that exploits spatial locality and predictability in the access stream resulting in a low power solution without compromising on performance.
BASE TEXEL
(tx,ty) (tx+1,ty) t1 t2 PIXEL t3 t4 CENTER (tx,ty+1) (tx+1,ty+1)
Figure 3.3: Footprint of a Bilinear filter Texture Filter Memory 53
CASE 1 CASE 2 CASE 3 CASE 4
Figure 3.4: Scenarios to which the texture footprint could be mapped
Since the direction of accesses in texture memory is arbitrary, a blocked representation of texture maps in memory is generally used (illustrated in Figure 3.5). Each block resides in contiguous memory space. The algorithm for computation of the texel address from the texel co-ordinates is shown in Algorithm 2. The overhead of extra additions and shifts in the block address computation is offset by the performance gained by the reduced cache miss rates by selecting the line size equal to the block size[23].
bw
bh 1 2 3 4
5 by=2
sx sy bx=2 (tu,tv)
16
width
Figure 3.5: Blocked representation of texture 54 Texture Filter Memory
Algorithm 2 Computation of texel address Input: Texel Co-ordinates (tu,tv), Base - Starting address of Texture Output: Texel address 1: lbw ← log2(bw) 2: lbh ← log2(bh) 3: rs ← log2(width · bh) 4: bs ← log2(bw · bh) 5: bx ← tu >> lbw 6: by ← tv >> lbh 7: sx ← tu&&(bw − 1) 8: sy ← tv&&(bh − 1) 9: block address ← (by << rs)+(bx << bs) 10: offset ← (sy << lbw)+ sx 11: texel address ← base + block address + offset 12: return texel address
Since texture mapping exhibits high spatial locality, we propose to buffer the blocks of texture expected to be accessed in the near future in a set of registers. The number of blocks to be buffered depends on the type of filter being used. In a bilinear filtering operation, the texels could be in one, two, or four of the neighboring blocks as shown in Figure 3.4. Also, the next set of texels could fall in one of these four blocks with a high probability. Hence we would require to buffer upto four blocks of texture. For trilinear filtering we would need to buffer eight blocks – four blocks from each of the two nearest mipmap levels.
A standard cache based memory architecture could be used for the texture memory accesses, but this is expensive in terms of power, as each access results in a lookup operation where both tag and data arrays of the cache are read, with the number of such memory accesses proportional to the associativity (lower power cache architectures exist, but they compromise on performance). The predictability of the texture access pattern can be used to reduce the average number of memory accesses. We propose a novel memory architecture for textures, where cache-style lookups are minimized by modifying the conventional kernel for bilinear filtering as shown in Algorithm 3 to the one shown in Algorithm 4. The information about which of the four cases in Figure 3.4 applies to a texture access can be obtained by comparing the block co-ordinates of the texels(lines 3 and 4 of Algorithm 4). If the accesses belong to case 1, where all the texels are mapped to same block, a lookup for texel 1 could be followed by fetching the texels 2, 3, and Texture Filter Memory 55
4 from the same block(lines 8-11 of Algorithm 4) . Thus, we need only one lookup for fetching four texels. Similarly for cases 2(lines 12-17 of Algorithm 4) and 3(lines 18-23 of Algorithm 4), two lookups are sufficient for four texel accesses. Only case 4(lines 24-25 of Algorithm 4) requires four lookups. For this to be possible, our buffering unit should be designed such that it allows both lookup operation like a cache and a direct register access.
Algorithm 3 Kernel For Bilinear Filtering Input: Texel Co-ordinates (tu,tv), Base - Starting address of Texture Compute texel addresses corresponding to texel co-ordinates (tu,tv), (tu+1,tv), (tu,tv+1) and (tu+1,tv+1) 2: for I =1to4 do texelI ← CacheLookup(texeladdressI) 4: end for color ← WeightedAverage(texel1, texel2, texel3, texel4) 6: return color
Though we use two additional comparison operations for classifying the accesses to different cases, at the same time we reduce the number of block address computations and eliminate the texel address computations. From our experiments on various benchmarks we observed (Figure 3.6) that on an average, 58% of the accesses are to the same block (case 1), 36% to two blocks (cases 2,3) and 6% of the times to four of the blocks (case 4). Thus, only one lookup is required in 58% of the texture accesses. Even though this single lookup can consume the same power as in an associative cache (though smaller in magnitude because our buffers have only 4 registers), the remaining three accesses do not require any lookup/comparison operation because the register containing the block is already known. On an average, the number of memory accesses and comparisons is drastically reduced. 56 Texture Filter Memory
Algorithm 4 Modified Bilinear Texture Filtering Input: Texel Co-ordinates (tu,tv), Base - Starting address of Texture Output: Color bx ← tu >> lbw 2: by ← tv >> lbh bx1 ← (tu + 1) >> lbw 4: by1 ← (tv + 1) >> lbh c0 ← (bx = bx1)? 0:1 6: c1 ← (by = by1)? 0:1 Calculate offset1, offset2, offset3 and offset4 8: if c0=0 and c1=0 then compute block address1 10: texel1 ← LookupBuffer(block address1, offset1) Read texels 2,3 and 4 from the same block 12: else if c0=0 and c1=1 then compute block address1 and block address 3 14: texel1 ← LookupBuffer(block address1, offset1) Read texels 2 from the same block 16: texel3 ← LookupBuffer(block address3, offset3) Read texels 4 from the same block 18: else if c0=1 and c1=0 then compute block address1 and block address 2 20: texel1 ← LookupBuffer(block address1, offset1) Read texels 3 from the same block 22: texel2 ← LookupBuffer(block address2, offset2) Read texels 4 from the same block 24: else compute block addresses of all the four texels 26: for I =1to4 do texelI ← LookupBuffer(blockaddressI, offsetI) 28: end for end if 30: color ← WeightedAverage(texel1, texel2, texel3, texel4) return color
3.3 Architecture of Texture Filter Memory
In a conventional texturing unit, the address generator computes the block address and the offsets of the four texels to be bi-linearly filtered. The four texels are fetched from the cache by the fetch unit and the filtering unit performs the bi-linear interpolation. There are two filtering units so that during tri-linear interpolation the bilinear filtering of Texture Filter Memory 57
60%
50%
40%
30%
20%
10%
0% case 1 case 2 case 3 case 4
Figure 3.6: Distribution of texture accesses between various cases both the mipmap levels could be done in parallel. We include a Texture Filter Memory (TFM) in the texturing unit which acts as an interface between the texturing unit and the texture memory. The architecture of TFM is described in this section.
TFM consists of three components (Figure 3.7): (i) Texture Buffer Array, (ii) Address Comparators, and (iii) Controller.
3.3.1 Texture Buffer Array
A block of texture consists of 4 × 4 texels. Hence we need a buffer of 16 registers to save one block of texture and an array of four buffers to store four blocks of texture for bilinear filtering and eight buffers to support trilinear filtering also. We arrange the eight buffers as two sets of four buffers each. For bilinear filtering we use only one of the sets, and turn off the other in order to reduce power. In trilinear filtering, we map the two sets to the two different mipmap levels. By doing so, each texel lookup would require us to search four buffers instead of eight. The address bus width of the TBA is 7 bits: one bit to select between the two sets, two bits to select the buffer in a set, and four bit offset into the buffer. 58 Texture Filter Memory
Cur From L1 Cache Level CONTROLLER (512 bits) Bank Enable Load Hit Sel R/W BANK−I TEXEL Block ACOMP−I addr indx (256 Bytes) FETCH 4 Load 4 Hit UNIT Addr ACOMP−II BANK−II Offset (256 Bytes)
TEXTURE load hits BUFFER ARRAY
(32 bits) REG To Shader Unit
REG addr index
REG ENCODER
REG ADDRESS COMPARATOR (ACOMP)
Figure 3.7: TFM architecture Texture Filter Memory 59
3.3.2 Address Comparators
TFM has two blocks of address comparators, each associated with a set of buffers in the TBA. A comparator block consists of four registers to store the addresses of the blocks present in the buffers of that set. When the address comparator receives an address, it compares it with the addresses saved in the four registers in parallel. The output of the comparators are sent to the controller and also encoded to give the address of the buffer in which the texture block resides in case of a hit.
3.3.3 Controller
The Texel fetch unit provides the block address, the offset of the texel along with the mipmap level from which the texel is to be fetched. The access to TFM could be: (i) a direct access when the fetch is to the same block as the previous texel (ii) a lookup when it is not known in which of the buffers the texel would reside. The texel fetch unit determines the type of each access as shown in Algorithm 4 and provides this information to the controller. In case of a lookup, the controller enables the appropriate address comparator and the TBA entry. For bilinear filtering, only one of the banks and comparator associated with it are enabled. For trilinear filtering, the controller compares the mipmap level of the current access with that of the previous one and when it changes, toggles comparator enable and bank select signals. Thus the texels of two different mipmap levels would always be mapped to two different sets, reducing the interference. The controller combines the outputs of the address comparator to determine if the lookup resulted in a hit or a miss.
• Upon a hit, the buffer address generated by the comparator is registered so that for the successive access to the same block it is sufficient to provide only the offset and the costly comparisons can be avoided. These accesses are called direct accesses. Thus we achieve the hit rate of a fully associative cache but at much lesser compar- isons per access. It appears that for the accesses that need address comparison we need two cycles. But the design is inherently pipelined since we are registering the output of the comparator and hence we can achieve one fetch every cycle.
• When there is a TBA miss, the controller issues a read signal to the L1 cache and a 60 Texture Filter Memory
block of texture is moved from L1 cache to TBA. A pseudo LRU policy [46] is used by the comparator to select the block to be replaced. The controller issues a load signal to the corresponding address register in the comparator so that it stores the address of the new block now. It also issues the load signal to the buffer in the TBA and the registers in the buffer are loaded in parallel. A 512-bit internal bus between the L1 cache and TBA fills the buffer. From synthesis of TFM, we observed that the access time to the TFM is about half that of the cache and hence we can fill the buffer in two cycles when there is a miss.
In high throughput texture cache designs with mutiple-banks, we can easily divide the TFM also into as many banks as in the L1 texture cache and associate each of the L1 banks with the corresponding TFM bank.
3.4 Static Power Reduction due to Texture Filter Memory
A line in texture cache is retained in the cache even after completion of spatially local accesses to it, in anticipation of temporal reuse. The study of life cycle of a typical cache line in a conventional texture cache shows that it undergoes large intervals of no activity interleaved between bursts of high activity. As a result each cache line leaks power for majority of its life time. Various circuit level techniques are proposed in literature to tackle the problem of leakage in caches. One of the most popular of them is selectively putting the cache lines in low power mode called the drowsy mode. In the low power drowsy state, the cache line retains its data. However, to access the line, it must be first transitioned into its normal power state, which might incur a few cycles delay. Thus, there is a trade-off between the power saved and the corresponding delay incurred. One of the popular techniques to reduce leakage power consumption of CPU caches is to put all the cache lines in drowsy mode periodically after a defined window of cycles [47]. Though this technique is quite simple to implement, it results in significant power saving and also performance degradation due to the overhead of waking up the lines that are accessed. Hence a more sophisticated technique that maintains a register with each cache line to track the activity of the line and switches the line to drowsy mode when activity falls Texture Filter Memory 61 below a threshold is also used [47]. But, this incurs hardware overhead for tracking and logic to switch the cache lines and each mis-prediction would still incur power-performance penalties. Another important design decision is whether to put the tag array into sleep mode along with the data array or not. Putting tag array into drowsy mode would incur a penalty of an extra cycle to wake-up the tag array along with the data array. All the above techniques trade performance for power. However, the quality of graphics applications are quite sensitive to performance. Hence it is not possible to directly adapt the leakage power optimization techniques targeted towards the CPU caches for conventional texture caches. In this section, we demonstrate how texture filter memory could be exploited to reduce the leakage power consumption of the texture memory sub-system. In the proposed texture cache architecture with Texture Filter Memory(TFM), con- secutive accesses to a block of texture are directed to the buffers rather than the texture L1 cache. Since the accesses to the L1 would be at the granularity of blocks, the duration of activity of each cache line decreases further and leaks more power than the conventional texture cache. But, we show that due to presence of TFM, it is possible to use a smart technique to reduce the leakage power of the texture memory sub-system to levels less than that consumed by the conventional texture cache. Since all the consecutive accesses to a cache line hit the TFM and not the L1 cache, we propose to maintain the texture L1 cache always in drowsy state. The L1 is woken up only when there is a TFM miss and hence an access is made to the L1 cache. Both the data and tag array of the cache are woken up upon an access and put into drowsy sleep immediately after the access. But by doing so, every miss in a TFM will incur an additional overhead of waking up the data and tag array of the L1 cache. To reduce this overhead we use a predictive wake up mechanism. The predictive wake-up mechanism is explained in the section below.
3.4.1 Predictive wake-up
• Since bi-linear interpolation might result in access to four nearest neighbouring texture blocks, if we encounter an access to a base texel into L1 cache, we wake up three of its nearest neighbours as well.
• On every access to a block corresponding to a base texel into the L1 cache, we predict the next base block and wake-up that block also. To predict the next base 62 Texture Filter Memory
block, we track the direction of progression of consecutive texels in x and y direction, ∆x and ∆y respectively. If ∆x is positive and ∆y is positive, we expect the future accesses to one of the three blocks shown in Figure 3.8 and wake them up.
Predicted wake−up
Expected Texel Gradient
Current Base Texel
Figure 3.8: Pre wake up – Case 1
If ∆x is negative and ∆y is positive, we wake up the three blocks as shown in Figure 3.9.
Expected Texel Gradient Current Base Texel
Predicted wake−up
Figure 3.9: Pre wake up – Case 2
If ∆x is negative and ∆y is negative, we wake up the three blocks as shown in Figure 3.10.
If ∆x is positive and ∆y is negative, we wake up the three blocks as shown in Figure 3.11.
• On every new prediction, we wake up the new set of lines and put the rest into sleep. Texture Filter Memory 63
Current Base Texel
Expected Texel Gradient
Predicted wake−up
Figure 3.10: Pre wake up – Case 3
Current Base Texel Expected Texel Gradient
Predicted wake−up
Figure 3.11: Pre wake up – Case 4
Predictive wakeup of a line can be pipelined with accesses to an active line, hiding the wake up delay effectively.
In our technique, only a maximum of 6 L1 cache lines are active at any instant of time. In a 16KB L1 cache, this amount to a negligible 2.5%. The accuracy of our technique is as high as 95% as shown in Figure 3.12.
Interestingly, our prediction mechanism for waking up the cache lines can also be used for pre-fetching the lines into the L1 cache. Since all the spatially local access to a line in cache are to the filter memory, we can pre-fetch the lines into the L1 cache without the risk of replacing the lines which could be accessed in the near future. 64 Texture Filter Memory
120
100
80 iction Accuracy iction
d 60
40
20 Percentage Pre Percentage
0 Unreal Doom Prey Quake Benchmark
Figure 3.12: Pre-wakeup Prediction accuracy
3.5 Extension to other architectures and filters
In this section we describe how TFM can be used to optimize the power of other common filtering methods and how it could be scaled to support parallel and multi texturing architectures.
3.5.1 Anisotropic Filtering
Since there is little predictability in access patterns in this case, we cannot use our earlier searching mechanism. Instead, we propose to configure the TFM similar to a block buffered fully associative cache [48]. Because of high spatial locality, we expect large number of texels to be mapped to the same texture block resulting in a single comparison rather than four for all texels falling in a block. When there is a miss from the buffered block, we will have a search in the buffer array. In this architecture we have two hit times, a fast hit when the texel is present in the same block as the previous texel and a slow hit when the texel is in the buffer array but not in the block to which the previous texel was mapped. Fast hit into the buffer array would need one cycle and a slow hit would need two cycles. But since the access time of the buffer is half that of the L1 cache, we would not lose on performance when compared to an architecture without the buffer unit. Texture Filter Memory 65
3.5.2 Parallel Texturing
In conventional parallel rasterization hardware, there are multiple rasterization units and multiple texturing units connected through an interconnection network. All the texturing units share an L1 cache. For efficient dynamic load balancing, it should be possible to schedule a texel fetch operation from any rasterizer to any texturing unit. Thus the texel fetches from all texturing units do not exhibit the spatial locality discussed earlier to the same extent. However, in our proposed architecture of texturing unit as explained above, to have the benefit of TFM, we need to ensure that there is high spatial locality among the texels being filtered in the texturing unit. The simplest way of achieving this is to tie the texturing unit to the rasterizer and select the rasterization algorithm such that the texturing units are uniformly loaded . The proposed parallel rasterization hardware with modified texturing unit is shown in Figure 3.13. Each texturing unit has a Texture Addressing unit (TA), two Bilinear Interpolation Units (BIUs), and one TFM.
RASTERIZER RASTERIZER #1 #N
TEXTURING UNIT TA BIU BIU TA BIU BIU
TFM TFM
INTER−CONNECT
TEXTURE CACHE
Figure 3.13: TFM for parallel texturing
We analyze below the impact of our proposed architecture on the following common tiled rasterization algorithms.
Tiled Rasterization The screen space is sub-divided into fixed-size tiles and each ras- terizer is responsible for a fraction of tiles. Since the tiles can have a varying number of fragments, there could be load imbalances among the rasterizers. 66 Texture Filter Memory
Object Space Subdivision The object space is subdivided into groups of primitives and distributed among the rasterizers in a round robin fashion. For efficient dynamic load balancing larger primitives could be divided into smaller primitives.
Striped Rasterization The fragments are divided according to image space subdivision of 2-pixel wide vertical stripes. These stripes can be assigned to the rasterizers in round-robin order for efficient load balancing. In case the lengths of scanlines vary significantly, sub-division into equal length sub-scan lines also could be considered.
In each of the above architectures, spatially local fragments are rendered in each of the parallel raster units. This is to maximize the locality into the texture and framebuffer memories. We propose to introduce a TFM into each of the texture units to exploit the spatial locality and predictability of the texture accesses.
3.5.3 Multi Texturing
Multitexturing is the process of applying more than one texture to a primitive [29]. Since several textures are fetched into the cache, the number of conflict misses increases considerably during multitexturing. In [49], the authors suggest a partitioned cache such that each active texture is allotted a partition of the cache. In our architecture, we observed that buffering two blocks of each active texture in the TFM we can reduce conflict misses in the L1 cache significantly. This is because 94% of the time, the four bilinearly interpolated texels fall in either one or two of the neighboring blocks of the texture (Figure 3.4). Hence by buffering two blocks per texture, we can reduce most of the accesses to L1, thereby reducing conflicts. We observed that TFM, along with a 4-way associative L1 cache in the background, reduces the misses considerably and achieves hit rate equal to the hit rate in a partitioned cache. Hence by introducing TFM we can use a conventional L1 cache and thus elimi- nate the overhead of cache partitioning. Since we have 8 buffers per texturing unit, we can support multitexturing with bilinear interpolation of a maximum of 4 simultaneous textures; this is good enough to support most current current applications. Texture Filter Memory 67
3.6 Experiments and Results
We describe in this section our experiments conducted to validate our proposed archi- tecture on typical rendering examples used for evaluating graphics hardware. We have developed a trace driven simulator for the proposed architecture. We have instrumented Mesa code (software renderer for the graphics pipeline [50]) for generating the memory traces. Our experimental platform for validating the power optimizations in the proposed architecture is as follows: We developed a synthesizable VHDL model for the design and used Synopsys Design Compiler and PrimePower for synthesis and power/energy simu- lation. We have also used CACTI models [51] for estimating the energy of caches and SRAMs in the designs.
3.6.1 Evaluation of the TFM architecture
The overall energy of a texture memory architecture depends on two main factors:
• Hit rates to the lower/smaller levels of the hierarchy
• Access energy of the corresponding levels
In the case of our proposed architecture, the hit rates are smaller than those obtained using a large L1 cache, but the energy per access is much smaller. The overall energy for TFM is the lowest among the evaluated architectures.
Hit rate comparison
Figure 3.14 shows the hit rate into TFM and compares it with the hit rates obtained by the following architectures:
1. The conventional 16KB, 2-way set associative cache as L1 and 256KB 4-way set associative L2 cache
2. 512B direct mapped cache as L1 and 256 KB L2 cache
3. 512B direct mapped filter cache along with L1 and L2
4. 512B fully associative filter cache along with L1 and L2 68 Texture Filter Memory
100%