POWER OPTIMIZATIONS FOR GRAPHICS PROCESSORS

B.V.N.SILPA

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING INDIAN INSTITUTE OF TECHNOLOGY DELHI

March 2011

POWER OPTIMAZTIONS FOR GRAPHICS PROCESSORS

by

B.V.N.SILPA

Department of Computer Science and Engineering

Submitted in fulfillment of the requirements of the degree of Doctor of Philosophy

to the

Indian Institute of Technology Delhi March 2011

Certificate

This is to certify that the thesis titled Power optimizations for graphics pro- cessors being submitted by B V N Silpa for the award of Doctor of Philosophy in Computer Science & Engg. is a record of bona fide work carried out by her under my guidance and supervision at the Deptartment of Computer Science & Engineer- ing, Indian Institute of Technology Delhi. The work presented in this thesis has not been submitted elsewhere, either in part or full, for the award of any other degree or diploma.

Preeti Ranjan Panda Professor Dept. of Computer Science & Engg. Indian Institute of Technology Delhi

Acknowledgment

It is with immense gratitude that I acknowledge the support and help of my Professor Preeti Ranjan Panda in guiding me through this thesis. I would like to thank Professors M. Balakrishnan, Anshul Kumar, G.S. Visweswaran and Kolin Paul for their valuable feedback, suggestions and help in all respects. I am indebted to my dear friend G Kr- ishnaiah for being my constant support and an impartial critique. I would like to thank Neeraj Goel, Anant Vishnoi and Aryabartta Sahu for their technical and moral support. I would also like to thank the staff of Philips, FPGA, laboratories and IIT Delhi for their help. I owe my deepest gratitude to Microsoft Research India, for supporting my research by granting me MSR fellowship. I would also want to thank Intel India Pvt. Limited for funding my research. I would like to specially thank Kumar S S Vemuri for the mentorship I received from him. This thesis is dedicated to my family members who have shown immense patience and provided me with great support during the course of my work.

B V N Silpa

ABSTRACT

Advances in have led to creation of sophisticated scenes with realistic characters and fascinating effects. As a consequence, the computational com- plexity of graphics applications has also increased tremendously. With increasing interest in sophisticated graphics capabilities in mobile systems, energy consumption of graphics hardware is becoming a major design concern in addition to the traditional performance enhancement criteria. This motivates us to focus on designing low power graphics pro- cessors for mobile devices. We present the first comprehensive power optimization work targetting the computer graphics rendering pipeline. The power minimization is targetted at different levels of abstraction: component level, compiler level, and full system level. The main contributions of this thesis are the following:

• A custom memory architecture for low power texture memory sub-system.

• A code optimization technique that reduces the computational complexity and hence the power consumption of the geometry engine.

• System level power optimization by Dynamic Voltage and Frequency Scaling for tiled graphics processors.

Among the different steps in the graphics processing pipeline, we observe that mem- ory accesses during – a highly memory intensive phase – contributes 30-40% of the energy consumed in typical embedded graphics processors. This makes the texture mapping subsystem an attractive candidate for energy optimization. We argue that a standard hierarchy, commonly used by researchers and commercial graphics processors for texture mapping, is wasteful of energy, and propose the Texture Filter Memory, an energy efficient architecture that exploits locality and the relatively high degree of predictability in texture memory access patterns. Our architecture con- sumes 75% lesser energy for texturing in a fixed function pipeline and about 85% lesser energy in a parallel rasterization hardware. It also achieves 7% more hits than a parti- tioned cache generally used for multitexturing. Interestingly our proposed architecture also achieves higher performance than conventional texture mapping hardware. We also demonstrate that introduction of these filter buffers help greatly in reducing the leakage power consumption of the texture memory sub-system. Our proposed drowsy texture L1 with predictive wake-up helps in achieving 80% leakage power savings at the cost of less than 1% performance loss. We have observed that the geometry engine also contributes significantly towards the total power consumption in modern games. This is because the creation of scenes with increasing levels of detail is resulting in escalating the amount of geometry per frame, making the performance of the geometry engine one of the computationally intensive stages of the pipeline. In this thesis we propose a mechanism to reduce the amount of computation in the geometry engine, thereby reducing the power consumption of the geometry engine and at the same time speeding up the geometry processing. This is achieved by partitioning the vertex into position-variant and position-invariant parts and executing the position-invariant part of the shader only on those triangles that pass the trivial reject test. Our main contributions here are : (i) a partitioning algorithm that attempts to minimize the duplication of code between the two partitions of the shader and (ii) an adaptive mechanism to enable the vertex shader partitioning so as to minimize the overhead incurred due to thread-setup of the second stage of the shader. From the results we observe a saving of upto 50% of vertex shader instructions and hence a speed-up of upto 15%. Due to significant savings on number of vertex shader instructions, we can expect attractive saving on power consumed by the geometry engine. From the study of various modern games we observe that the workload varies sig- nificantly with time and hence can benefit from dynamic voltage and frequency scaling (DVFS) which saves the system level power consumption of the GPU. Since visual quality of graphics applications is highly dependent on the rate at which frames are processed, it is important to devise a DVFS scheme that minimizes deadline misses due to inaccuracies in workload prediction. We demonstrate that tiled-graphics renderers exhibit substantial advantages over immediate-mode renderers in obtaining access to frame parameters that help in enhancing the workload estimation accuracy. We also show that, operating at a finer granularity of “tiles” as opposed to “frames” allows early detection and corrective action in case of a mis-prediction. We propose an accurate workload estimation tech- nique and two DVFS schemes namely (i) tile-history based DVFS and (ii) tile-rank based DVFS for tiled-rendering architectures. The proposed schemes are demonstrated to be more efficient in terms of power and performance than the frame level DVFS schemes proposed in recent literature. With a system with 8 DVFS levels, our tile-history based DVFS scheme results in 60% improvement in quality (deadline misses) over the frame history based DVFS schemes and gives 58% saving in energy. The more sophisticated tile-rank based scheme achieves 75% improvement in quality over the frame history based DVFS scheme and results in 58% saving in energy. We have also compared the efficiency of the proposed tile-level DVFS schemes with frame-level schemes with increasing num- ber of DVFS levels, and found that while the frame-level schemes suffer from increasing deadline misses as the frequency levels increase, the impact on our tile-level schemes is negligible. The energy per frame-rate for our scheme is the minimum, indicating that it delivers the best performance-energy results.

Contents

List of Figures vii

List of Tables xi

1 Introduction 1 1.1 Introduction to Graphics Processing ...... 4 1.1.1 Application Stage ...... 4 1.1.2 Geometry ...... 9 1.1.3 Triangle Setup ...... 11 1.1.4 Rasterization ...... 12 1.1.5 Display ...... 15 1.2 Graphics Processor Architecture ...... 17 1.2.1 Immediate Mode Rendering Engines ...... 18 1.2.2 Tiled Graphics Engines ...... 23 1.3 Power Dissipation in a Graphics Processor ...... 26 1.4 Our Contribution ...... 28 1.5 Thesis Outline ...... 29

2 Literature Survey 31 2.1 Programmable Units ...... 31 2.1.1 Clock Gating ...... 31 2.1.2 Fixed Function ALUs ...... 33 2.1.3 Predictive Shutdown ...... 33 2.2 Texture Unit ...... 34 2.2.1 Low power cache configurations ...... 34

i ii CONTENTS

2.2.2 ...... 35 2.2.3 Clock Gating ...... 37 2.3 Frame Buffer ...... 37 2.3.1 Depth Buffer Compression ...... 38 2.3.2 Color Buffer Compression ...... 40 2.4 System Level Power Management ...... 41 2.4.1 Power Modes ...... 41 2.4.2 Dynamic Voltage and Frequency Scaling ...... 41 2.4.3 Multiple Power Domains ...... 47 2.5 Miscellaneous ...... 48

3 Texture Filter Memory 49 3.1 Introduction ...... 49 3.2 Texture Mapping Access Pattern ...... 52 3.3 Architecture of Texture Filter Memory ...... 56 3.3.1 Texture Buffer Array ...... 57 3.3.2 Address Comparators ...... 59 3.3.3 Controller ...... 59 3.4 Static Power Reduction due to Texture Filter Memory ...... 60 3.4.1 Predictive wake-up ...... 61 3.5 Extension to other architectures and filters ...... 64 3.5.1 ...... 64 3.5.2 Parallel Texturing ...... 65 3.5.3 Multi Texturing ...... 66 3.6 Experiments and Results ...... 67 3.6.1 Evaluation of the TFM architecture ...... 67 3.6.2 Leakage Power and Delay Comparision ...... 71 3.6.3 Parallel rasterization architecture ...... 72 3.6.4 Multitexturing ...... 73 3.7 Summary ...... 74 CONTENTS iii

4 Vertex Shader Partitioning 77 4.1 Introduction ...... 77 4.2 Shader Compiler ...... 82 4.2.1 Partitioning with Minimum Duplication ...... 82 4.2.2 Comparison with the na¨ıve algorithm ...... 87 4.2.3 Selective Partitioning of Vertex Shader ...... 88 4.3 Framework ...... 89 4.3.1 Partitioning the assembly code ...... 91 4.3.2 Vertex Input Buffer ...... 92 4.3.3 Feeder Unit ...... 92 4.4 Experiments and Results ...... 93 4.4.1 Performance Improvement ...... 93 4.4.2 Energy Reduction ...... 95 4.5 Summary ...... 96

5 DVFS for Tiled GPUs 99 5.1 Introduction ...... 99 5.2 Workload Estimation of a Tiled Graphics Processor ...... 104 5.2.1 Frame Rank Computation ...... 107 5.2.2 Tile Rank Computation ...... 110 5.2.3 Extraction of Ranks ...... 111 5.3 DVFS Scheme ...... 111 5.3.1 Tile History Based DVFS ...... 112 5.3.2 Tile Rank Based DVFS ...... 115 5.4 Experiments and Results ...... 117 5.4.1 Performance Impact ...... 118 5.4.2 Energy Saving ...... 120 5.5 Summary ...... 122

6 Conclusion and Future Work 125 6.1 Summary and conclusion ...... 125 6.2 Future Work ...... 127 iv CONTENTS

Bibliography 129 CONTENTS v vi CONTENTS List of Figures

1.1 Evolution of character modeling over the years [1]...... 2 1.2 Increasing transistor counts in GPUs [1] ...... 3 1.3 Power break-up in a typical desktop computer [2] ...... 4 1.4 Power break-up in a typical mobile computer [2] ...... 5 1.5 ...... 6 1.6 Three different poses of the same character in a 3D game ...... 7 1.7 Transformations on an object ...... 8 1.8 Object space culling ...... 9 1.9 Space-Space transformations on an object ...... 10 1.10 Types of light sources ...... 11 1.11 Scan line conversion ...... 12 1.12 Texture mapping example ...... 14 1.13 Antialiasing ...... 15 1.14 Double buffering ...... 16 1.15 The CPU – GPU interface ...... 17 1.16 Fixed function GPU ...... 19 1.17 Triangles sharing vertices ...... 20 1.18 Indexed addressing into vertex buffer ...... 20 1.19 Tiled triangle traversal ...... 22 1.20 Unified shader architecture for graphics processor ...... 24 1.21 Tiled rendering ...... 25 1.22 Tiled graphics pipeline ...... 25 1.23 Tiled graphics processor architecture ...... 26 1.24 State management in tiled GPUs ...... 27

vii viii LIST OF FIGURES

1.25 Fraction of energy consumed in texture unit ...... 27 1.26 Footprint of the die of GT200 (approximately to the scale) ...... 28

2.1 Processing Element (PE) ...... 32 2.2 S3TC texture compression ...... 36 2.3 PID controller ...... 43 2.4 PID controller based DVFS for graphics processor ...... 44 2.5 Signature based DVFS for graphics processor ...... 47

3.1 Texture mapping to model a globe ...... 50 3.2 Oblique traversal of scanlines in texture space ...... 51 3.3 Footprint of a Bilinear filter ...... 52 3.4 Scenarios to which the texture footprint could be mapped ...... 53 3.5 Blocked representation of texture ...... 53 3.6 Distribution of texture accesses between various cases ...... 57 3.7 TFM architecture ...... 58 3.8 Pre wake up – Case 1 ...... 62 3.9 Pre wake up – Case 2 ...... 62 3.10 Pre wake up – Case 3 ...... 63 3.11 Pre wake up – Case 4 ...... 63 3.12 Pre-wakeup Prediction accuracy ...... 64 3.13 TFM for parallel texturing ...... 65 3.14 Hit rate comparison for texture memory architectures ...... 68 3.15 Average access energy comparison of texture memory architectures .... 69 3.16 Average access times comparison of texture memory architectures ..... 70 3.17 Area of several texture memory architectures ...... 70 3.18 Leakage power consumption of various texture memory architectures ... 71 3.19 Delay overhead of various drowsy policies ...... 71 3.20 Hit rate in parallel texture cache architecture ...... 72 3.21 Average access energy in parallel texture cache architecture ...... 73 3.22 Area of the parallel texture caching architectures ...... 73 3.23 Hit rates for multitexturing ...... 74 LIST OF FIGURES ix

4.1 Vertex shaded partitioning example ...... 78 4.2 Pipeline modified to support vertex shader partitioning ...... 80 4.3 DAG representing the data-flow in a vertex shader ...... 83 4.4 Vertex shader partitioning algorithm : Case 1 ...... 85 4.5 Vertex shader partitioning algorithm: Case 2 ...... 86 4.6 Comparison of proposed algorithm with the existing one ...... 88 4.7 Variation of Trivial Rejects across frames of UT2004 ...... 89 4.8 ATTILA architecture ...... 90 4.9 Modified ATTILA architecture ...... 91 4.10 Feeder unit architecture ...... 93 4.11 % Instructions saved due to adaptive vertex shader partitioning ...... 94 4.12 % Cycles saved due to adaptive vertex shader partitioning ...... 95 4.13 Frame captures from the games ...... 96

5.1 Workload variation in games ...... 100 5.2 I/O buffering for video decoder ...... 101 5.3 Buffer occupancy based DVFS for video decoder ...... 102 5.4 Buffer occupancy based correction of workload prediction ...... 102 5.5 Consecutive frames of UT2004 ...... 105 5.6 Consecutive frames of UT2004 ...... 106 5.7 Accuracy of tile history based prediction ...... 107 5.8 Consecutive frames of UT2004 ...... 108 5.9 Accuracy of frame rank based predictions ...... 110 5.10 Area of overlap as a measure of count ...... 111 5.11 Accuracy of tile rank based prediction ...... 112 5.12 Tile level DVFS ...... 113 5.13 Deadline Misses at target of 60fps ...... 119 5.14 Frame Rate at target of 60fps ...... 120 5.15 Normalized Energy at 60fps ...... 120 5.16 Normalized Energy at 30fps ...... 121 5.17 Energy per Frame-Rate at 30fps ...... 122 x LIST OF FIGURES List of Tables

2.1 Exponential encoding for color buffer compression ...... 40

4.1 % Trivial rejects per frame ...... 78 4.2 % Instructions saved due to vertex shader partitioning ...... 81

5.1 DVFS results at 60 FPS ...... 123 5.2 DVFS results at 30 FPS ...... 124

xi xii LIST OF TABLES Chapter 1

Introduction

The field of computer graphics has advanced profoundly in recent years leading to creation of realistic characters and fascinating effects. What started as a tool to generate 2D pixel art has now evolved to be capable of generating complex 3D images incorporating intricate details of objects. The evolution of video game character animation over the years as shown in Figure 1.1 demonstrates this fact. The real time generation of these complex images in computer graphics applications was made possible by advancements in semiconductor industry that provided more and more transistors in each technology node. Initially, graphics applications were run on general purpose CPUs. However, the gradual increase in complexity of these applications has led to the development of hardware accelerators for graphics processing called Graphics Processing Units (GPUs). Since then, each generation of GPU has witnessed a tremendous increase in transistor density. This can be observed from Figure 1.2, which shows the transistor counts of several generations of graphics cards [1]. With aggressive technology scaling over the years, the computational capacity of mobile platforms has also increased tremendously. As a result, traditional 3D graphics applications which have been developed for either desktops or dedicated gaming consoles, such as gaming, GPS-backed maps, screen savers, and animated chats are emerging as possible applications for mobile devices as well. The challenge in porting complex 3D graphics applications onto mobile platforms is posed not as much by performance as power consumption. Since mobile devices are powered by battery and battery capacity is not increasing on par with processing power of chips, the gap between the demand and supply of power is widening. Moreover, due to constraints

1 2 Introduction

(a) Pacman(1980s): Basic 3D (b) Toy Soldier(2000): Real- image time reflection and shadow

(c) Zoltar(2001): Realtime lip (d) Dawn(2003): Realistic movements features

(e) Naluu(2004): Soft Shad- (f) Mad Mod Mike(2005): ows on skin and hair Realistic clothes

(g) Adrianne(2006): Complex (h) Human Head(2007): Real- and deformation istic skin texture and deforma- tion Figure 1.1: Evolution of character modeling over the years [1]. Introduction 3

10000 GTX 400 GeForce 8 1000 illions) Ge Force 6 GTX 200 M

100 GeForce 4 GeForce 7 (in GeForce FX

nt GeForce 256 u 10 Co

sistor 1 n NV1 Tra 0.1

Figure 1.2: Increasing transistor counts in GPUs [1]

on the size of these devices, the cooling solutions that can be used in these devices are limited. In fact, the current generation semiconductor technology is said to have hit the ”power wall”. As a consequence, the problem of inadequate cooling is not only limited to mobile devices, but also applicable to desktop systems. Thus, the increasing popularity of graphics applications has hence introduced an additional constraint of power in the design of a graphics subsystem to the already existing ones of performance and quality. A recent study on the PC energy efficiency trends by Intel Technologies [2], shows that the the graphics processor is a major component of the total energy dissipated in desktop and mobile computers as shown in Figure 1.3 and Figure 1.4 respectively.

From the figures we can see that the GPU consumes as much power as the CPU in desktop computers. And in mobile computers, GPU consumes double the power of CPU, thus making it the major source of power dissipation in these systems. Hence, it is very important to focus our attention towards low power solutions for graphics processing. In this thesis, we investigate the major sources of power consumption in a graphics processor and propose optimizations for the same. 4 Introduction

Power Supply Loss 22%

VR 1%

Other 7% Monitor 56%

HDD/DVD 4%

CPU 4% Graphics 6%

Figure 1.3: Power break-up in a typical desktop computer [2]

1.1 Introduction to Graphics Processing

The aim of graphics processing is to generate the view of a scene on a display device. The pipeline processes the complex geometry present in the scene, which is represented using several smaller primitives such as triangles, lines, etc., to produce the color corresponding to each position on a 2D screen called a pixel. Several operations are applied in sequential order, on the data representing the math- ematical model of an object, to create its graphical representation on the screen. The high level view of the flow of these operations, generally called the Graphics Pipeline, is illustrated in Figure 1.5.

1.1.1 Application Stage

The application layer acts as an interface between the player and the game engine. Based on the inputs from the player, the application layer places the view camera, which defines Introduction 5

Rest of the CPU Platform 7% 13% Cooling Fan Chipset 4% 13%

Power Supply Loss 7%

HDD/DVD 9% 14.1' LCD 33%

Graphics 14%

Figure 1.4: Power break-up in a typical mobile computer [2]

the position of the “eye” of the viewer in 3D space. The application also maintains the geometry database, which is the repository of the 3D world and the objects used in the game, represented as geometric primitives (triangles, lines, points etc). Every object is associated with a position attribute defining its placement in the world space, and a pose defining the orientation of the object and movable-parts of the object with respect to a fixed point on the object as shown in Figure 1.6. Animation is the process of changing the position and pose of the objects from frame to frame, so as to cause the visual effect of motion. The movement of objects in a frame can be brought about by a combination of translation, rotation, scaling, and skewing operations as shown in Figure 1.7:

• Translation: Displacement of an object from one position to another.

• Rotation: Movement of an object around an axis causing angular displacement. 6 Introduction

1. Position Camera 2. Animation Application 3. Frustum Culling 4. Occlusion Culling 5. Calculate LOD

1. Translate, Rotate & Scale 2. Tranform to view space Geometry 3. Lighting 4. Persepective Divide 5. Clipping and culling

1. Slope and Delta Triangle Setup Calculation 2. Scan−Line Convert

1. Shading GRAPHICS PIPELINE 2. Texturing Rasterization 3. Fog 4. Alpha, Depth & Stencil test 5. Antialiasing

1. Swap Buffers Display 2. Screen Refresh

Figure 1.5: Graphics Pipeline Introduction 7

(a) (b) (c)

Figure 1.6: Three different poses of the same character in a 3D game

• Scaling: Resizing of an object to cause the perception of depth. The object appears to magnify when it is approaching the viewer and diminishes as it moves away.

• Skewing: Reshaping an object by scaling it along one or more axes, effectively changing the pose of the object.

The application associates the objects in the geometry database with transformations as determined by the game play. The actual transformation operation on the primitives of the object happens in the next stage – the geometry stage of the pipeline. In addition, the application layer also identifies possible collisions among objects in the frame and generates a response in accordance with the game play. This layer is also responsible for processing the artificial intelligence (AI), physics, audio, networking, etc. Since graphics processing is highly computation intensive, efforts are made to reduce the workload by avoiding operations that would result either in no actual or perceivable change in what is displayed on the screen. One such technique called Frustum Culling is generally employed by almost all game engines to avoid rendering objects that fall totally outside the view frustum as shown in Figure 1.8(Object A). Another such technique is Occlusion culling, which is a visibility test performed by the application to identify the objects that are totally obstructed by some other object in the scene as shown in Figure 1.8(Object C). Since these objects would not be visible on the screen anyway, we can avoid processing them, thereby saving computation time. Another popular technique used for graphics workload reduction is adjustment of the precision at which objects are rendered based on their distance from the view camera. This is based on the observation that, as the distance between the object and eye increases, the precision with which its details could be perceived decreases. Hence a far-off object can be modeled with fewer primitives than those required to model a nearer object. This Level of Detail (LOD) based technique results in loss of model detail whereas frustum 8 Introduction

Y’ Y Y

A’ C’ C’ A A C X’ C X B’ X A’ B’ B B

Z Z Z’

(a) Translation (b) Rotation :

Y Y

A’ A’ C’ A A C C X X B B B’ Z Z

(c) Scaling (d) Skewing

Figure 1.7: Transformations on an object Introduction 9

A

C

B B

Camera (a) View Space (b) Screen Space

Figure 1.8: Object space culling and occlusion culling are non-lossy ones.

1.1.2 Geometry

The geometry engine receives the vertices representing the primitives of the objects as inputs from the application stage. The first step in the geometry stage is to apply the transformations (associated with them in the application stage) on the primitives. The various transformations are illustrated in Figure 1.9. In newer pipeline implementations, the geometry engine is also capable of animating the primitives. In this case, the trans- formations are generated and applied by the geometry engine itself. In addition to these transformations, the geometry engine also needs to apply space- space transformations on the primitives. Various spaces used to represent a scene are illustrated in Figure 1.9 and discussed below:

• Model Space: where each object is described with respect to a co-ordinate system centered at a point on the object.

• World Space: where all the objects that form the scene are placed in a common co-ordinate space.

• View Space: where the camera/eye forms the center of the world thus representing 10 Introduction

Y Y

X X Z Z

(a) Model Space (b) World Space (c) View Space

Figure 1.9: Space-Space transformations on an object

the world as seen by the viewer.

To transform the primitives from model space to view space, they are either first trans- formed to world space and then to view space, or directly transformed to view space. In terms of operations, these transformations are also a combination of translations and rotations. The next step is lighting the vertices taking into account the light sources present in the scene and also the reflections from the objects present in the scene. The lighting of primitives could be done either at vertex level or pixel level. Though pixel level shading results in better effects, the downside of per-pixel lighting is the resulting heavy computational workload. Thus, the choice between per-vertex and per-pixel shading is a trade-off between accuracy and workload. Various kinds of light sources are considered in 3D graphics, as shown in Figure 1.10.

• Directional Light: This refers to a source of light placed at an infinite distance from the object. This kind of light illuminates all the objects facing the source equally.

• Point Light: This is light emanating from a point source but spreading into all the directions. The illumination of this kind of light decreases with the distance from the light. Introduction 11

Cut−off Angle

Direction of Light

(a) Directional Light (b) Point Light (c) Spot Light

Figure 1.10: Types of light sources

• Spot Light: This refers to directional light emanating from a point source, illumi- nating the objects that fall in the conical region in which it spreads its light.

After the per-vertex operations of transformation and lighting are done, the vertices are assembled into triangles. Before the triangles are sent to the next stages of the pipeline for further processing, the primitives that would not contribute to the that are finally displayed on the screen, are discarded so as to reduce the workload on the pixel processor. As a first step, the geometry engine identifies the triangles that fall partially or totally outside the view frustum. The primitives that fall totally outside the frustum are trivially rejected. If the primitives are not completely outside the frustum, they are divided into smaller primitives so that the part falling outside the frustum can be clipped off. In addition to the primitives falling totally outside the view frustum, the triangles that face away from the camera are also trivially rejected. This process is called back-face culling. For example, independent of the viewing point, half of a solid sphere’s surface is always invisible and hence can be discarded.

1.1.3 Triangle Setup

So far in the pipeline, the scene is represented in terms of triangles, lines, and points. But what is finally displayed on the screen is a 2D array of points called pixels. In order to progress towards this final objective, the triangles are first divided into a set 12 Introduction of parallel horizontal lines called scan-lines , as shown in Figure 1.11. These lines are further divided into points, thus forming the 2D array of points called the fragments. The scan-line conversion of the triangles occurs during the triangle setup phase; the scan lines are then passed on to the rasterization engine, which then generates pixels from these lines. While dividing the triangles into the corresponding scan lines, the triangle setup unit calculates the attributes – depth, color, lighting factor, texture co-ordinates, normals, etc., of the end points of the lines through interpolation of the vertex attributes of the triangle. A(x1,y1) (x1+1/m1, y1+1) (x1+1/m3, y1+1) slope=m1 slope=m3 B(x2,y2) (x1’, y2) where x1’=x1+(y2−y1)/m3 (x1’+1/m3, y2+1) (x2+1/m2, y2+1)

slope=m2 C(x3,y3)

Figure 1.11: Scan line conversion

1.1.4 Rasterization

The raster engine generates the pixels from the scan-lines received from the setup unit. Each pixel is associated with a color stored in a color buffer and depth stored in a depth buffer. These two buffers together form the framebuffer of the graphics processor. The aim of the pixel processor is to compute the color of the pixel displayed on the screen. The various operations involved in this processing are enumerated below:

Shading: In this step, the lighting values of the pixels are computed. This is done by either assigning the weighted average of vertex lighting values or, for greater accuracy, actually computing the lighting at each pixel using one of the following models: Introduction 13

• Flat Shading: The lighting value of pixel is assigned the average of lighting values of the vertices (lit in the geometry stage) of the primitive. This is the simplest of the shading models, but is highly inaccurate.

• Gouraud Shading: In this model, the lighting value of the pixels is computed as the weighted average of the lighting value of the end points of the scanline. This model gives good quality results with relatively small computation over- head, and hence has been the most popular shading techniques.

• Phong Shading: In this model, the shading normal of the pixel is generated as the interpolation of the shading normals associated with the ends of the scanline. The generated normal is used in the computations involved in lighting the pixel.

Texturing: Texture mapping is a technique for adding surface detail, texture, or color to an object and helps significantly in adding realism to the scene. This process could be visualized as something similar to wrapping a patterned paper to a plain white sphere as shown in Figure 1.12. In this technique, the color associated with each of the pixels of the image is looked-up from a stored 2D image called the texture and the mapping between the pixel and a point on the texture, called the , is based on a predefined mathematical function. This technique is very popular and most commonly used since this has made the generation of objects with surface ir- regularities such as the bumps on the surface of moon, objects with surface texture such as that on a wooden plank, etc., possible in a graphics pipeline.

Fog: After texturing, fog is added to the scene, giving the viewer perception of depth. Fogging effect is simulated by increasing the haziness of the objects with increasing distance from the camera. The fog factor is thus a function of the z-value of a pixel and could increase either linearly or exponentially. The fog factor is then applied to the pixel by blending it with the color computed in shading and texturing steps. Another application of fogging is to make the clipping of objects at the far clipping plane less obvious by fading their disappearance rather than abruptly cutting them out of the scene. 14 Introduction

(a) Texture (b) Textured sphere

Figure 1.12: Texture mapping example

Alpha and Depth: Alpha value is one of the attributes of a vertex that is used to model opacity of the vertex. This is required to model transparency and translucency of the objects, for example in simulating water, lens, etc. An opaque object occludes the objects that are behind it. Thus, if a pixel is opaque and the z-value of the pixel is less than the value present in the depth buffer at the position corresponding to the pixel, then the depth buffer and color buffer are updated with the attributes of the pixel. However, if the object is transparent, depending on its transparency, the color of the occluded objects have to be blended with the color of the object to simulate the effect of transparency. In addition to depth and color buffers, a graphics pipeline also has a stencil buffer. Generally, this buffer stores a value of 0/1 per pixel to indicate whether the pixel has to be masked or not. This is used to create many effects such as shadowing, highlighting, and outline drawing. The operations involved in these three tests together can be summarized as follows.

Anti-: When an oblique line is rendered, it appears jagged on the screen as shown in Figure 1.13(a). This is a result of discretization of a continuous function (line) by sampling it over a discrete space (screen). One way to alleviate this effect Introduction 15

Algorithm 1 Alpha, depth and stencil test 1: if StencilBuffer (x,y) =6 0 then 2: if Alpha =6 0 then 3: if DepthBuffer (x,y) ≥ z then 4: ColorBuffer (x,y) ← color 5: DepthBuffer (x,y) ← z 6: end if 7: end if 8: end if

is to render the image at a resolution higher than the required resolution and then filter down to the screen resolution. This technique is called full screen anti-aliasing (Figure 1.13(b)). The problem with this method is that it increases the load due to pixel processing. Hence an optimization called multi-sampling is generally used, which identifies the edges of the objects in the screen and applies anti-aliasing only to the edges.

(a) Line (b) Anti-aliased line

Figure 1.13: Antialiasing

1.1.5 Display

When a new frame is to be displayed, the screen is first cleared; then the driver reads the new frame from the framebuffer and prints it on the screen. Generally a screen refresh rate of 60fps is targeted. If only one buffer is used for writing (by the GPU) and reading (by the display driver), artifacts such as flickering are common because the GPU could 16 Introduction update the contents of the frame before they are displayed on the screen. To overcome this problem, generally double buffering is used, as shown in Figure 1.14, wherein the display driver reads the fully processed frame from the front buffer while the GPU writes the next frame to the back buffer. The front and back buffers are swapped once the

Back B1 Buffer

Write Read GPU Frame #N Display Frame #N−1

Front B2 Buffer

(a) GPU writes to B1 and Display driver reads from B2

Front Buffer B1

Read GPU Write Frame #N Display Frame #N+1

B2 Back Buffer

(b) Swap Buffers: GPU writes to B2 and driver reads from B1

Figure 1.14: Double buffering read and write operations to the front and back buffers respectively, are completed. The obvious drawback of double buffering is performance loss. Since the frame which needs slightly more than 16.67msec (1/60 of a sec) would be updated only in the next refresh cycle, the GPU cannot start processing the next frame until then. In such cases, the overall frame rate may fall to half the targeted frame rate even when the load is only slightly increased. To counter this problem, triple buffering using three buffers (one front and two back buffers) can be used. The GPU can now write to the additional buffer, Introduction 17 while the other buffer holds the frame to be refreshed. The choice of double buffering or triple buffering depends on the availability of memory space.

1.2 Graphics Processor Architecture

The computational workload of 3D graphics applications is so high that to achieve real time rendering rates, for graphics processing is almost always nec- essary. Generally, the application layer executes on the CPU and the rest of the graphics processing is offloaded to Graphics Processing Units (GPUs). To enable ease of devel- opment and also application portability, an Application Programming Interface (API) is used to abstract the hardware from the application. The device driver that forms the interface between the CPU and GPU receives the API calls from the application and interprets them to the GPU. The interaction between CPU and GPU is shown in Figure 1.15

Host CPU Application CPU RAM Hard Disk API

Device Driver

Ring Buffer DMA Transfer Interface

GPU VRAM Graphics Card

Display

Figure 1.15: The CPU – GPU interface 18 Introduction

The commands from the application running on the CPU are passed on to the GPU through a ring buffer interface. The data associated with these commands, such as vertex attributes, textures, and shader programs are transferred from system memory to VRAM through (DMA) transfers. In addition to acting as temporary storage for input data, VRAM also needs to store the processed frames that are ready for display. This area in VRAM that is reserved for storing the processed frames is essentially the framebuffer. Since the GPU need not send the processed frames to CPU, the CPU need not wait for the GPU to complete the processing before issuing the next GPU command. This helps CPU and GPU to work in parallel, thus increasing the processing speed. In graphics applications, we observe that the input data set is operated upon by a large number of sequential operations. Hence, GPUs are generally deeply pipelined to enhance the throughput. Moreover, the data set is huge and operations on one data element are independent of operations on other data elements. Hence, each stage in the pipeline consists of multiple function units to support parallel processing of the data streaming into it. Commercial implementations of graphics processors are mainly of two types:immediate rendering engine and tiled rendering engine. The classification is based on the order in which primitives are rendered in the pipeline. In this section we describe the immediate mode rendering engines in detail, followed by the tiled mode GPUs.

1.2.1 Immediate Mode Rendering Engines

High–end discrete graphics architectures from nVidia and ATI follow immediate mode rendering. Figure 1.16 shows the high level architectural view of these graphics processors. The Host Interface acts as an interface between the host CPU and the GPU pipeline. It maintains the state of the pipeline, and on receiving the commands from the driver, updates the state and issues appropriate control signals to the other units in the pipeline. It also initiates the required DMA transfers from system memory to GPU memory to fill vertex buffer and index buffer, load the shader program, load textures, etc. Vertex buffer is generally implemented as a cache, since re-use of vertices is expected. The first block in the pipeline, Transformation and Lighting, is responsible for per- Introduction 19

CPU

Host Interface

Index Buffer Transformation Vertex I/P and Lighting Cache

Vertex O/P Cache

Primitive Assembly

Clipping and Culling

Triangle Setup

HZ Buffer Hierarchial Z−Test

Early Z −Test

Depth Pixel Processor Texture Cache Cache

Z−Test

Color Color Blend Cache

VRAM

Figure 1.16: Fixed function GPU 20 Introduction forming the transformation and lighting computations on the vertices. The vertex input cache and index buffer are used to buffer the inputs to this block. The primitives of an object are generally found to share vertices as shown in Figure 1.17 – vertex 3 is common to triangles T 1, T 2, and T 3. Index mode for addressing the vertices results in reduced CPU-GPU transfer bandwidth than transferring the actual vertices in the presence of vertex reuse [3].

1 3 5

T1 T3

T2

2 4

Figure 1.17: Triangles sharing vertices

Vertex to Index Mapping

Vertex v1 v2 v3 v4 v5

Index 1 2 3 4 5

T1 v1 v2 v3 T1 1 2 3 T2 v2 v3 v4 T2 2 3 4

T3 v3 v4 v5 T3 3 4 5

(a) Triangles represented in terms (b) Indexed triangle representation of vertices

Figure 1.18: Indexed addressing into vertex buffer

For the example shown in Figure 1.17, if we send the vertices forming each of the Introduction 21 triangles T1, T2, and T3 to the GPU, we need to send 9 vertices as shown in Figure 1.18. If each vertex is made of N attributes, where each attribute is a four component vector (e.g., x,y,z,w components for position; R,G,B,A components of color), a bandwidth of 9 × 4 × N floating point data is required. Instead, we could assign each vertex an index (a pointer to the vertex – of integer data type), and send 9 indices and only 5 vertices to the GPU. Thus, in indexed mode we send only 9 integers and 5 × 4 × N floating point data. Indexed mode for vertex transfer and the resulting bandwidth saving are depicted in Figure 1.18. The indices of the vertices to be processed are buffered into the index buffer and the attributes of these vertices are fetched into the vertex input cache, since they are expected to be reused [4]. The processed vertices are also cached into a vertex output cache so as to reuse the processed vertices. Before processing a new vertex, it is first looked up in the vertex output cache[4]. If the result is a hit, the processed vertex can be fetched from the cache and sent down the pipeline, thereby avoiding the processing cost. The transformed and lit vertices are sent to the Primitive Assembly unit which assem- bles them into triangles. These triangles are sent to the Clipper where trivial rejection, back-face culling, and clipping take place. The triangles are then sent to the Triangle Setup unit which generates fragments from the triangles. The scan-line conversion of tri- angles into lines does not exploit the spatial locality of accesses into the framebuffer and also the texture memory. Hence tiled rasterization, as shown in Figure 1.19, is generally employed. In this technique, the screen is divided into rectangular tiles and triangles are fragmented such that the pixels belonging to same tile are generated first before proceed- ing to pixels falling in a different tile. The accesses to the framebuffer and texture cache are also matched to this tile size so that accesses to memories can be localized [5]. The next unit is the pixel processor which shades and textures the pixels. Since texture accesses exhibit high spatial and temporal reuse, a dedicated cache called the Texture Cache is used in this unit to store the textures. Most architectures use depth based optimizations prior to pixel processing because a large number of fragments are often culled in the depth test that follows pixel processing. Thus, the time spent on shading and texturing such fragments is actually wasted. However, it is not possible to conduct the depth test prior the pixel processing because, the pixel processor can potentially change the depth or transparency of the pixel. In circumstances where it 22 Introduction

19 20

16 17 18 4 1617 18 1 2 5 6 7 8

3 9 10 11 12 24 13 14 15 25 21 22 23

Figure 1.19: Tiled triangle traversal

is known that the pixel processor would not change these parameters, we can always perform the depth test prior to pixel processing. This is known as the earlyZ test [6]. It is generally observed that if a pixel fails the depth test, the pixels neighboring it also fail the depth test with a high probability. This property is exploited by the Hierarchical Z Buffer algorithm in identifying the groups of pixels that could be culled, thus reducing the number of per-pixel z-tests [7, 8]. After being shaded and textured, the pixels are sent to the Render Output Processor (ROP) for depth, stencil, and alpha tests followed by blending, and finally, writing to the framebuffer. Generally, the z-cache and color cache are used in this block to exploit spatial locality in the accesses to the off-chip framebuffer. The initial generations of GPUs were completely hardwired. However, with rapid advances in computer graphics, there was a need to support a large number of newer operations on vertices and pixels. Fixed function implementations have been found in- adequate to support the evolving features in the field of graphics processing due to their restricted vertex and pixel processing capabilities. Programmable units to handle vertex and pixel processing have been introduced in the programmable graphics processors of recent years. The vertex and pixel programs that run on these programmable units are called . By changing the shader code, we can now generate various effects on the same . A study of the workload characteristics of various applications on modern programmable Introduction 23 processors reveals that the relative load due to vertex processing and pixel processing varies with applications and also within an application [9]. This results in durations when the vertex processors are overloaded while the pixel processors are idle and vice- versa, leading to inefficient usage of resources. The resource utilization efficiency can be improved by balancing the load for both vertex and pixel processing on the same set of programmable units, leading to faster overall processing. This is illustrated in Figure 1.20. Modern games are expected to have a primitive count of about a million resulting in tens of millions of pixels. The operations on these millions of vertices and pixels offer the scope for a very high degree of parallelism. Moreover, large batches of these vertices and pixels share the same vertex shader and pixel shader programs respectively. Hence, the programmable units are generally designed as wide SIMT (Single Instruction Multiple Thread) processors. In Figure 1.20, we observe that the GPU consists of multiple pro- grammable units, each consisting of several processing elements (PEs). Different threads could be running on different programmable units, but within a programmable unit, the same thread is executed on a different data element in every PE. All these PEs can hence share the same instruction memory and decoder. This results not only in optimization of area, but also in considerable power savings because the costs of instruction fetch and decode are amortized over the group of threads running in tandem on the PEs of the programmable unit.

1.2.2 Tiled Graphics Engines

Tiled rendering is the process of dividing the task of rendering a frame into smaller sub-tasks of rendering a regular grid of tiles that the frame is composed of. This is the preferred rendering mode in very high performance graphics cards such as Pixel Planes [10], [11], Microsoft Talisman [12], and Intel Larrabee [13] because multiple tiles can be rendered in parallel on multiple cores of a high-performance GPU. Interestingly, this mode of rendering is also attractive in low power, low form factor mobile graphics processors such as Mali [14] and PowerVR [15] because rendering a tile at a time requires fewer resources than those required to render the whole frame in a single pass. In conventional immediate mode rendering engines explained above, since the rendering order of the objects is not sorted and the consecutive objects could span any region of 24 Introduction

Host

Input Assembler Set Up / Rasterize / Z−Cull

Vertex Thread Issue Pixel Thread Issue

Programmable Unit Programmable Unit

PE PE PE PE PE PE PE PE

PE PE PE PE PE PE PE PE

PE PE PE PE PE PE PE PE

PE PE PE PE PE PE PE PE Thread Processor TF TF TF TF TF TF TF TF TF TF TF TF TF TF TF TF

L1 L1

ROP L2 ROP L2

Frame Buffer

Figure 1.20: Unified shader architecture for graphics processor

the frame as illustrated in Figure 1.21, these caches need to be large to exploit data reuse. Mobile graphics processors, owing to stringent area constraints, cannot afford to accommodate these huge on-chip caches. Hence, to address the issue of memory bandwidth requirement, they generally use tiled mode rendering instead of the traditional immediate mode rendering approach discussed above[16]. In this technique, the scene in divided into sub-regions called tiles (shown in Figure 1.21), and the primitives falling into each of these tiles are divided into bins after geometry processing. Each bin is now rendered independently, and requires only small on-chip depth and color caches that are just sufficient to hold the depth and color values of the tile. After the tile is completely Introduction 25

Tile 1 Tile 2 C A B Bin 1 Bin 2 Bin 3 Bin 4

Tile 3 Tile 4 A,B C C

Figure 1.21: Tiled rendering rendered, the contents of the on-chip depth and color buffer are transferred to the off-chip frame-buffer. The Figure 1.22 shows a high level view of operations in a tiled rendering architecture.

CPU Geometry Tiling Rasterization Pixel Frame Buffer Processing Processing

Figure 1.22: Tiled graphics pipeline

Figure 1.23 shows the architecture of a mobile graphics processor with tiled rendering. The primitives of the scene data are fetched from off-chip memory by the geometry engine. After transformation, lighting, and culling the primitives, follows the screen-space sub- division of the primitives. The screen is divided into tiles of say 32 × 32 pixels. Each primitive is placed into bin(s) corresponding to the tile(s) it overlaps with, as shown in Figure 1.21. This overlap is detected by testing the intersection of the bounding box of the primitive with the tiles of the frame. Each tile is associated with a data-structure called the tile list that maintains the list of primitives present in the tile along with the state of the primitives. State defines the operations that are to be applied on the primitive in the rasterization phase of the processing such as: (i) Textures associated with the primitive and the type of texture filtering to be used; (ii) Depth test enable/disable; and (iii) Blending mode to be used. An example showing the state management in a tiled architecture is shown in Figure 1.24. Triangle T1 belongs to Tile 1 and triangles T2 26 Introduction and T3 to Tile 2. Depth test is enabled for all the triangles, but texture is enabled only for T3. Hence, while binning the triangles into tiles, we associate T1 and T2 with their state(Enable Depth) and T3 with its state(Enable Depth and Enable Texture).

Primitives Geometry Engine

Tile Lists Binning

Texture Pixel Processor

Frame Buffer Color & Depth Buffer

Off−Chip Memory GPU

Figure 1.23: Tiled graphics processor architecture

At the end of geometry processing of each primitive, the processed vertices are stored back in off-chip memory and the corresponding tile list updated. Once geometry process- ing and tiling on all the primitives of the frame is completed, the tiles are processed in sequence. In this architecture, small color and depth buffers are sufficient, as opposed to the huge depth and color caches used in immediate mode rendering because we need to store the values of only 32x32 pixels on the chip. Once the tile is processed, the buffer contents are transferred to external framebuffer memory.

1.3 Power Dissipation in a Graphics Processor

Analysis of operations in a graphics pipeline demonstrates that most of the computa- tions are concentrated in the programmable units, texture units, and ROP units. Pro- Introduction 27

Input API Calls State for Tile 1 State for Tile 2 Enable Depth Enable Depth Enable Depth Triangle(1) Triangle(2) Triangle(1) Triangle(2) Enable Texture Enable Texture Triangle(3) Triangle(3)

Figure 1.24: State management in tiled GPUs grammable units execute a large number of floating point vector operations; texture units use large memory bandwidth to move the textures from VRAM to cache, and perform a large number of floating point operations for filtering the texels; ROP units are memory intensive, needing multiple reads and writes to the color and depth buffers. To estimate the relative power consumption of different pipeline stages of a GPU, we have used Qsil- ver [17] simulator. Qsilver gives a high level estimate of power consumption for various operations on the graphics pipeline. A plot of the energy consumption in different bench- marks at various stages of the graphics processor pipeline using Qsilver [17]simulation is shown in Figure 1.25. 1

0.8

0.6

0.4

0.2 Normalized Energy

0 City Fire Teapot Tunnel mean Benchmark

Frame Buffer Write z−Test Texture Mapping Setup and Rasterize Transform and Lighting Figure 1.25: Fraction of energy consumed in texture unit From Figure 1.25 we observe that vertex processing, pixel processing, texturing and 28 Introduction raster operations are the main sources of power dissipation in the GPU and the power consumed by the stages primitive assembly, clipping and triangle setup can be ignored. This observation is further strengthened the fact that most of the real estate of popular commercial GPUs is also occupied by PUs, texture units and ROPs. Figure 1.26 shows the footprint of nVidia’s GPU GT200 targeted for laptop computers [18].

ROP PU Tex Tex PU

Frame BufferRest Frame Buffer

PU Tex Tex PU ROP

Figure 1.26: Footprint of the die of GT200 (approximately to the scale)

1.4 Our Contribution

In this thesis we have proposed and evaluated several techniques to optimize power con- sumption of graphics processors. Since the visual quality of graphical applications is highly dependent on the rate at which the frames are processed, major compromises on performance are unacceptable. Hence, care is taken to optimize power without affecting the overall performance of the graphics processor. The main contributions of the thesis are

• We propose a customized memory architecture, named Texture Filter Memory for texture cache, which exploits spatial locality and predictability in texture accesses. Introduction 29

By buffering the blocks of texture in registers, we have replaced the high power cache lookups with low power register reads. We have proposed a smart lookup mechanism to maximize the hits into the registers with a small number of arithmetic operations.

• We propose a code optimization technique to avoid complex operations such as lighting, texturing, etc., on primitives that would be trivially rejected after vertex shading. We propose to partition the vertex shader code into position dependent and position independent parts and defer that position invariant part to post trivial reject stage of the pipeline. Since almost 50% of the vertices are expected to be trivially rejected in most applications, our technique results in huge power savings for geometry dominated applications.

• We propose a Dynamic Voltage and Frequency Scaling technique to exploit the variation in workload of graphics applications. We demonstrate that tiled-graphics processors exhibit substantial flexibility over immediate-mode renderers in obtain- ing access to some of the frame parameters that help in enhancing the workload estimation accuracy. We also show that, operating at a finer granularity of ”tiles” as opposed to ”frames” allows early detection and corrective action in case of a mis-prediction. We propose an accurate workload estimation technique and two DVFS schemes, namely (i) tile-history based DVFS and (ii) tile-rank based DVFS for tiled-rendering architectures.

1.5 Thesis Outline

In this thesis we investigate the opportunities to conserve the power consumption of graph- ics processors. Chapter 2 presents detailed literature survey of known power optimization techniques that are applicable to GPUs. In Chapter 3, we study the access patterns to the texture cache and propose a customized low power memory architecture that conserves both dynamic and leakage power of the texture memory subsystem. In chapter 4, we propose a vertex shader code optimization technique to increase the power-performance efficiency of geometry engine. In Chapter 5, we present a sophisticated workload estima- tion technique for tiled GPUs and propose a system level power optimization scheme of 30 Introduction

DVFS based on the same. Chapter 2

Literature Survey

In recent years, low power GPU designs have drawn a lot of attention in both research and commercial communities. In this chapter we present a survey of component level power optimizations such as using custom memory architectures, scaling down the complexity of functional units for better performance per power ratio, data compression to reduce power due to memory transfers, clock gating, etc. We also study system level power optimizations like block level power gating, dynamic voltage and frequency scaling, etc. We first present the unit level optimizations for the power hungry blocks of the GPU, followed by the system level power optimizations.

2.1 Programmable Units

2.1.1 Clock Gating

The high level view of a processing element(PE) in the Programmable Unit(PU) is shown in Figure 2.1. Each processing element in the programmable unit consists of a SIMD ALU working on floating point vectors. In addition to the SIMD ALU, there is a also a scalar ALU that implements special functions such as logarithmic, trigonometric, etc. The ALU supports multi-threading so as to hide the texture cache miss latency. Context switches between threads in a conventional processor causes some overhead since the current state (consisting of inputs and the auxiliaries generated) needs to be stored in memory and the state of the next thread has to be loaded from memory. In order to support seamless context switches between the threads, the PUs in a graphics processor store the thread

31 32 Literature Survey

Context #1 Context #N I/P AUX Reg Bank Reg Bank

Constant Reg Bank MuxContext Mux

Special SIMD ALU FU

Context DeMux

O/P Reg Bank

Figure 2.1: Processing Element (PE)

state in registers. The register file in the shader has four banks, one each to store input attributes, output attributes, constants, and intermediate results of the program. The constant register bank is shared by all threads whereas separate input/output and tempo- rary registers are allocated to each thread. The instruction memory is implemented either as a scratch pad memory where the driver assumes the responsibility of code transfer or through a regular cache hierarchy.

Clock gating of the various sub-blocks of a programmable unit presents itself as a huge power saving opportunity. Since the PUs support a large number of threads and use registers to save the state, large register files are needed. However, since only one thread is active at any given instant, it is sufficient to clock only the registers allotted to the active thread and gate the clock to the remaining the registers. Similarly, the special function units in the ALU are infrequently used, and hence, can be activated only when the decoder stage confirms the need. Literature Survey 33

2.1.2 Fixed Function ALUs

Mobile platforms generally use GPUs with separate PUs for vertex and pixel shaders. Hence, there is a scope for customizing these to the operations specific for vertex and pixel processing respectively. It is observed that vertex shakers give comparable quality at much lesser precision than pixel shaders. Thus researchers have explored the options of using integer or fixed function ALUs rather than floating point units for vertex shading [19, 20]. The observation has been that integer ALUs do not provide the required quality but fixed function implementations yield required quality at much better power budgets in comparison to floating point implementations. A few research efforts also report the benefit of using asynchronous function units in the ALUs. In [21], the authors suggest replacing the Booth multiplier in the MAC unit of the shader pipeline with a low power asynchronous multiplier.

2.1.3 Predictive Shutdown

Predictive shutdown is an effective technique for reducing power loss due to leakage in the idle components of a system. Due to workload variations, not all programmable units are fully utilized in every frame. Advance information about a frame’s workload can help estimate the number of cores required to process it within the time budget. By activating only the required number of cores and powering down the surplus ones, leakage power from the idle cores can be avoided, thereby leading to substantial power savings. A history based method could be used to estimate the utilization of the PUs [22]. Let th the number of active cores used to process the n frame be Sn and the rate at which it was processed be FPSn. Then the maximum rate at which each of the cores processed the frame is given as FPSn . Similarly, the maximum number of cores required to process Sn n +1th frame can be calculated as

Target frame rate for n +1th frame S +1 = (2.1) n minimum rate at which a core is estimated to process the frame

The expected rate at which the core processes a frame can be approximated to the mini- mum processing rate observed in processing of a window of m previous frames.

Based on previous history, the number of active cores Sn+1 required to process the 34 Literature Survey n +1th frame is given by the equation:

[FPStarget + α] Sn+1 = (2.2) min{FPSn , FPSn−1 ,...... , FPSn−m+1 } Sn Sn−1 Sn−m+1

The factor α is introduced so as to slightly overestimate the core requirement, so that small variations in the workload can be taken care of without missing the deadline. In the above formula, it is assumed that the entire duration in processing the frame is spent on the PUs, which is generally not true. The frame could as well be texture intensive or ROP intensive. The α factor also serves to reduce the effect of deadline misses due to under estimation of workload.

2.2 Texture Unit

2.2.1 Low power cache configurations

Hakura et al. [23] were the first to observe that texture access have high spatial and temporal locality. They demonstrated the ability of texture cache to reduce the memory bandwidth requirement of texture mapping and hence improve power and performance of texture memory. Later it was observed that the texture L1 cache, due to its small size, only handles intra primitive locality and not inter triangle or inter frame locality. Hence in [24] the authors suggest an external texture cache (L2 cache) between the internal L1 cache and the texture memory. They organize the L2 cache as a virtual memory, with a mechanism to translate from texture addresses to physical addresses. Texture cache architecture in parallel rasterization architectures was studied by Igehy et al. [25]. Serial rasterization architectures benefited from texture cache because of locality of accesses. In an architecture with parallel rasterization units with local texture caches, the spatial locality within each cache is not as high as in architectures with single texture unit. Hence they propose a shared texture memory architecture for effective bandwidth utilization by avoiding duplication of textures. In [26] the authors evaluate the effect of three hybrid access cache systems: victim cache, half-and-half cache and cooperative cache on conflict misses. They observed that the results varied a lot with the size of the cache. For an 8KB cache, victim cache Literature Survey 35 performs better that the others but for a 16KB cache, the performance of victim cache and half-and-half cache are comparable. In [27], the authors reduce power consumption by replacing the 16KB 2-Way associative cache with a very small (128-256 Byte) direct mapped cache. This reduces average power as the direct-mapped cache avoids several tag comparisons of set-associative caches, but at a considerable performance penalty (50%) due to high miss rates.

2.2.2 Texture Compression

To reduce the power consumed in fetching textures from off-chip caches, the texture memory bandwidth is reduced by transferring compressed textures from off-chip texture memory to the texture cache. Since texture accesses have a very high impact on system performance, the main requirement of texture compression system is that it should allow fast random access to texture data. Block compression schemes such as JPEG are not suitable for textures though they give high compression ratios. Since the accesses to texture memory are non-affine, it cannot be assured that the decompressed data is used up before the next block is fetched and decompressed. In cases where consecutive texture accesses alternate between a few texture blocks, the same block would have to fetched and decompressed multiple times resulting in increased block fetch and decompression overhead. Hence, we require compression schemes where the texels in a block can be decompressed independent of the other elements of the block. The S3TC compression technique is commonly used for this purpose[28]. In the S3TC technique, for a block of texels, two reference values and a few values generated by interpolation of the reference values are chosen such that each texel in the block can be approximated to one of the chosen values with least loss in accuracy. For example, if four values are to be used to represent the colors of a texture block, and c0 and c1 are the chosen reference values, two other colors (c2 and c3) are generated from interpolation of c0 and c1 as shown in Figure 2.2. For each texel in the block, the closest of the colors among c0 to c3 is chosen. Thus, a 4 × 4 tile would require 2 reference values and 16 2-bit offsets to generate the interpolants from the reference values instead of 16 texel values. Based on this principle five modes of compression named DXT1. . . DXT5 have been proposed, with varying accuracy and compression ratios. 36 Literature Survey

1−unit 1−unit 1−unit

C0 C2 C3 C1

2.C + C C + 2.C C2 = 0 1 C3 = 0 1 3 3

Figure 2.2: S3TC texture compression

1. DXT1 gives the highest compression ratio among all the variants of DXT compres- sion. Texels are usually represented as 32-bit values with the R,G,B,A components allotted 8-bits each. However, most of the time, textures do not need 32-bit accu- racy. Hence, DXT1 uses 16-bit representation for RGB components (5:6:5) of the reference colors and allows a choice of 0 or 255 for transparency.

The colors that could be generated from the two 16 bit reference values and 2 bits per texel which determine the weights for interpolation are shown below:

If c0 and c1 are the reference values, the other two colors are calculated as

If c0>c1 2c0+c1 c0+2c1 c2 = 3 and c3 = 3 else c0+c1 c2 = 2 and c3 =0

For a 4 × 4 tile size, this scheme needs 64 bits per tile giving 8:1 compression.

2. The DXT2 and DXT3 compression schemes encode alpha values also in addition to color values and the compression scheme is similar to that described in DXT1. Thus for a 4 × 4 tile, they need 64 bits for color as in DXT1 and an additional 64 bits for alpha values, giving a 4:1 compression. In DXT2, color data is assumed to be pre-multiplied by alpha, which is not the case in DXT3.

3. In the DXT4 and DXT5 schemes, color components are compressed as in DXT2/3 and for alpha compression, two 8-bits reference values are used and 6 other alphas are interpolated from them, giving 8 alpha values to choose from. The alpha en- coding is as shown below: Literature Survey 37

If α0 > α1, 6α0+α1 5α0+2α2 4α0+3α3 α2 = 7 , α3 = 7 , α4 = 7 , 3α0+4α4 2α0+5α5 α0+6α6 α5 = 7 , α6 = 7 , α7 = 7 else 4α0+α1 3α0+2α2 2α0+3α3 α2 = 5 , α3 = 5 , α4 = 5 , α0+4α4 α5 = 5 , α6 = 0, α7 = 255 DXT4 is used in case color is pre-multiplied with alpha and DXT5 is used if it are not. DXT4/5 also give 4:1 compression, but produce superior results for alpha values.

2.2.3 Clock Gating

Clock gating is a powerful technique that could be used to conserve the power dissipated in a texture unit. The textures are generally stored in such a way that the odd mipmap[29] and even mipmap levels map to different cache banks so that the texels could be fetched in parallel during tri-linear[29] interpolation. Moreover, the addressing and filtering units are also present in pairs so that the texels could be filtered in parallel so as to facilitate faster texture sampling. However, when the texels are filtered in bilinear mode, half of these units and texture banks are idle. There could also be intervals during which the vertex or pixel shader threads would not be using texturing at all. Since the requirement of texture and the type of filter used is a part of the state information that the driver sends to the GPU, texture enable and filtering modes are set before the processing of the batch starts. Ideally, half of these units could be powered off when bilinear filtering[29] is used and the entire texture module is switched off when texturing is not used. However, since the intervals between switching from one condition to another may not be large enough to merit powering-down of these circuits, clock gating is generally used to conserve the power associated with clocking these units.

2.3 Frame Buffer

Raster operations are highly memory intensive since they need multiple reads and writes to the off-chip framebuffer. Since off-chip accesses are slow and power hungry, these 38 Literature Survey framebuffer accesses affect the power consumption and also the performance of the sys- tem. Reducing the memory bandwidth between the GPU and framebuffer is therefore a very important power-performance optimization. Major techniques that are generally employed to reduce the required bandwidth between GPU and VRAM are:

• Re-use the data fetched from VRAM to the maximum extent before it is replaced by other data. Efforts made in this direction include extensive on-chip caching and blocked data accesses to maximize cache hits.

• Send compressed data to GPU from VRAM and decompress it on-chip so as to decrease the memory traffic. The decoder has to be simple enough so that the savings due to compressed data transfers dominates the decompression cost in terms of power and performance.

In this section we discuss the data compression strategies for memory bandwidth reduction of color and depth buffer.

2.3.1 Depth Buffer Compression

As described in Figure 1.19, the fragments are generated in tiles to exploit spatial locality of accesses to color, depth, and texture data. Several tiles of depth information are cached in the on-chip depth cache; whenever there is a miss in this cache, a tile of data is fetched into it from the off-chip VRAM. To reduce the memory bandwidth due to these transfers, VRAM stores and transfers compressed tiles, which are decompressed on-the- fly before they are stored in the on-chip depth cache. Differential Differential Pulse Code Modulation (DDPCM) is one of the popular compression techniques used for depth buffer compression [30]. It is based on the principle that, since the depth values of fragments of a triangle are generated by interpolation of depth values of the vertices, if a tile is completely covered by a single triangle, the second order differentials of the depth values across the tile would all be zeroes. The steps in the compression scheme are enumerated below:

1. Start with a tile of depth values. Assuming the tile is covered by a single triangle, the interpolated depth values in the tile would be as shown below: Literature Survey 39

2. Compute the column wise first order differentials, and then repeat the same step to obtain column-wise second order differentials.

3. Follow it up with row-wise second order differential computation.

We see that in the best case (i.e., when the triangle covers a tile), we need to store only one z value and two differentials. Thus, for an 8 × 8 block (which would originally need 64 × 32 bits), the compressed form would need 32 bits for the reference z value and 2 × 33 bits for the differentials. Since the depth values are generally interpolated at a higher precision than they are stored in, the second order differentials would be one among the values 0,-1, and 1. Hence two bits are required to encode the value of the second differential. Thus with an addition 61 × 2 bits for the second differentials, a total of 32+2 × 33 + 61 × 2 = 220 bits would be required instead of 2048 bits. With the two bits used to represent the differentials, four values can be realized. Since only values 0,-1, and 1 are required to be realized, the fourth value can be used to indicate the case when the differentials take values other than 0, -1, and 1. In this case a fixed number of 40 Literature Survey second order differentials are stored at higher precision and picked up in order each time a violation is indicated.

2.3.2 Color Buffer Compression

The transfers from color buffer to color cache are also done in tiles, so as to exploit spatial locality of accesses. Hence, block based compression schemes are used for these data transfers. Since color values are not always interpolated from vertex colors (they could be textured), the compression scheme used for depth buffer compression is not very efficient for color buffer compression. The difference between the color values of the neighboring pixels is small and this makes variable length encoding of the differences a suitable option for color buffer compression [31]. This compression technique is called exponent coding since the numbers are represented as s(2x − y), where s is the sign bit and y ∈ [0, 2x−1 − 1]. x + 1 is unary coded and concatenated with sign and y coded in normal binary coding to give the compressed value. For example, value 3 is represented as (22 − 1). Here x + 1 = 3 which is 1110 in unary coding, s = 0 and y = 1, hence the code for 3 is 111001. Table 2.1 shows the coded values for numbers in the range [−32, 32].

Value range Code

0b 0 10sb ±1 110sb ±2 1110sxb ±[3, 4] 11110sxxb ±[5, 8] 11110sxxxb ±[9, 16] 11110sxxxxb ±[17, 32] 11110sxxxxxb 8-bit absolute value

Table 2.1: Exponential encoding for color buffer compression

From the table we see that smaller numbers can be represented with relatively fewer number of bits than larger numbers. Since in most of the cases the differentials are observed to be small, significant compression ratios can be expected. Color values are used by both GPU and also the display controller. Compression helps reduce the bandwidth between the framebuffer and display controller also. Hence the display controller also needs a decompressor to decode the compressed color values read from the framebuffer. Literature Survey 41

2.4 System Level Power Management

In addition to the architectural power optimization techniques discussed so far, system level power management techniques can also be effective in reducing the power consump- tion by minimizing the wastage of power in a graphics subsystem. Techniques such as system level power gating, Vdd and Vth scaling, and DVFS scaling are efficient in saving power. These techniques, as applicable to GPUs, are discussed in detail in this section.

2.4.1 Power Modes

Graphics processors are used for accelerating various kinds of applications such as word processors, GUIs for various tools such as internet browsers, games, etc. Since the amount of graphics processing varies to a great extent from application to application, the GPU workload due to these applications also varies greatly. Moreover, there could be large in- tervals of time during which none of the applications requires graphics processing, leaving the GPU idle. Since it is not always required to operate the GPU at peak performance levels, a few power modes with varying performance levels are generally supported. For example, when the GPU is idle, it can be operated at minimum Vdd and Vth levels sav- ing maximum dynamic and leakage power. However, when 3D games, which use heavy graphics processing are running on the system, the GPU can be operated at maximum performance mode. Performance monitors are used to gauge the utilization of the GPU (similar to monitoring CPU utilization), and the operating system switches the GPU to a power mode that delivers the required performance level with minimum power consump- tion.

2.4.2 Dynamic Voltage and Frequency Scaling

In case of power management by mode switching, since the switching overhead is high, there is a relatively large difference between the thresholds that cause a transition be- tween power modes. The observation intervals are also large. However, applications such as games have been shown to exhibit significant variation in workload presented by dif- ferent frames in the application. Fine tuning the computational capacity of the GPU in response to such workload variations has a huge power saving potential. Dynamic voltage 42 Literature Survey and frequency scaling is a popular power optimization technique used by processors to match the computational capacity of the processors with varying workloads of the appli- cations running on them by adjusting the frequency and operating voltage at run time. Commercial processors such as Transmeta Crusoe, Intel Pentium Mobile, and ARM pro- vide support for DVFS. To control the operating point of the CPU, simple schemes such as PAST [32] and Aged Averages [33] are used to predict the expected utilization of the CPU in the current time interval based on the average CPU utilization observed in recent intervals. Since these schemes have no knowledge of the applications and decisions are taken only on the basis of the history of observed CPU workloads, they can suffer from significant performance degradation due to mis-predictions. DVFS schemes that have ap- plication knowledge have shown better power and performance characteristics [34]. Since the quality of service in games is highly sensitive to the frame rates, it is important to predict the workload accurately in order to minimize the number of frames missing their deadlines. Some techniques use the workload history to predict the expected workload of the current frame, while others attempt to extract hints from the frame state information to guide the workload prediction. Various prediction techniques proposed in literature are discussed in more detail in the following sections.

History based Workload Estimation

The history based workload estimation technique predicts the workload of the current frame from the workload of the previously rendered frames [35]. The simplest and most straightforward way to do this is to approximate the workload of the current frame to that of the previous frame. However, doing so would result in frequent voltage-frequency changes, which are not desirable, since switching from one voltage-frequency level to another imposes an overhead of stabilization time. To minimize the number of transitions, the average workload of a window of previous frames is used to guess the workload of the current frame. A large window size is helpful in reducing the number of voltage changes, but at the same time, leads to a larger number of frames missing the deadlines as a result of the slower correction mechanism. This history based workload prediction can be extended to estimate the workload of all the voltage islands in the design and the operating point of each of the islands can be tuned to match the workload changes Literature Survey 43 experienced by the island.

Control Theory based Workload Estimation

Control theory based DVFS takes into account the previous prediction error along with the previous predicted workload to predict the workload of the current frame [36]. Since it can adapt faster to the workload changes, it results in lesser number of frames missing their deadline. In a control based DVFS scheme, a simple Proportional Integral Derivative (PID) controller, as shown in Figure 2.3, is used as a closed loop feedback mechanism to adjust the predicted workload of the current frame based on the prediction errors for some of the previously rendered frames. The workload of the current frame wi is expressed as

wi = wi−1 + delta(w) (2.3) where delta(w) is the output from the PID controller. The proportional control regulates the speed at which the predicted workload responds to the prediction error of the pre- vious frame. The Integral control determines how the workload prediction reacts to the prediction errors accumulated over a few of the recently processed frames. The differen- tial control adjusts the workload based on the rate at which the prediction errors have changed over a few of the recent frames. Thus the correction value generated by the PID

Proportional Kp x Error

Set Point Integral Process Output Error Ki x Error + − Differential Kd x Error

Figure 2.3: PID controller

controller can be expressed as

delta(w)= Kp × Error + Ki × X Error + Kd × ∆Error (2.4) 44 Literature Survey

The contribution of each of the Proportional, Integral, and Differential components of the controller can be tuned by varying the coefficients Kp, Ki, and Kd respectively. The flow of operations that take place in a PID based DVFS scheme can be summarized as shown in Figure 2.4. Based on the difference between the actual workload and the predicted workload (Error) of the current frame, the PID controller estimates the workload for the next frame. The voltage and frequency of the system are scaled to match the computational capacity of the system with the predicted workload of the next frame. The frame is processed at this operating point and actual workload of the frame is observed to generate the Error value that drives the PID controller.

Error

PID Controller Generates Correction from Error

Workload = Previous Workload + Correction

Scale Voltage and Frequency

Process Frame & Measure Workload

Error = (Predicted−Measured) Workload

Figure 2.4: PID controller based DVFS for graphics processor

Frame Structure based Workload Estimation

In all the above discussed methods, the workload of a frame is estimated based on the history of previously processed frames. Hence the prediction would be good only when the scene remains almost the same across consecutive frame captures. The workload is bound to be mis-predicted when there is a significant change in the scene, which may result in frames missing their deadlines. To alleviate this problem, the frame structure Literature Survey 45 based estimation technique bases its prediction on the structure of the frame and the properties of the objects present in the frame [37]. Since this information is obtained prior to processing of the frame, the workload prediction could be based on the properties of the current frame rather than basing it on the workload of previous frames. In this approach, a set of parameters impacting the workload are identified and an analytical model for the workload as a function of these parameters is constructed. During the execution of the application, each frame is parsed to obtain these parameters and the pre-computed workload model is used to predict the expected workload of the current frame. For example, the basic elements that make up a frame in the Quake game engine can be enumerated as follows.

• Brush models used to construct the world space. The complexity of a brush model is determined by the number of polygons present in the model. If the average workload for processing a polygon is w, the workload W presented by n brush models each consisting of p polygons is represented as:

W = n × p × w (2.5)

• Alias models are used to build the characters and objects such as monsters, soldiers, weapons, etc. Alias models consist of geometry and the skin texture of the entity being modeled. The skin could be rendered in one of two modes – opaque and alpha blend. Since the geometry consists essentially of triangles, its workload is characterized in terms of the number of triangles and average area of the triangle. Since alpha blending and opaque blending present different workloads, the workload is parametrized for both modes of rendering. If the workload of processing a single

pixel with blending is wt and without blending is wo, the workload W due to alias

models consisting of Nt triangles with blend and No opaque triangles, of average area A is given by:

W = Nt × A × wt + No × A × wo (2.6)

• Textures applied to the surfaces of brush model to give a realistic appearance like

that of wood, brick wall, etc. The workload W due to applying Nt textures, where 46 Literature Survey

w is the workload for applying a single texture on N polygons with average area A, is given by:

W = Nt × N × A × w (2.7)

• Light maps to create lighting effects in the scene. Since they are similar to texture maps, the workload due to light-maps is estimated similar to the estimation for texture maps.

• Particles to create bullets, debris, dust, etc. The workload W due to rendering the

N particles, where the number of pixels in a particle i is given as Pi and workload for rendering one such pixel is w, is given by:

W = N × Pi × w (2.8)

Finally the total workload of the frame is the sum total of the workloads computed above.

Signature based Workload Estimation

The Signature based estimation technique aims to estimate the workload using the prop- erties of the frame in addition to the history of cycles expended in processing the previous frames [38]. Every frame is associated with a signature composed from its properties such as the number of triangles in the frame, average height and area of the triangles, the num- ber of vertices in the frame, etc. A signature table records the actual observed workload of the frame against the signature of the frame. Prior to rendering a frame, its signature is computed and the predicted workload of the frame is picked from the signature table. On rendering the frame, if there is a discrepancy between the observed and predicted workloads, the signature table is updated with the observed workload value. To compute the signature of the frame, we need the vertex count, triangle count, and also the area and height of the triangles. The pipeline has to be modified to facilitate the signature extraction since the triangle information can be obtained only after the triangle culling and clipping are performed. The modified pipeline is shown in Figure 2.5. The geometry stage is divided into vertex transformation and lighting stages. Triangle clipping and culling stages are now performed prior to lighting and a signature buffer is inserted Literature Survey 47 prior to the lighting stage to collect the frame statistics. Since we need the information of the entire frame to compute a meaningful signature, the buffer should be big enough to handle one frame delay. Signature based prediction works on the assumption that the computational intensity of the pre-signature stage is negligible and also can be performed on the CPU without hardware acceleration. For every signature generated, the best matching signature from the table is to be looked up. A distance metric shown in equation 2.9 is used to locate the signature that is closest to the current signature. For a signature S consisting of parameters s1,s2, ...sd and a signature T comprising t1, ...td in the signature table, the distance D(S, T ) is defined as

d |s − t | D(S, T )= i i (2.9) X s i=1 i

The signature that is at a minimum distance from the current signature could be looked up either by linear search or any other sophisticated searching mechanism.

Transform Lighting ClippingRasterization Pixel Processing

(a) Conventional Pipeline

Transform Signature Lighting Rasterization Pixel Processing & Clip Buffer

Monitor Perf & Extract Lookup Scale Update Sig Table Signature Sig. Table Vol & Freq (b) Pipeline enhanced with Signature based DVFS

Figure 2.5: Signature based DVFS for graphics processor

2.4.3 Multiple Power Domains

From the discussion in Section 1.3, it is clear that PUs, texture units, and ROPs are major components in the graphics processor that consume power. From the workload analysis of games it has been observed that some frames use a lot of texturing; others 48 Literature Survey load the programmable units; still others require a large number of ROP operations. Hence these three modules could be designed to have different sets of power and clock signals. Thus, the voltage and frequency of each of these domains can be independently varied in accordance to their load, leading to power savings.

2.5 Miscellaneous

In [39], the authors propose the processing of the difference of adjacent pixel values instead of directly on the pixel values. Spatial correlation in typical images lead to the difference being a small number on an average. Similar tonal locality observations are also exploited in [40, 41] to assign codes in order to reduce total bit transition count during serial transmission over the Liquid Crystal Display (LCD) bus to the LCD display device. In [42, 43], the authors observe that the eye’s visual perception depends on the intensity and the transmittance characteristic of the LCD panel. It is possible to adjust these parameters without affecting the perception quality; specifically, it is possible to reduce the intensity and thereby reduce power. Usage of fixed point arithmetic instead of floating point has been suggested for low power implementations of graphics processors in mobile applications [44]. Low power features in such implementations include using the valid instruction signal in the vertex shader to clock-gate the register files, preventing writes (reads can still proceed). Chapter 3

Texture Filter Memory

3.1 Introduction

From power analysis of a typical graphics pipeline using Qsilver [17] as shown in Fig- ure 1.25, we have seen that texture mapping is one of the components that contributes significantly to total power consumption. The Texture memory sub-system consumes upto 38% of the total dynamic energy consumption, making it a potential candidate for optimization. One of the main side-effects of technology scaling has been increasing lev- els of leakage power. Typically in caches, since cache lines leak power for most of their life time, leakage power of caches is much more than their dynamic power. It has been observed that in a 70nm cache, 77% of the total power consumption is from the leakage component where as the dynamic power consumption is only 23% of the total [45]. Leakage power consumption is directly proportional to area of the chip. Since signifi- cant area of the GPU real estate is occupied by texture memories as seen in Figure 1.26, texture memory is also a major contributor of leakage power of the GPU. In this chapter we aim to come up with a custom memory architecture for texture mapping which would optimize both dynamic and static power consumption of the texture memory sub-system resulting in significant overall power saving. Texture mapping is the process of mapping an image (in texture space) on to a surface (in object space) using some mapping function [29]. The process of texture mapping can be explained with a simple example shown in Figure 3.1. Consider modeling a globe. One way to do this is to represent the sphere as a large

49 50 Texture Filter Memory

(a) Object (b) Texture (c) Textured object

Figure 3.1: Texture mapping to model a globe

number of tiny triangles and associate the vertices of these triangles with appropriate colors so that after the triangles are passed through the pipeline, what finally appears on the screen looks like a globe. The modeling effort in this case is so huge that it makes rendering such models almost impossible. Things would be easier if we could just define the mapping of a few points on a sphere to the points on a 2-D world map, and the pipeline had the capability to associate the pixels with appropriate colors from the world map. This process of mapping the pixels on a 3D model to points on a 2D texture called texels, is called texture mapping or texturing. The figure 3.2 shows the process of texture mapping a primitive. From the figure we observe that texture space and object space could be at arbitrary distance and orientation with respect to each other. As a result, there is no one-to- one correspondence between the pixels on the object and texels of the texture. This necessitates the use of some texture filtering mechanism to attribute the best color to a pixel. Hence several texture filtering techniques are used to improve the quality of texture mapped images. Point filtering: This is the simplest of the filtering methods where the color of the texel that is nearest to the pixel center is picked up as the color of the pixel. This is the fastest but crudest form of texture filtering. Linear filtering: This is sightly more refined than the point filtering. In this filtering method, two texels closest to the pixel center are weighted averaged to obtain the color of the pixel. Though this is better than point filtering, it still doesn’t yield acceptable Texture Filter Memory 51

A

B C

C

A B

Figure 3.2: Oblique traversal of scanlines in texture space quality. Bilinear filtering: This is one of the most common filtering techniques used for texture mapping. In bilinear filtering the weighted average of four texels nearest to the pixel center gives the color of the pixel (Figure 3.3). Trilinear filtering: In order to produce good results for varying levels of depth (lod) at which the object could be viewed, the texture image is stored at various resolutions called mip-maps [29] and the nearest one is picked up for filtering at run time based on the lod. In bilinear filtering, abrupt changes from one mip-map level result in noticeable changes in quality of the image. To avoid this artifact, in trilinear interpolation, the bilinearly interpolated values from the two nearest mip-map levels are averaged to give the color of the pixel. Anisotropic Filtering: When the rendered object is at an oblique viewing angle with respect to the camera, bilinear and trilinear texture filtering do not give satisfactory image quality. Since the latter’s filter pattern is a square, it performs well only when the object is viewed head-on and leads to blurring when viewed at an angle. In such cases, anisotropic filtering[29] is generally used, in which the footprint of the filter is 52 Texture Filter Memory generated at run time depending on the level of obliqueness of the viewing angle. The current commercial implementations of anisotropic filtering might require upto 128 texels for generating the color of single pixel. Since, this filtering incurs a heavy performance cost, it is limited to very few pixels of the scene.

3.2 Texture Mapping Access Pattern

Since almost all of the common filtering methods inherently use bilinear filtering, we study the access pattern of bilinear filtering in detail. Texture mapping with bilinear filtering exhibits high spatial and temporal locality. This is because:

• to compute the color of a pixel we need to fetch four neighboring texels,

• consecutive pixels on the scan line map to neighboring texels, and

• consecutive scanlines of a primitive share texels.

In addition to locality, texture mapping also exhibits predictability in access pattern. As seen in Figure 3.3, access to texel t1(base texel) is followed by accesses to texels t2,t3 and t4. Thus access to texel t1 gives us the information about the accesses to the next three pixels. A conventional 4-way associative cache architecture for texture memory as suggested in [23] is oblivious to such predictability in accesses. The four required reads from the tag and data arrays in addition to the four tag comparisons for each texel fetch make it very power hungry. We propose a customized memory architecture that exploits spatial locality and predictability in the access stream resulting in a low power solution without compromising on performance.

BASE TEXEL

(tx,ty) (tx+1,ty) t1 t2 PIXEL t3 t4 CENTER (tx,ty+1) (tx+1,ty+1)

Figure 3.3: Footprint of a Bilinear filter Texture Filter Memory 53

CASE 1 CASE 2 CASE 3 CASE 4

Figure 3.4: Scenarios to which the texture footprint could be mapped

Since the direction of accesses in texture memory is arbitrary, a blocked representation of texture maps in memory is generally used (illustrated in Figure 3.5). Each block resides in contiguous memory space. The algorithm for computation of the texel address from the texel co-ordinates is shown in Algorithm 2. The overhead of extra additions and shifts in the block address computation is offset by the performance gained by the reduced cache miss rates by selecting the line size equal to the block size[23].

bw

bh 1 2 3 4

5 by=2

sx sy bx=2 (tu,tv)

16

width

Figure 3.5: Blocked representation of texture 54 Texture Filter Memory

Algorithm 2 Computation of texel address Input: Texel Co-ordinates (tu,tv), Base - Starting address of Texture Output: Texel address 1: lbw ← log2(bw) 2: lbh ← log2(bh) 3: rs ← log2(width · bh) 4: bs ← log2(bw · bh) 5: bx ← tu >> lbw 6: by ← tv >> lbh 7: sx ← tu&&(bw − 1) 8: sy ← tv&&(bh − 1) 9: block address ← (by << rs)+(bx << bs) 10: offset ← (sy << lbw)+ sx 11: texel address ← base + block address + offset 12: return texel address

Since texture mapping exhibits high spatial locality, we propose to buffer the blocks of texture expected to be accessed in the near future in a set of registers. The number of blocks to be buffered depends on the type of filter being used. In a bilinear filtering operation, the texels could be in one, two, or four of the neighboring blocks as shown in Figure 3.4. Also, the next set of texels could fall in one of these four blocks with a high probability. Hence we would require to buffer upto four blocks of texture. For trilinear filtering we would need to buffer eight blocks – four blocks from each of the two nearest mipmap levels.

A standard cache based memory architecture could be used for the texture memory accesses, but this is expensive in terms of power, as each access results in a lookup operation where both tag and data arrays of the cache are read, with the number of such memory accesses proportional to the associativity (lower power cache architectures exist, but they compromise on performance). The predictability of the texture access pattern can be used to reduce the average number of memory accesses. We propose a novel memory architecture for textures, where cache-style lookups are minimized by modifying the conventional kernel for bilinear filtering as shown in Algorithm 3 to the one shown in Algorithm 4. The information about which of the four cases in Figure 3.4 applies to a texture access can be obtained by comparing the block co-ordinates of the texels(lines 3 and 4 of Algorithm 4). If the accesses belong to case 1, where all the texels are mapped to same block, a lookup for texel 1 could be followed by fetching the texels 2, 3, and Texture Filter Memory 55

4 from the same block(lines 8-11 of Algorithm 4) . Thus, we need only one lookup for fetching four texels. Similarly for cases 2(lines 12-17 of Algorithm 4) and 3(lines 18-23 of Algorithm 4), two lookups are sufficient for four texel accesses. Only case 4(lines 24-25 of Algorithm 4) requires four lookups. For this to be possible, our buffering unit should be designed such that it allows both lookup operation like a cache and a direct register access.

Algorithm 3 Kernel For Bilinear Filtering Input: Texel Co-ordinates (tu,tv), Base - Starting address of Texture Compute texel addresses corresponding to texel co-ordinates (tu,tv), (tu+1,tv), (tu,tv+1) and (tu+1,tv+1) 2: for I =1to4 do texelI ← CacheLookup(texeladdressI) 4: end for color ← WeightedAverage(texel1, texel2, texel3, texel4) 6: return color

Though we use two additional comparison operations for classifying the accesses to different cases, at the same time we reduce the number of block address computations and eliminate the texel address computations. From our experiments on various benchmarks we observed (Figure 3.6) that on an average, 58% of the accesses are to the same block (case 1), 36% to two blocks (cases 2,3) and 6% of the times to four of the blocks (case 4). Thus, only one lookup is required in 58% of the texture accesses. Even though this single lookup can consume the same power as in an associative cache (though smaller in magnitude because our buffers have only 4 registers), the remaining three accesses do not require any lookup/comparison operation because the register containing the block is already known. On an average, the number of memory accesses and comparisons is drastically reduced. 56 Texture Filter Memory

Algorithm 4 Modified Bilinear Input: Texel Co-ordinates (tu,tv), Base - Starting address of Texture Output: Color bx ← tu >> lbw 2: by ← tv >> lbh bx1 ← (tu + 1) >> lbw 4: by1 ← (tv + 1) >> lbh c0 ← (bx = bx1)? 0:1 6: c1 ← (by = by1)? 0:1 Calculate offset1, offset2, offset3 and offset4 8: if c0=0 and c1=0 then compute block address1 10: texel1 ← LookupBuffer(block address1, offset1) Read texels 2,3 and 4 from the same block 12: else if c0=0 and c1=1 then compute block address1 and block address 3 14: texel1 ← LookupBuffer(block address1, offset1) Read texels 2 from the same block 16: texel3 ← LookupBuffer(block address3, offset3) Read texels 4 from the same block 18: else if c0=1 and c1=0 then compute block address1 and block address 2 20: texel1 ← LookupBuffer(block address1, offset1) Read texels 3 from the same block 22: texel2 ← LookupBuffer(block address2, offset2) Read texels 4 from the same block 24: else compute block addresses of all the four texels 26: for I =1to4 do texelI ← LookupBuffer(blockaddressI, offsetI) 28: end for end if 30: color ← WeightedAverage(texel1, texel2, texel3, texel4) return color

3.3 Architecture of Texture Filter Memory

In a conventional texturing unit, the address generator computes the block address and the offsets of the four texels to be bi-linearly filtered. The four texels are fetched from the cache by the fetch unit and the filtering unit performs the bi-linear interpolation. There are two filtering units so that during tri-linear interpolation the bilinear filtering of Texture Filter Memory 57

60%

50%

40%

30%

20%

10%

0% case 1 case 2 case 3 case 4

Figure 3.6: Distribution of texture accesses between various cases both the mipmap levels could be done in parallel. We include a Texture Filter Memory (TFM) in the texturing unit which acts as an interface between the texturing unit and the texture memory. The architecture of TFM is described in this section.

TFM consists of three components (Figure 3.7): (i) Texture Buffer Array, (ii) Address Comparators, and (iii) Controller.

3.3.1 Texture Buffer Array

A block of texture consists of 4 × 4 texels. Hence we need a buffer of 16 registers to save one block of texture and an array of four buffers to store four blocks of texture for bilinear filtering and eight buffers to support trilinear filtering also. We arrange the eight buffers as two sets of four buffers each. For bilinear filtering we use only one of the sets, and turn off the other in order to reduce power. In trilinear filtering, we map the two sets to the two different mipmap levels. By doing so, each texel lookup would require us to search four buffers instead of eight. The address bus width of the TBA is 7 bits: one bit to select between the two sets, two bits to select the buffer in a set, and four bit offset into the buffer. 58 Texture Filter Memory

Cur From L1 Cache Level CONTROLLER (512 bits) Bank Enable Load Hit Sel R/W BANK−I TEXEL Block ACOMP−I addr indx (256 Bytes) FETCH 4 Load 4 Hit UNIT Addr ACOMP−II BANK−II Offset (256 Bytes)

TEXTURE load hits BUFFER ARRAY

(32 bits) REG To Shader Unit

REG addr index

REG ENCODER

REG ADDRESS COMPARATOR (ACOMP)

Figure 3.7: TFM architecture Texture Filter Memory 59

3.3.2 Address Comparators

TFM has two blocks of address comparators, each associated with a set of buffers in the TBA. A comparator block consists of four registers to store the addresses of the blocks present in the buffers of that set. When the address comparator receives an address, it compares it with the addresses saved in the four registers in parallel. The output of the comparators are sent to the controller and also encoded to give the address of the buffer in which the texture block resides in case of a hit.

3.3.3 Controller

The Texel fetch unit provides the block address, the offset of the texel along with the mipmap level from which the texel is to be fetched. The access to TFM could be: (i) a direct access when the fetch is to the same block as the previous texel (ii) a lookup when it is not known in which of the buffers the texel would reside. The texel fetch unit determines the type of each access as shown in Algorithm 4 and provides this information to the controller. In case of a lookup, the controller enables the appropriate address comparator and the TBA entry. For bilinear filtering, only one of the banks and comparator associated with it are enabled. For trilinear filtering, the controller compares the mipmap level of the current access with that of the previous one and when it changes, toggles comparator enable and bank select signals. Thus the texels of two different mipmap levels would always be mapped to two different sets, reducing the interference. The controller combines the outputs of the address comparator to determine if the lookup resulted in a hit or a miss.

• Upon a hit, the buffer address generated by the comparator is registered so that for the successive access to the same block it is sufficient to provide only the offset and the costly comparisons can be avoided. These accesses are called direct accesses. Thus we achieve the hit rate of a fully associative cache but at much lesser compar- isons per access. It appears that for the accesses that need address comparison we need two cycles. But the design is inherently pipelined since we are registering the output of the comparator and hence we can achieve one fetch every cycle.

• When there is a TBA miss, the controller issues a read signal to the L1 cache and a 60 Texture Filter Memory

block of texture is moved from L1 cache to TBA. A pseudo LRU policy [46] is used by the comparator to select the block to be replaced. The controller issues a load signal to the corresponding address register in the comparator so that it stores the address of the new block now. It also issues the load signal to the buffer in the TBA and the registers in the buffer are loaded in parallel. A 512-bit internal bus between the L1 cache and TBA fills the buffer. From synthesis of TFM, we observed that the access time to the TFM is about half that of the cache and hence we can fill the buffer in two cycles when there is a miss.

In high throughput texture cache designs with mutiple-banks, we can easily divide the TFM also into as many banks as in the L1 texture cache and associate each of the L1 banks with the corresponding TFM bank.

3.4 Static Power Reduction due to Texture Filter Memory

A line in texture cache is retained in the cache even after completion of spatially local accesses to it, in anticipation of temporal reuse. The study of life cycle of a typical cache line in a conventional texture cache shows that it undergoes large intervals of no activity interleaved between bursts of high activity. As a result each cache line leaks power for majority of its life time. Various circuit level techniques are proposed in literature to tackle the problem of leakage in caches. One of the most popular of them is selectively putting the cache lines in low power mode called the drowsy mode. In the low power drowsy state, the cache line retains its data. However, to access the line, it must be first transitioned into its normal power state, which might incur a few cycles delay. Thus, there is a trade-off between the power saved and the corresponding delay incurred. One of the popular techniques to reduce leakage power consumption of CPU caches is to put all the cache lines in drowsy mode periodically after a defined window of cycles [47]. Though this technique is quite simple to implement, it results in significant power saving and also performance degradation due to the overhead of waking up the lines that are accessed. Hence a more sophisticated technique that maintains a register with each cache line to track the activity of the line and switches the line to drowsy mode when activity falls Texture Filter Memory 61 below a threshold is also used [47]. But, this incurs hardware overhead for tracking and logic to switch the cache lines and each mis-prediction would still incur power-performance penalties. Another important design decision is whether to put the tag array into sleep mode along with the data array or not. Putting tag array into drowsy mode would incur a penalty of an extra cycle to wake-up the tag array along with the data array. All the above techniques trade performance for power. However, the quality of graphics applications are quite sensitive to performance. Hence it is not possible to directly adapt the leakage power optimization techniques targeted towards the CPU caches for conventional texture caches. In this section, we demonstrate how texture filter memory could be exploited to reduce the leakage power consumption of the texture memory sub-system. In the proposed texture cache architecture with Texture Filter Memory(TFM), con- secutive accesses to a block of texture are directed to the buffers rather than the texture L1 cache. Since the accesses to the L1 would be at the granularity of blocks, the duration of activity of each cache line decreases further and leaks more power than the conventional texture cache. But, we show that due to presence of TFM, it is possible to use a smart technique to reduce the leakage power of the texture memory sub-system to levels less than that consumed by the conventional texture cache. Since all the consecutive accesses to a cache line hit the TFM and not the L1 cache, we propose to maintain the texture L1 cache always in drowsy state. The L1 is woken up only when there is a TFM miss and hence an access is made to the L1 cache. Both the data and tag array of the cache are woken up upon an access and put into drowsy sleep immediately after the access. But by doing so, every miss in a TFM will incur an additional overhead of waking up the data and tag array of the L1 cache. To reduce this overhead we use a predictive wake up mechanism. The predictive wake-up mechanism is explained in the section below.

3.4.1 Predictive wake-up

• Since bi-linear interpolation might result in access to four nearest neighbouring texture blocks, if we encounter an access to a base texel into L1 cache, we wake up three of its nearest neighbours as well.

• On every access to a block corresponding to a base texel into the L1 cache, we predict the next base block and wake-up that block also. To predict the next base 62 Texture Filter Memory

block, we track the direction of progression of consecutive texels in x and y direction, ∆x and ∆y respectively. If ∆x is positive and ∆y is positive, we expect the future accesses to one of the three blocks shown in Figure 3.8 and wake them up.

Predicted wake−up

Expected Texel Gradient

Current Base Texel

Figure 3.8: Pre wake up – Case 1

If ∆x is negative and ∆y is positive, we wake up the three blocks as shown in Figure 3.9.

Expected Texel Gradient Current Base Texel

Predicted wake−up

Figure 3.9: Pre wake up – Case 2

If ∆x is negative and ∆y is negative, we wake up the three blocks as shown in Figure 3.10.

If ∆x is positive and ∆y is negative, we wake up the three blocks as shown in Figure 3.11.

• On every new prediction, we wake up the new set of lines and put the rest into sleep. Texture Filter Memory 63

Current Base Texel

Expected Texel Gradient

Predicted wake−up

Figure 3.10: Pre wake up – Case 3

Current Base Texel Expected Texel Gradient

Predicted wake−up

Figure 3.11: Pre wake up – Case 4

Predictive wakeup of a line can be pipelined with accesses to an active line, hiding the wake up delay effectively.

In our technique, only a maximum of 6 L1 cache lines are active at any instant of time. In a 16KB L1 cache, this amount to a negligible 2.5%. The accuracy of our technique is as high as 95% as shown in Figure 3.12.

Interestingly, our prediction mechanism for waking up the cache lines can also be used for pre-fetching the lines into the L1 cache. Since all the spatially local access to a line in cache are to the filter memory, we can pre-fetch the lines into the L1 cache without the risk of replacing the lines which could be accessed in the near future. 64 Texture Filter Memory

120

100

80 iction Accuracy iction

d 60

40

20 Percentage Pre Percentage

0 Unreal Doom Prey Quake Benchmark

Figure 3.12: Pre-wakeup Prediction accuracy

3.5 Extension to other architectures and filters

In this section we describe how TFM can be used to optimize the power of other common filtering methods and how it could be scaled to support parallel and multi texturing architectures.

3.5.1 Anisotropic Filtering

Since there is little predictability in access patterns in this case, we cannot use our earlier searching mechanism. Instead, we propose to configure the TFM similar to a block buffered fully associative cache [48]. Because of high spatial locality, we expect large number of texels to be mapped to the same texture block resulting in a single comparison rather than four for all texels falling in a block. When there is a miss from the buffered block, we will have a search in the buffer array. In this architecture we have two hit times, a fast hit when the texel is present in the same block as the previous texel and a slow hit when the texel is in the buffer array but not in the block to which the previous texel was mapped. Fast hit into the buffer array would need one cycle and a slow hit would need two cycles. But since the access time of the buffer is half that of the L1 cache, we would not lose on performance when compared to an architecture without the buffer unit. Texture Filter Memory 65

3.5.2 Parallel Texturing

In conventional parallel rasterization hardware, there are multiple rasterization units and multiple texturing units connected through an interconnection network. All the texturing units share an L1 cache. For efficient dynamic load balancing, it should be possible to schedule a texel fetch operation from any rasterizer to any texturing unit. Thus the texel fetches from all texturing units do not exhibit the spatial locality discussed earlier to the same extent. However, in our proposed architecture of texturing unit as explained above, to have the benefit of TFM, we need to ensure that there is high spatial locality among the texels being filtered in the texturing unit. The simplest way of achieving this is to tie the texturing unit to the rasterizer and select the rasterization algorithm such that the texturing units are uniformly loaded . The proposed parallel rasterization hardware with modified texturing unit is shown in Figure 3.13. Each texturing unit has a Texture Addressing unit (TA), two Bilinear Interpolation Units (BIUs), and one TFM.

RASTERIZER RASTERIZER #1 #N

TEXTURING UNIT TA BIU BIU TA BIU BIU

TFM TFM

INTER−CONNECT

TEXTURE CACHE

Figure 3.13: TFM for parallel texturing

We analyze below the impact of our proposed architecture on the following common tiled rasterization algorithms.

Tiled Rasterization The screen space is sub-divided into fixed-size tiles and each ras- terizer is responsible for a fraction of tiles. Since the tiles can have a varying number of fragments, there could be load imbalances among the rasterizers. 66 Texture Filter Memory

Object Space Subdivision The object space is subdivided into groups of primitives and distributed among the rasterizers in a round robin fashion. For efficient dynamic load balancing larger primitives could be divided into smaller primitives.

Striped Rasterization The fragments are divided according to image space subdivision of 2-pixel wide vertical stripes. These stripes can be assigned to the rasterizers in round-robin order for efficient load balancing. In case the lengths of scanlines vary significantly, sub-division into equal length sub-scan lines also could be considered.

In each of the above architectures, spatially local fragments are rendered in each of the parallel raster units. This is to maximize the locality into the texture and framebuffer memories. We propose to introduce a TFM into each of the texture units to exploit the spatial locality and predictability of the texture accesses.

3.5.3 Multi Texturing

Multitexturing is the process of applying more than one texture to a primitive [29]. Since several textures are fetched into the cache, the number of conflict misses increases considerably during multitexturing. In [49], the authors suggest a partitioned cache such that each active texture is allotted a partition of the cache. In our architecture, we observed that buffering two blocks of each active texture in the TFM we can reduce conflict misses in the L1 cache significantly. This is because 94% of the time, the four bilinearly interpolated texels fall in either one or two of the neighboring blocks of the texture (Figure 3.4). Hence by buffering two blocks per texture, we can reduce most of the accesses to L1, thereby reducing conflicts. We observed that TFM, along with a 4-way associative L1 cache in the background, reduces the misses considerably and achieves hit rate equal to the hit rate in a partitioned cache. Hence by introducing TFM we can use a conventional L1 cache and thus elimi- nate the overhead of cache partitioning. Since we have 8 buffers per texturing unit, we can support multitexturing with bilinear interpolation of a maximum of 4 simultaneous textures; this is good enough to support most current current applications. Texture Filter Memory 67

3.6 Experiments and Results

We describe in this section our experiments conducted to validate our proposed archi- tecture on typical rendering examples used for evaluating graphics hardware. We have developed a trace driven simulator for the proposed architecture. We have instrumented Mesa code (software renderer for the graphics pipeline [50]) for generating the memory traces. Our experimental platform for validating the power optimizations in the proposed architecture is as follows: We developed a synthesizable VHDL model for the design and used Synopsys Design Compiler and PrimePower for synthesis and power/energy simu- lation. We have also used CACTI models [51] for estimating the energy of caches and SRAMs in the designs.

3.6.1 Evaluation of the TFM architecture

The overall energy of a texture memory architecture depends on two main factors:

• Hit rates to the lower/smaller levels of the hierarchy

• Access energy of the corresponding levels

In the case of our proposed architecture, the hit rates are smaller than those obtained using a large L1 cache, but the energy per access is much smaller. The overall energy for TFM is the lowest among the evaluated architectures.

Hit rate comparison

Figure 3.14 shows the hit rate into TFM and compares it with the hit rates obtained by the following architectures:

1. The conventional 16KB, 2-way set associative cache as L1 and 256KB 4-way set associative L2 cache

2. 512B direct mapped cache as L1 and 256 KB L2 cache

3. 512B direct mapped filter cache along with L1 and L2

4. 512B fully associative filter cache along with L1 and L2 68 Texture Filter Memory

100%

           80%                                         60%                                                 40%       Hit Rate                                    20%                                               0%      FIRE TEAPOT TUNNEL GLOSS GEARBOX SPHERE

 16KB 2−way associative 512B direct mapped filter

512B fully asssociative filter TFM Figure 3.14: Hit rate comparison for texture memory architectures

5. Texture Filter Memory along with L1 and L2

Case 1 above is the original proposal in [23]. Case 2 is the architecture proposed in [27] and cases 3-4 are variants of the proposal in [52]. From Figure 3.14, we observe that for 80% of the accesses we would hit into the TFM, which is significant in comparison to 96% hit rate into the L1 cache. Since the majority of the accesses would be to the smaller TFM rather than the cache, we achieve a significant energy saving. We find that the TFM gives about 4.5% better hit rate than the direct mapped cache of same size used as a filter and equals the miss rate for a fully associative cache of the same size, at much lower comparisons per access than the fully associative cache.

Access energy comparison

Comparing the access energy of various architectures for texture memory (Figure 3.15), we observe that by buffering the texels in the texturing unit we obtain significant reduction in the average access energy. We also observe that average access energy when TFM is used is 25% less than using a direct mapped cache of the same size and 60% lesser than using a fully associative cache at a lower level of hierarchy (below L1). We see that using a direct mapped cache instead of 2 way associative cache does not result in reduction in energy due to lower hit rate. An architecture with TFM consumes about Texture Filter Memory 69

0.08      0.07                      0.06                   0.05                   0.04                         0.03                   0.02                   0.01       Average access energy(nJ)                   0       FIRE TEAPOT TUNNEL GLOSS GEARBOX SPHERE

  16KB 2−way associative  512B direct mapped  with 512 direct mapped filter with 512B fully asssociative filter with TFM

Figure 3.15: Average access energy comparison of texture memory architectures

77% lower energy than the conventional architecture. The energy overhead due to the extra computation in TFM has been included in the reported energy numbers.

Access time comparison

The comparison of access times in Figure 3.16 shows that TFM performs better than the other architectures in terms of access time also because of the simpler circuit. Thus, our proposed architecture achieves both low power and better performance than other existing proposals.

Area comparison

The Figure 3.17 shows that by adding TFM we would incur an area overhead of 0.48% over the conventional texture memory (L1: 16 KB, 2-way; L2: 256 KB, 4-way). 70 Texture Filter Memory

1.2

1

0.8

0.6

0.4 Access Time (ns) 0.2

0

16KB 2 Way 8KB direct with 512B WITH TFM with 512B full assoc filter. direct mapped filter.

Figure 3.16: Average access times comparison of texture memory architectures

12

10

8

6

4

2

0

with TFM with 512B with 512B 16KB 2−Way 512B direct map direct map filter full assoc. filter

Figure 3.17: Area of several texture memory architectures Texture Filter Memory 71

90 80 70 60 50 tionalcache s in Leakagein Power s

n 40 g 30 20

w.r.t. conve w.r.t. 10 0 Percentage Savin Percentage 10 With TFM Drowsy with TFM Drowsy with TFM and predictive wakeup 16KB L1 Texture Cache

Figure 3.18: Leakage power consumption of various texture memory architectures

3.6.2 Leakage Power and Delay Comparision

12

10

8

6 Delay Overhead ventionalcache e n 4

w.r.t. co w.r.t. 2 Percentag 0 With TFM Drowsy with TFM Drowsy with TFM and predictive wakeup 16 KB L1 Texture Cache

Figure 3.19: Delay overhead of various drowsy policies

From Figure 3.18 we see that introduction of TFM into the conventional texture memory sub-system leads to a small increase of 1% leakage power. However, putting the L1 texture cache always in drowsy-state(with 1 cycle wakeup latency as in [47]) results in 72 Texture Filter Memory

84% saving in leakage power at the cost of 10% delay overhead. A more sophisticated technique of drowsy L1 with predictive pre-wakeup results in 80% energy savings at less than 1% delay overhead as shown in Figure 3.19.

3.6.3 Parallel rasterization architecture

In a parallel rasterization architecture with four texturing units, a 4 way associative cache is used in most commercial architectures instead of a 2 way cache to maintain good hit rate. We have included a TFM in each of the texturing units and found that out of 96% hits to the texture cache subsystem, about 86% of them are to TFMs as plotted in figure 3.20.

100%                 80%                      60%                      40%       Hit Rate (%)               20%                       0%   GLOSS GEARBOX SPHERE    512B direct mapped cache  16KB 4−Way associative cache  512B fully associative filter TFM

Figure 3.20: Hit rate in parallel texture cache architecture

We find that the access energy of an architecture with TFM would be about 85% lower energy consuming than the conventional architecture (Figure 3.21). The area overhead in this case is found to be as small as 0.27% (Figure 3.22). Texture Filter Memory 73

0.2                    0.15                   0.1                      0.05       Access Energy (nJ)                0    GLOSS GEARBOX SPHERE   16KB 4−Way associative cache 512B direct mapped cache   with 512B fully associative filter  with 512B direct mapped filter  with TFM Figure 3.21: Average access energy in parallel texture cache architecture

20 18 16 14 12 10 8 6

Area (square mm) 4 2 0

with 512B with 512B with TFM 16KB 4−Way 512B direct map direct map filter full assoc. filter

Figure 3.22: Area of the parallel texture caching architectures

3.6.4 Multitexturing

In the case of Multitexturing, by using the TFM in addition to 16KB 4-way associative cache we obtain a 15% improvement in hit rate. We observe that this performs about 7% 74 Texture Filter Memory

90% 80% 70% 60% 50% 40% Hit Rate 30% 20% 10% 0%

16 KB TFM TFM

16KB 4−Way Partitioned with 4−WAY with Partitioned

Figure 3.23: Hit rates for multitexturing better than the partitioned cache of the same size and is slightly better than TFM with a partitioned cache (Figure 3.23). This is because, buffering two blocks of each active texture is sufficient to reduce the conflict misses.

3.7 Summary

In this chapter we see that in addition to high spatial and temporal locality, texture mapping in the graphics rendering pipeline also exhibits considerable predictability with respect to the . We have utilized these properties of texture map- ping to build a low power memory architecture for texture mapping. By buffering the blocks of texture in registers, we have replaced the high power cache lookups with low power register reads. We have proposed a smart lookup mechanism to maximize the hits into the registers with a small number of arithmetic operations. Our architecture performs 75% better than existing texture cache architectures in a pipeline with single texturing unit and consumes 85% lesser energy in parallel texturing. We also demonstrated that our architecture experiences about 7% reduction in miss rate than the partitioned cache for multitexturing. We also demonstrated that our TFM architecture can be exploited Texture Filter Memory 75 to save leakage power to upto 80% with negligible delay overhead of 1%. Thus, for pixel texturing, we observe that our proposed architecture consumes lower energy than those reported in literature, with no performance overhead, and insignificant area overhead. 76 Texture Filter Memory Chapter 4

Vertex Shader Partitioning

4.1 Introduction

In the previous chapter we elaborated on a custom memory architecture for low power texture memory sub-system. In this chapter we present a compiler optimization technique for low power geometry engine. Geometry - the measure of number of objects present in the scene and the detail at which they are modeled, is one of the most important aspects determining the complexity and visual reality of a scene. With increasing amounts of geometry in a scene, number of primitives/frame is increasing greatly, since modeling at finer levels of granularity requires the objects to be represented with larger number of smaller primitives. In the older generation graphics systems, the geometry was processed on the CPU and hence, the amount of geometry that could be accommodated in a scene was constrained by the computational capacity of the CPU. Acceleration of vertex pro- cessing in newer generation graphics cards by programmable vertex shading on hardware, facilitates advanced geometry processing, thus paving way to generation of realistic im- ages. In the modern workloads used for benchmarking the performance of graphics cards, it has been observed that:

• There is a surge in the polygon count per frame – The polygon count in 3DMark05 is about a few Million polygons/frame in contrast to 10-30K polygons/frame in yesteryear games like Quake3 or Doom3.

77 78 Vertex Shader Partitioning

• Complexity of vertex shaders is increasing – It is now possible to apply advanced transformations to vertex position, use complex per-vertex lighting models and also render the surfaces with realistic material models.

With increasing vertex counts and vertex shader complexities, vertex shading has become one of the factors that significantly impacts the overall performance of a graphics appli- cation. We propose to reduce the amount of computations on geometry and hence reap the benefits on performance gain and power saving.

Table 4.1: % Trivial rejects per frame

Benchmark/Game Trivial Reject Rate 3dMark05 50 Chronicles of Braddock 61 Unreal Tournament 54 Prey 47 Quake4 56 Doom3 36

PARTITION 1 (VS1) SHADER PROGRAM dp4 o0.x, c0, i0 Position dp4 o0.x, c0, i0 dp4 o0.y, c1, i0 Dependent dp4 o0.y, c1, i0 dp4 o0.z, c2, i0 Partition dp4 o0.z, c2, i0 dp4 o0.z, c3, i0 dp4 o0.z, c3, i0 dp4 o6.z, c4, i0 dp4 o6.z, c5, i0 PARTITION 2 (VS2) Position dp4 o7.z, c6, i0 dp4 o6.z, c4, i0 Independent dp4 o7.z, c7, i0 dp4 o6.z, c5, i0 Partition mov o1, i3 dp4 o7.z, c6, i0 dp4 o7.z, c7, i0 mov o1, i3

Figure 4.1: Vertex shaded partitioning example

It has been observed from the simulation of games and benchmarks as shown in Ta- ble 4.1, that on an average about 50% of triangles are Trivially rejected in each frame. Vertex Shader Partitioning 79

Trivial rejects account for the triangles that fall totally outside the viewing frustum and also front/back face culled triangles. Since testing for Trivial Rejection requires only the position information of the vertex, the time spent on processing the non-transformation part of the vertex shader on these vertices is wasteful. Instead, if we partition the vertex shader into position variant(transformation) and position invariant(lighting and texture mapping) parts and defer the position invariant part of the vertex shader, post Trivial re- ject stage of the pipeline, we can achieve significant savings in cycles and energy expended on processing these rejected vertices. An example illustrating vertex shader partitioning is shown in Figure 4.1.

The Figure 4.2 shows the changes to be incorporated in the conventional graphics pipeline to introduce partitioned vertex shading. In the modified pipeline, Vertex Shader 1(VS1) stage computes only the position variant part of the vertex shader and rest of the vertex processing is deferred to Vertex Shader 2(VS2) stage. Clipper stage is divided into Trivial Reject and Must Clip stages. The triangles that pass through the trivial reject test are disassembled into verticies to feed the VS2 stage (since the vertex shader can only work on vertices). These vertices after being processed in VS2 are assembled back to triangles to be sent to Must Clip stage.

Table 4.2 shows the improvement in terms of instructions saved on some of the vertex shader programs by partitioning them. The savings are calculated assuming about 50% trivial reject rate. The results are obtained by compiling the shaders to the instruction set implemented in ATTILA framework [53]. ATTILA is a cycle level simulation frame- work for an architecture that closely resembles a modern graphics processor architecture. From the results we see that vertex shader partitioning leads to significant reduction in instruction count, thus motivating the adoption of vertex shader partitioning into a graphics pipeline. We have implemented the same in ATTILA and rendered frames from games like Unreal Tournament, Chronicles of Riddick, Quake to illustrate the benefits of our proposal.

In [58], the authors report the power benefits achieved if the API supported lighting and texturing of the vertices after the trivial reject stage, but we observed that such a hard partitioning of the vertex shader is not always beneficial. This could be due to one of the two reasons discussed below. 80 Vertex Shader Partitioning OUTPUT MERGER MERGER PIXEL OUTPUT PIXEL SHADER SHADER CLIP MUST RASTERIZER RASTERIZER PRIMITIVE ASSEMBLY SETUP SETUP TRIANGLE TRIANGLE VERTEX (b) Modified Pipeline SHADER 2 (a) Conventional Pipeline CLIPPER PRIMITIVE PRIMITIVE DISASSEMBLY ASSEMBLY PRIMITIVE ASSEMBLY Figure 4.2: Pipeline modified to support vertex shader partitioning REJECT TRIVIAL SHADER VERTEX VERTEX SHADER 1 STREAMER STREAMER Vertex Shader Partitioning 81

Table 4.2: % Instructions saved due to vertex shader partitioning

Shader Inst VS1 Inst VS2 Inst % Inst Count Count Count Saved Hatching [54] 54 4 50 46.3 Cartoon [54] 22 4 18 41 Directional 34 8 30 32.4 Lighting [55] 3 lights [55] 78 8 74 42.3 8 lights [55] 185 8 181 46.8 Move Light [55] 21 4 25 21.4 Crystal [56] 19 4 15 39.4 Water [56] 77 26 74 18.2 Bubble [1] 62 23 58 16.12 Vertex Blending 51 22 51 6.86 Lighting[57]

• Thread setup overhead for the second vertex shader could overshadow the advantage of deferring the position invariant part of the shader code. Hence we propose an adaptive algorithm for vertex shader partitioning, which would take the decision based on trade-off between the setup overhead and cycles saved due to partitioning.

• Moreover, we observe that there could be a significant number of instructions com- mon to position-variant and position-invariant part of the vertex shader. Our algo- rithm identifies the set of intermediate values to be transferred from VS1 stage to VS2 so as to minimize the amount of code duplication resulting from partitioning the shader.

Shader partitioning has been studied in [59, 60], in the context of code generation for multi-pass shaders for execution on GPUs that are constrained by the availability of resources. Virtualization of GPU resources is achieved by dividing the shader into multiple smaller programs and number of such passes is to be minimized to maximize the performance. In our case, the number of code partitions is fixed at two and the aim is to minimize the amount of code duplicated between the two partitions. 82 Vertex Shader Partitioning

4.2 Shader Compiler

We propose to include the vertex shader partitioning pass in the code generation phase of the shader compiler. The binaries of all the compiled shaders reside in the GPU memory and the device driver sends the appropriate shader to the pipeline as a part of the state. The code is generated for both non-partitioned and partitioned shader during compi- lation and the decision of enabling or disabling partitioning is taken dynamically by the driver as explained later in the chapter. Since compilation is only a one time process, incorporation of this additional pass would result in only a small software overhead. The algorithm used to partition the vertex shader code is explained herein.

4.2.1 Partitioning with Minimum Duplication

A DAG(Directed Acyclic Graph) representing the data-flow in the vertex shader program is the input to the partitioning phase. Each node in the graph represents an operand and a directed edge between two nodes exists if there exists a data dependency between them. A principal input has no incoming edge and a principal output has no outgoing edge. We introduce a source node S which is connected to all the PIs (I0,I1,I2 and I3 in Figure 4.3) and a destination node T connecting all the POs (O0,O1 and O2 in Figure 4.3) to this DAG and call this new graph N. Trivial code partitioning can be done by simple BFS(Breadth First Search) traversal of the graph starting from the destination node and separating all nodes reachable from the PO representing the position attribute(o0 in Figure 4.3) of the vertex program into VS1 and all the nodes that are reachable from the rest of the POs into VS2(O1 and O2 in Figure 4.3). But this partitioning results in duplication of the instructions that are common to both VS1 and VS2 ( the shaded nodes in Figure 4.3). The duplication of code could be minimized by sending some of the intermediate values to VS2 along with the position output attribute, and representing the VS2 program as a function of primary inputs (PI) and outputs from VS1 (AUX)

VS2 = f(PI, AUX)

Let i be the number of PIs to the VS2 stage and k be the number of auxiliaries sent from VS1 to VS2. If a vertex shader program can accept a maximum of T inputs, Vertex Shader Partitioning 83

S

I0 I1 I2 I3

1 3 4

2 5 6

10 7 8

11 9

O2 O0 O1

T

Figure 4.3: DAG representing the data-flow in a vertex shader

i + k = T

Now the problem is to identify the set of k intermediate values to be transferred from VS1 to VS2 from the set of m of them so that, the number of instructions duplicated in VS2 would be minimized. A heuristic that uses an iterative approach to solve this problem is presented below. We start with the DAG of the vertex shader N and separate the common code shared by VS1 and VS2 by BFS traversal of graph starting from leaves(POs) and separating all the nodes that are reachable from the PO of VS1 (position output attribute of the shader program) and the POs of VS2 (rest of the output attributes generated by the shader 84 Vertex Shader Partitioning

Algorithm 5 PASS 1 : Separate the Duplicated Code Input: DAG of the vertex shader Output: DAG of the duplicated code 1: Start BFS traversal of the DAG starting from the destination node. 2: Color the node corresponding to o0 red and all other outputs blue. 3: If a node is reachable only from a red colored node,color it red. 4: If a node is reachable only from a blue colored node,color it blue. 5: If a node is reachable from both red colored nodes and blue colored nodes color it green. 6: The DAG with green colored nodes represents the duplicated code. program) into a graph M as shown in Algorithm 5. The leaf nodes of this graph represents all the temporaries that are common to both VS1 and VS2 and hence the instructions generating them need duplication in VS2, if not transferred from VS1. We assign to each node a weight equal to one more than the sum of weights of all its parents, starting from the Principal Inputs which are given a weight of one as shown in Algorithm 6. Thus the weight of a node gives the measure of number of instructions required to generate that temporary value. Since at this stage in compilation, high level instructions are already divided into micro-operations, it is acceptable to assume that each operation takes equal amount of time for execution. In a system with variable instruction latencies, weight is calculated as the sum of latencies of its parents and the delay of the instruction represented by the node.

Algorithm 6 PASS 2 : Label the DAG Input: DAG from PASS1 Output: DAG with nodes labeled with weights 1: BFS traversal of the DAG starting from the source node. 2: Assign all the PIs a weight of 0. 3: On reaching a node, 4: if node is not green then 5: Drop it. 6: else 7: W eight of node ← 1 + sum of weights of its parents 8: end if 9: If a leaf node is reached, insert it into a max heap (sorted by weight)

We sort the leaf nodes in the order of weight and divide the leaf nodes into two sets A and B such that set A contains k nodes of largest weight(line 1 of Algorithm 7). Taking set A as an initial solution, we iteratively replace nodes in set A with nodes from rest of Vertex Shader Partitioning 85 the graph M, such that each replacement reduces the residual computations required to be done so as to generate all the leaf nodes from the PIs and the nodes in the solution set. Starting from the leaf nodes, we traverse up the graph until we reach a node(say node R) which is reachable from at least two of the leaf nodes(lines 2-4 of Algorithm 7). Consider the following three scenarios.

1. The node R is reachable from at least two nodes from set A. Consider the scenario shown in Figure 4.4. Instead of sending P and Q, we can

1 1

R 10

P 11 Q 11 5 D

set A set B

Figure 4.4: Vertex shader partitioning algorithm : Case 1

send node R and node D. By sending node R we can compute P and Q in VS2 with a cost of 2 instructions, but we have saved a cost of 5 by sending D. If node R has a fan-out higher than two belonging to set A, we choose the two nodes of lowest weight for replacement, since they can be recomputed with lowest cost. Let P and Q be the nodes of smallest weight from set A that are reachable from R (as computed in lines 5-11 of Algorithm 7), and let D be the node of largest weight from set B(as computed in line 12-14 of Algorithm 7).

Condition 1: Wr + Wd > Wp + Wq − 2Wr Action: Replace P and Q from set A with R and D

In case the above condition fails, as shown in scenario Figure 4.5, we check if fan- out of node R has nodes in set B and compute the decrease in cost of generating these node when R is sent to VS2. If this difference is greater than the cost in- 86 Vertex Shader Partitioning

curred in computing node P from R, we replace node P with node R(lines 19-21 of Algorithm 7).

7 3

R 6

P 14 Q 10 9 D

set A set B

Figure 4.5: Vertex shader partitioning algorithm: Case 2

Condition 2: Wr > Wp − Wr Action: Replace node P with node R

The same is shown in lines 22-25 of Algorithm 7).

2. The node R is reachable from only one node P from set A. We use the Condition 2 for evaluating a possible replacement of node P with node R.

3. The node R is not reachable from any of the nodes from set A. No replacement is possible in this case.

The BFS traversal is done until the root node is reached, and whenever a node is en- countered that is common to two or more leaf nodes, we use the conditions enumerated above to find a possible replacement that would improve the solution. The pseudo code for the algorithm discussed above is presented below. The algorithm requires three BFS traversals of the graph. In each BFS traversal, the nodes and edges of the graph are vis- ited once and constant amount of work is done at each node and edge. If e is the number of edges and m is number of nodes in the graph, the time complexity of the algorithm is O(e +m). Vertex Shader Partitioning 87

Algorithm 7 PASS 3 : Partitioning the DAG Input: DAG from PASS 2 and the heap of the leaf nodes sorted in the descending order of weight of the nodes. Each node has fields w1,w2,w3 set to zero and n1,n2 pointing to NULL. Output: Set A containingk intermediate nodes to be transferred from VS1 to VS2 1: The first k nodes de-queued from the heap form set A and the rest of the nodes set B. Set A represents the solution set. Insert all the leaf nodes in a an empty queue Q 2: while Q not empty do 3: N ← dequeue(Q) , w ← weight(N) 4: Reach all the parents of N. 5: if N’ is reachable from N, such that N ∈ A then ′ 6: if w(N) w1(N) + w2(N) − 2w(N) then 20: replace nodes N1 and N2 from set A, with nodes N and N3. Delete N3 from the heap. 21: end if 22: else if n1(N) is not NULL and w3(N) is not zero then 23: if w(N) > w1(N) − w(N) then 24: replace node N1 from set A with node N. 25: end if 26: end if 27: end if 28: end while

4.2.2 Comparison with the na¨ıve algorithm

The Figure 4.6 shows the reduction in duplicated instructions and hence increase in % instructions saved by using our partitioning algorithm over the trivial partitioning algorithm used in [58]. The savings are reported assuming a Trivial Reject rate of 50%. 88 Vertex Shader Partitioning

50 Instr. Saved(Naive Algorithm) Instr. Saved(PMD)  Duplication(Naive Algorithm) 40  Duplication(PMD) 30

    20             % Instructions    10                      0  Water Bubble Vertex Blending with Lighting

Figure 4.6: Comparison of proposed algorithm with the existing one

4.2.3 Selective Partitioning of Vertex Shader

Spawning a thread on the PU incurs some thread setup overhead the value of which is dependent on the micro-architecture of the thread setup unit. This could include the idle time waiting for the availability of resources, time spent on loading the inputs, time spent on transmission of outputs, etc. Partitioning of the vertex shader, results in an extra overhead of VS2 thread setup. Hence it is very important to weigh the benefit of cycles saved on rejected vertices against the overhead incurred on thread setup for the vertices that are not rejected. The cost incurred to process a batch of vertices (B) without vertex shader partitioning is given as

Costno−part = B × (VS Thread Setup overhead + Execution time of VS)

If we assume C to be the rate at which vertices are trivially rejected, then cost incurred to process the batch with partitioning is given as

Costpart = B × (VS1 Thread Setup overhead + Execution time of VS1) + B × (1 − C) × (VS2 Thread Setup overhead + Execution time of VS2) Vertex Shader Partitioning 89

Vertex shader partitioning is profitable only if Costpart is less than Costno−part. Hence the driver takes the architecture specific details of thread setup overheads for VS1 and VS2, predicted Trivial reject rate for the frame as inputs. The execution time of the program is approximated as the number of instructions in the program and the partitioning decision is taken. If partitioning results in an advantage, the driver sends the partitioned code to the instruction memory of the shader and enables VS2 stage of the pipeline. Otherwise, the driver sends the unpartitioned code to VS1 stage and disables the VS2 stage. From studying the frames of various games as shown in Figure 4.7, we have observed that the clip rates of consecutive frames are similar. Hence this information could be used to predict the Trivial Reject Rate of the next frame to be processed. We have observed that this history based adaptive partitioning algorithm results in attractive performance benefits in comparison to hard partitioning of the vertex shader. 1

0.8 0.6

0.4 TR Rate 0.2 0 Frames : #60 - #200 Figure 4.7: Variation of Trivial Rejects across frames of UT2004

4.3 Framework

We have implemented vertex shader partitioning in the ATTILA simulator framework(shown in Figure 4.8) and the details of the same are discussed here. ATTILA is an open source simulation framework available that models a modern GPU micro architecture. Though the implementation details of our proposal are with reference to the ATTILA framework, the ideas are generic enough to be incorporated into any micro-architecture with minor variations. In this section the the architectural details pertaining to vertex processing on the Graphics Processor modeled in ATTILA framework is explained. Command Processor acts as an interface between the driver and the pipeline. It maintains the state of the pipeline and on receiving the commands from the driver, 90 Vertex Shader Partitioning

Appplication

GL_Interceptor gl2attila Driver GPU Simulator

Attila Framework

Figure 4.8: ATTILA architecture updates the state of the pipeline and issues appropriate control signals to the rest of the units in the pipeline. At every context switch, the command processor :

• Loads the control registers : It updates the registers representing the render state.

• Initiates the transfer of data to GPU memory : Setup the transactions to fill vertex buffer, index buffer, load shader program, load textures, etc. from system memory to GPU memory through .

Streamer is responsible for reading the vertices from the GPU memory and setting up the vertex shader threads to process them. Vertices and Indices are buffered in FIFOs and the Shader loader spawns a vertex shader thread whenever a Programmable unit is free. When Indexed mode is used to address the vertices, a post shading vertex cache is used for reusing the shaded vertices. The shaded vertices are sent to the primitive assembly unit. Unified Shader architecture implemented in ATTILA is based on the ISA described in ARB vertex and fragment program OpenGL extensions. The ALU is a 4 way SIMD working on 4 component floating point vectors. The execution pipeline has single cycle Fetch and Decode stages. Execution and write-back stages together can take a latency of 1-9 cycles depending on the instruction. The instruction memory is implemented as a scratchpad memory which is divided into three logical partitions, one each to hold the vertex, triangle setup and fragment shader programs. The driver is responsible for loading the shader code into the scratch pad ( if it is already not present ) whenever there is a state change. The register file in the shader has four banks – a read-only bank to store the input attributes, write-only bank for storing output attributes, another read-only bank to Vertex Shader Partitioning 91 hold the constants and a read-write bank to hold the intermediate results of the program. The constant register bank is shared by all threads while 16 each of input/output and temporary registers are allocated per thread. The changes incorporated in ATTILA framework to support partitioned vertex shad- ing are illustrated in Figure 4.9.

Streamer (VS1)

Primitive Assembly  Shader Vertex Trivial Reject   Input   Feeder (VS2) Buffer    Shader  Must Clip    Primitive Setup 

Gen Fragments Shader Z−Test

Interpolation

Blend

MEM MEM MEM MEM CNTR CNTR CNTR CNTR

Figure 4.9: Modified ATTILA architecture

4.3.1 Partitioning the assembly code

Since the source code for ATTILA device driver is not available in open, we could not incorporate the partitioning algorithm in the driver. Instead, whenever the command processor receives a request to transfer vertex shader program from GPU memory to the 92 Vertex Shader Partitioning shader instruction memory, we dump the binary for the vertex shader program, disassem- ble it and run the shader partitioning algorithm on the assembly code. The command processor takes the adaptive decision of enabling/disabling the partitioning of vertex shader and generates the appropriate control signals for the Streamer and the Feeder units. It also loads the appropriate machine code into the instruction memory of the programmable units.

4.3.2 Vertex Input Buffer

We introduce a small fully associative cache to buffer the vertex input attributes read at the VS1 stage for reuse at VS2 stage. Since the vertices reach the VS2 stage in the same order in which they are processed at VS1 stage, we choose to use FIFO replacement policy for the buffer. The trivially rejected vertices would not be processed at the VS2 stage. Hence the de-allocation signal to the buffer from the trivial reject stage helps free the lines that carry the trivially rejected vertices, thus increasing the hit rate into the cache.

4.3.3 Feeder Unit

The micro-architecture of the feeder unit is as shown in Figure 4.10. Triangle buffer holds the triangles accepted after passing through the Trivial Reject stage of the pipeline. A triangle is a set of three vertices, each associated with an index to address it and the position attribute computed by VS1. If vertex shading is disabled for the current shader, the triangles are sent to the 3D clipping unit from the Triangle Buffer. If partitioning is enabled, the triangle is disassembled into vertices and sent to the Shader Loader. Vertex Output Cache is used to reuse the shaded vertices. When the shader loader receives a vertex, it looks up the vertex output cache for the vertex. If the result is a cache miss, the shader loader reads the input attributes of the vertex from the Vertex Input Buffer and spawns a VS2 thread for processing the vertex. The triangle assembly unit receives the position output attribute of the vertices from the Triangle buffer and rest of the attributes from the programmable shader. After the triangle assembly, the triangles are sent to the 3D clipping unit. Vertex Shader Partitioning 93

Streamer

Trivial Vertex Memory Reject I/P Buffer Controller Shader

Shader Shader Triangle Vertex Loader O/P Cache Buffer

Tri. Assem Shader En

Must Clip

Figure 4.10: Feeder unit architecture

4.4 Experiments and Results

4.4.1 Performance Improvement

To illustrate the advantages of vertex shader partitioning, we have taken a few frames from the games Unreal Tournament 2004(UT), Chronicles of Riddick(CR) and Quake4 and rendered them on ATTILA simulator modified to support vertex shader partitioning. Some of the sample frames rendered on the modified architecture are shown in Figure 4.13. The frames rendered by the modified architecture are compared with the frames rendered on the basic architecture for validating the correctness of our implementation. Figure 4.11 shows the saving in vertex shader instructions and Figure 4.12 shows the corresponding improvement achieved on cycles spent on geometry processing by adopting our algorithm. The results are compared against those achieved by the na¨ıve fragmenting algorithm proposed in [58]. We observe that due to hard fragmentation of vertex shader as proposed by them, there is a degradation of performance on those frames with smaller clip rate and on those that have so few instructions in VS2 stage that the thread setup 94 Vertex Shader Partitioning

70 Naive Partitioning 60 Adaptive Partitioning 50 40 30 20 10 % Instructions Saved 0 −10

UT(50) UT(80) UT(117) CR(325) CR(500) Quake4(150)

Figure 4.11: % Instructions saved due to adaptive vertex shader partitioning overhead overshadows the advantage of partitioning. This could be observed in the case of frame rendered from game Quake4. In Quake4 most of the effects are achieved by texturing and vertex shaders are used only for simple transformations. Hence in this case, hard partitioning results in a small saving in instructions, but leads to a negative impact on performance. In contrast, our adaptive algorithm has almost negligible effect on performance in such frames. Similarly in frames containing shaders that have a lot of common code between VS1 and VS2 stages, hard partitioning leads to negative impact on performance due to duplication in instructions. This could be observed from the frames rendered from Chronicles of Riddick. This frame uses shaders for vertex blending and lighting. This shader, when partitioned na¨ıvely, would lead to a large number of instructions being duplicated, as reported in Table 4.2. Our approach of reusing the intermediate values generated at VS1 in VS2 leads to better results. In frames having higher trivial rejects and using a large number of instructions in VS2 stage, our algorithm gives better performance improvement and the saving in instructions is comparable to that achieved by the hard partitioning method. This could be seen from the scenes rendered from Unreal Tournament. From the graph we see that the number of instructions saved by our algorithm is slightly less than that achieved by the na¨ıve method. But this is intentional since we avoid partitioning the shaders that would result in greater thread setup overhead and hence achieve better performance than the na¨ıve method. Vertex Shader Partitioning 95

25 Naive Partitioning 20 Adaptive Partitioning 15 10 5 0 % Cycles Saved −5 −10

UT(50) UT(80) UT(117) CR(325) CR(500) Quake4(150)

Figure 4.12: % Cycles saved due to adaptive vertex shader partitioning

From simulations of several frames from Unreal Trournmament, Chronicles of Riddick and Quake4, we observe that adaptive vertex shader partitioning can save upto 50% instructions and which results in upto 15% performance improvement and also significant improvement in energy.

4.4.2 Energy Reduction

In addition to saving due to reduction in number of instructions, another advantage of vertex shader partitioning is that, it is not required to fetch the input attributes required for the second stage of vertex shader for the trivially rejected verticies. This results in additional energy savings due to reduction in memory accesses. From our experiments we observe that upto 60% of input attribute fetches can be saved due to vertex shader partitioning. To estimate the overall energy savings due to vertex shader partitioning, we use an energy model where energy of an operation is the sum of energy consumed in data memory, instruction memory and the floating point unit. We assume that the number of accesses to instruction memory is equal to floating point operation count. We consider an architecture with 512KB 4-way associative data memory, 32KB instruction memory and estimate their energy consumption using cacti[61]. For the floting point unit, we use ARM vertex floating point architecture and get its energy from its specifications[62]. For 96 Vertex Shader Partitioning

(a) Unreal Tournament Frame 50 (b) Chronicles of Riddick Frame 500

(c) Quake Frame 100

Figure 4.13: Frame captures from the games a 50% reduction in instructions and 60% reduction in data memory accesses as observed from our experiments, we achieve a 56% saving in energy.

4.5 Summary

In this chapter we have seen that in contrast to older generation games which were predominantly pixel processing intensive, the newer ones tend to aim towards achieving greater realism by incorporating more and more geometry, thus increasing the load on ver- tex processing. Correspondingly we have observed increasing number of primitives/frame Vertex Shader Partitioning 97 and also increasing size of vertex shader programs. Based on the observation that about 50% of the primitives are trivially rejected in each frame in the games, we proposed a mechanism for partitioning the vertex shader and deferring the positioning invariant part of the shader to post trivial reject stage of the pipeline. We also identified that such par- titioning can have a negative impact on performance in some cases, and hence proposed an adaptive partitioning scheme that applies partitioning only to scenarios that would result in benefits. From our experiments we have observed upto 50% saving in shader instructions and 60% reduction in input attribute fetches due to vertex shader partitioning. This results in about 15% speed-up and upto 56% energy saving of the geometry engine. 98 Vertex Shader Partitioning Chapter 5

DVFS for Tiled GPUs

5.1 Introduction

So far we have seen component level power optimizations for two of the main blocks of the GPU namely the geometry engine and texture memory unit. In this chapter we present system level power optimization technique of Dynamic Voltage and Frequency Scaling. This DVFS scheme exploits features specific to tiled rendering GPUs and achieves better power savings than the schemes that are independent of the rendering order. Figure 5.1 shows the variation of processing time of consecutive frames in two games: Unreal Tournament and Quake4. We observe that the workload of frames varies consid- erably with time. In Figure 5.1(a) we notice that the workload of frames 71-96 is 1/5 times the workload of frames between 6-50. Since the workload variations are large, significant amount of power can be saved by tuning the computational capacity of the graphics processor to the varying workloads of the frames. The properties of an application that makes it amenable to DVFS are:

1. varying workload, and

2. accurate predictability of workload.

Though games show significant variation of workload from frame to frame, they require an accurate prediction scheme to be able to take advantage of DVFS by anticipating workloads in advance. Games being interactive, require fast on-line workload prediction

99 100 DVFS for Tiled GPUs

1600000

1400000

1200000

s 1000000

800000

600000 #Cycle 400000

200000

0 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97 Frame Number

(a) UT2004

12000000

10000000

8000000

cles 6000000 y

#C 4000000

2000000

0 1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96101 Frame N um b er

(b) Quake4

Figure 5.1: Workload variation in games DVFS for Tiled GPUs 101 schemes to make it possible for the system to adapt to abrupt changes in the workload. Off-line workload characterization schemes, as used in applications such as video decoders [63], cannot be employed for games, since the frame content is determined only at run- time. To understand why interactivity makes games distinct from any other real time appli- cation in the context of DVFS, let us see how DVFS is applied to a video decoder. The MPEG video decoding standard has been used extensively in literature to make a case for DVFS [64, 65]. A video stream is generally composed of a series of still images which, when displayed sequentially at a constant rate, create an illusion of motion. These images are called frames, and are stored in compressed form so as to minimize the storage and transfer requirements. MPEG is a popular standard used for compression of video streams. MPEG compression divides the video stream into a sequence of Group of Pictures (GOPs), each of which comprises several frames. Each frame is further divided into vertical strips called slices and each slice is divided into several macro blocks which comprise a 16×16 pixel area of the image. Header information is associated with each structure in this hierarchical representation of the video stream; this information is used for workload prediction [66]. During the decoding process, all frames in a GOP are first buffered, the workload of the entire GOP is estimated, and the optimum value of voltage and frequency at which the GOP is to be decoded to meet the deadline is determined. Since the first frame is displayed only after the entire GOP is decoded, an output buffer is required to hold all the decoded frames of the GOP. Since the input frames keep streaming in at a constant rate, while decoding time could vary from frame to frame, an input buffer is used to store the incoming frames as shown in Figure 5.2. Input Buffer Output Buffer Incoming Frames Decoded Frames Video Decoder

Figure 5.2: I/O buffering for video decoder

The example shown in Figure 5.3 clearly demonstrates the advantage of buffer based DVFS over one without buffering [66]. If DVFS is applied over a set of buffered frames 102 DVFS for Tiled GPUs instead of varying the operating point online on per frame prediction basis, the slack over the set of frames can be accumulated and distributed among all the frames, leading to a lower power solution. An additional advantage of the buffer based mechanism is that it is possible to correct the losses due to mis-predictions to some extent in this method. Errors in predicting the workload would result in accumulation of frames in these buffers [67]. Hence the buffer occupancy is constantly monitored to correct the prediction inaccuracies and the operating voltage and frequency scaled accordingly as shown in Figure 5.4.

Vdd Vdd Vdd

V V Frame Frame 3V/4 Frame #1 #2 V/2 #1 #2 #1 #2 T 2T T 2T 1.33T 2T (a) Without DVFS (b) Without Buffer (c) With Buffer 2 2 2 E = V *F*(1.5 T) E1 = V 2 *F*T + (V/2) *(F/2)*T E2 = (3V/4) *(3F/4)*2T = 0.75 E = 0.56 E

Figure 5.3: Buffer occupancy based DVFS for video decoder

Vdd Vdd Vdd V V/2 V/2 V/2 V/3 Frame Frame #2 #1 #2 #1 #1 #2 T 2T T 2T T/2 2T (a) Predicted Deadlines (b) #1 Under Prediction (b) #1 Over Prediction for Frames 1 and 2 #2 Accelerated #2 Slowed down

Figure 5.4: Buffer occupancy based correction of workload prediction

In case of games, the application has to respond to the user’s inputs, and ideally, the response needs to be instantaneous. For instance, when the user is playing a shooter game and has fired at a target, he would want this to reflect immediately on the screen rather than after some latency. Here we do not have the liberty to buffer a set of frames and then correct prediction inaccuracies of one frame by adjusting the operating voltage of the next frame. DVFS for Tiled GPUs 103

Interactivity also gives the impression that workload prediction based on previous frames behavior would not be viable for games. However, this is not true owing to the fact that consecutive frames of the games exhibit high levels of correlation in their pixel values. This is because, to maintain continuity of motion, positions of the objects in the frame can be displaced only by small amounts. Thus, the workloads do exhibit large but infrequent variations, making games excellent candidates for DVFS.

We notice in Figure 5.1 that although the workload variations are large, the work- load of consecutive frames is mostly comparable (in Figure 5.1(a) the workload does not change much for about 5-6 neighbouring frames). Slow variation in the workload makes it possible to employ history based schemes for workload prediction. However, since these schemes result in mis-predictions at transitions of workload, some of the frames can miss their deadlines resulting in considerable degradation of visual quality. We demonstrate that the tiled rendering architectures have an advantage over immediate mode rendering architectures in the context of DVFS. This is mainly due to the following reasons. Since the pixel processing, which is the major component of workload of games, is started only after the geometry processing of the whole frame is completed, it is possible to extract parameters of the frame during the geometry stage, which could aid in prediction of the rasterization workload of the frame with greater accuracy. We can apply DVFS at a gran- ularity of tiles as opposed to frames in these architectures. Operating at finer granularity allows early corrective action in case of under-predictions and a more efficient utilization of slack in case of over-prediction. The main contributions of this work are as follows:

• an efficient workload prediction in tiled graphics architectures, which exploits the knowledge of the geometry processing results for the entire frame while predicting the workload of an individual frame.

• a DVFS scheme that is more power-efficient and incurs fewer deadline misses than tradi- tional frame-level power optimization mechanisms, by continuously tracking prediction inaccuracies and taking corrective action after processing each tile. 104 DVFS for Tiled GPUs

5.2 Workload Estimation of a Tiled Graphics Pro- cessor

In 3D graphics, an illusion of motion is created by displaying static pictures that capture the progressive displacement of objects in the scene at a rate greater than the perception of vision. For example, in shooter games, the player views the action through the eye of the player character as it races through the environment. Hence, the frames capture the view seen by the character at close intervals as the character advances on its path as shown in Figure 5.5. Similarly, the racing games project the view as seen from a point positioned behind vehicle and looking into the direction of motion, as the vehicle progresses on its track. To emulate character animation in games, the frames are displayed such that the character changes its position and pose incrementally from those in the previous frame.

In all the above cases, the rate of displacement between the frames is kept small to simulate continuity of motion and avoid jerkiness in movements. As a result, consecutive frames exhibit high level of correlation in their content and hence their workload is also comparable. This is evident from the workload characteristics presented in Figure 5.1 as well. As a consequence of this correlation, when the frames are divided into tiles, the workload of a tile also remains similar to the workload of the corresponding tile in the previous frame. Interestingly, even in the case where the workload of consecutive frames varies considerably, a large number of tiles still have workloads that are comparable to their workloads in the previous frame. For example, consider the frame sequence shown in Figure 5.6. Frame 7 has higher workload than frame 6 due to addition of a new object in frame 7. However, since the environment remains almost the same in both frames, a large number of tiles still have the same workload.

From our analysis of several modern 3D games, we found that the tile level correlation of workload is considerably high. Figure 5.7, shows the percentage of tiles whose work- load is estimated at different levels of accuracy by the history based predictor for some benchmark applications. We observe that the workload of more that 80% of tiles can be predicted with an error of less than 10%. Hence as a rough estimate, the workload of a tile can be approximated to be equal to the workload of the tile in previous frame and can be used as a workload prediction for dynamic voltage and frequency scaling techniques. DVFS for Tiled GPUs 105

(a) Frame 34 (b) Frame 35

(c) Frame 36 (d) Frame 37

Figure 5.5: Consecutive frames of UT2004 106 DVFS for Tiled GPUs

(a) Frame 6 (b) Frame 7

Figure 5.6: Consecutive frames of UT2004

In the next section, we show that this history based workload estimation, when combined with tile-level DVFS control, performs much better in terms of avoiding deadline misses and also exploits available slack to a much greater extent than frame-level DVFS.

In tiled architectures, the prediction accuracy can be further enhanced by the in- formation extracted from the geometry processing and tiling stages. Based on the state associated with each primitive of the frame, we can associate with each frame a parameter called rank which helps in estimating whether the workload of the frame has increased, decreased, or is comparable to the workload of the previous frame. Similarly, each tile can also be associated with a rank that indicates the workload change of the current tile with respect to the workload of the same tile of the previous frame. Thus, frame rank and tile rank help us identify (i) whether the frame workload is likely to change substantially; and also (ii) the tiles of the current frame whose workload is comparable to their corresponding workload in the previous frame. We now show how this information is used to design an efficient DVFS scheme. DVFS for Tiled GPUs 107

UT2004 Doom3 Quake 100

80

60

Tiles 40 %

20

0 5% 10% 15% 20% % Error in prediction

Figure 5.7: Accuracy of tile history based prediction

5.2.1 Frame Rank Computation

In [38], the authors used primitive count and primitive area to assign a frame with a unique signature which helps in estimating the pixel workload of the frame. From simulations of various games we find that frame workloads can vary to a great extent in spite of primitive counts and total primitive area of consecutive frames being similar. As a simple example, consider the frames shown in Figure 5.8. Though the same vehicle is displayed in both the frames leading to equal primitive count and area, the workload differs because the vehicle uses different number of texture maps in each of the frames. We conclude that primitive count and area are not sufficient to estimate frame workload accurately and define a new metric called Rank for capturing pixel level workload of a frame. Rank is a

4 component vector represented as (Rg, Rp, Rt, Rr) capturing measures of: (i) geometry, (ii) pixel shading, (iii) texturing, and (iv) raster-operation workload.

• Geometry Workload The workload in geometry stage is due to vertex shading, clip-

ping, and binning. If Nv is the instruction length of the vertex shader program, which usually has only datapath-oriented operations with no conditionals, and V is the total

number of vertices in the frame, the total number of vertex shader instructions Wv in the frame is given as:

Wv = V × Nv (5.1) 108 DVFS for Tiled GPUs

(a) Frame 9 (b) Frame 10

Figure 5.8: Consecutive frames of UT2004

Wv is indicative of the vertex shader workload. The workload due to clipping and binning is proportional to the primitive count P . Hence, the geometry workload is given as

Wg = Wv + P = V × Nv + P (5.2)

• Pixel Shading Workload The length of pixel shader program Np is taken as reflective of the amount of the pixel shader workload. We take the area of the bounding box A of the primitive as a rough estimate of the number of pixels generated by the primitive.

Hence the measure of workload due to pixel shading Wp can be expressed as:

Wp = A × Np (5.3)

• Texture Workload The workload due to texturing is captured by computing the number of texels that are to be read for each pixel. To compute the color of the pixel, texels from more than one texture could be read and blended. Depending on the kind of filtering used, the number of texels that are filtered also varies. Filtering strategies such as nearest neighbor, linear, bi-linear, and tri-linear, require 1, 2, 4, and DVFS for Tiled GPUs 109

8 texels respectively for filtering. Anisotropic filtering needs upto 16 texels or more depending on the level of . If each pixel is associated with M textures and needs t samples depending on the filtering mode, the measure of texture workload can be estimated as

Wt = A × M × t (5.4)

Although in the graphics API, the depth test follows pixel shading and texturing, some architectures use the early-z test, where the z-test precedes pixel shading. By doing so, cycles spent on shading and texturing of occluded pixels can be avoided. However, this is possible only in some circumstances – the pixel shader must not change the pixels’ depth value and also the pixels must not be translucent. In cases when early z-test is enabled, we found from our experiments that approximately half of the pixels would be discarded at this stage and hence the pixel shading workload and texturing workload would be halved. Hence, when early-z is enabled,

Wp =0.5 × A × Np and Wt =0.5 × A × M × t (5.5)

• Raster Operations Workload Raster operations mainly load the frame-buffer band- width. Each of the several operations such as depth test, stencil test, and color blending, are responsible for reading and writing to the frame-buffer memory. Hence, if R raster operations are enabled for a primitive, each pixel of the primitive would need 2 × R

accesses to the frame-buffer. Hence, the workload due to raster operations Wr can be captured by the equation

Wr = A × 2 × R (5.6)

The frame-rank is computed by component-wise accumulation of the ranks of each primitive in the frame.

Rg = Wg , Rp = X Wp , Rt = X Wt and Rr = X Wr (5.7)

Since the operations – shading, texturing, and raster operations are performed in parallel, the speed of processing would be determined by the slowest component of the three. Hence, increase in any one of the components might result in a workload increase. 110 DVFS for Tiled GPUs

Through a component-wise comparison of ranks of consecutive frames, we can quantita- tively compare the workloads of frames. This can help us scale the voltage and frequency of the current frame accordingly, resulting in better power-performance results. We use rank only to compare the workload of consecutive frames and not to estimate the work- load of the frame as in [38]. From Figure 5.9, we see that frame rank based prediction has almost negligible number of under-predictions compared to the frame signature based workload prediction scheme.

Primitive based Rank Pixel based Rank 9

s 8 n 7 6 edictio

r 5 4 3 nder P nder

U 2

% 1 0 UT2004 Doom3 Quake4

Figure 5.9: Accuracy of frame rank based predictions

5.2.2 Tile Rank Computation

Since the tiles are generated post geometry processing stage, the tile rank is a 3-component vector that has pixel shading, texturing, and raster operations components in it. Tile rank computation is very similar to frame rank computation. Here we accumulate the component-wise ranks of the primitives in the tile, the only difference being that we consider the area of overlap between the bounding box of the primitive and the tile as a measure of number of pixels, as shown in Figure 5.10. The area of overlap can be obtained by a simple extension to the bounding box-tile overlap test. Our experiments show that tile rank is more efficient in identifying the workload change of the current tile than the tile history based prediction. From Figure 5.11 we see that less than 4% of the tiles suffer from under-predictions when tile rank based prediction is used. DVFS for Tiled GPUs 111

Tile1 Tile2

 

Tile3 Tile4

Figure 5.10: Area of overlap as a measure of pixel count

5.2.3 Extraction of Ranks

To extract the geometry rank, we use a frame structure based technique where the appli- cation traverses the scene graph to estimate the vertex shader instructions and primitive count. The application passes this information to the GPU driver, which uses it to scale the performance level of the geometry engine. Pixel count cannot be estimated at the application level, since the trivial rejection and back-face culling in the geometry stage can significantly impact the pixel count. Hence, we add counters to the tiling stage of the geometry pipeline for tracking the bounding box area of the primitives. We have modified the driver so that, each time it encounters a state change, it calculates the per-pixel rank that the current state would result in (factors Np, M × t, and 2 × R). As a batch of primi- tives associated with the current state undergoes geometry processing, the bounding box areas and overlap area per tile are accumulated. At the end of processing each batch, the driver multiplies the per-pixel rank co-efficients with the accumulated areas to compute the frame and tile ranks for the current batch. The accumulated ranks of all the batches of the frame gives the frame and tile ranks.

5.3 DVFS Scheme

In this section we present a simple tile history based DVFS scheme, followed by a more sophisticated tile-rank based DVFS strategy for the pixel processor. 112 DVFS for Tiled GPUs

4.5

s 4 n 3.5 3 edictio

r 2.5 2 1.5 nder P nder

U 1

% 0.5 0 UT2004 Doom3 Quake

Figure 5.11: Accuracy of tile rank based prediction

5.3.1 Tile History Based DVFS

The idea behind the tile history based DVFS scheme is depicted in the Figure 5.12. The advantage of tile based DVFS scheme is that, at the end of processing each tile, we can check for the occurrence of a mis-prediction, and take immediate corrective action. For instance, in the scenario presented in Figure 5.12(b), we see the Tile 1 exceeds its deadline. Hence, the frequency of Tile 2 is increased so that the combined deadline is met. In the scenario presented in Figure 5.12(c), we find that processing of Tile 1 completes well before its deadline. Hence, Tile 2 uses up this slack by running at a lower frequency and hence saves overall energy.

Algorithm 8 details the tile history based DVFS scheme. In this scheme, we start with the assumption that the current frame’s workload is equal to the workload of the previous frame. Hence, if W is the workload of the previous frame (measured in the number of clock cycles) and Tf is the time (clock cycles) allotted for processing the frame, the frequency for the current frame is calculated as(line 1 of Algorithm 8)

W F = Fmax × (5.8) Tf

If the geometry stage takes Tg time for completion, the remaining time is divided into (i) DVFS for Tiled GPUs 113

Vdd

V/2 Tile #1 #2

T 2T

(a) Predicted Deadlines for Tiles 1 & 2

Vdd Vdd V

V/2 V/2 V/3 #2 Tile #1 #1 #2

T 2T T/2 2T

(b) #1 under-predicted, hence #2 (c) #1 over-predicted, hence #2 is is accelerated slowed down

Figure 5.12: Tile level DVFS 114 DVFS for Tiled GPUs

pixel processing time Tp and (ii) transfer time Tt of the color and depth buffers to the off-chip frame buffer (line 2 of Algorithm 8). If the predicted pixel processing workload of the frame (i.e the observed workload of the previous frame) is Wp, the frequency with which the frame is to be processed is given by the equation (line 3 of Algorithm 8)

Wp F = Fmax × (5.9) Tp

If frame-level DVFS is adopted, all tiles of the frame would be operated at frequency F . The advantage of operating at tile-level is that at the end of processing each tile, we can check for prediction accuracy and take corrective action by either increasing or decreasing the frequency. We have observed that when the workloads of consecutive frames match, the workloads of the constituent tile pairs of the frames are also comparable. Hence, we can initialize the deadlines for processing each tile of the frame by assuming that the tile workload is equal to its workload in the previous frame. If Wp(ti) is the predicted workload of tile ti, then its deadline D(ti) is given by the equation(lines 5 and 6 of Algorithm 8):

W (t ) D(t )= p i (5.10) i F

If the actual processing time of the tile is observed to be P (ti), the slack S(ti) is given by:

S(ti)= D(ti) − P (ti) (5.11)

This slack can be compensated for by distributing it among the remaining tiles. If ′ W (ti) is the observed workload of the current tile, and Wp is the workload of the remaining tiles, the adjusted frequency with which the next tile is to be processed is re-computed to account for the slack as shown below:

′ Wp Fnew = Fmax × (5.12) (Tp − P (ti)) ′ Wp = Fmax × (5.13) (Tp − (D(ti) − S(ti)))

Since Tp − (D(ti) is the predicted processing time for the remaining tiles, we can DVFS for Tiled GPUs 115 express the frequency for the remaining tiles as:

Predicted workload F = F × (5.14) new max Predicted Processing Time + Slack (5.15)

Positive slack means over-prediction of the workload of the ith tile. The surplus time allotted to this tile is thus used by the remaining tiles by effectively decreasing their frequency. Similarly, negative slack means under-prediction and is compensated by accelerating the remaining tiles. Thus, the mis-predictions can be corrected at every tile boundary so as to minimize their effect. We can easily complete the calculation of frequency for the next tile, while the contents of the tile buffer are transferred to the frame buffer and hence computation of operating point does not cause any extra overhead.

5.3.2 Tile Rank Based DVFS

The problem with the above tile history based DVFS is that, a tile could miss its deadline by such a magnitude that it is not possible to recover the slack generated by it even by running all the future tiles at maximum frequency. We use the frame and tile ranks discussed in the above section to efficiently counter such cases. In this scheme, the geometry component of the current frame’s rank is compared with that of the previous frame’s to check for frame correlation. If the ranks are comparable, we take the workload of the previous frame as a safe estimation of the workload for the current frame. If the ranks vary by more than a threshold level (either increase or decrease) , it indicates that the frame content is too dis-similar and hence, we start processing the frame at Fmax. In our experiments, we empirically fixed this threshold to be 10% of the geometry workload. Though this might be an over-estimate, we are assured that the frame will not miss the deadline. Once, the geometry processing is completed, we use the other three components of the frame rank and tile ranks to determine the frequency with which the tiles are processed. Thus, the slack caused due to overestimated geometry workload can still be reclaimed. On the basis of tile-rank, we can divide the tiles into two groups:

• Light Tiles: Tiles for which the workload is expected to be less than or equal to 116 DVFS for Tiled GPUs

their workload in the previous frame (Rank(ti) <= Rank(ti−1))

• Heavy Tiles: Tiles whose workload is expected to be greater than their workload in

the previous frame (Rank(ti) > Rank(ti−1)).

We can safely assume that the workload of each of the light tiles is equal to their workload in the previous frame, but the workload of heavy tiles is unpredictable. The technique of tile rank based DVFS scheme is outlined in Algorithm 9. From comparison of rank of the current frame with that of the previous one, we can determine whether the processing speed of the frame has to be increased or decreased. For each of these two cases, the operating point is managed as described below:

Case 1: Rank(Fi) > Rank(Fi−1) In this case, the workload of the current frame is expected to be greater than the workload of the previous frame. Since increased workload is contributed by the heavy tiles, we process the heavy tiles at maximum

frequency. If the total time required for processing the heavy tiles is Th, the time

left for processing (Tl = Tp −Th) is left for processing the light tiles. We can employ tile-history based DVFS to determine the minimum frequency with which each tile is to be operated so that processing of tiles completes just before the allotted time

Tl. (lines 7 and 8 of Algorithm 9)

Case 2: Rank(Fi) <= Rank(Fi−1) In this case, since the current frame workload is expected to not exceed the workload of the previous frame, workload of the present frame can be safely approximated as workload of the previous frame. Hence, we process the heavy tiles at frequency F given by equation(line 10 of Algorithm 9)

W F = Fmax × (5.16) Tp

Once the heavy tiles are processed, we can employ the tile history based DVFS scheme for the light tiles since their workload is predictable(lines 11 and 12 of Algorithm 9).

Implementation of this DVFS technique incurs an overhead of five additions and mul- tiplications per primitive for workload estimation and calculation of frequency. This amounts to a negligible computational overhead of less than 0.008%. Storing tile ranks DVFS for Tiled GPUs 117

Algorithm 8 Tile History Based DVFS Input: Workload of previous frame W and pixel processing workload of previous frame Wp Predicted workload of the tiles Wp(ti). Maximum frequency supported is Fmax and time allotted for frame is Tf . Output: Per-tile voltage and frequency 1: Compute average Frequency for the current frame W F ← Fmax × Tf 2: If geometry takes time Tg, time left for pixel processing is Tp ← Tf − Tg 3: The average frequency for processing the tiles is F ← F × Wp max Tp 4: for all tiles ti do 5: Process tile at frequency F and note the tile processing time P (ti) and workload of the tile W (ti) ′ 6: If Wp is the predicted workload of the remaining tiles, Compute frequency for the remaining tiles as ′ W T ← T − P (t ) and F ← F × p p p i max Tp 7: end for requires an additional memory of 32(# of tiles) × 3 (rank components) × 4 bytes which equals 384 bytes. This amounts to a storage overhead of 0.0075% for a system with 512MB memory. Since the frequency of operation for the next tile can be computed immediately after the pixel processing of the current tile completes, the switching delay can be hidden behind the time required to transfer the contents of local color buffer to the framebuffer.

5.4 Experiments and Results

To evaluate the efficiency of our proposed DVFS schemes, we have modified the ATTILA simulation framework [68] to emulate tiled mode rendering on it. In the ATTILA sim- ulation framework, the GL Interceptor records the graphics API calls when a high level graphics application is executed on the CPU. These API calls are converted to simulator commands by the gl2attila driver. We have modified the driver and the architecture mod- eled in the simulator to be able to emulate a tiled rendering graphics processor. We have chosen a tile size of 32 × 32 for our experiments, based on published results on optimum tile size [69]. We have instrumented the simulator to extract the trace of frame rank, tile ranks, and the execution times of the tiles for a few hundred frames of games such as 118 DVFS for Tiled GPUs

Algorithm 9 Tile Rank Based DVFS Input: Frame Ranks and workloads for current and previous frames, workloads of Heavy and light tiles Output: Frequency for each tile 1: if (Abs(Rg(i) − Rg(i − 1)) < 0.1 × Rg(i)) then W 2: F ← Fmax × Tf 3: else 4: F = Fmax 5: end if 6: If Geometry processing takes time Tg, time left for pixel processing is Tp = Tf − Tg 7: if (Rp(i) > Rp(i-1) or Rt(i) > Rt(i-1) or Rr(i) > Rr(i-1)) then 8: Process Heavy tiles at Frequency F = Fmax 9: else 10: Process Heavy tiles at Frequency F ← F × Wp max Tp 11: end if 12: If heavy tiles take a processing time of TH , Time left for light tiles TL = TD − TH 13: Process light tiles within time TL using tile history based DVFS scheme.

UT2004, Quake4, and Doom3. These traces form the inputs to the estimation tool that models the proposed tile level DVFS schemes and also the existing frame level schemes. We have generated experimental results for processors supporting 5, 8, and 11 voltage and frequency levels to choose from for DVFS. We have generated the results for target output frame rate of 30 and 60fps to observe the efficiency of schemes in two contexts – (i) using the slack in case of lighter frame rate requirement and (ii) ability to meet deadlines in the case of higher frame rate requirement. The comparison of performance and energy results presented below clearly establishes the efficiency of our proposed tile-level DVFS schemes over the frame level schemes.

5.4.1 Performance Impact

Figure 5.13, shows the percentage deadline misses of the different DVFS schemes (ND – No DVFS; FH – Frame History based; FS – Frame Signature based; TH – Tile History based; TR – Tile Rank based) for UT2004 when the target frame rate is 60fps. We observe that Frame history based DVFS schemes encounter maximum deadline misses. DVFS for Tiled GPUs 119

No DVFS Frame History Frame Sign Tile History Tile Rank 30

25 s e 20

eMiss 15 n

10 Deadli 5

0 5 Levels 8 Levels 11 Levels

Figure 5.13: Deadline Misses at target frame rate of 60fps

Signature based frame workload prediction works better than history based prediction, but still leads to significant deadline misses, indicating that frame signature is also not able to capture the workload accurately. The ability to take early corrective measures makes our tile history based DVFS scheme work better than both the frame level DVFS schemes. Tile rank based DVFS scheme works better than tile history based DVFS since the number of under-predictions is minimized. We also observe that as the number of DVFS levels increases, the number of deadline misses of the frame level schemes increases. The reason for this can be understood with the following small example. Let the workload of the frame be 0.8 times the maximum workload and the predicted workload be 0.6 times

Fmax the maximum workload. In a five level DVFS scheme a frequency of 3 × 5 would be chosen, leading to a deadline miss. However, in a 2-level DVFS scheme, a frequency of Fmax would be chosen, saving the frame from losing the deadline but resulting in an under-utilization of slack. Our tile level schemes handle both these situations with a higher efficiency. The impact of deadline misses on the achieved frame-rate is shown in Figure 5.14. We observe that the tile level schemes achieve a frame rate that is nearly equal to the required frame rate, whereas the frame level DVFS schemes suffer from significant drop in the frame rate. From Table 5.1, we notice that history based schemes can have a drop upto 10 FPS whereas tile rank scheme has a maximum of 1 FPS drop. 120 DVFS for Tiled GPUs

No DVFS Frame History Frame Sign Tile History Tile Rank 70 60 50 40

FPS 30 20 10 0 5 Levels 8 Levels 11 Levels

Figure 5.14: Frame Rate at target of 60fps

5.4.2 Energy Saving

No DVFSD VFS FrFrameame Hi Historysto ry FrFrameame SSignign TilTilee Hi Historysto ry TilTilee R Rankank 1.2

1

0.8 nergy E 0.6 alized

m 0.4 Nor 0.2

0 5 Levels 8 Levels 11 Levels

Figure 5.15: Normalized Energy at 60fps

Frame Level DVFS schemes are expected to be less efficient than the tile-level schemes in conserving energy. This is because, for DVFS, we need to choose from the available discrete voltage and frequency levels. Since most often the required frequency F is in between two discrete levels F1 and F2 (Fi

F1 and F2 for time durations t1 and t2, such that

F 1 × t1 + F2 × t2 = F × TD (5.17)

t1 + t2 = TD (5.18)

However, the above scheme results in unacceptable number of frames missing their deadlines, often due to small under-predictions of the workload. Our experiments show that more that 40% of the frames miss deadlines by using this approach. However, in tile based DVFS schemes, the slack obtained due to processing a tile at F2 can be used up by future tiles by slowing them down. Effectively, we can process the frame at an average frequency that is closer to F . Moreover, tile based schemes can also use up the slack caused due to over-prediction of workload. This can be observed from the results obtained in frames processed at 60fps as shown Figure 5.15. However, from the results of frames processed at 30fps shown in Figure 5.16, we observe that the tile rank based DVFS scheme consumes slightly higher energy than that due to the history based scheme. This is not because the history based schemes are more energy efficient, but because of

ND FH FS TH TR 1.20

y 1.00 g 0.80 d Ener d

e 0.60 0.40 rmaliz o

N 0.20 0.00 5 Leve l s 8 Leve l s 11 Leve l s

Figure 5.16: Normalized Energy at 30fps

workload under-predictions resulting in a choice of lower than required frequency, leading 122 DVFS for Tiled GPUs to deadline misses. For a fair comparison of the energy efficiency of the schemes, we compare the Normalized Energy per Normalized Frame-Rate, which is similar to the popular Energy-Delay product that is used to jointly compare system performance and energy. We observe from the results shown in Figure 5.17 , that the Energy per Frame- Rate is minimum in our tile-rank based DVFS scheme, making it the best choice. The tile-history based scheme is also found to fare better than the frame-level DVFS schemes.

No DVFS Frame History Frame Sign Tile History Tile Rank 1.8 1.6 1.4 ay l 1.2 1 0.8 rgyDe e 0.6 En 0.4 0.2 0 5 Levels 8 Levels 11 Levels

Figure 5.17: Energy per Frame-Rate at 30fps

The results for the other applications and FPS requirements of both of 30FPS and 60FPS are presented in Tables 5.1 and 5.2. From the results presented in the table we can conclude that across different applications, the tile based DVFS schemes are consistently better than the frame level schemes and the energy-performance behaviour of tile rank based scheme is superior to the rest of the schemes.

5.5 Summary

In this chapter we have seen that the consecutive frames in 3D graphics applications typically show a high degree of correlation, as a result of which the workload of consecutive frames is often comparable. We also saw that, the workloads of the individual tiles of the consecutive frames also exhibit high levels of correlation. We observe that even in cases when the frame-level workloads differ, a large number of tiles still have a workload that is comparable to their workload in the previous frame. Based on these observations and DVFS for Tiled GPUs 123

Table 5.1: DVFS results at 60 FPS 5 Levels 8 Levels 11 Levels UT Q4 D3 UT Q4 D3 UT Q4 D3 Deadline Misses ND 0 2 3 0 2 3 0 2 3 FH 12 9 12 16 10 17 24 16 19 FS 8 9 10 9 9 12 11 12 12 TH 3 4 8 3 5 8 3 9 7 TR 1 4 5 1 4 5 1 4 5 Frame Rate ND 60.0 58.8 58.2 60.0 58.8 58.2 60.0 58.8 58.2 FH 53.6 55.0 53.6 51.7 54.5 51.3 48.4 51.7 50.4 FS 55.5 55.0 55.5 55.0 55.0 52.0 54.0 53.6 53.5 TH 58.2 57.7 55.5 58.2 57.1 55.5 58.2 55.0 56.1 TR 59.4 57.7 57.1 59.4 57.7 57.1 59.4 57.7 57.1 Normalised Energy ND 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 FH 0.6 0.9 0.9 0.5 0.9 0.9 0.5 0.8 0.9 FS 0.6 0.9 1.0 0.5 1.0 0.9 0.5 0.9 0.9 TH 0.5 0.9 0.9 0.6 0.9 0.9 0.6 0.8 0.9 TR 0.5 0.8 0.9 0.5 0.8 0.9 0.5 0.8 0.9 Normalised Energy/Frame-Rate ND 1.00 1.02 1.03 1.00 1.02 1.03 1.00 1.02 1.03 FH 0.67 1.02 1.03 0.60 1.01 1.05 0.59 0.94 1.03 FS 0.63 1.03 1.03 0.57 1.04 1.07 0.56 0.99 1.02 TH 0.54 0.94 0.98 0.58 0.94 0.97 0.57 0.90 0.99 TR 0.52 0.85 0.91 0.50 0.83 0.91 0.49 0.82 0.91

also by exploiting the fact that tiled architectures provide access to additional parameters, we proposed an accurate workload prediction scheme for tiled graphics architectures. We also proposed tile-history based and tile-rank based DVFS schemes that are more efficient than frame-level DVFS schemes published previously. In an eight level DVFS system, our tile-history based DVFS scheme resulted in 60% improvement in quality (deadline misses) over the frame history based DVFS schemes and in 58% energy savings. The more sophisticated tile-rank based scheme achieved 75% improvement in quality and resulted in 58% saving in energy. We also saw that, unlike the frame-level schemes, the quality of our schemes do not suffer with increasing number of frequency levels. The Energy 124 DVFS for Tiled GPUs

Table 5.2: DVFS results at 30 FPS 5 Levels 8 Levels 11 Levels UT Q4 D3 UT Q4 D3 UT Q4 D3 Deadline Misses ND 0 0 0 0 0 0 0 0 0 FH 15 7 9 12 11 13 17 14 15 FS 7 7 6 8 7 9 7 12 10 TH 3 4 2 4 6 2 3 7 4 TR 0 1 1 0 1 1 0 1 2 Frame Rate ND 30.0 30.0 30.0 30.0 30.0 30.0 30.0 30.0 30.0 FH 26.1 28.0 27.5 26.8 27.0 26.5 25.6 26.3 26.1 FS 28.0 28.0 28.3 27.8 28.0 27.5 28.0 26.8 27.3 TH 29.1 28.8 29.4 28.8 28.3 29.4 29.1 28.0 28.8 TR 30.0 29.7 29.7 30.0 29.7 29.7 30.0 29.7 29.4 Normalised Energy ND 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 FH 0.30 0.50 0.49 0.30 0.42 0.58 0.28 0.45 0.50 FS 0.34 0.54 0.54 0.27 0.47 0.60 0.29 0.50 0.53 TH 0.34 0.52 0.52 0.31 0.47 0.51 0.31 0.47 0.49 TR 0.32 0.51 0.50 0.31 0.46 0.49 0.30 0.46 0.48 Normalised Energy/Frame-Rate ND 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 FH 0.35 0.54 0.53 0.34 0.47 0.65 0.32 0.51 0.57 FS 0.37 0.58 0.57 0.29 0.51 0.66 0.31 0.56 0.58 TH 0.35 0.54 0.53 0.32 0.50 0.52 0.31 0.51 0.51 TR 0.32 0.52 0.51 0.31 0.47 0.49 0.30 0.47 0.48

per Frame-rate for our schemes is minimum, indicating that our scheme delivers the best performance-energy results. Chapter 6

Conclusion and Future Work

Advancements in semiconductor industry have packed enough computational capacity on very small form-factor dies, that it is possible to port complex applications on mobile and handheld devices. This has resulted in increasing the power consumption and power density of these devices to such an extent that they are limited by the battery capacity and cooling capacity of the current technology. As a consequence, power has become one of the major constraints in the design of any semiconductor device. Since graphics processor is becoming one of the most power consuming semiconductor components in today’s mobile and handheld devices, this thesis is directed towards the design of a low power graphics processor architecture. In this thesis we have identified texture memory and geometry engine to be significant contributors to power consumption of a typical GPU, and hence proposed several low power optimizations targeting them. We have also observed that graphics applications exhibit variation in workload; hence there is a scope for system level power optimization by dynamic voltage and frequency scaling.

6.1 Summary and conclusion

The main contributions of the thesis are summarized below:

• In addition to high spatial and temporal locality, texture mapping in the graphics rendering pipeline also exhibits considerable predictability with respect to the mem- ory access pattern. We have utilized these properties of texture mapping to build

125 126 Conclusion and Future Work

a low power memory architecture for texture mapping. By buffering the blocks of texture in registers, we have replaced the high power cache lookups with low power register reads. We have proposed a smart lookup mechanism to maximize the hits into the registers with a small number of arithmetic operations. Our architecture performs 75% better than existing texture cache architectures in a pipeline with single texturing unit and consumes 85% lesser energy in parallel texturing. We also demonstrate that our architecture experiences about 7% reduction in miss rate than the partitioned cache for multitexturing. Thus, for pixel texturing, we claim that our proposed architecture consumes lower energy than those reported in literature, with no performance overhead, and insignificant area overhead. We also demon- strate that our TFM architecture can be exploited to save leakage power to upto 80% with negligible delay overhead of 1%.

• In contrast to older generation games which were predominantly pixel processing intensive, the newer ones tend to aim towards achieving greater realism by incor- porating more and more geometry, thus increasing the load on vertex processing. This is as a result of increasing number of primitives/frame and also increasing size of vertex shader programs. Based on the observation that about 50% of the primi- tives are trivially rejected in each frame, we proposed a mechanism for partitioning the vertex shader and deferring the position invariant part of the shader to post trivial reject stage of the pipeline. We also identify that such partitioning can have a negative impact on performance in some cases, and hence propose an adaptive partitioning scheme that applies partitioning only to scenarios that would result in benefits.

From the experiments on ATTILA framework, we observe upto 50% saving in shader instructions due to vertex shader partitioning, leading to 15% speed-up and upto 56% energy saving on computations in the geometry engine.

• We have seen that the consecutive frames in 3D graphics applications typically show a high degree of correlation, as a result of which the workload of consecutive frames is often comparable. We also saw that, the workload of the individual tiles of the consecutive frames also exhibit high levels of correlation. We observe that even in cases when the frame-level workloads differ, a large number of tiles still have a Conclusion and Future Work 127

workload that is comparable to their workload in the previous frame. Based on these observations and also by exploiting the fact that tiled architectures provide access to additional parameters, we proposed an accurate workload prediction scheme for tiled graphics architectures. We also proposed tile-history based and tile-rank based DVFS schemes that are more efficient than frame-level DVFS schemes published previously. In an eight level DVFS system, our tile-history based DVFS scheme resulted in 60% improvement in quality (deadline misses) over the frame history based DVFS schemes and in 58% energy savings. The more sophisticated tile-rank based scheme achieved 75% improvement in quality and resulted in 58% saving in energy. We also saw that, unlike the frame-level schemes, the quality of our schemes does not suffer with increasing number of frequency levels. The Energy per Frame-rate for our schemes is minimum, indicating that our scheme delivers the best performance-energy results.

6.2 Future Work

The possible future extensions to the work are suggested below:

• Currently our texture memory architecture only supports pixel texturing. The vertex texturing phase [29] can not take advantage of our proposed architecture as it does not exhibit similar spatial locality. This architecture needs to be adapted so as to be able to reduce power for vertex texturing also.

• The proposed vertex shader partitioning algorithm can be extended to be able to analyze shaders with dynamic branch control. It is also possible to compute a power based cost metric and base the partitioning decision on the same.

• The proposed rank based workload estimation technique can be utilized for energy- efficient thread scheduling in a multicore implementation of a low power pixel pro- cessor. The possibility of devising an efficient rank based dynamic load balancing technique for high performance tiled rendering architectures can be explored. 128 Conclusion and Future Work Bibliography

[1] http://www.nVidia.com/.

[2] PC energy-efficiency trends and technologies. http://www.intel.com/.

[3] Hugues Hoppe. Optimization of mesh locality for transparent vertex caching. In SIGGRAPH ’99: Proceedings of the 26th annual conference on Computer graph- ics and interactive techniques, pages 269–276, New York, NY, USA, 1999. ACM Press/Addison-Wesley Publishing Co.

[4] Kyusik Chung, Chang-Hyo Yu, and Lee-Sup Kim. Vertex cache of programmable geometry processor for mobile multimedia application. In Circuits and Systems, 2006. ISCAS 2006. Proceedings. 2006 IEEE International Symposium on, 0-0 2006.

[5] Joel McCormack and Robert McNamara. Tiled polygon traversal using half-plane edge functions. In HWWS ’00: Proceedings of the ACM SIG- GRAPH/EUROGRAPHICS workshop on Graphics hardware, pages 15–21, New York, NY, USA, 2000. ACM.

[6] Woo-Chen Park, Kil-Whan Lee, Il-San Kim, Tack-Don Han, and Sung-Bong Yang. A mid-texturing pixel rasterization pipeline architecture for 3D rendering processors. In Application-Specific Systems, Architectures and Processors, 2002. Proceedings. The IEEE International Conference on, pages 173–182, 2002.

[7] Ned Greene, Michael Kass, and Gavin Miller. Hierarchical Z-buffer visibility. In SIGGRAPH ’93: Proceedings of the 20th annual conference on Computer graphics and interactive techniques, pages 231–238, New York, NY, USA, 1993. ACM.

129 130 BIBLIOGRAPHY

[8] Cheng-Hsien Chen and Chen-Yi Lee. Two-level hierarchical Z-buffer for 3D graphics hardware. In Circuits and Systems, 2002. ISCAS 2002. IEEE International Sympo- sium on, volume 2, pages II–253–II–256 vol.2, 2002.

[9] Victor Moya Del Barrio, Carlos Gonz´alez, Jordi Roca, Agustin Fern´andez, and Roger Espasa. A single (unified) shader GPU microarchitecture for embedded systems. In HiPEAC, pages 286–301, 2005.

[10] Henry Fuchs, John Poulton, John Eyles, Trey Greer, Jack Goldfeather, David Ellsworth, Steve Molnar, Greg Turk, Brice Tebbs, and Laura Israel. Pixel-planes 5: a heterogeneous multiprocessor graphics system using processor-enhanced memo- ries. SIGGRAPH Comput. Graph., 23(3):79–88, 1989.

[11] Jeff Andrews and Nick Baker. Xbox 360 system architecture. IEEE Micro, 26(2):25– 37, 2006.

[12] Jay Torborg and James T. Kajiya. Talisman: commodity realtime 3d graphics for the pc. In SIGGRAPH ’96: Proceedings of the 23rd annual conference on Computer graphics and interactive techniques, pages 353–363, New York, NY, USA, 1996. ACM.

[13] Larry Seiler, Doug Carmean, Eric Sprangle, Tom Forsyth, Michael Abrash, Pradeep Dubey, Stephen Junkins, Adam Lake, Jeremy Sugerman, Robert Cavin, Roger Es- pasa, Ed Grochowski, Toni Juan, and Pat Hanrahan. Larrabee: a many-core x86 architecture for visual computing. ACM Trans. Graph., 27(3):1–15, 2008.

[14] http://www.arm.com/products/multimedia/mali-graphics-hardware/-200.php.

[15] Powervr mbx technology overview. http://www.arm.com/products/multimedia/mali- graphics-hardware/mali-200.php.

[16] Tolga Capin, Kari Pulli, and Tomas Akenine-M¨oller. The state of the art in mobile graphics research. IEEE Comput. Graph. Appl., 28(4):74–84, 2008.

[17] J. W. Sheaffer, D. Luebke, and K. Skadron. A flexible simulation frame- work for graphics architectures. In HWWS ’04: Proceedings of the ACM SIG- GRAPH/EUROGRAPHICS conference on Graphics hardware, pages 85–94, New York, NY, USA, 2004. ACM. BIBLIOGRAPHY 131

[18] http://www.anandtech.com/.

[19] Ju-Ho Sohn, Ramchan Woo, and Hoi-Jun Yoo. A programmable vertex shader with fixed-point datapath for low power wireless applications. In HWWS ’04: Pro- ceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hard- ware, pages 107–114, New York, NY, USA, 2004. ACM.

[20] Ju-Ho Sohn, Yong-Ha Park, Chi-Weon Yoon, R. Woo, Se-Jeong Park, and Hoi- Jun Yoo. Low-power 3d graphics processors for mobile terminals. Communications Magazine, IEEE, 43(12):90–99, Dec. 2005.

[21] K.P. Acken, M.J. Irwin, R.M. Owens, and A.K. Garga. Architectural optimizations for a floating point multiply-accumulate unit in a graphics pipeline. asap, 0:65, 1996.

[22] Po-Han Wang, Yen-Ming Chen, Chia-Lin Yang, and Yu-Jung Cheng. A predictive shutdown technique for GPU shader processors. IEEE Computer Architecture Let- ters, 8:9–12, 2009.

[23] Ziyad S. Hakura and Anoop Gupta. The design and analysis of a cache architecture for texture mapping. SIGARCH Comput. Archit. News, 25(2):108–120, 1997.

[24] Michael Cox, Narendra Bhandri, and Michael Shantz. Multi-level texture caching for 3d graphics hardware. In ISCA, pages 86–97, 1998.

[25] Homan Igehy, Matthew Eldridge, and Pat Hanrahan. Parallel texture caching. In HWWS ’99: Proceedings of the ACM SIGGRAPH/EUROGRAPHICS workshop on Graphics hardware, pages 95–106, New York, NY, USA, 1999. ACM.

[26] C.J. Choi, G.H. Park, J.H. Lee, W.C. Park, and T.D. Han. Performance comparison of various cache systems for texture mapping. The Fourth International Confer- ence/Exhibition on High Performance Computing in the Asia-Pacific Region, 2000. Proceedings., 1:374–379 vol.1, 2000.

[27] I. Antochi, B.H.H. Juurlink, A. G. M. Cilio, and P. Liuha. Trading efficiency for en- ergy in a texture cache architecture. In Proceedings of the 2002 Euromicro conference on Massively- systems, pages 189–196, April 2002. 132 BIBLIOGRAPHY

[28] Konstantine I. Iourcha, Krishna S. Nayak, and Zhou Hong. System and method for fixed-rate block-based image compression with inferred pixel values. Patent 5956431, September 1999. http://www.freepatentsonline.com/5956431.html.

[29] Tomas M¨oller and Eric Haines. Real-time rendering. A. K. Peters, Ltd., Natick, MA, USA, 1999.

[30] Jon Hasselgren and Tomas Akenine-M¨oller. Efficient depth buffer compression. In GH ’06: Proceedings of the 21st ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware, pages 103–110, New York, NY, USA, 2006. ACM.

[31] Jim Rasmusson, Jacob Strom, and Tomas Akenine-Moller. Error-bounded lossy com- pression of floating-point color buffers using quadtree decomposition. Vis. Comput., 26(1):17–30, 2009.

[32] Mark Weiser, Brent Welch, Alan Demers, and Scott Shenker. Scheduling for reduced CPU energy. In OSDI ’94: Proceedings of the 1st USENIX conference on Operating Systems Design and Implementation, page 2, Berkeley, CA, USA, 1994. USENIX Association.

[33] Kinshuk Govil, Edwin Chan, and Hal Wasserman. Comparing algorithm for dynamic speed-setting of a low-power CPU. In MobiCom ’95: Proceedings of the 1st annual international conference on Mobile computing and networking, pages 13–25, New York, NY, USA, 1995. ACM.

[34] Xiaotao Liu, Prashant Shenoy, and Mark D. Corner. Chameleon: Application-level power management. IEEE Transactions on Mobile Computing, 7:995–1010, 2007.

[35] Yan Gu, S. Chakraborty, and Wei Tsang Ooi. Games are up for dvfs. In Design Automation Conference, 2006 43rd ACM/IEEE, pages 598–603, 0-0 2006.

[36] Yan Gu and Samarjit Chakraborty. Control theory-based dvs for interactive 3D games. In DAC ’08: Proceedings of the 45th annual Design Automation Conference, pages 740–745, New York, NY, USA, 2008. ACM. BIBLIOGRAPHY 133

[37] Yan Gu and S. Chakraborty. Power management of interactive 3D games using frame structures. In VLSI Design, 2008. VLSID 2008. 21st International Conference on, pages 679–684, Jan. 2008.

[38] Bren C. Mochocki, Kanishka Lahiri, Srihari Cadambi, and X. Sharon Hu. Signature- based workload estimation for mobile 3D graphics. In DAC ’06: Proceedings of the 43rd annual Design Automation Conference, pages 592–597, New York, NY, USA, 2006. ACM.

[39] Uzi Zangi and Ran Ginosar. A low power video processor. In ISLPED ’98: Pro- ceedings of the 1998 international symposium on Low power electronics and design, pages 136–138, Monterey, USA, 1998.

[40] Wei-Chung Cheng and Massoud Pedram. Chromatic encoding: A low power encoding technique for digital visual interface. In DATE ’03: Proceedings of the conference on Design, Automation and Test in Europe, page 10694, 2003.

[41] Sabino Salerno, Alberto Bocca, Enrico Macii, and Massimo Poncino. Limited intra- word transition codes: an energy-efficient bus encoding for lcd display interfaces. In ISLPED ’04: Proceedings of the 2004 international symposium on Low power electronics and design, pages 206–211, Newport Beach, USA, 2004.

[42] Yen-Jen Chang, Shanq-Jang Ruan, and Feipei Lai. Design and analysis of low-power cache using two-level filter scheme. IEEE Trans. VLSI, 11(4):568–580, 2003.

[43] Ali Iranli and Massoud Pedram. DTM: dynamic tone mapping for backlight scaling. In DAC ’05: Proceedings of the 42nd annual conference on Design automation, pages 612–617, San Diego, USA, 2005.

[44] Ju-Ho Sohn, Jeong-Ho Woo, Min-Wuk Lee, Hye-Jung Kim Ramchan Woo, and Hoi-Jun Yoo. A 155-mw 50-m vertices/s graphics processor with fixed-point pro- grammable vertex shader for mobile applications. IEEE Journal of Solid-State Cir- cuits, 41(2):1081–1091, May 2006.

[45] Nam Sung Kim, Kriszti´an Flautner, David Blaauw, and Trevor Mudge. Drowsy instruction caches: leakage power reduction using dynamic voltage scaling and cache 134 BIBLIOGRAPHY

sub-bank prediction. In Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture, MICRO 35, pages 219–230, Los Alamitos, CA, USA, 2002. IEEE Computer Society Press.

[46] J. Jeong and M.Dubios. Cost-sensitive cache replacement algorithms. HPCA, 2002.

[47] Kriszti´an Flautner, Nam Sung Kim, Steve Martin, David Blaauw, and Trevor Mudge. Drowsy caches: simple techniques for reducing leakage power. SIGARCH Comput. Archit. News, 30:148–157, May 2002.

[48] Ching-Long Su and A.M. Despain. Cache designs for energy efficiency. System Sciences, 1995. Proceedings of the Twenty-Eighth Hawaii International Conference on, 1:306–315 vol.1, 3-6 Jan 1995.

[49] http://www.patentstorm.us/patents/7069387-claims.html.

[50] http://www.mesa3d.org/.

[51] Wilton, Steve J.E, Jouppi, and Norman P. An enhanced access and cycle time model for on-chip caches. HP Labs Technical Report, WRL-93-5, 1994.

[52] Johnson Kin, Munish Gupta, and William H. Mangione-Smith. The filter cache: An energy efficient memory structure. In International Symposium on Microarchitecture, pages 184–193, 1997.

[53] V.M. del Barrio, C. Gonzalez, J. Roca, A. Fernandez, and Espasa E. Attila: a cycle-level execution-driven simulator for modern GPU architectures. Performance Analysis of Systems and Software, IEEE International Symmposium on, 0:231–241, 2006.

[54] http://ati.amd.com/developer/shaderx/ShaderX NPR.pdf.

[55] http://attila.ac.upc.edu/wiki/index.php/Traces.

[56] http://amd-ad.bandrdev.com/samples/Pages/default.aspx.

[57] http://www.flipcode.com/archives/Geometry Skinning Blending and Vertex Lighting-Using Programmable Vertex Shaders and DirectX 80.shtml. BIBLIOGRAPHY 135

[58] You-Ming Tsao, Shao-Yi Chien, Chin-Hsiang Chang, Chung-Jr Lian, and Liang-Gee Chen. Low power programmable shader with efficient graphics and video acceleration capabilities for mobile multimedia applications. pages 395–396, Jan. 2006.

[59] Eric Chan, Ren Ng, Pradeep Sen, Kekoa Proudfoot, and Pat Hanrahan. Efficient partitioning of fragment shaders for multipass rendering on programmable graphics hardware. In HWWS ’02: Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware, pages 69–78, Aire-la-Ville, Switzerland, Switzer- land, 2002. Eurographics Association.

[60] Alan Heirich. Optimal automatic multi-pass shader partitioning by dynamic pro- gramming. In HWWS ’05: Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware, pages 91–98, New York, NY, USA, 2005. ACM.

[61] http://quid.hpl.hp.com:9081/cacti/.

[62] http://www.arm.com/products/processors/technologies/vector-floating-point.php.

[63] Ying Tan, Parth Malani, Qinru Qiu, and Qing Wu. Workload prediction and dynamic voltage scaling for mpeg decoding. In ASP-DAC ’06: Proceedings of the 2006 Asia and South Pacific Design Automation Conference, pages 911–916, Piscataway, NJ, USA, 2006. IEEE Press.

[64] Kriszti´an Flautner and Trevor Mudge. Vertigo: automatic performance-setting for linux. SIGOPS Oper. Syst. Rev., 36(SI):105–116, 2002.

[65] Zhijian Lu, Jason Hein, Marty Humphrey, Mircea Stan, John Lach, and Kevin Skadron. Control-theoretic dynamic frequency and voltage scaling for multimedia workloads. In CASES ’02: Proceedings of the 2002 International Conference on Compilers, architecture, and synthesis for embedded systems, pages 156–163, New York, NY, USA, 2002. ACM.

[66] Ying Tan, Parth Malani, Qinru Qiu, and Qing Wu. Workload prediction and dynamic voltage scaling for mpeg decoding. In ASP-DAC ’06: Proceedings of the 2006 Asia and South Pacific Design Automation Conference, pages 911–916, Piscataway, NJ, USA, 2006. IEEE Press. 136 BIBLIOGRAPHY

[67] Kihwan Choi, Ramakrishna Soma, and Massoud Pedram. Off-chip latency-driven dynamic voltage and frequency scaling for an mpeg decoding. In DAC ’04: Proceed- ings of the 41st annual Design Automation Conference, pages 544–549, New York, NY, USA, 2004. ACM.

[68] V.M. del Barrio, C. Gonzalez, J. Roca, A. Fernandez, and null Espasa E. Attila: a cycle-level execution-driven simulator for modern GPU architectures. Performance Analysis of Systems and Software, IEEE International Symposium on, 0:231–241, 2006.

[69] Iosif Antochi, Ben Juurlink, and Stamatis Vassiliadis. Selecting the optimal tile size for low-power tile-based rendering. In In Proceedings ProRISC 2002, pages 1–6, 2002. Research Publications

1. Texture Filter Memory: A Power-efficient and Scalable Texture Memory Architec- ture for Mobile Graphics Processors, B. V. N. Silpa, Anjul Patney, Tushar Krishna, Preeti Ranjan Panda, G. S. Visweswaran, IEEE/ACM, International Conference on Computer Aided Design (ICCAD ’08), San Jose, November 2008.

2. Low Power High Throughput Texture Cache, B.V.N.Silpa, Poster Presented at Intel Scholar Program (ISP’08). (Selected for Best Poster Award)

3. Adaptive Partitioning of Vertex Shader for Low Power High Performance Geom- etry Engine , B.V.N.Silpa, Karma S.S.Vemuri, Preeti Ranjan Panda, Springer- Verilag/LNCS, International Symposium on Visual Computing (ISVC’09), Las Ve- gas, November 2009.

4. Verter Shader Partitioning, B.V.N.Silpa, Kumar S.S.Vemuri, Preeti Ranjan Panda, Poster Presented at High Performance Graphics (HPG’09), New Orleans, August 2009.

5. Rank based dynamic voltage and frequency scaling fortiled graphics processors, B.V.N Silpa, Gummidipudi Krishnaiah, and Preeti Ranjan Panda, In Proceedings of the eighth IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis (CODES/ISSS ’10). ACM, New York, NY, USA. (Best Pa- per Candidate)

6. FastFwd: an efficient hardware acceleration technique for trace-driven network-on- chip simulation, Gummidipudi Krishnaiah, B. V. N. Silpa, Preeti Ranjan Panda, Anshul Kumar, In Proceedings of the eighth IEEE/ACM/IFIP international con- ference on Hardware/software codesign and system synthesis (CODES/ISSS ’10). ACM, New York, NY, USA.

7. Power-efficient System Design, Preeti Ranjan Panda, Aviral Shrivastava, B. V. N. Silpa, Krishnaiah Gummidipudi, book published by Springer, New York, 2010.

Biography

B.V.N.Silpa received her MTech degree in VLSI Design Tools and Technology, from Indian Institute of Technology Delhi in 2006. She received her Bachelors in Electronics and Communications from University College of Engineering, Osmania University in 2004. She received Microsoft Research PhD fellowship for the years 2007 to 2010 for her research on Power Optimizations for Graphics Processors. She is currently working for nVidia as a senior architect.