UPTEC IT10017 Examensarbete 30 hp Juni 2010

Energy efficient graphics

Making the rendering process power aware

Johan Bergman

Abstract Energy efficient graphics

Johan Bergman

Teknisk- naturvetenskaplig fakultet UTH-enheten Today, it is possible to produce computer generated graphics with amazing realism, even in embedded systems. Embedded systems, such as mobile phones are Besöksadress: characterized by limited battery power, and as graphics become more complex, it Ångströmlaboratoriet Lägerhyddsvägen 1 becomes necessary to find a solution that provides the means to control the energy Hus 4, Plan 0 consumption of graphics at run-time. When energy resources are scarce it would be desirable to be able to limit how much energy is spent generating graphics so that Postadress: other, more important, system components may continue to operate for a longer Box 536 751 21 Uppsala time. This thesis examines how the rendering process can be made power aware and energy efficient. Telefon: 018 – 471 30 03 The proposed solution to achieve power awareness without modification to existing

Telefax: hardware and software is a library interposer on top of the OpenGL API. The design 018 – 471 30 00 and implementation of the interposer library shows that it is possible to limit energy consumption with high precision through a relatively simple algorithm. The interposer Hemsida: limits the amount of time the processing units are actively rendering graphics and http://www.teknat.uu.se/student since CPU and GPU utilization displays a linear correlation with utilization, energy is preserved at the expense of or image quality. To preserve an acceptable frame rate certain visual effects are turned off to reduce the frame rendering time. Lowering image quality makes it possible to increase frame rate while keeping utilization constant. Measurements show that energy consumption remains stable at lowered image quality and higher frame rate. In the conclusion of this thesis are thoughts on how to incorporate such a system in existing frameworks for power management, and how power management frameworks could be improved to better exploit the possibilities presented by a power aware rendering process. During the research for this master thesis it has become apparent that a scalable rendering process is desirable not only for power management but can be used for other purposes as well.

Handledare: Barbro Claesson, Detlef Scholle Ämnesgranskare: Stefanos Kaxiras Examinator: Anders Jansson ISSN: 1401-5749, UPTEC IT10017

Contents

1 Introduction 5 1.1 Background ...... 5 1.2 Problem statement ...... 6 1.2.1 Energy efficient ...... 6 1.2.2 Computer graphics and power management ...... 6 1.2.3 Design and implementation of a power aware graphics system . . . 6 1.3 Method ...... 6 1.4 Limitations and time constraints ...... 7

2 Computer graphics 8 2.1 Graphical Processing Unit ...... 8 2.2 The ...... 9 2.3 Tile-based rendering ...... 10 2.4 Ray casting and ray tracing ...... 11 2.5 ...... 11 2.6 Unified architecture and General Purpose GPUs ...... 11 2.7 Graphics in embedded systems ...... 12 2.8 Summary ...... 12

3 Power management 14 3.1 Static power management ...... 14 3.2 Dynamic power management ...... 15 3.3 Power management policy ...... 15 3.3.1 Break even schemes and time-out policies ...... 15 3.3.2 Predictive wake-up ...... 16 3.4 The Advanced Configuration and Power Interface (ACPI) power manage- ment framework ...... 16 3.5 Power management module ...... 16 3.6 Operating system power management ...... 17 3.6.1 ECOSystem ...... 17 3.6.2 EQoS ...... 18 3.7 Power manageable components ...... 19 3.7.1 Power management at the application level ...... 19

2 CONTENTS

3.7.2 Power management at the driver level ...... 20 3.8 Workload prediction for graphics applications ...... 20 3.8.1 Signal-based workload prediction ...... 21 3.9 Hardware support for dynamic power management ...... 21 3.9.1 Power modes ...... 22 3.9.2 Dynamic Voltage and Frequency Scaling (DVFS) ...... 23 3.10 Summary ...... 24

4 Power management methods in computer graphics software 25 4.1 Energy efficient graphics applications ...... 25 4.2 Level of detail ...... 27 4.2.1 Simplification models and progressive meshes ...... 28 4.2.2 Level of detail control ...... 29 4.3 Power awareness in the rendering pipeline ...... 29 4.3.1 Per-vertex transform and lighting ...... 29 4.3.2 Clipping and culling ...... 31 4.3.3 Fragment ...... 32 4.3.4 Fog ...... 33 4.3.5 Texturing ...... 33 4.3.6 Anti-aliasing ...... 35 4.4 Energy efficient tile-based rendering ...... 36 4.5 Frame rate ...... 37 4.6 Summary ...... 38 4.6.1 Reduce computational complexity ...... 38 4.6.2 Reduce external memory accesses ...... 39 4.6.3 Power efficiency in the graphics pipeline ...... 40

5 Design and implementation of a power aware graphics system 41 5.1 Requirements ...... 41 5.2 Specification ...... 44 5.2.1 User perspective ...... 44 5.2.2 Graphics application perspective ...... 44 5.2.3 Power management module perspective ...... 45 5.2.4 Hardware perspective ...... 45 5.3 Design and implementation ...... 45 5.4 The interposer library ...... 46 5.5 Energy consumption and workload ...... 48 5.6 The utilization limit ...... 48 5.7 Optimization of image quality ...... 49 5.8 Graphics system operation ...... 52 5.8.1 Intra-process communication (IPC) ...... 52 5.8.2 Threaded architecture ...... 53 5.8.3 Interposed OpenGL functions of note ...... 54 5.8.4 Rendering timer ...... 55

3 CONTENTS

5.8.5 Rendering buffer ...... 56 5.8.6 Measuring energy consumption ...... 56 5.8.7 Workload prediction ...... 57

6 Results 58 6.1 Demos ...... 58 6.2 Measurements ...... 59 6.2.1 Estimating utilization ...... 59 6.3 Summary of the measurements ...... 61

7 Conclusion 63 7.0.1 Energy consumption of tile-based rendering ...... 63 7.0.2 Library interposition as a method for power management . . . . . 63 7.0.3 Power management and the OpenGL API ...... 64 7.0.4 Power management frameworks and power aware rendering . . . . 64 7.1 Future work ...... 66 7.1.1 Extend the functionality of the library interposer ...... 66 7.1.2 OpenGL library interposition for other reasons ...... 66 7.1.3 Mapping the power / performance trade-off ...... 67

Bibliography 68

4

Chapter 1

Introduction

This master thesis will investigate how to best increase energy efficiency for computer generated graphics in embedded systems, in particular, the possibility of scaling the operation of rendering in order to trade performance for power. This chapter describes the background, goals and limitations of this thesis.

1.1 Background

The performance of computer-generated graphics in embedded systems, such as mobile phones, has increased steadily over the last decade. The limited energy resources of these systems mean that power consumption is becoming a more and more pressing issue. Many embedded systems today include (GPU)-s dedicated to performing complex computations for computer graphics and high-resolution displays. 3D graphics has become an integral part of mobile phones, but this comes at the cost of increased energy consumption and reducing the energy footprint of the processing units involved in generating graphics is vital. Efficient use of limited resources in the computer science field is referred to as Quality of Service (QoS). QoS can have different meanings in different context. In this project, QoS refers to the relationship between energy consumption and performance. Another concept used in this thesis work is power awareness. A power aware device has several modes of operation, where each mode has a different level of energy consumption but also performance. Making a device or module power aware will provide the system with tools that will enable it to use energy more efficiently. Minimizing power consumption while maintaining as much performance as possible is referred to as graceful degradation. This master thesis will be performed at ENEA AB in collaboration with Uppsala University, in the spirit of the GEODES1 project. The GEODES project aims to provide design techniques embedded software needs to face the challenge of long power-autonomy in feature-rich systems and possibly life-critical systems.

1GEODES - Global Energy Optimization for Distributed Embedded System

5 Introduction

1.2 Problem statement

There are several topics, areas of interest, for this thesis. These are listed below. Some parts of this master thesis work is purely theoretical while other areas of research form the basis for design and implementation.

1.2.1 Energy efficient computer graphics The main topic for this thesis work is a comprehensive survey of how to maximize the performance of computer generated graphics under power constraints. As a first step towards power awareness it is necessary to understand how computer graphics actually works. How does computer graphics work and what are the benefits of having it? It is also necessary to know when and where energy is being consumed in a graphics system, and perhaps, most importantly of all, how energy can be saved?

1.2.2 Computer graphics and power management A related subject is how rendering of computer graphics can be included in a power management framework. It is possible to design a graphics system that is able to deliver top quality graphics when requested but continue to operate, albeit at reduced perfor- mance, when energy resources are scarce. This thesis contain a comparison of different power management frameworks and how these relates to computer graphics. How does a power management framework that includes power aware computer graphics look like? What are the difficulties that need to be overcome in order to be able to achieve a power aware rendering process?

1.2.3 Design and implementation of a power aware graphics system The practical part of this thesis consists of the specification, design and implementation of a power aware graphics system on the i.MX31 hardware platform. The implementation is a prototype and only has a subset of features proposed in the design. It was used to test and demonstrate some of the reasoning behind the design. The design aims to describe how power awareness can be introduced in a system with minimal changes to existing software and hardware. This thesis work attempts to explain all of the options that are available when it comes to trading performance for power and also how to choose the method that provides the best trade-off.

1.3 Method

This thesis project were conducted in two distinct phases. The thesis work began with a literature study of academic research papers, manuals, technical documentation and reports. To focus the research, research subjects were updated on a weekly basis as new knowledge was gained. The second phase built upon the knowledge gained from

6 Introduction the literature study to design and implement the power aware graphics system for the i.MX31 board. Although the implementation only is a prototype some measurements were carried out in order to validate parts of the design and as a starting point for discussion. This document was continuously written and re-written throughout the entire process.

1.4 Limitations and time constraints

It was decided early on that the development platform for the power aware system was going to be the i.MX31 development board running Linux 2.6. The only graphics library considered was OpenGL. The time frame for the literature study was going to be ten weeks after which another ten weeks should be set aside for design and implementation. From the beginning it was also the intention that one of the research subjects would be display power management since the display usually is the component with the highest energy consumption in an embedded system but the study was cut from the final report since it was not complete.

7 Chapter 2

Computer graphics

This chapter contains a brief description of a general computer graphics system. How a combination of dedicated hardware and software enable the projection of a three- dimensional scene onto a screen.

2.1 Graphical Processing Unit

Figure 2.1: Overview of a traditional GPU in an embedded system.

A GPU is a microprocessor dedicated to performing computations associated with computer graphics. Traditionally, the main task of the GPU is to render graphics. Rendering is the process of synthesizing an image from the description of a scene. Special- purpose hardware can always perform a given task, such a 3D rendering more efficiently than a general-purpose CPU[8] and using dedicated graphics hardware helps embedded systems get by with lower-clock-rate CPUs. The scene description consists of geometric primitives in euclidian 3-dimensional space. The light, color, reflection properties and the viewer’s position is also taken as input to calculate the image. The rendering process has several stages where complex calculations can be per- formed in parallel and it is this property which makes GPUs so efficient at computer graphics computations. A GPU features a highly parallel structure and fast memory

8 Computer graphics access. Together, these features enables a GPU to speed up the rendering process con- siderably. GPUs are optimized for high throughput and not low latency as a CPU is[22]. The GPU may or may not provide acceleration for the entire rendering process. In many systems where the space on chip is limited the GPU only supports a subset of graphics operations and the rest of the rendering process has to be carried out on the CPU.

2.2 The graphics pipeline

Below follows a short description of the stages of the traditional GPU pipeline, which is also referred to as scanline rendering. The graphics pipeline can be divided into two main stages.

• Geometry stage. During the geometry stage the vertices that are the input to the graphics pipeline are transformed into a stream of triangles, in a common 3D space with the viewer located at the origin.

• Fragment stage. The fragment stage is responsible for gen- erating the pixel value of each pixel on the display.

Each stage consists of several sub-stages which may be performed in slightly different order depending on the system architecture. A flow-chart of a typical graphics pipeline is presented in figure 2.2.In many embedded systems the geometry stage computations are car- ried out on the CPU[2]. The possibilities for optimization and grace- ful degradation in the rendering pipeline is a large part of this thesis work and will be explained in detail in chapter 4.3. According to the GPU, the world is made up of triangles and before any computations are done any complex shape has to be split into triangles. OpenGL or some other graphics library is used to push each triangle into the graphics pipeline one vertex at a time. To fully take advantage of the parallelism in the GPU all objects that will be put into the image will first have to be transformed into the same coordinate system. Lighting is then added to the scene on a per-vertex basis by combining information about all the light- sources in the scene. In the next step the vertices are projected onto the virtual camera’s film plane. Ideally, only objects that are visi- ble from the camera are considered when calculating the pixel value. When it comes to determining which visible screen-space triangles overlaps pixels on the display each pixel can be treated individually, Figure 2.2: The which allows the GPU to fully utilize its parallelization capabilities. stages of a typical The process is called rasterization. Lighting is not enough to give an graphics pipeline. image the impression of realism. By draping images called textures

9 Computer graphics

Figure 2.3: Overview of a tile-based GPU in an embedded system. over the geometry an illusion of detail is added to the image. Textur- ing requires a huge amount of memory accesses in quick succession and GPUs are equipped with high-speed memory as well as excellent caching to provide fast access to textures[22]. Several of these stages require informa- tion about how far away the closest objects is from the viewpoint of the virtual camera for each pixel. For this reason a depth buffer called z-buffer is kept which updates the minimum distance each time the a triangle is closer than any previous triangle and the pixel value is updated. For complex scenes where many triangles overlap the number of reads from the z-buffer can become very large.

2.3 Tile-based rendering

Tile-based rendering is a technique where the 3D scene is decomposed into regions, or tiles. It was originally developed to speed up the rendering process by allowing multiple triangles to be rendered in parallel but is now used in low-power embedded graphics systems to provide sequential rendering of large scenes, maximizing utilization of limited hardware acceleration[3]. The geometry stage of tile-based rendering is the same as the traditional graphics pipeline. Geometry operations can be carried out on the CPU or special-purpose hard- ware. The processed vertices are sent to the rasterizer, but before they reach it the triangles are sorted into bins that correspond to different tiles. Since a triangle might span several tiles it may have to be put into multiple bins and sent to the fragment stage repeatedly. The sorting might be performed as part of the rasterization process or it might be performed by the CPU as part of the geometry stage. The tiles are then rendered one by one and the pixel values written to the frame buffer. In a traditional graphics pipeline the depth values of each fragment and textures have to be placed in off-chip memory because the depth- and texture buffers would have to be unfeasibly large to be able to store all data. The benefits of using a tile-based rendering system is that the depth values and textures for one tile can be stored on chip which allows very fast very efficient memory access without large buffers[2]. The i.MX31 development board used for the implementation part of this thesis uses a tile-based rendering technique.

10 Computer graphics

2.4 Ray casting and ray tracing

The overwhelming majority of computer graphics systems uses the rasterization ren- dering process described above. An alternative rendering process is ray casting and its relative ray tracing. Ray casting is a technique where the pixel value is calculated by simulating the path of light through a 3D environment. Pixel value is calculated based on the first surface that crosses the path of the ray projected from the viewpoint for each pixel. Ray tracing is a more advanced technique where the ray is allowed to bounce several times within the scene, thereby producing effects such as reflection and shadows. It is used to create images with incredibly realistic lighting and color at the expense of increased computational complexity. Ray traced images typically takes several seconds to render and are therefore unsuitable for interactive application. Hardware acceleration of ray tracing is still in development and it will take some time before the first interactive ray traced applications hit the market.

2.5 Graphics library

A graphics library enables programmers to utilize the graphics accelerator hardware of the GPU and provide a common API to applications that wishes to render 3D graphics. There exists several graphics libraries. OpenGL|ES is a cross-platform API for 2D and 3D graphics applications that is adapted in order to suit the limited hardware resources of embedded systems. It is the graphics library used for the practical part of this master thesis. The graphics library enables developers of graphics application to access dedicated hardware without having to consider differences between platforms. The graphics library resides in the operating system and translates API calls into actions executed on either the CPU or the GPU.

2.6 Unified shader architecture and General Purpose GPUs

GPUs today, implement the geometry and rasterization stages of the graphics pipeline using programmable hardware called vertex- and fragment . Creating a unified shader is done by implementing functionality for both vertex- and fragment shaders. A uniform shader architecture is more desirable for applications with uneven workload between the different stages of the graphics pipeline. Moya et al.[25] have measured the benefits of using a unified shader architecture using a comprehensive simulated environment. The amount of space needed on chip for shader hardware could be reduced by as much as 30% using a conservative estimate. Since GPUs outperform CPUs when it comes to computational power a lot of ef- fort has been put into using them for other calculations than graphics. This is not as straightforward as it may sound and up until a couple of years ago was not even possible due to hardware constraints. Processes with large computational requirements, high parallelism and whose throughput is more important than low latency are suitable for

11 Computer graphics running on a GPU[22]. A GPU which performs operations traditionally handled by a CPU is referred to as a General Purpose GPU (GPGPU). GPGPU computing is outside the scope of this thesis but the unified shader model and general purpose GPUs serve as an example of the ongoing trend that GPUs are becoming more programmable and versatile. Using hardware acceleration does not only improve the execution time of applications but also has a positive impact on power consumption. The development of new hardware is very important to achieve high system performance and low power consumption.

2.7 Graphics in embedded systems

Embedded systems with computer graphics, including mobile phones differs from more powerful computer systems in several aspects. Handheld devices have a very limited power-supply compared to a stationary computer which requires innovation at both the hardware and the software level to increase life-time of such systems by smart design and energy efficiency. Another limitation of handheld devices is the size. Most mobile phones are small and even if the power supply is increased the extra power turns into heat which could potentially be damaging to circuits unless thermal design aspects are considered[8]. Small memory bandwidth and limited chip area for dedicated hardware such as GPUs are some of the challenges when designing embedded systems apart from the demands for low power consumption. Small displays which are held close to the eye of the viewer actually result in higher demands for image quality than in a desktop system[1]. Embedded systems limited rendering capabilities are really stretched for applications that are developed for PCs and ported to embedded systems. Computer games are often distributed to PCs and embedded systems simultaneously which requires embedded systems to reduce level of detail in order to provide acceptable frame rates[24, 9]. It is important to make a distinction between interactive and non-interactive graph- ics. Non-interactive graphics can be rendered as simple bitmaps on other devices and does usually not infer as calculation-heavy demands on the GPU[8].

2.8 Summary

The rendering graphics process takes the description of a 3D scene and projects it onto a 2-dimensional screen. Scanline rendering is the name of the process used in almost every system today. Tile-based rendering is an alternative technique which splits the scene into regions and render each region separately. It is mostly found in embedded systems. The graphics pipeline can be divided in a number of stages that all contribute to the final image. A GPU provides hardware acceleration of computations that are needed to render graphics. The reason why a GPU is able to speed up the rendering process by several orders of magnitude is because the rendering pipeline contains several stages where many elements can be processed in parallel. Applications that request computer generated graphics can call a graphics library like OpenGL to access hardware accelerated. Using

12 Computer graphics

GPUs increase the energy consumption and speed of the rendering process considerably. A trend in GPU technology is towards more general-purpose hardware that can be used for other than rendering purposes. Energy efficiency and power awareness is especially important in embedded systems where the battery and space on chip is limited.

13 Chapter 3

Power management

Power management (PM) is the process of efficiently directing power to different compo- nents of a system. It is especially important in embedded systems that rely on battery power. This chapter aims to provide a definition for various power management concepts used throughout this thesis. It will also describe the power management control tech- niques that are available at different levels of the hardware/software stack and how they relate to system components such as computer graphics. Power management is used not only to prolong battery life-time but also to reduce noise and cooling requirements for integrated circuits. Luca Benini et al.[5] have presented a comprehensive survey of the various system- level power management techniques available. They look at several aspects of dynamic power management such as what type and how much information should be exchanged between the manager and the system components.

3.1 Static power management

The best solution to power management issues is of course to reduce power consumption without any degradation of performance. Static power management refers to minimiza- tion of leakage current and other power consumption characteristics of hardware circuits on the base power consumption level, but could also refer to a more general minimization of power consumption at various levels of performance. The base power consumption of a system can be defined as sum of the energy con- sumed by components without power management and the lowest possible power state for each power manageable component[35]. What all static power management has in common is that it is implemented at design time and does usually not require as much software support as dynamic power manage- ment, in the shape of complex algorithms and architectures. Static power management and dynamic power management are not mutually exclusive and both are required to improve life-time of feature-rich embedded systems.

14 Power management

3.2 Dynamic power management

Any electronic design need to be able to deliver the peak performance when requested[5]. This is why static power management alone cannot suffice. At the peak performance power consumption is high, even though most components will not be running at full capacity all the time. Without some way of restricting the power consumed by inactive or partially active components the battery will either have to be impractically large or the system will have a very limited life-time. Dynamic Power Management (DPM) is a way for dynamically reconfigurable systems to provide requested services with a minimum of power consumption. DPM techniques include methods to turn off or reduce the performance of system components when they are not used to their full capacity. Most systems experience fluctuations in workload during runtime and any DPM is based on the assumption that it is possible to predict future workload with a degree of certainty[5].

3.3 Power management policy

How to configure a system using DPM is called power management policy. Power man- agement policy design can be designed off-line by the developer, or it can be implemented as a general adaptive solution which dynamically reconfigures itself at run-time. Many DPM policies is a combination of both. Predictive techniques for setting power management policy does not guarantee op- timal solutions. All predictive techniques uses information about past events to make reliable predictions about the future. Regardless of the predictive algorithm used the quality is dependent on the correlation between past and current events, which is always beyond the control of the designer[5]. Most dynamic power management research has focused on optimizing power un- der performance constraints. Power is a global system resource which means that the challenges of power-constrained QoS are different from other types of QoS.

3.3.1 Break even schemes and time-out policies The time-out policy is widely used in laptops and handheld devices. It is a simple policy which shut down a component after some inactivity. This policy is clearly sub-optimal and can even prove contra-productive in specific situations[5]. An example where a simple time-out policy with a fixed time-out is ineffective is in a system with a periodic workload and the period is slightly longer than the time-out value. A processor in such a system would enter a sleep mode when the time-out expires but the energy savings, if any, would be too small to justify the transition energy overhead. This example highlights the need to set the time-out correctly. The break even scheme is one of the most commonly used to determine the optimal time-out value. The processor is put into a sleep mode when the expected energy gain is greater than the mode transition cost. It is a scheme that is intuitive, easy to implement and performs relatively well for components with few power modes[5].

15 Power management

Using the break even scheme on the GPU and the rendering process as a whole requires that the arrival time of the next request is predicted with accuracy. Lu et al. have tested a simple time-out technique using different schemes to determine the optimal time-out. They show that it is possible to save power by an efficient time-out scheme. The adaptive algorithm they use outperforms the other methods, indicating that adaptive algorithms are well suited for power management[21].

3.3.2 Predictive wake-up When the processor goes into sleep mode and there is an incoming request it will take some time for the processor to get up to speed. If the power manager gets accurate information about the expected future use of an application it can employ the predictive wake-up scheme to reduce performance penalties and save energy due to wake-up perfor- mance loss[21]. Rendering of computer graphics is often done at highly periodic intervals, which makes graphics applications prime targets for predictive wake-up schemes.

3.4 The Advanced Configuration and Power Interface (ACPI) power management framework

The ACP Interface define a platform-independent open interface for device configura- tion and power management of individual devices and entire systems. ACPI removes device power management responsibilities from firmware interfaces and describes a set of standardized power management states and methods as guidelines for developers[10]. Most platforms implements the ACP Interface[29]. The ACPI specification contains information about how to set the operating modes of many kinds of hardware, such as CPUs and other processing units. ACPI-based power management algorithms spend less than 1% of computations on power management according to Benini et al[5]. The ACP Interface does not guarantee that power management will be efficient or that it will be computationally feasible. ACPI provides the interface, it is up to the designer to implement the power management.

3.5 Power management module

Most suggested power management design solutions feature a power management mod- ule. The module is responsible for gathering system status information, making in- telligent DPM decisions and actuating commands to the individual subsystems. The module’s three main tasks are sometimes concretized as sub-modules within the mod- ule. The sensor module gathers system information such as which devices are registered and how their power management capabilities are. It also monitors the current state of each system component at run-time. This information is then relayed to the policy man- ager that contains the decision-making algorithms. The output of the policy manager are state change commands. The actuator module takes these commands and relays these to the affected components. In some systems, the policy manager and actuator

16 Power management is combined into one module. This is by no means the only possible architecture to handle power management but it serves as an example and a starting point for further discussion.

Figure 3.1: The basic power management module architecture.

3.6 Operating system power management

Power management can be implemented at various levels in the system architecture. Simple embedded systems with few features often has power management in the form of a dedicated hardware circuit. It is possible to implement power management either as an application or middleware. Due to the fact that most power management policies require the power management module to be able to access low-level drivers and hardware, the vast majority of power management architectures puts the policy manager in the operating system. Operating system power management is a hardware/software design challenge since both hardware resources and applications need to facilitate power management func- tionality for the system to be effective[5]. To facilitate power management the platform should ideally provide support to mea- sure the power consumption of each individual component[35]. To make accurate deci- sions, the power management module need accurate and timely updates about system and subsystem power consumption. The power manager can measure the remaining battery at regular intervals and adjust the power budget continuously to fit the desired life-time of the system. Without this information the manager is left to rely on the feedback provided by the components themselves which may be delayed or inaccurate. The feature of accurately measuring current at run-time is costly to implement, and so, most embedded systems today lack this feature.

3.6.1 ECOSystem Zeng et al.[35] has produced one of the relatively few research papers that puts energy as the first priority resource. Since, this is the purpose of this thesis it is interesting to describe their proposed solution in detail. In the ECOSystem, each application is

17 Power management given an amount of currentcy that can be used to purchase the right to consume energy of hardware devices. One currentcy unit represent a certain amount of energy within a ceratin time frame. The ECOSystem model resource management policies have two main goals. The policy tries to eliminate waste by using each device as efficiently as possible. At the same time, it limits the offered workload of the system to ensure a minimum life-time. One of the strengths of the ECOSystem and the currentcy approach is that it does not require devices or applications to be power aware although application involvement in power management is facilitated. Currentcy is allocated to applications at specific time intervals. The authors found that a one-second period is sufficient to achieve smooth energy allocation. The amount of total currentcy available determine the maximal power consumption in that time frame and is proportional to the estimated battery model and desired life-time of the system. Each task (application) is then given an amount of currentcy depending on its relative priority to other tasks. The ECOSystem is designed to incorporate a flexible enough device interface to support a wide variety of devices. Each device has its own charging policy.

3.6.2 EQoS Another example of a system that tries to maximize performance under power con- straints is the EQoS framework. A limited energy budget and real-time tasks are both incorporated in the Energy-aware Quality of Service (EqoS) concept described by Pil- lai et Al.[29] They introduce utility as a way of maximizing system performance under power constraints. Utility is a measure of the value of benefit gained from a particular task or application. The Quality of Service (QoS) term is originally a traffic engineering term used in packet-switched telecommunication networks, to denote resource reservation control mechanisms. In this context, QoS refers to the ability to provide different priority to different applications, users or data flows. High priority data flows get a larger share of the network resources. The EQoS concept has the same basic idea, but uses power instead of bandwidth as the primary system resource. How to assign utility is straightforward for some tasks. For applications that pro- vide interactive graphics the authors admit that the assignment of utility is somewhat arbitrary. They suggest using a combination of objective image quality measures and common sense. Also, the utility assignment is insensitive to task dependencies which makes utility assignment a question of some intricacy. Tasks running in the EqoS system have to be able to gracefully degrade. How this is achieved is up to the developer. At the most extreme some tasks will not be allowed to run since their execution would violate the run-time demands. The earliest-deadline-first scheduler used for CPU scheduling in the EqoS system ensures that all tasks above a certain utilization are allowed to run.

18 Power management

3.7 Power manageable components

Power management on the component level is a feature provided by some electronic equipment to turn off or go into low-power mode when there is a shortage of resources. A power manageable component (PMC) is defined as a functional block. The power manager makes no assumptions about the internal structure. A PMC can be an appli- cation, a driver or some other system component. What separates a PMC from other components is the availability of multiple power modes. Components can have both external and internal power management. When designing internal power management policy the policy have to be more conservative since the component usually lacks ob- servability of the system operation[5], but at the same time internal management has the benefit of direct access to the hardware registers of the managed device, enabling a fast feedback loop. Depending on how important the component is to overall system performance the performance degradation has to be implemented in such a way that it does not impair overall system performance. Effective power management requires high flexibility in the types of resources that can be integrated as power management components[5]. Rendering of computer graphics can be considered as a functional block. It has a clearly defined purpose in a system, even though the rendering process affects several hardware devices.

3.7.1 Power management at the application level As mentioned above the power management decision-making could potentially be imple- mented at the application level. This section assumes that the power management mod- ule resides in the operating system and that applications exists in the system as PMCs. Some forms of power management can only be implemented in the application layer. Carla Schlatter Ellis[12] makes a good argument for higher-level power management. She claims that the application should play an important role in power management and that the operative system should provide the application with a power management interface.

Figure 3.2: An example of the input/output of the power management module. Lafruit et al. [18]

The operative system should provide the application with updates about the power state of the system. The application itself has the most knowledge about which services

19 Power management it should offer to the user and the importance of different aspects of those services. Trade-offs exists at the application layer specifically, but not exclusively for computer graphics. A GPS, for instance, has a non-trivial trade-off between processor idle time and polling frequency[12]. The power management interface between operating system and application is poten- tially beneficial to both parties. By communicating its future requests to the operating system is able to incorporate this information in the decision-making process. The ap- plications can request task-specific power management by communication with the PM module[21]. A power-aware application is able to scale its service content in an effi- cient way since it has full knowledge of its own operation. Power-aware applications can therefore be less conservative with their graceful degradation schemes.

3.7.2 Power management at the driver level A device driver is relatively easy to model as a PMC. Reducing the maximum utilization or performance of a hardware device often correspond to separate system features, mak- ing drivers ideal for power management implementation. Information the PM needs from drivers include the power requirements of each state, transition energy and delay[21]. These parameters can be submitted during a handshake face at system startup or when a new device is connected to the system. The current status of a device should be made visible to the power manager in order to make accurate power management decisions, even if the device has internal management[35].

3.8 Workload prediction for graphics applications

If it is possible to predict the workload of the rendering process then predictive DPM techniques (Section: 3.3) can be used to save energy. It is also vital if the system should has some constraints on the amount of rendering that is allowable. The frame structure of a 3D application offers a rich set of structural information which can be used to predict future workload. Gu et al.[17] have shown that it is possible to predict the future workload of a computer game frame with high accuracy. Important rendering parameters include: average triangle area, triangle count, aver- age triangle height and vertex count[24]. According to Lafruit et al.[18] the number of vertices and the number of rendered pixels of the object are the most significant param- eters when calculating complexity for the rendering stages. The workload of processing a frame is almost linearly correlated with its rasterization workload. All primitives of the same type can be rendered in approximately the same time for 3D games and other applications[17]. An example of on-the-fly calculation time prediction has been imple- mented by Lafruit et al.[18] The Computational Graceful Degradation technique uses scene description parameters to predict the workload of each individual scene and the content is then scaled accordingly to fit the rendering within the allotted time frame. A performance heuristic often used[33] for modeling the rendering pipeline execution time is shown below in equation 3.1.

20 Power management

T (x) = max (c1 ∗ V (x), c2 ∗ P (x)) (3.1) Where x is the object that is being rendered, V is the number of vertices of the object and P is the number of projection fragments of the object.

3.8.1 Signal-based workload prediction Analytical schemes for workload prediction are often computationally expensive. The developer often has to compromise computation complexity. Otherwise the computa- tional overhead might cause the rendering to miss frame deadlines. Analytical prediction models may have problems with specific applications. Workload estimation techniques that takes additional input about the executing application is one method that can be used for this[24]. To reduce computational complexity and improve the accuracy of workload estima- tion Mochocki et al.[24] Suggests using a signal-based prediction scheme. The model does not require any elaborate system model but instead uses a cause- and effect-thinking to assign each frame signature with a workload taken from actual measurements. If another frame arrives in the pipeline then a signature containing a subset of graphics rendering parameters is compared with previous signals and if it is a good enough match then the workload of the stored signal is the estimation for the frame to be rendered. The signal-based approach demands a minor adjustment to the regular 3D graphics pipeline. In order to collect enough information about each frame to calculate its signature a signature buffer has to be implemented in the middle of the geometry rendering stage. The benefits are substantial compared to analytical methods in terms of accuracy. The signal-based scheme is easy to understand, has a tolerable computational complexity and is sensitive to application-specific workloads. For a 3D graphics benchmark render- ing scene the prediction error never exceeded 3% which is a substantial improvement compared to analytical methods[24].

3.9 Hardware support for dynamic power management

Without hardware support the power management capabilities of a system are limited. Hardware support for power management in embedded systems include dynamic process and temperature compensation (DPTC), active well-bias, clock gating, sleep modes and dynamic voltage and frequency scaling (DVFS). The platform used for the practical part of this thesis supports all of these features but DVFS is unfortunately not available for the GPU. The DPTC mechanism measures the circuit’s speed dependency on the process technology and operating temperature and lowers the voltage to the minimum level needed to support the existing required operating frequency. Active well bias minimizes leakage current by lowering the well power to the transistors in the circuit. Both DPTC and active well bias are hardware support for static power management and is considered outside the scope of this thesis, which focuses on the use of different

21 Power management power modes and to some extent DVFS. Equation 3.2 describes the dynamic power consumption of a processor.

P = C ∗ V 2 ∗ F (3.2) Where P is power, C is the capacitance switched per clock cycle, V is voltage and F is the switching frequency.

3.9.1 Power modes A power mode is a power-performance trade-off mode of operation for a system compo- nent which can be either hardware or software. A PMC could have a continuous range of power modes, enabling the power manager to fully utilize the power saving capabili- ties of the device and of the system as a whole. The hardware overhead and increased design complexity that this fine control requires have resulted in that most components offer a very limited amount of power modes. Mode transition typically comes at the a non-negligible cost in either performance or delay, or both. These transition costs have to be taken into account when designing power management systems. High power states generally have smaller transition latency and higher performance overhead than low power states[5].

Clock gating Integrated circuits consists of multiple components each of which may have several power domains. A GPU usually has sub-modules that can be gated individually. Pruning the clock tree so that the flip-flops of a sub-module is halted is called clock-gating. Clock-gating is ideally suited for internal power management. The low performance overhead caused by stopping the clock makes it possible to use clock-gating for very short periods in idle mode without affecting the performance in any significant way. Clock- gating does not eliminate dynamic power dissipation on the external clock circuitry or the leakage current[5].

Sleep modes An ACPI (Section: 3.4) compatible device typically feature at least one sleep mode. When a component is in sleep mode the power is shut off completely to some parts of the circuit. Turning off power completely to a component requires controllable switches as well as handling the potentially large wake-up time. When powering up the component’s operation must be reinstated. This will take even longer is there are mechanical parts which have to come up to speed[5]. Sleep modes are definitely useful for power savings but are not as effective in real-time operating systems which are more or less always in an active state[29].

22 Power management

Mode transition The increasing need for power management in embedded systems have led to the con- struction of devices featuring many intermediate power modes, where they previously only had two, on or off. Components such as CPUs and GPUs today typically fea- ture several operational- and even non-operational modes. Modern hard drives have dozens of operational modes and some devices feature scalable power levels. Hard tim- ing constraints and subsystem dependencies have to be considered when designing a mode transition algorithm[19]. The optimal mode transition sequence is hard for programmers to identify since it quickly becomes complex as the number of modes increases. The results can sometimes seem contra intuitive. Liu et al.[19] have analyzed the idle-mode transition sequences on a component- and system-level. Their algorithm calculates the optimal mode transi- tion sequence for an arbitrary number of devices and low-powered modes in logarithmic time. The first stage of the algorithm identifies the optimal mode transition under tim- ing constraints. The second stage calculates the system-wide energy savings potential by using the constraints of all subsystems and choosing the mode transition that pro- vide the largest energy savings on the system-level. The power manager is not aware of the optimal sequence for each resource. This information is handled by each driver internally. In systems where the idle energy cost equal the active energy cost the algo- rithm outperforms traditional optimization schemes by 30%-50% in system-wide energy consumption. The algorithm only considers idle-mode transition and can be combined with algorithms for active-mode optimization.

3.9.2 Dynamic Voltage and Frequency Scaling (DVFS) Simultaneous scaling of processor voltage and frequency has been used extensively for many years to reduce the power consumption of CPUs and other processing units. Equa- tion 3.2 indicates that frequency has a linear relationship with power consumption and that the supply voltage has a quadratic relationship. A processor that experiences slack time will have a lower power consumption if it is able to run at a lower voltage and frequency. The time it takes for a calculation to complete will then increase but the overall power consumption will be reduced. If the processor is able to stay in a busy state at a low voltage /frequency then the overhead of mode transition can be avoided. Using DVFS on a GPU or a CPU performing graphics rendering will increase the frame latency (Section: 4.5) slightly but reduce power consumption significantly. Park et al.[27] have used DVFS on a development board with performance similar to the i.MX31 to achieve energy savings of up to 46% compared to a non-DVFS scheme. Taking advantage of both intra-frame- and inter-frame slacks to scale down the CPU frequency in times of reduced workload. Intra-frame DVFS conservatively identify rendering slack time caused by the imbalance between different stages in the pipeline, while the inter- frame DVFS picks up any remaining slack in between frames.

23 Power management

3.10 Summary

Embedded systems rely on battery power to operate. A combination of static- and dynamic power management is needed to prolong the system life-time. Most power management frameworks uses a power management module that resides in the operat- ing system[12]. The power management module is responsible for monitoring system operation and actuating DPM policy. A good PM framework will include both power aware applications and hardware drivers as power manageable components[5]. How to configure a system so that it achieve system power consumption objectives is called power management policy. Most power management policies optimize power under per- formance constraints, but for some systems it is more important to be able to guarantee system life-time, rather than performance[29]. Power aware system components provide several power modes. A low-power mode usually has some form of reduced performance characteristics. The rendering process can be considered a functional block and if it is able to scale its operation and thereby reduce energy consumption then it can be modeled as a PMC. The power management frameworks studied for this thesis only consider PMCs with a restricted range of functionality. They fail to address how such a complex component as a graphics rendering system can be modeled. Hardware support is required to perform dynamic power management. Dynamic voltage and frequency scaling (Section: 3.9.2) is a technique that enable processing units to run at reduced clock frequency in a low power mode.

24 Chapter 4

Power management methods in computer graphics software

Modern embedded systems require both high performance 3D graphics and low power consumption. The rendering of 3D graphics is a computationally heavy and memory intensive task which consume a lot of energy for the CPU, GPU and memory. The major part of this chapter is dedicated to the description of the basic operations that are required to render graphics and if there exist alternative rendering techniques that produces similar results. If there exists more than one method and the different methods come at different computational complexity then a less expensive method can be used, thereby reducing the rendering workload. Modern rendering solutions consists of an almost infinite number of visual effects. It would not be possible to describe every effect in this thesis, but instead this chapter lists the basic operations of the graphics pipeline in an embedded system, such as the one found on the i.MX31 board. When designing a power aware graphics driver it is only natural to target the process- ing units specifically. The GPU consumes power when it is working but also whenever it has to fetch data from external memory. External memory access is one of the most energy consuming operations in embedded systems[3]. To achieve energy efficiency in an embedded system both the rendering workload and the amount of memory accesses has to be addressed. This chapter describes how to produce energy efficient graphics in an embedded system. The methods will be divided into sections in a top-down approach.

4.1 Energy efficient graphics applications

3D graphics rendering is a powerful tool for creating visually pleasing and effective applications, but rendering 3D content comes at the cost of increased computational workload. When the GPU and CPU are running at high capacity a lot of power is con- sumed. Developers have to be careful when designing applications not to use superfluous animations and have power consumption issues as a part of the development process. This is especially true for embedded systems where battery power is in limited supply.

25 Power management methods in computer graphics software

Adding detail to scenes Most 3D scenes contain more objects than a human can cate- gorize and remember instantaneously[7]. Developers of interactive 3D applications must determine whether adding detail to a scene is justified. More objects in a scene will re- sult in higher rendering times and higher energy consumption. Removing objects from a scene will sometimes not only save time and power but also make the scene less cluttered and improve user acceptance.

Graphics application development Developers of graphics applications has the ability to affect the power consumption of 3D rendering by writing efficient code. Developers that are aware of the inner workings of the platform that they are developing for will be able to balance the workload across the different pipeline stages properly. How calls are made to the graphics library can affect how fast the scene is rendered even if the images produced are identical. For instance, trying to read from a texture or vertex which is currently locked by the GPU causes the CPU to stall. The GPU might also have to stall because it requires a resource that is currently locked by the CPU. In general, applications should avoid accessing resources that the GPU might need during processing. Also, applications should strive to maximize batch size as much as possible. A batch is a set of primitives used by the graphics library for a single API call. Every batch causes a small overhead for the CPU. The number of batches can be drastically reduced by intelligent application design[14].

Improving user productivity When designing 3D graphics applications it is not enough to minimize power consumption for a given computational task. Improving user pro- ductivity is equally important to energy efficiency. Zhong et al.[36] have come up with a useful definition of energy efficiency from the user’s perspective, as not only the lifetime of the battery but also how much the user is able to accomplish before the battery runs out. Rendering steps such as occlusion, shadows, contrast and perspective are powerful tools for creating more user-friendly applications in handheld systems[8]. Both naviga- tion and text readability can be improved by making better use of such methods. There exists a trade-off between user productivity and power consumption. With intelligent use of 3D animations user productivity can be increased. Even though the instant power consumption may be higher while the user is active, he is able to accomplish a task in less time, enabling the system to go into sleep mode earlier, which results in a lower overall power consumption. Interactive applications typically have a point where additional power allocation does not add to user-perceived quality, because of the time it takes for the user to respond to visual input[35]. An energy efficient graphical user interface is often one that enables the system to accomplish a task while having to wait as little as possible for user input. A more aggressive approach is trying to predict user behavior and get the result ready even before the next input. The auto-complete feature in many search fields today is an excellent example of this approach[36].

26 Power management methods in computer graphics software

4.2 Level of detail

Even if unnecessary rendering should be avoided it is not desirable to completely avoid it. For many applications the whole point is to provide the user with 2D- or 3D graphics. With growing complexity of polygonal models comes a need for fidelity- and quality control. If visual appearance can be maintained with fewer polygons then a substantial amount of rendering computations can be removed. The level of detail (LoD) concept involves decreasing the complexity of a 3D object representation according to metrics such as object importance or position. Level of detail techniques increases the efficiency of rendering by decreasing the workload on the graphics pipeline stages, especially vertex transformations. Nicoolas Tack et al.[33] have written one of the papers that describe an algorithm that combines demands for minimum frame-rate and image quality with power management. The algorithm lets the user set a requested maximum error and exploits the remaining time to reduce power consumption. They use a geometric error model based on mean squared error (MSE) to approximate the impact each object has on perceived visual quality and it- eratively increase the LoD of the object that yields the highest distortion reduction at the smallest cost until the desired error is ob- Figure 4.1: Trade-off between the ge- tained. The remaining time needed to render ometric error and computational cost, the frame can be spent in idle mode to save Tack et Al. power. The algorithm provides energy savings between 30%-76% compared to traditional optimization techniques.

(a) (b)

Figure 4.2: (a) Spheres rendered with 23284800 vertices. (b) Spheres rendered with 1094400 vertices. The 21 times reduction in level of detail is barely perceptible. http://en.wikipedia.org/wiki/Level of detail - 2010-03-25

27 Power management methods in computer graphics software

The reason why LoD control has received such attention from the research community is the high potential for time and energy savings. Figure 4.1 serves as an illustrating example. As the number of triangles used to render the bunny object increases the geometrical error curve quite soon levels out. A small increase in the visual quality of a scene sometimes leads to a large increase in execution time. The reduced visual quality of an object is often unnoticed because of the small effect on object appearance when distant, moving fast or simply because the user’s attention is focused elsewhere in the scene.

4.2.1 Simplification models and progressive meshes

Figure 4.3: An aeroplane model at different levels of simplification. Watson et al.[34]

The level of detail that is actually necessary to preserve convincing realism varies within a scene. Objects that are close to the camera have to be modeled with very high vertex resolution to be convincing. It is useful to have several representations of the same object with varying degrees of resolution. That way, applications can choose to render some objects at reduced complexity. Simplification models take an object with high vertex count and tries to compute an approximation using fewer vertices while preserving image fidelity[15]. Using a simplification method it is possible to produce progressive meshes. A pro- gressive mesh is a representation of an object at various levels of detail. Even low vertex count progressive meshes describe the overall shape of the object and as more data is added new vertices increase the level of detail. Producing progressive meshes is computationally heavy and is done offline.

28 Power management methods in computer graphics software

4.2.2 Level of detail control High-performance graphics have used level of detail control to speed up rendering while maintaining image quality for several years[11] and with the advanced rendering capa- bilities of today’s embedded systems LoD control is making its way into mobile phones. LoD control typically works in one of two ways. Either a method for mesh simplification is built into the graphics application itself or a scene graph toolkit is used to balance LoD against other parts of the rendering process. Each solution requires that the de- veloper constructs a custom system for LoD control and this might be the reason why there has been only partially successful attempts to create a unified interface for LoD control. The GLOD API developed by Cohen et al.[11] can be used to control the LoD in a standardized fashion which enables developers to focus on application development while leaving LoD control to the GLOD system. GLOD is meant to be flexible enough to work alongside different versions of the OpenGL API. Not only does GLOD provide a method for LoD control on an object-to-object basis but can also adapt the LoD for separate parts of large objects that span large parts of the display. Adjusting the LoD of a large object will cause distant parts of the object to be rendered at lower LoD, thereby saving precious CPU cycles. LoD control is certainly an excellent example of how graceful degradation exists in the rendering pipeline. The subject is far from trivial and due to time restrictions implementation details and a more in-dept study has to be left as future work.

4.3 Power awareness in the rendering pipeline

The application can become power aware by making intelligent use of the limited ren- dering processes but the rendering process can also become power aware itself. Power aware rendering requires a combination of dynamic power management methods and optimization without a decrease in performance. The headings in the following section does not directly correspond to the pipeline stages mentioned in chapter 2. The different stages and the power-saving methods will be described in the order they are handled by the graphics pipeline (Section: 2.2).

4.3.1 Per-vertex transform and lighting Vertices in the 3D scene is lit according to color and light sources within the scene. Ver- tex lighting is usually computed using the Phong[28] or Blinn-Phong[6] lighting model. Phong lighting requires that the reflective properties of each object is described in terms of ambient, specular and diffuse reflection. For each light within the scene the specular term is calculated by calculating the dot product of the reflection vector R and viewer direction V. Calculating the dot product is computationally heavy, and Blinn’s modifi- cation to the Phong avoids this operation by approximating the angle between V and R, using the halfway vector H. The halfway vector can be found by multiplying the surface normal N and the viewer direction V.

29 Power management methods in computer graphics software

Figure 4.4: The vectors needed to calculate Phong and Blinn-Phong lighting.

It is possible to use vertex lighting to interpolate the values of individual pixels, but it is also possible to calculate lighting on a per-pixel basis. Whichever method is used the process is called shading and will be explained in section 4.3.3. Lighting on a per-vertex basis is computationally heavy for scenes containing a lot of geometry data. Calculating vertex normals and tangents on the CPU and storing them reduces the complexity of vertex processing but requires slightly more vertex fetching[14].

Vertex caching Vertex caching is a transform and lighting optimization technique em- ployed by among others Park et Al[27]. Neighboring triangles often share vertices in the rendering pipeline. Before a vertex enters the transformation stage the vertex cache is searched and if the same vertex has been processed before the results are reused, thus avoiding performing transformation and lighting for the same vertex twice.

Many lights For each light source additional information is required to calculate light- ing. Iterating over many light sources can increase the computational requirements for lighting and only using a subset of local light sources can speed up the process. Choosing only the lights closest to the viewpoint or the strongest light sources are two possibilities for choosing light sources, based on the assumption that these lights contributes the most to the overall appearance of the scene.

Light maps Lighting calculations per vertex can easily become computationally heavy as the number of vertices in a scene increases. Even with hardware support per-vertex

30 Power management methods in computer graphics software lighting such as Phong lighting, or even worse, ray tracing might lead to frame rendering times up to several seconds. Light maps are per-pixel lighting algorithms which uses textures containing lighting information and multiple rendering passes to perform an- other type of hardware accelerated lighting. Illumination maps are best used for global lighting and is not as good at handling local light sources[?]. Light maps can be used as a substitute for vertex lighting or the lighting methods can work in parallel. Some objects within a scene might be lighted using Phong or some other lighting method while other objects are lighted using a light map. If the object is stationary and there are no moving light sources within the scene then a light map can provide excellent lighting at low computational cost.

4.3.2 Clipping and culling Only objects that are visible to the user needs to be rendered. Clipping is the process of removing vertices outside the current view. The more vertices that can be clipped the less time the rendering process will take and power will be saved. Clipping is non- trivial. Objects that are partially within the field of vision must be cut off where the object intersects the viewpoint border. There are different methods for removing the part of the object that is not visible. A technique related to clipping is occlusion culling. While clipping is concerned only with removing objects that are outside the field of vision culling is the process of identifying objects that can safely be omitted from the rest of the rendering process because they are completely or partially occluded by other objects. It is also unnecessary to draw the parts of an object that is facing away from the camera since they are occluded by the front of the object. The issue of partially occluded objects apply to culling as well. The algorithms used to identify content that can be culled are themselves computa- tionally heavy and programmers have to take care when implementing culling so that the culling process does not require more power than it saves. To reduce the complexity of the culling process many culling algorithms are heuristics and approximations. One way to achieve graceful degradation is to use a less conservative approximation when identifying objects that can be removed. That way, more computations can be saved by not rendering but sometimes objects that are within the field of vision will be re- moved as well, sometimes resulting in images of severely lower quality.Culling is useful to both scanline- and tile-based rendering. In both cases, reducing the amount of depth comparisons and depth information that has to be stored in memory.

Z-max culling Z-max culling is a commonly used occlusion culling algorithm. For each tile of e.g. 8∗8 pixels the maximum z depth value is stored. If the stored value is smaller than the smallest z value of a triangle that tile is not rendered. To update z max for a tile all the z values of that tile has to be read which can be expensive[1]. To increase the effectiveness of z-max culling the scene should be rendered in a roughly front-to-back order[14].

31 Power management methods in computer graphics software

Z-min culling A complementary culling technique is z-min culling. The idea is to de- termine if a triangle is definitely is in front of all previously rendered geometry. In that case, there is no need to perform z-buffer reads for the tile. It works similarly to z-max culling, but instead of storing the maximum z value of a tile it stores the minimum value. If the minimum z value of the triangle is smaller than the z-min value of the tile then the triangle is in front and all pixels in the triangle will be rendered. The performance of the z-min culling technique reduces the total bandwidth needed for depth reads. It also affects the internal / external memory bandwidth ratio positively resulting in even lower power consumption[1].

4.3.3 Fragment shading Lighting has to be added on a per-pixel basis to make a 3D scene more realistic. The rasterization process uses different shading techniques to calculate the impact lighting has on pixel color. The absence of shading is referred to as flat shading in which case the pixel color is the same for each pixel within a triangle. Pixel shading in OpenGL is referred to as a fragment shader, which is the name used in this thesis. The traditional method for fragment shading is to interpolate pixel value using ver- tex lighting data. Gouraud shading is a technique that is used for this purpose. An alternative to Gouraud shading is Phong shading. Phong shading is basically a per-pixel generalization of the Phong vertex lighting algorithm (Section: 4.3.1). Phong shad- ing assumes that the curvature of a triangle is uniform between vertices and performs Phong lighting using the approximated normal on each pixel. The computational cost of Gouraud shading is much less compared to Phong shading. The power consumption ratio can be described using the equation 4.1.

P 20p + 6x + 24 phong = (4.1) Pgouraud p + 2x + 35 Where x is the number of pixels of a side of a triangle and p is the number of pixels within a triangle[13].

Depth first rendering Depth first rendering is a technique that can be used to speed up the fragment shading stage of the graphics pipeline. The first rendering pass is done without any color or shading information. When all geometry has been rendered color and shading is added to the scene. This way, no extra calculations are needed to render invisible surfaces. Depth first rendering also has a positive effect on frame-buffer bandwidth[14].

Adaptive shading Jeongseon Euh and Wayne Burleson[13] present an adaptive shading algorithm which enables the rendering process to make an intelligent decision about the optimal shading technique. The algorithm considers both graphics content and human perception. The algorithm set the shading technique individually for each object in a scene. The algorithm provides power savings around 80% for the shading part of

32 Power management methods in computer graphics software the pipeline. The paper also describes an approximate Phong algorithm which mimics the results of Phong shading while reducing power consumption by 37.5%. This is an additional option when choosing shading method.

4.3.4 Fog Fog is a visual effect employed in some applications. Distant objects are lighted with a fog color to give the illusion of distance. Disabling fog saves a relatively small amount of computations with a potentially large impact on image quality although very little structural information is lost in image processing.

4.3.5 Texturing 2-dimensional images called textures are draped over the geometry using interpolation in a process called texturing. The fundamental building blocks of textures are called texels. All modern graphics hardware implements perspective correct texturing. It is possible to use less accurate interpolation methods but these methods might result in noticeable artifacts on triangles that are at an angle to the plane of the screen. Multi-texturing is the process of applying several textures to a polygon in one pass without the need for texture blending. Multi-texturing is supported in OpenGl|ES and most modern rendering architectures. Textures are not only useful for adding detail to a scene but can be used for a number of visual effects such as illumination (Section: 4.3.1) and bump mapping.

Texture caching Textures are stored in high-speed caches close to the GPU. The GPU may have to make several accesses to each texture since a texture might span several triangles and pixels. The order vertices are processed and the method for loading textures into the cache affects the speed of rendering. By processing vertices in an order according to their position in the 3D scene the chance that the appropriate texture is already loaded into the cache memory is increased. Decompression of textures is usually not expensive compared to the extra texture fetching bandwidth required but requires some computations which could be reduced with less decompression of textures[14]. Caching is an effective technique to reduce the number of off-chip memory accesses and also hides the latency of texture fetching from memory[1].

Texture compression The core idea behind texture compression is calculate lossy com- pression images of the texture. Decompression of the texture is performed at run-time. Texture compression is usually done offline and can therefore be allowed to take more time than de-compression. Using compressed textures increases the effectiveness of the texture cache since it enables more textures to be stored in the cache at the same time, albeit of lower quality. Texture compression reduces the need for memory bandwidth which in turn can be used by hardware designers to reduce gate count on chip[32]. As a textured polygon moves farther away a larger part of the texture falls within each

33 Power management methods in computer graphics software pixel. To accurately calculate the pixel value requires calculating the average of all tex- els that fall within a texture. Mipmapping solves this problem by storing the texture using different levels of compression. Mipmapping is the texture equivalent to level of detail control (Section: 4.2). Compression saves storage space and reduces the amount of data that has to be sent over the network. Nearest-neighbor interpolation is the fastest texture compression method. It set the pixel value to the texel that is closest to the pixel center. Nearest-neighbor results in a high number of artifacts. Mipmapping improves the performance of the nearest- neighbor interpolation significantly. The mipmap contains an average of the nearest texels depending on distance. Bilinear- and trilinear filtering further improves the quality of distant textures. Bilinear filtering uses the average of the four texels closest to the pixel center to set the color. Trilinear filtering calculates pixel color by interpolating the results of bilinear compression at two different mipmap levels. Compression techniques that uses square textures will produce good results only as long as the viewer is looking at the texture straight on. Anisotropic filtering samples textures in a trapezoidal shape which produces better results when the textured surface disappears into the distance. Anisotropic filtering can be used in combination with the other compression methods. Developers have adopted Ericsson Texture Compression (ETC) as a new codec for OpenGL|ES[8]. It was first presented under the name iPACKMAN in a paper by Jacob Str¨omand Tomas Akenine-M¨oller[32],along with a description of the hardware archi- tecture required. ETC is a compression technique that provides higher accuracy than bilinear mipmapping, while at the same time not being as computationally heavy as trilinear mipmapping. The ETC compression system uses blocks of 4 ∗ 4 pixels which enables spatial re- dundancy in a large area to be exploited. The strengths of the ETC technique is its versatility. The average color of blocks of 2 ∗ 4 pixels are stored using 12 bits. Then another 16 bits are used at de-compression to retrieve the relative luminance of each pixel by searching a table of stored numbers. 3 bits is used to decide which lookup table to use. 3 bits means there are 23 = 8 tables to choose from. Each table entry has 4 values. If the luminance is evenly distributed within the 4∗4 block the average color value of each of the 2 ∗ 4 blocks is stored using twice the resolution (24 bits). This enables high- resolution color at the expense of resolution of compression. One bit has to be used to store whether a block should be stored with high color precision or as two separate color blocks of 2 ∗ 4 pixels. The 4 ∗ 4 block can thus be stored using 2 ∗ (12 + 16 + 3) + 1 = 63 bits. The remaining bit can be used to indicate the direction of the 2 ∗ 4 blocks within the 4*4 block, which can be either vertical or horizontal. ETC uses an error metric to decide if the block should be split in two. The correct luminance can be calculated using the equation 4.2.

2 2 2 2 2 2 2 eperception(u, v) = wr (ur − vr) + wg(ug − vg) + wb (ub − vb) (4.2) √ √ √ where u and v are the vectors of the two colors and w is a vector ( 0.299, 0.587, 0.144). w is chosen so that the luminance of the projected color is exactly the same as the desired

34 Power management methods in computer graphics software color to the human visual perception system, where green is a more important color than red and blue[32]. The images in figure 4.5 and 4.6 displays the relative strengths and weaknesses of ETC compared to the S3TC compression algorithm.

(a) (b) original (c) S3TC (d) ETC

Figure 4.5: ETC is relatively weak in regions with smooth chrominance transitions.

(a) (b) original (c) S3TC (d) ETC

Figure 4.6: ETC shows its strength when it comes to preserving luminance detail.

4.3.6 Anti-aliasing Low cost anti-aliasing has not been researched extensively. Brute force anti-aliasing schemes are common in graphic cards. Rendering is done at four times the resolution and then filtered down by using the average color of 2 ∗ 2 regions of the rendered scene. The memory fetching is four times as high for these schemes and they perform poorly for near-vertical and near-horizontal lines. Quincunx generates half as many samples as brute force full screen anti-aliasing but uses the value of neighboring pixels to perform anti-aliasing. Rotated grid super-sampling is just as computationally expensive, gen- erating four samples per pixel as brute force but performs better for near-vertical and near-horizontal edges. FLIPQUAD is a multi-sampling scheme that combines the good features of Quincunx and Rotated grid super-sampling. It produces only two samples per pixel but rotates these in the same way as RGSS to improve edge anti-aliasing[1]. A series of explanatory images can be sen in figure 4.7 and 4.8. T

35 Power management methods in computer graphics software

Figure 4.7: Simple, low-cost sampling patterns (left to right): one sample per pixel (no anti-aliasing), 2 ∗ 2 super sampling (brute force), Quincunx sampling pattern (gray samples are borrowed from neighboring pixels), and rotated grid supersampling (RGSS).

Figure 4.8: Left: the RGSS pattern is adapted so that sample sharing between neighbors can potentially occur. Right: the sample pattern has to be reflected when moving to neighboring pixels. This results in the FLIPQUAD sampling pattern. Str¨omand Akenine-M¨oller.[1]

4.4 Energy efficient tile-based rendering

Tile-based renderers reduces the amount of external memory accesses at the cost of increased internal memory accesses. With a larger tile size, less triangles have to be sent more than once to the rasterizer engine, but at the same time the number of depth comparisons increases since the number of triangles within a tile increases[3]. The increased amount of geometry data sent to the rasterizer in tile-based rendering is proportional to the amount of overlap. Overlap is the average number of tiles covered by a triangle. Overdraw is a major issue in conventional scanline renderers. It is defined as the number of fragment written to the frame buffer divided by the image size. Scenes with high overdraw and a small amount of overlap are ideal for tile-based rendering[3]. The amount of overdraw is likely to increase in the future[20].

Choosing the optimal tile size Antochi et al.[3] have identified the optimal tile size to be 32 * 32 pixels, using a 3D graphics benchmark suite. They have measured the data traffic be- tween the rasterizing engine, CPU and frame buffer. The geometry- and state change data sent to the rasterizer increased by a factor 2.66, but the total external memory accesses decreased by a factor 1.96. The tile-based ren- dering approach almost eliminates the traffic Figure 4.9: The bounding box over- between the rasterizer and the frame- and z- lap test produces false intersections for buffers. large triangles. Antochi et al.[4]

36 Power management methods in computer graphics software

Primitive sorting and overlap tests One of the drawbacks of tile-based rendering is the need for a large scene buffer to store the primitives of each tile. Sorting and sending the primitives to the rasterizer in an efficient manner is key to limit memory requirements and computational complexity. Sorting the primitives as they are being buffered is one possibility that requires additional memory space to store the bounding box parameters of each primitive. The other possibility is to sort the primitives as they are sent to the rasterizer. The bounding box test calculates overlap by comparing the bounding box of a triangle with the tile boundaries. If the bounding box intersects the tile boundary then the primitive is considered to be overlapping. The two-step method calculates and stores the bounding box of a triangle in the scene buffer. The primitives are then sorted as they are sent to the rasterizer. The bounding box test generate up to 30% false intersections for large triangles that span several tiles[4]. The computational complexity / memory trade-off regarding primitive sorting has been mapped by Antochi et al.[4] Sorting the primitives as they are sent to the rasterizer is 6 times slower than storing the primitives sorted in the frame buffer, but reduces the scene buffer size requirement by a factor 3.2. Which method performs best is application and hardware dependent. They also introduce a linear edge function test as a complement to the bounding box test commonly used to calculate overlap. Using the linear edge function test results in less false intersections which reduces the load on the rasterizer.

4.5 Frame rate

All the above methods are concerned with image quality. For interactive applications the frame rate is an equally important performance characteristic of the rendering process. Frame rate is the update frequency of the display. It is measured in frames per second (FPS). Low frame rates can make it hard to follow the movement of objects within a scene and generally lower perceived quality of the computer generated graphics. A frame rate above a certain limit does not increase the user experience[17]. This limit varies greatly depending on the application. The 24 FPS standard has been used for years in the movie industry, as an acceptable rate for cartoon pictures. A common misconception is that high motion content require high frame rates. This misconception has been formulated by streaming video providers. McCarthy et al.[23] has written one of the research papers stating that a frame rate above 6 FPS is acceptable to view streaming video football regardless of display size, even though low frame rates are slightly less acceptable on small displays. The quality of the individual frames is more important to user acceptance and quantization effects have a much heavier impact on user acceptance than frame rates. Another interesting aspect is that reducing frame rate seems to have a linear correlation with acceptance while reducing image quality has a logarithmic correlation. On the other hand, interactive applications where quick reflexes is crucial such as computer games require significantly higher frame rates to achieve optimal user perfor-

37 Power management methods in computer graphics software mance. Claypool et al.[9] Have measured the impact of frame rate and resolution on the performance and enjoyability of users playing a 3D interactive computer game. Both frame rate and resolution has a significant impact on user acceptance but only frame rate affected user performance. Even a frame rate reduction from 60 to 30 FPS has a slight impact on performance and frame rates below 10 has a significant impact. The required response time for an interactive application makes the minimal acceptable frame rate different from passively watching video. A related issue for interactive applications is the frame latency. Latency is the time it takes for the system to render a frame and put it on the screen from the moment the frame is sent from the application. Much work in the scientific community has been dedicated to optimize image quality for a given frame-rate[33], but not as much has been done to do the same when the power consumption is the limit. The time the processing units spend rendering is lin- early correlated with workload but the workload in 3D graphics applications vary across frames by an order of magnitude[18] and changes in frame rendering complexity happen quickly[24].

4.6 Summary

It is always important to utilize the available system resources as effectively as possible. In a system where the performance is limited by power constraints it becomes even more important to optimize the rendering process. There are two ways in which the rendering process can be made more power efficient.

4.6.1 Reduce computational complexity A GPU or CPU that is switching at peak capacity consumes a lot of power. Any rendering system has to consider how many visual effects should be included and how advanced the techniques should be that are used to calculate each effect. Finding render- ing methods that offer high performance at low computational complexity is important. Blinn-Phong lighting and ETC texture compression are two methods that has been adopted in a vast number of systems because they provide a favorable performance / power trade-off. Graphics applications are often geared towards high performance and it is reasonable to assume that the images produced by a constrained rendering process will still be useful to the user. A rendering process that has a complexity or power limit will most of the times not be able to deliver peak visual performance. It is possible to save power by ignoring function calls or perform less complex calculations and take advantage of the extra slack-time to enter a low-power mode. The easiest way to save CPU cycles is to reduce the frame rate of an application. Reducing the number of frames that are rendered is a sure way of decreasing the workload of the rendering process, because of the linear relationship with workload. Changing the visual quality of an image can be disturbing to the user. It is therefore better to provide a uniform image quality between frames if it is at all possible[33].

38 Power management methods in computer graphics software

While limiting frame rate usually is the safest way to save energy, low frame rates will reduce perceived quality. At low frame rates it might be preferable to reduce the quality of individual frames instead. Research shows that it is sometimes possible to achieve considerable savings without a loss of performance. If there is a way to determine which calculations are the ones that contribute the least to image quality then those can sometimes be omitted without the user noticing. The list in 4.6.1 describes the different approaches that can be used to reduce complexity.

• Use the same method at reduced complexity. Some stages in the pipeline are calculated using methods that are scalable. Reducing the number of iterations[14], or using a less conservative heuristic can reduce complexity.Scalable methods is generally a good solution since it provides power management policies with a large range of options.

• Use a less complex method. If there are two or more methods for computing the same visual effect then a complex method can be substituted with a less complex one. An example could be the Phong- and Gouraud shading algorithms (Section: 4.3.3).

• Turning off a visual effect. Skipping some steps or stages of the rendering pipeline is a method that can drastically reduce the amount of calculations needed. Unfortunately, it can also have a substantial negative impact on image quality. The implications of turning off an effect completely is hard to predict because of dependencies between the different pipeline stages.

4.6.2 Reduce external memory accesses One of the most energy-consuming operations in embedded systems is external memory accesses by the processing units. Some visual effects require more of these operations than others and it is therefore necessary to find alternative solutions for embedded systems where power is in limited supply. External memory accesses is partly a hardware-related issue. The size and effec- tiveness of the on-chip cache memories is of high importance to power efficiency in this regard. Tile-based rendering is an approach that specifically targets the problem. The external memory accesses are reduced by a factor 1.96 for a typical software- and hard- ware configuration[3]. The amount of memory accesses is is also affected by software design. Some visual effects are more memory-intense than others. Optimizing mem- ory intensive operations is especially important since the potential energy savings are large. Texturing is an operation that require a high amount of memory accesses in quick succession. A texture cache can reduce the number of texture fetches and reduce power consumption. Multi-texturing is a feature that enables the texturing process to apply multiple textures in one pass. In advanced rendering systems that support bump- mapping, light maps and other visual effects that require multiple texture reads per pixel multi-texturing support provides substantial energy savings.

39 Power management methods in computer graphics software

4.6.3 Power efficiency in the graphics pipeline Below is a list of pipeline stages and the power management options that are available at each stage. A power aware graphics driver will have to optimize the rendering process by finding a combination of rendering operations that provide the highest performance given the power limit.

• Level of detail. Level of detail control is a technique that can currently only be implemented at the application level and is thus technically part of the graphics pipeline, but the same concept of reducing computational complexity applies to LoD control. Using progressive meshes it is possible to reduce the complexity of 3D content without affecting user acceptance. Objects that are considered of less importance can be represented using fewer vertices which will reduce the workload of the rendering pipeline.

• Vertex lighting. The vertices are lit according to light sources within the 3D en- vironment. The Blinn-Phong lighting calculation is an approximation of the more computationally heavy Phong equation. Other things that affects the complexity of vertex lighting is the number of light sources and the effectiveness of the vertex cache if there is one.

• Fragment shading. The individual pixel value is calculated in a process called rasterization. The commonly used gouraud shading method uses interpolation to calculate pixel value within a triangle. This is an efficient method compared to Phong shading or other shading methods that calculate the pixel value individually. A radically different approach to traditional shaders methods is using light maps. Light maps are textures that describe the light reflection on a surface.

• Texturing. Textures are images that are draped over the objects in a 3D scene. Texturing requires multiple accesses to external memory, but the increased power consumption is warranted since textures add realism to a scene. To reduce the memory accesses a texture cache is often used. Multi-texturing and texture caches reduce the energy consumption of the texturing stage.

• Anti-aliasing. No anti-aliasing will cause jagged edges in an image when the edges are not aligned with the pixel borders. Regular ”brute force” anti-aliasing computes pixel value based on an the mean value of a 2 ∗ 2 square of an image rendered at 4 times the resolution. The number of samples determines the compu- tational complexity. Using a anti-aliasing scheme that only generates one or two samples will be less expensive in terms of rendering time.

40 Chapter 5

Design and implementation of a power aware graphics system

Graceful degradation can be introduced to the graphics system in the shape of a power aware graphics system. As a power manageable component the driver must be able to scale and monitor its own operation and provide an interface towards the power man- agement module. This chapter contains a design description of a power aware graphics system supposed to function as a PMC in a framework for power management. It describes the requirements the PMC will have to satisfy in order to provide this func- tionality, followed by an informal specification of the behavior of the driver as observed by user, the power management module and the hardware components involved in the rendering process. Finally, the proposed system architecture and the description of the internal power management scheme is presented. Several features that would be desir- able in the graphics system require further research before a complete design can be attempted. These features are included as requirements and discussed but the design is not complete. The implementation that was the result only features a subset of the functionality described in the requirements section below.

5.1 Requirements

Below is a list of the requirements of the power aware graphics driver. The full motivation behind each requirement can be found in chapter 3 and 4, but to give some understanding each requirement comes with a short motivating text. The requirement to use the OpenGL graphics library was part of the limitations for this thesis work. The first two requirements describe the software architecture that the proposed solution will have to conform to.

Req1 OpenGL|ES is one of the most commonly used graphics libraries in embedded systems. The power aware graphics system will implement the OpenGL|ES 1.1 API and provide the same functionality.

41 Design and implementation of a power aware graphics system

Req2 The proposed solution should be transparent to developers of graphics application. The idea behind using a graphics library is that it should hide the underlaying hardware architecture and this property should be kept intact for the graphics system to be useful.

To be included in a power management framework the graphics system needs to be conform to the PMC template. The following requirements describe how the graphics system can be modeled as a PMC and its communication with the power management module.

Req3 To provide support for dynamic power management the graphics driver must be- come power aware. It must feature several power modes. The driver should feature at least one passive and at least two active power modes.

Req4 In a system where life-time demands cannot be violated system components need to be able to guarantee that the power limit is not exceeded. The driver should be able to guarantee that the power consumption of the rendering process will not exceed a limit over a specified period of time.

Req5 To make accurate power management decisions the power management modules need constant updates about the working state of hardware components within the system. Updates about the mode of operation i.e. working / idle of the rendering process will be sent to the power management module.

Req6 A power manageable component should be able to communicate its power / per- formance characteristics to the power management module at startup. These char- acteristics will include the number of available power modes and for each power mode the following properties:

Req6.1 Utility. Req6.2 Maximum power consumption. Req6.3 Transition delay to other power modes1. Req6.4 Transition energy overhead to other power modes.

Req7 Different levels of power savings will be achieved by creating increased slack time in the rendering process. The slack time will enable the processing units in the system to enter a low-power sleep mode, thereby lowering power consumption.

Req8 Extra slack time can be created by reducing the workload of the rendering process. Below is a list of the methods that the graphics driver should have the ability to change at run-time.

Req8.1 Use a less complex vertex lighting method. Reduce the number of light sources. 1How transition delay and energy overhead is defined depends on the power management framework. Some frameworks requests maximum delay and energy while others have a policy that uses the average

42 Design and implementation of a power aware graphics system

Req8.2 Disable fragment shading or use a less complex shading method. Req8.3 Disable texturing or restrict texture size. Req8.4 Disable anti-aliasing. Req8.5 Reduce frame rate.

Any number of visual effects could be included in the list of operations that the driver should be able to turn on / off. An area of particular interest is the possibility of integrating a system for LoD control. GLOD (section: 4.2.2) or some other library for LoD control should be added to the driver to provide yet another really powerful tool for optimization of performance. The following requirements all have to do with optimization of the rendering process. If there exists a limit on the amount of calculations that the rendering process is allowed to perform, how can the performance of graphics be optimized?

Req9 Performing optimization of the rendering process as well as more advanced forms of workload prediction requires that all OpenGL function calls needed to render a frame are stored in a trace. The frame trace contains all information needed to render the frame and can thus be used to replay the frame after it has been rendered. The graphics system will implement an OpenGL tracer.

Req10 The driver should be able to make intelligent decisions about the rendering for individual objects within the scene. In particular, the following parameters should be set according to importance. Once again, it would be very interesting to try and find a way to include LoD control in the rendering optimization process but how this can be done has to be left as future work.

Req10.1 Which light sources should be considered when calculating lighting. Req10.2 Which shading method to use for individual objects.

Req11 Changes in frame rate and image quality is distressing to the user. The graceful degradation algorithm will use ”lazy” updates to its operation. It will only change settings when the potential performance increase is large enough.

Req12 Image quality degradation shall be optimized based on the performance / power trade-off offered by the various techniques to trade performance for power.

To further improve the effectiveness of power management of the rendering process it is possible to scale the voltage and frequency of the processing units in the system based on the predicted workload. This is possible since rendering is not a real-time task and has a workload that can be predicted with high accuracy.

Req13 The graphics driver should be able to predict the workload of a frame with a degree of certainty using an analytic approach. Studies show that analytical pre- diction models are generally more accurate than history-based prediction schemes.

43 Design and implementation of a power aware graphics system

Req14 The workload of the individual pipeline stages mentioned in section 4.3 shall be estimated as a first step towards optimization of the rendering pipeline.

Req15 The graphics system will be integrated with the DVFS driver of the development platform to produce more tangible results in terms of energy saved.

5.2 Specification

The specifications of a power aware graphics driver describes the behavior of the system from a user perspective. Because the system, in this case, is not accessible to the user directly the behavior of the driver will only be perceptible as changes in the performance of graphics applications. This complicates matters, since there exists an almost infinite range of applications that could potentially be served by a power aware graphics driver. The specification tries to describe how the system reacts to different kinds of requests.

5.2.1 User perspective A system which features a power aware graphics system will scale graphics content to suit the system life-time demands. The system might start out running at peak capacity but as the user realizes that the system needs to be operational for a longer time than what is possible at the current power consumption he chooses to switch to a low-power mode which promises that the system will be operational for some additional hours. The graphics of interactive 3D applications will, at this point, no longer be able to operate at peak capacity. Reduced frame rate and image quality may be imperceptible to the user. If the frame rate drops from 60 to 30 fps the system will be running at approximately half the power but with barely noticeable quality reduction. Other times, especially when the system is experiencing a shortage of power and the demands for energy preservation are high, the degradation of graphics may be observable immediately. There is a risk that extreme deterioration of rendering quality will generate images that are unappealing, unacceptable, unusable or even unintelligible. It is up to the user and the PM policy to prioritize between system tasks and shut down rendering altogether if the graphics system causes such problems.

5.2.2 Graphics application perspective One of the main strengths of the proposed architecture is that graphics applications will be completely unaware of the graceful degradation and will be able to operate exactly as normal. An application that is power aware will be able to scale its content. As long as the workload is sufficiently reduced by the application itself the driver will not affect the graphics content. Only if the application is unable to reduce its rendering demands enough will the driver step in and further reduce the rendering quality.

44 Design and implementation of a power aware graphics system

5.2.3 Power management module perspective The driver will function as a PMC in a power management framework, featuring a PM module that will send signals to the driver process requesting that the driver change its operation in order to satisfy system-level demands for power consumption. In order to make informed decisions, the PM module needs information about the characteristics of the power driver as a PMC. The driver have to communicate these parameters to the PM module. A PMC that affects several other system components is not as straightforward to manage as a one that has a limited scope of operation, even though it is a clearly defined functional block. Rendering of graphics is definitely a functional block but a PM framework that involves a graphics driver like the one proposed in this master thesis has to be able to manage complex dependencies between the graphics driver and other components. This would require additional information from the driver but managing PMC dependencies is not part of this thesis and will not be considered further. How the power level decisions are made is irrelevant to the driver, and it will blindly follow all requests for new levels sent by the PM module. The driver will always try to maximize performance and utilization of the available resources, given the power level. Choosing how to conserve energy by graceful degradation of the rendering process can be considered a form of dynamic power management. This creates a hierarchy of power management. The motivation behind such an architecture can be found in chapter 4. It is important to make the distinction between the performance of the rendering process as a whole and the performance of individual stages of the OpenGL pipeline.

5.2.4 Hardware perspective Running graphics applications puts a heavy toll on the processing units of the system. The graphics acceleration card on the i.MX31 performs some parts of the rendering process with hardware acceleration but the geometry stage and some synchronization operations are carried out by the CPU. The processing units utilizes the system memory, buses and other components, which makes graphics rendering a feature that involves large parts of the system. The graphics library is situated in the operating system, enabling access to hardware registers and device drivers. Most of the time there will be no need to read or write from registers since most rendering operations are carried out by either the CPU scheduler or OpenGL. The operation of the graphics driver affects large parts of the system but does not affect the operation of hardware devices, except in rare cases. One such case could be to retrieve some information about the intermediate stages of OpenGL operation. Another case, direct manipulation of the DVFS driver.

5.3 Design and implementation

A sample implementation of the graphics driver has been designed and implemented. In this chapter is the description of how this implementation of the driver function. The

45 Design and implementation of a power aware graphics system design has a wider perspective than the actual implementation. The implementation shall just be considered a proof-of-concept which was created in order to make some preliminary measurements and validate the basic idea behind the design. This section aims to specify the main objectives of the graphics driver and the overall structure of the algorithms involved, but also contains implementation-specific information about the implementation of a power aware graphics library on a i.MX31 development board running Linux.

5.4 The interposer library

Figure 5.1: The hardware / software architecture of the power aware graphics driver.

OpenGL is hardware and platform dependent and the source code is often proprietary which makes it difficult to find a power management solution that is applicable to most systems. Intercepting function calls is an efficient technique to control the operation of a library when the source code is unavailable. The interception of library function calls is referred to as library interposition. Interposition has been used extensively to control and modify OpenGL operation for a multitude of purposes. The BuGLe[16] software is an example of a OpenGL control and trace tool, and could possibly have been extended to provide the desired functionality of the power aware graphics system. Given the extremely tight time schedule and the presumed difficulties of cross-compiling, the idea of using an existing interposer toolkit was discarded. Graphics applications will be un-aware that they are not communicating with the graphics library directly. How library interposition is implemented differs between operating systems, but an interposer solution is possible in most operating systems. Myers et al.[26] have written how interposition can be achieved in Windows, Mac and UNIX-type operating systems. UNIX uses the LD PRELOAD environment variable to dynamically link in a file that contains a copy of the function call that is to be interposed. Before the operating system

46 Design and implementation of a power aware graphics system check any other library for the function the file specified specified by LD PRELOAD will be checked first. It is still possible to access the old version of the interposed function by using the dlsym() function. dlsym() takes two arguments, where the first is the name of the function and the other argument can be set to RTLD NEXT in which case the operating system will skip the first function it comes across when resolving links. The first function being the interposition library version of the function and the second function declaration the original. The interposition library created for this master thesis work has a copy of each OpenGL|ES function called. Most functions only contain a link to the original function call and an if-statement that checks whether the underlying original OpenGL function should be called when the interposer function is called. For the driver to be able to make workload predictions or to implement more advanced forms of rendering optimization techniques it would be necessary to capture the function calls for an entire frame and then analyze the data of those calls before the frame is rendered. The calls can be broken down into defining components such as return value and parameter data types, apart from the function name. Storing the calls before a frame is run will cause a computational overhead. Just like the rest of the power management calculations the overhead has to be small enough that the potential energy savings can be justified.

(a) (b)

(c) (d)

Figure 5.2: A visual representation of some of the calculations carried out to control the drivers operation.

47 Design and implementation of a power aware graphics system

5.5 Energy consumption and workload

The workload for 3D graphics rendering can be defined as the difference between the start and end times for rendering a frame[24]. Calculating frame rendering time or workload with the OpenGL|ES (Section: 2.5) library is usually done by calculating the time difference between the eglSwapBuffers() and glClear() function calls[27].

Tframe = teglSwapBuffers() − tglClear() (5.1) In equation 5.6 the capital letter T represents an interval while the lower-case letter t represents a point in time. During the rendering of a frame the load on the processing units within the system will be heavy. The assumption is that the energy consumption of the rendering process will be highly correlated with the workload.

5.6 The utilization limit

The driver control the power consumption of the graphics by limiting the computational complexity of the rendering process. This is achieved by denying or modifying requests to render geometry data. The idea behind the energy conservation techniques is that an incoming rendering request only can be serviced if it has been preceded by an idle interval proportional to the desired performance level and the predicted rendering time of the frame. How long the preceding interval has to be is shown in equation 5.2. 1 T = T ∗ ( − 1) (5.2) preceding rendering U This model will guarantee that the power consumption limit will not be broken, as long as the rendering time is accurately predicted. All predictions are subject to some uncertainty and it is impossible to know the exact rendering time of a frame before it starts rendering. The graphics system will guarantee power consumption over one second time intervals. Frames are measured in frames per second and one second intervals have been used in power management frameworks before[29, 35], which makes one second a suitable interval. Frame rendering time is the time interval between the glClear() and eglSwapBuffers() function calls. In UNIX the gettimeofday() function returns the current system time. gettimeofday() is called each time glClear() or eglSwapBuffers() is called and by com- paring the time stamps the frame rendering time can be calculated with high precision. This method will not describe exactly how much energy is consumed or even how much time the processing units spend processing graphics, but the method can be used to get a good estimate of the workload if the graphics rendering is the most intensive task running. Time stamping can also provide the driver with a mechanism for intra-process com- munication that is not dependent on the timely delivery of messages. The pmegl thread will signal the interposer when it is allowed to run again through a time stamp in shared

48 Design and implementation of a power aware graphics system memory. This approach has several benefits. The time stamps will ensure that the utilization is controlled down to the microsecond.

5.7 Optimization of image quality

The primary metrics of video quality is image quality, frame rate and spatial resolution, where spatial resolution is the number of pixels per frame. Image quality and frame rate can be changed at run-time to degrade performance and save energy. Image quality is something that in turn can be degraded using several methods. Choosing the video quality degradation method that results in the lowest quality decrease is not trivial for a power aware driver since the optimal method is highly application dependent.

(a) (b)

Figure 5.3: The rendering process running at full capacity.

(a) (b)

Figure 5.4: Reducing the computational complexity can be achieved either by (a) drop- ping frames or (b) reducing the complexity (and quality) of rendering.

The figures 5.4a and 5.4b visualize how the workload can be adjusted using the two main methods for reducing computational complexity. In figure 5.5b the pipeline stages has been adjusted so that the rendering process now spends more time executing in the stage that contributes the most to image quality. Using the methods for image quality reduction listed in requirement 8 there is a large combination of possible combinations of techniques. Optimization issues has to do with finding the combination that provides the most reduction in execution time compared to the reduction in image quality. The utility and power consumption of each pipeline

49 Design and implementation of a power aware graphics system

(a) (b)

Figure 5.5: (a) A combination of the techniques in 5.4a and 5.4b. (b) The workload of the individual pipeline stages may not scale uniformly. An optimal combination of low-complexity rendering methods has to be found. stage can be estimated using a benchmark graphics applications suite and hardcoded into the driver header files. Alternatively the power / performance trade-off can be mapped using data analysis of previously rendered frames. Objective image quality assessment methods can estimate the impact various visual effects has on perceived image quality, but to implement them at run-time is difficult. Generating a high quality reference image and comparing it to images generated using various constellations of image quality reduction techniques is contra-productive unless future power savings are substantial which might not be the case. Objective image quality methods can be used in conjugation with subjective measurement to assign utility value to different methods. The second part of the trade-off is less complicated to calculate using an adaptive approach. How this will be done practically is by inserting hooks into the OpenGL architecture signaling where one stage has finished and the next one begun. Where this is ambiguous an estimate has to be found. Using this information will enable some run- time optimization although the optimization will at best be best-effort and not optimal in a true sense. The flow-chart in figure 5.6 describes how the iterative image degradation algorithm operates. The visual effects are ordered with the most effective method at the top. If the most effective method is active then it is turned off and the next frame is run without it. If the most effective method is not active then the second most effective is chosen and so on until all visual effects have been turned off. If rendering images at reduced quality is not enough then the frame rate will drop even further.

Frame rate Many applications run at frame rates that are unnecessarily high. A frame rate of 6 FPS is acceptable for users passively watching video[23] and frame rates as low as 10 FPS is enough for interactive applications such as computer games[9]. The first step to reduce the computational complexity of the rendering process is therefore to reduce the frame rates of applications running in the system.

Image quality Restricting the number of frames per second provides potentially large energy savings while not being too disturbing to the user. Reducing image quality is not as straightforward as a frame rate limit and should be reserved for times when the

50 Design and implementation of a power aware graphics system

Figure 5.6: The image quality reduction methods are prioritized according to relative value. The visual effects are turned off one after the other until the desired frame complexity is reached. system is going into a really low-powered mode. In such cases it might be better to reduce the computational complexity of each frame instead of dropping more frames. The flow chart presented above (Figure: 5.6) shows how and where a decision is made to reduce the quality of a frame. While running in a low-powered mode it is possible to encounter frames with a computational complexity that would cause the frame rate to drop below what is acceptable. When the rendering time of a frame is predicted to cause an unacceptably low frame rate then the OpenGL state will be modified so that rendering time and image quality drop. Even though some rendering stages are inherently computationally heavier than oth- ers the complexity is also application dependent. Both the computational complexity and the impact of a particular visual effect varies between applications and even between frames.

51 Design and implementation of a power aware graphics system

5.8 Graphics system operation

Figure 5.7: The interposer architecture.

5.8.1 Intra-process communication (IPC) Different methods of IPC are used to synchronize the operation of the program.

Shared memory A thread is able to share memory with the process that starts it. Syn- chronization issues when using shared memory can be avoided using semaphores and mutexes but because of lack of time these features are missing from the implemented solution. The design does however strive to minimize the risk for memory corruption by using only using atomic integers that are written to by the thread and read by the parent process or thread. Shared memory is efficient in terms of speed because the overhead of reading and writing is almost non-existent. In the implemented solution shared memory is used both for the filter struct that contains the settings for the interposer and the utilization level that is updated when a message arrives from the power management module and read every time a frame has stopped rendering.

UNIX message queue The interposer sends updates about its operation to a UNIX message queue whenever a frame has stopped rendering. A message queue stores the message until the receiving process is ready to process it. In the implemented solution a message queue is used because it has the desirable feature that a process (in this case the

52 Design and implementation of a power aware graphics system pmegl thread) can block until a message is received, wake up and do what it is supposed to do and then block again until a new message is received. The type of message queue is not important since the program only sends and receives messages. Using the LINX system for the IPC between the interposer and the pmegl thread would certainly be an option.

LINX The LINX protocol[?] is used for the communication between the interposer and the power management module. LINX enables two processes to communicate with each other as if they were running on the same system, which they may or may not be in this case. The PM manager module has to be running and listening for incoming signals for the power management protocol to work. As soon as the graphics driver process is created it will send its power characteristics to a predefined address which represents the power management module. The signal contains the information needed to perform power management. The most important characteristics are the levels of power consumption and their associated performance level.

5.8.2 Threaded architecture When the first OpenGL function call is made and the interposer loaded two threads are started. The pmegl thread is where the algorithm for power management is located and the input thread handles communication with the power management module. The threads are POSIX threads and are created using the pthread create() command. POSIX threads are supported in many operating systems, including Linux, Mac OS and a sub- set of POSIX thread operations are available in Windows. Figure 5.7 visualizes the major components of the interposer architecture: The interposer threads, function calls made by external modules, internal dependencies and IPC mechanism, all of which are explained in detail below. pmegl thread The pmegl thread receives updates about the operation of the application through a UNIX message queue and signals the interposer when it is allowed to continue rendering through the filter struct that is passed as an argument when the thread is started. After the thread initialization phase the pmegl thread enters the main loop blocking until a message is received. A message is sent from the interposer library whenever a frame has stopped rendering. The message contain the timestamp when the frame started and the timestamp when it stopped. By subtracting the first from 1 the latter the frame rendering time is calculated. This interval is multiplied by ( U − 1) and subtracted from the available rendering buffer (section: 5.8.5). The buffer is then increased by the time that elapsed before the frame started rendering. The difference between the desired buffer and the actual buffer is the time interval that OpenGL has to stay dormant to keep the utilization level. A new frame is allowed from the point in time that is calculated using equation 5.3.

53 Design and implementation of a power aware graphics system

newFrame = (desiredBuffer − actualBuffer) + Tcurrent + tframe (5.3) input thread The input handles communication with the power management module. When the input thread is started it hunts for the power manager process. If it is found a message is sent that registers the interposer as a PMC. The message contains all the information the power management module needs to make its decisions for the interposer. The input thread then enters a loop, listening for messages from the PM module. The first message should be an acknowledgement that the registration was successful. Unless an acknowledgement is received the input thread will send a new registration message at regular intervals. After the acknowledgement the input thread will listen for requests for new power levels. Should the connection to the manager be lost the input thread will once again start hunting for the power management process. For demonstration purposes, the input can be simulated using terminal input.

5.8.3 Interposed OpenGL functions of note

(a) (b) eglSwap- (c) glEnable() glClear() Buffers()

Figure 5.8: The flow-chart of the graphics driver operation.

Some OpenGL functions have been modified to provide more control over the op- eration. Calls to glClear() and eglSwapBuffers() can be used to indicate when a frame starts respectively stops, and glEnable() turns on and off visual effects. Figure 5.2 de- scribe how some of the calculations used to limit utilization, while figure 5.8a through 5.8c describe the operation of some of the OpenGL functions that has been modified using the interposer.

54 Design and implementation of a power aware graphics system glClear() When a frame starts rendering a call to glClear() is made which clears all data from the previous frame. glClear() compare the current system time to the timestamp in shared memory which specifies the point in time when the next frame can start rendering. If the current time is greater than the newFrame timestamp then the atomic integer acceptCalls is set to 1 (TRUE) which means that every OpenGL function will be allowed to execute. If acceptCalls is TRUE then the global timeval variable renderingStarted is assigned as the current time. Next, the rendering timer is started with a value that equal to the current timestamp and a value that is read from the filter struct in shared memory. eglSwapBuffers() A call to eglSwapBuffers() indicates that a frame is finished. When- ever eglSwapBuffers() is called the the first thing that happens is that the rendering timer is reset. Then, the global timeval variable renderingStarted is compared to an- other global timeval variable renderingDone. If renderingStarted is greater than render- ingDone this means that a frame has started but not finished yet. If this is the case then the pmegl thread will have to be notified of the frame rendering time by sending a UNIX message containing the timeval values of renderingStarted and renderingDone. It would of course be possible to just check the acceptCalls variable to see whether OpenGL is actively rendering but this would fail to catch the expiration of the rendering timer. When a message is sent then renderingDone is assigned as the current time. glEnable() Even if the rest of the OpenGL library is in an active state because the acceptCalls variable is TRUE there are some functions that corresponds to visual effects that needs to be turned off to keep a minimum frame rate. These functions are con- trolled through updating OpenGL state, in particular the variables GL ENABLE FOG, GL ENABLE LIGHTING and GL ENABLE 2DTEXTURE. In the filter struct are three integers that are used to indicate if a call to glEnable() should be allowed to enable fog, lighting or texturing. For example, if the variable enableFog is equal to 0 (FALSE) then calls to update GL ENABLE FOG will be denied and fog disabled.

5.8.4 Rendering timer A solution has to be found for the times when the rendering time exceeds the prediction. For this reason, a timer is started as soon as the frame starts rendering. If the frame is not finished rendering when the timer expires then execution will be stopped and OpenGL operation halted. Software timer interrupts can potentially disrupt the operation of a process due to memory corruption and synchronization issues. The rendering timer was implemented as a way of making sure that the rendering time of a frame did not exceed a certain threshold. A timer in UNIX is started using the setitimer() function call. setitimer() takes three arguments where the first argument specifies what kind of timer should be used and the second argument how long is the interval before the timer expires. The rendering timer is started every time a frame starts rendering. When a UNIX timer

55 Design and implementation of a power aware graphics system expires a SIGALRM UNIX signal is sent to the process itself. The signal handler set a flag which will cause the next, and all subsequent OpenGL function calls to be ignored until a new frame is allowed. The flag is a global variable atomic integer. Using an atomic variable for the signal handler ensures that there will be no memory corruption.

5.8.5 Rendering buffer Halting OpenGL operation is costly and should only be performed as a last resort. The rendering timer will only expire if the frame being rendered is taking more time than expected. If the workload is stochastic then this will happen for half of the frames, which is an unacceptable high number. Therefore, a rendering buffer is introduced. Frames that require more time than expected will be given some extra time to finish rendering before OpenGL is put to sleep. The size of the rendering buffer should be set so high that it almost eliminates the possibility that operation has to be stopped. The buffer should be proportional to the predicted rendering time of a frame based on the assumption that the absolute variation in frame rendering time is correlated with the absolute rendering time.

5.8.6 Measuring energy consumption Measuring the energy consumption of the rendering process is not straightforward since it involves several hardware components. The rendering process directly affects the energy consumption of the CPU, GPU and memory. It is likely that a power management framework will consider these components as PMC:s in their own right. This could prove problematic since the power consumption of the rendering process might be reported twice through the other components it affects. A PM solution containing different types of PMC:s must be able to handle PMC dependencies. How this should be done is outside the scope of this thesis. The proposed driver operates under the assumption that the PM policies are able to handle these dependencies in a satisfactory manner. The maximum power consumption of the rendering process is defined as the difference between the power consumption of the entire embedded system when it is idle and when it is running graphics rendering at maximal capacity. The power consumption will not be modeled completely accurately because the graphics application will also consume clock- cycles and affect the measurements, but if the application is small enough its effect on power consumption should be negligible.

Eframe = (teglSwapBuffers() − tglClear()) ∗ Pmaxrendering (5.4) The worst case power consumption of rendering a frame can be estimated using equa- tion 5.4. The total energy consumed is equal to the time interval between the OpenGL function calls eglSwapBuffers() and glClear() multiplied with the maximal power con- sumption of the rendering process.

Pmaxrendering = Pmaxsystem − Pidlesystem (5.5)

56 Design and implementation of a power aware graphics system

Runtime updates The working state of the rendering process can be communicated to the PM module so that it can incorporate the information in its decision-making policy.

5.8.7 Workload prediction The initial driver architecture utilizes a history-based approach for workload estimation. History-based predictors are built around the assumption that the frame workload is correlated with recent workloads. The driver will feature the possibility to set how many frames the history-based predictor will consider when making the prediction, as well as different weighting methods. Workload varies greatly over time and a one- frame predictor will make poor predictions for fluctuating workloads. A predictor that considers several frames will perform better in these cases but instead react slowly to workload changes. As a final step towards power awareness an analytical scheme for workload prediction will be added to the driver. The workload of a frame and also the individual pipeline stages can be predicted by analyzing the 3D scene data. The vertice count is the main source of information about frame rendering time. The analytical predictor will, as a first step, assume that the workload of the rendering process can be described as:

Trenderingframe = c1 + c2 ∗ V (5.6)

where c1 and c2 are constants and V is the number of vertices in a frame. Using an analytical scheme will enable a more proactive approach to image quality degradation. Instead of reducing the quality of consecutive frames it is possible to take action before the next frame starts rendering and adjust the quality if it is predicted that the frame rendering time will violate the minimum frame rate limit. If the prediction is that the limit will be reached even at lowered image quality then the OpenGL rendering call can be denied and the driver will start waiting for the next frame. Workload prediction can possibly increase the accuracy of the optimization scheme if it is possible to predict the impact different rendering parameters have on the different pipeline stages. The more accurate the predictor is the better the driver will be able to react to sudden changes in workload. The rendering buffer (Section: 5.8.5) will not have to be as large.

57 Chapter 6

Results

In order to validate parts of the design and implementation an effort was made to measure the utilization and energy consumption of OpenGL graphics applications running on the i.MX31 board while using the interposer library.

6.1 Demos

(a) (b) (c)

Figure 6.1: (a) The EvilSkull demo. (b) The Vase demo. (c) The SkyBox demo.

The demo applications were obtained from the Imagination Technologies software development kit[30] for the i.MX31 graphics system. For the test cases, three demos were used. The demo applications are simple 3D animations that are non-interactive and in general does not display the fluctuations in workload that would be expected from an interactive application. An even workload over time will minimize the error of the history-based predictor that was used for the implemented utilization limiter algo- rithm. All applications uses the OpenGL interface to render graphics and are designed to highlight features of the PowerVR tile-based rendering technology.

EvilSkull Demo featuring a floating skull on a flaming background. Workload varies very little over time and between frames. The background is generated using multi- texturing (section: 4.3.5). Rendering of a frame is intense but fast. The EvilSkull demo has the lowest frame rendering time of the three applications.

58 Results

Vase A floating vase object that displays transparency and reflections. Like the Evil- Skull demo the workload is almost constant.

SkyBox The demo that has the most uneven workload. An air balloon featuring texture compression (section: 4.3.5) hovering over a landscape of lakes and mountains. Frame rendering time and frame content varies slightly as the balloon travels.

6.2 Measurements

The main purpose of these measurements is to verify that it is possible to save power by limiting the amount of 3D content and frame refresh rate, and that energy consumption is closely related to the utilization level. The measurements consists of both measurements on hardware and terminal output.

6.2.1 Estimating utilization The utilization of the processing units can be estimated using timestamping and terminal output. The timestamps were used to measure the frame rendering time as described in equation 5.6. Calculating frame rendering time in this manner and comparing it to the time in between frames when the rendering process is sleeping gives a good indication of the percentage of time the processing units are busy rendering graphics. No attempts were made to separate the CPU load due to rendering and the load caused by background tasks, the graphics application itself and the utilization limiter. The estimate is accurate as long as the workload is dominated by rendering calculations that occurs after a function call to OpenGL. The results obtained and the difference in utilization and rendering time corresponds very well to the results one would expect if all other tasks only infer a minor portion of the total workload.

Demo Frame rendering time Full performance Lowered image quality Difference EvilSkull 11.20 7.52 49% Vase 16.36 14.80 11% SkyBox 16.51 9.98 65%

Measuring energy consumption Demo CPU utilization 100% 75% 50% 25% 10% EvilSkull 495.0 492.0 373.3 226.0 137.3 Vase 505.7 476.3 345.0 209.4 131.4 SkyBox 497.1 423.1 316.1 197.2 126.1

59 Results

Figure 6.2: Energy consumption of the CPU at various power levels using three demo applications.

Figure 6.3: Energy consumption of the GPU.

Demo GPU utilization 100% 75% 50% 25% 10% EvilSkull 149.5 149.6 133.3 108.0 92.6 Vase 151.7 149.4 130.4 106.5 91.3 SkyBox 147.7 146.5 127.9 104.7 91.9

The energy consumption measurements were performed by measuring the voltage drop over a shunt resistor using an oscilloscope and loading the data into matlab for

60 Results visualization and analysis. The energy consumption of the power levels were measured over a 20 second interval. The reference document[31] for how to carry out the mea- surements and calculate energy consumption was only accessible through direct contact with the Freescale semiconductor support team and can be found as a reference for this report. The reason why it is possible to achieve any energy savings at all for the CPU is that the Linux BSP for the i.MX31 board comes with DVFS (section: 3.9.2). The DVFS is not integrated with the rendering process in any way but is able to save energy by identifying when CPU utilization drops and adjusting the voltage and frequency accordingly. This is not the same as a DVFS scheduler that uses workload prediction to scale the operation. Such an algorithm for DVFS would increase the effectiveness of the utilization limiter in terms of energy saved. The GPU does not display the same energy savings as the utilization decreases. The GPU lacks intermediate modes of operation and is only able to save energy by going into an idle mode. The transition delay means that the GPU will be unable to fully exploit the fact that the workload diminishes, and as a result, the energy savings are not proportional to the CPU savings. If a separate DVFS circuit was present for the graphics hardware acceleration the energy savings would presumably be substantially larger. Although it would have been interesting to measure how the energy consumption of other components, such as RAM and cache memory are affected by the utilization limiter, the implemented solution is aimed at reducing the load on the processing units. The largest gain in energy consumption is likely to be found when measuring these components. The implemented solution was installed as a network file system and to be able to measure the memory energy consumption the file system would have to be moved to the board.

6.3 Summary of the measurements

From the measurements it is clear that the utilization is highly correlated with energy consumption. In fact, it seems reasonable to assume that it is the single most contribut- ing factor. If utilization can be controlled with some accuracy then it is an excellent method to limit energy consumption. At the same time, it is apparent that utilization is not the only thing that affect energy consumption. Both of these properties are in line with what the literate suggests. The EvilSkull and SkyBox demo both displays a linear correlation between energy consumption and utilization, but the inclination of the line is different. The DVFS algorithm that exists for the i.MX31 is designed so that it will provide greater savings for bursty processor utilization. An application that has a longer frame rendering time will also have longer time between frames. The larger the idle interval is the lower the DVFS algorithm can push the voltage and frequency to achieve greater savings.

61 Results

Power awareness at different levels The reason for the comparatively small difference between the energy consumption of an application running at 100% and 75% is that the applications that were used for these measurements has a built-in FPS limit at 60 FPS. When the applications run at this particular hardware platform the 60 FPS limit is reached around 75% utilization and this is why the energy consumption at 100% utilization has been omitted from the graph. The interposer does not limit the utilization of the application if it is already running at an acceptable level. This proves that it is possible to combine power management at the application level with operating system power management. If the application itself is power aware and able to stay within its allotted utilization limit then the interposer should not interfere with its operation. It is assumed that the application has better knowledge of what rendering stages contribute the most to its perceived benefit.

Image degradation By disabling fog, textures and using flat shading the rendering time of a frame could be significantly shortened, even for these very simple animations. The biggest difference in frame rendering time was reported for the SkyBox demo. At reduced image quality the frame only took 60% of the time compared to a frame rendered at full quality. As a result the utilization limiter was able to run at a higher frame rate without breaking the utilization limit. The energy measurements showed little difference with and without image degra- dation for the same utilization level. This is satisfying since it indicates that limiting utilization is a reliable method to preserve energy, but also somewhat surprising since both lighting and texturing is hardware-accelerated on the i.MX31 and it would be rea- sonable to assume that disabling these effects would affect the energy versus rendering time trade-off. Utilization seems to be a more important factor than what kind of render- ing operation is running when estimating energy consumption, but more measurements are required before any conclusions can be reached. No attempts were made to measure the reduction in image quality caused by disabling these visual effects. Image quality optimization under power constraints is a complex issue and the the implemented solution lacks the required functionality to control and optimize performance.

62 Chapter 7

Conclusion

This chapter summarizes the conclusions that can be drawn from the literature study as well as the information gained from analyzing the implementation.

7.0.1 Energy consumption of tile-based rendering Hardware accelerated graphics is energy efficient graphics. Special-purpose hardware is able to produce high-performance graphics at low energy cost, but in embedded systems is limited by space on chip, and has to compete for space with other hardware that might be equally important. Still, the rendering process can be made several times more energy-efficient with good hardware support[22]. Tile-based rendering (section: 2.3) is a good solution for energy efficient graphics. Limited space on chip means that hardware acceleration of graphics needs to use as little space on chip as possible. Tile-based rendering maximizes usage of hardware acceleration by rendering the 3D scene sequentially and minimizes off-chip memory accesses since most of the time, all geometry data will be able to fit in the on-chip cache memory. More about energy optimization of tile-based rendering is found in section 4.4. When it comes to scaling the performance of graphics a system like the on on the i.MX31 platform, which only features hardware-acceleration of a subset of the pipeline stages it is desirable to decrease the performance of the stages that do not have ac- celeration first, since hardware-accelerated graphics usually deliver better performance compared to power consumption. On the i.MX31 the geometry stage (section: 2.2) lacks hardware acceleration and is the best candidate for degradation of performance. The potential energy savings of level of detail (section: 4.2) control would have been very interesting to investigate given the circumstances but was outside the scope for this thesis.

7.0.2 Library interposition as a method for power management From the implementation part of this thesis work it is apparent that library interposi- tion is ideal for power management. Library interposition can be thought of as a way of adding power awareness to applications without modification or even access to the

63 Conclusion application source code. It is highly desirable to have not only power awareness on the hardware driver level but also on the application level[12, 5]. Library interposition achieves power awareness for an infinite number of graphics applications without having to re-invent power awareness for each application. OpenGL is a prime target for this kind of application since its API is widely used and the computations of the library functions called are so expensive in terms of computational complexity but there is no particular reason why the interposition cannot be extended to apply to other libraries.

7.0.3 Power management and the OpenGL API OpenGL is a library that is under constant development and new functionality is being added daily. When it comes to power management through library interposition the best part is that so much power management is possible just by tracing OpenGL function calls. What complicates matters is that some of the parameters that affect the rendering process are beyond the control of the interposer. Level of detail is handled completely by the application itself and a unified framework for level of detail control is one of the things that would enhance the effectiveness of power management. Level of detail control is something that also could be handled using library interposition and An API that already has several hundred of functions still has room for improvement when it comes to more fine-grained control of operation of the rendering process. While it is possible to set the technique that shall be used for most stages of the pipeline some stages are harder to control, at least for OpenGL|ES 1.1. As computer-generated graphics for embedded systems become more complex the number of rendering techniques that are possible to choose between increases and pos- sibilities for optimization increases. Something that would enable feedback about the impact of various visual effects on image quality is the feature to obtain the image rendered halfway through the rendering pipeline. If it was possible to retrieve an image after each stage of the rendering process it would be possible to take optimization to a new level. By measuring the difference between images at various stages the visual effect that contributes the least to the final image can be determined and turned off.

7.0.4 Power management frameworks and power aware rendering The rendering pipeline (Section: 2.2) consists of several stages that each provide a spe- cific visual effect to the final image. It is possible to envision a system where each stage is a separate PMC. The DPM decisions will then be centralized in the power management module. But, it is also possible to describe the whole rendering process as a functional block with the purpose of generating an image from the description of a scene. Each ap- proach has its own advantages. The power management decision process becomes more complex as the number of managed devices increases, but at the same time it is possible to exert more fine-grained control. In a system where each pipeline stage is a PMC the power management module could possibly take less conservative PM decisions and at the same time account for system dependencies. However, if it is assumed that the

64 Conclusion

(a) (b) (c)

Figure 7.1: (a) The power management module treats each pipeline stage as a PMC. (b) The PMC called graphics driver uses an internal power management policy to decide which stage should reduce its performance in order to save energy. pipeline stages are mostly dependent upon each other then a graphics driver is equally adept at making PM decisions as long as it does not do anything that could upset system operation. Internal power management and external power management can work to- gether and achieve positive results[5, 21]. Graphics applications are built to provide the user with the best possible visual performance and often stresses the capabilities of the processing units in the system[8]. A power aware graphics driver enables the system to limit the workload caused by these applications, while still providing the user with use- ful visual information. The main benefit of this approach is that application developers does not have to consider power management issues in the development process. It is possible to predict the future workload of the rendering process with a high degree of precision using analytical or signal-based schemes[17]. Predictive PM tech- niques are ideal for rendering power management. Because rendering is performed using frame structures (Section: 4.5), computer graphics experience a highly periodic work- load which makes DVFS well suited for graphics applications and increases the power savings of slack-time in the rendering process by a factor 3[29]. Several papers that describe frameworks for power management stresses that both applications and hardware drivers should be included as PMC:s, to create a really effec- tive framework for DPM. Although each paper mentions the possibility, they often fail to give examples of how to deal with the dependencies between PMC:s that occur in a system with both power aware applications and drivers. Knowing what different oper- ations cost in terms of power consumption is crucial to power management[12].Another thing that the DPM frameworks fail to address is how to include a PMC that has levels of operation that come at different energy costs, but where the cost is almost impos- sible to calculate. An interposer library with power awareness presents the researchers of power management policy optimization with added challenges. For an application that is made power aware through library interposition it is very difficult to calculate how much energy is consumed by the application. Possible solutions to this problem is

65 Conclusion either an extremely complex monitoring feedback algorithm that calculates the energy consumption of the application by feedback from all affected hardware devices, or an extremely conservative estimate of the energy consumption which might result in inac- curate power management decisions because the estimate is far from the actual power consumption. What is apparent is that there exists a demand for a power management framework that is able to incorporate PMC:s that does not fit within the accepted PMC template.

7.1 Future work

To really be able to evaluate the performance of the proposed solution several more measurements are needed. First of all, the energy consumption of the entire system and not just the processing units. To continue work on the interposer the first step would be to remove the i.MX31 board from the process. The problems of setting up and using the cross-compiler has added a considerable overhead to the development process, and OpenGL|ES can run just as well or better on a much more powerful computer.

7.1.1 Extend the functionality of the library interposer Many of the features of the solution proposed in chapter 5 require that the interposer is extended with a frame capture tracer. That the interposer stores a linked list of commands and analyze those commands and associated data before the commands are then run and the frame rendered. Analytical - and signal-based workload prediction is only possible using this approach and the same can be said about making an intelligent decision about light sources, lighting scheme or texturing scheme on an object-to-object basis. Creating a OpenGL tracer takes time and the best solution is perhaps to modify an existing trace tool such as BuGLe[16].

7.1.2 OpenGL library interposition for other reasons The idea to use a library interposer with frame tracing capabilities presents a number of interesting possibilities. The function calls to OpenGL that are used to create a frame contains a substantial amount of information that can be analyzed. It would be interesting to investigate how library interposition could be used to optimize rendering. Optimization that is not only aimed at creating power awareness.

Scaling graphics applications for portability Today, 3D-applications has to be implemented in such a way that the 3D content is scaled depending on the platform for which it is intended. This is in opposition of the OpenGL philosophy which is intended to hide differences in hardware between platforms to facilitate portability of applications. A library interposer can be used to scale the operation of graphics applications so that the application can find a good computational balance between frame rate and various visual effects. Some applications enable the user

66 Conclusion to choose the complexity of rendering and to turn on and off visual effects, but many do not. Implementing such a feature in a graphics application requires extra effort on the developers side and increases time-to-market. This study shows that it is possible to control the complexity of the rendering process without access to the source code of the application or OpenGL library. Control can be asserted just by analysis and modification of OpenGL function calls. Compared to implementing this feature as part of the application much more effort can be put into creating an interposer, which could result in a more efficient kind of control. An interposer that is able to scale 3D content can be used to provide the user with a way of optimizing graphics rendering for any number of applications.

Platform-dependent optimization A related study is the possibility of tailoring rendering for a specific hardware platform. OpenGL provides programmers with an interface that hides the underlying hardware architecture used for rendering, but the truth is that the capabilities of hardware ac- celeration of graphics differs greatly between platforms. An effect that has hardware acceleration support is often desirable to turn on because it is many times cheaper in terms of time and energy than an effect that lacks hardware support. When porting applications to different platforms an interposer library can be used to optimize render- ing. Basically, an interposer can maximize the use of hardware acceleration and turn off effects that lack support.

7.1.3 Mapping the power / performance trade-off To optimize rendering, understanding what factors affect energy consumption and visual performance is necessary. Both power and performance are quantities that are extremely difficult to evaluate. It would be very interesting to try and find a way to compare these quantities in the context of the rendering process. It would be interesting to investigate the trade-off for a particular system but more interesting would be to create a framework for evaluating both quantities and to find the correlation between energy consumption and different pipeline stages.

67 Bibliography

[1] Tomas Akenine-M¨ollerand Jacob Str¨om. Graphics for the masses: a hardware rasterization architecture for mobile phones. In SIGGRAPH ’03: ACM SIGGRAPH 2003 Papers, pages 801–808, New York, NY, USA, 2003. ACM.

[2] I. Antochi. Suitability of Tile-Based Rendering for Low-Power 3D Graphics Accel- erators. PhD thesis, October 2007.

[3] Iosif Antochi, Ben H. H. Juurlink, Stamatis Vassiliadis, and Petri Liuha. Memory bandwidth requirements of tile-based rendering. In SAMOS, pages 323–332, 2004.

[4] Iosif Antochi, Ben H. H. Juurlink, Stamatis Vassiliadis, and Petri Liuha. Scene management models and overlap tests for tile-based rendering. In DSD, pages 424– 431, 2004.

[5] L. Benini, A. Bogliolo, and G. De Micheli. A survey of design techniques for system- level dynamic power management. Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, 8(3):299 –316, jun 2000.

[6] James F. Blinn. Models of light reflection for computer synthesized pictures. SIG- GRAPH Comput. Graph., 11(2):192–198, 1977.

[7] Kirsten Cater, Alan Chalmers, and Patrick Ledda. Selective quality rendering by exploiting human inattentional blindness: looking but not seeing. In VRST ’02: Proceedings of the ACM symposium on Virtual reality software and technology, pages 17–24, New York, NY, USA, 2002. ACM.

[8] Tolga K. C¸apin, Kari Pulli, and Tomas Akenine-M¨oller. The state of the art in mobile graphics research. IEEE Computer Graphics and Applications, 28(4):74–84, 2008.

[9] Mark Claypool, Kajal Claypool, and Feissal Damaa. The effects of frame rate and resolution on users playing first person shooter games. volume 6071, page 607101. SPIE, 2006.

[10] Hewlett-Packard Corporation, Intel Corporation, Microsoft Corporation, Phoenix Technologies Ltd., and Toshiba Corporation. Advanced Configura- tion and Power Interface Specification, 2009.

68 BIBLIOGRAPHY

[11] Jonathan Cohen David and David Luebke. Glod: Level of detail for the masses, 2003. [12] Carla Schlatter Ellis. The case for higher-level power management. In Workshop on Hot Topics in Operating Systems, pages 162–167, 1999. [13] Jeongseon Euh, Jeevan Chittamuru, and Wayne Burleson. Power-aware 3d com- puter graphics rendering. VLSI Signal Processing, 39(1-2):15–33, 2005. [14] Randima Fernando. GPU Gems: Programming Techniques, Tips and Tricks for Real-Time Graphics. Pearson Higher Education, 2004. [15] Michael Garland and Paul S. Heckbert. Surface simplification using quadric error metrics. In SIGGRAPH ’97: Proceedings of the 24th annual conference on Computer graphics and interactive techniques, pages 209–216, New York, NY, USA, 1997. ACM Press/Addison-Wesley Publishing Co. [16] Khronos Group. Opengl software development kit, 2009. [17] Yan Gu and Samarjit Chakraborty. Power management of interactive 3d games using frame structures. VLSI Design, International Conference on, 0:679–684, 2008. [18] G. Lafruit, L. Nachtergaele, K. Denolf, and J. Bormans. 3d computational graceful degradation. volume 3, pages 547 –550 vol.3, 2000. [19] Jinfeng Liu and P. H. Chou. Optimizing mode transition sequences in idle intervals for component-level and system-level energy minimization. In ICCAD ’04: Proceed- ings of the 2004 IEEE/ACM International conference on Computer-aided design, pages 21–28, Washington, DC, USA, 2004. IEEE Computer Society. [20] Imagination Technologies Ltd. PowerVR MBX technology overview, May 2009. [21] Yung-Hsiang Lu, Tajana Simuni´c,andˇ Giovanni De Micheli. Software controlled power management. In CODES ’99: Proceedings of the seventh international work- shop on Hardware/software codesign, pages 157–161, New York, NY, USA, 1999. ACM. [22] David P. Luebke and Greg Humphreys. How gpus work. IEEE Computer, 40(2):96– 100, 2007. [23] John D. McCarthy, M. Angela Sasse, and Dimitrios Miras. Sharp or smooth?: comparing the effects of quantization vs. frame rate for streamed video. In CHI ’04: Proceedings of the SIGCHI conference on Human factors in computing systems, pages 535–542, New York, NY, USA, 2004. ACM. [24] Bren C. Mochocki, Kanishka Lahiri, Srihari Cadambi, and X. Sharon Hu. Signature- based workload estimation for mobile 3d graphics. In DAC ’06: Proceedings of the 43rd annual Design Automation Conference, pages 592–597, New York, NY, USA, 2006. ACM.

69 BIBLIOGRAPHY

[25] Victor Moya, Carlos Gonzalez, Jordi Roca, Agustin Fernandez, and Roger Espasa. Shader performance analysis on a modern gpu architecture. In MICRO 38: Proceed- ings of the 38th annual IEEE/ACM International Symposium on Microarchitecture, pages 355–364, Washington, DC, USA, 2005. IEEE Computer Society.

[26] Daniel S. Myers, Adam L. Bazinet, Contact Michael P. Cummings, and For Corre- spondence. Intercepting arbitrary functions on windows, unix, and macintosh os x platforms. Technical report, Center for Bioinformatics and Computational Biology, Institute for Advanced Computer Studies, University of Maryland, 2004.

[27] Chanmin Park, Hyunhee Kim, and Jihong Kim. A low-power implementation of 3d graphics system for embedded mobile systems. pages 53 –58, oct. 2006.

[28] Bui Tuong Phong. Illumination for computer generated pictures. Commun. ACM, 18(6):311–317, 1975.

[29] Padmanabhan Pillai, Hai Huang, and Kang G. Shin. Energy-aware quality of service adaptation. Technical report, University of Michigan, 2003.

[30] PowerVR. PowerVR OpenGL ES SDK User Guide, August 2009.

[31] Freescale semiconductor. MX31 CPU BOARD WITH DDR, 2006.

[32] Jacob Str¨omand Tomas Akenine-M¨oller. ipackman: high-quality, low-complexity texture compression for mobile phones. In HWWS ’05: Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware, pages 63–70, New York, NY, USA, 2005. ACM.

[33] N. Tack, G. Lafruit, F. Catthoor, and R. Lauwereins. A content quality driven energy management system for mobile 3d graphics. pages 278 – 283, nov. 2005.

[34] Benjamin Watson, Alinda Friedman, and Aaron McGaffey. Measuring and predict- ing visual fidelity. In SIGGRAPH ’01: Proceedings of the 28th annual conference on Computer graphics and interactive techniques, pages 213–220, New York, NY, USA, 2001. ACM.

[35] Heng Zeng, Carla S. Ellis, Alvin R. Lebeck, and Amin Vahdat. Ecosystem: manag- ing energy as a first class operating system resource. SIGPLAN Not., 37(10):123– 132, 2002.

[36] Lin Zhong and Niraj K. Jha. Energy efficiency of handheld computer interfaces: limits, characterization and practice. In MobiSys ’05: Proceedings of the 3rd inter- national conference on Mobile systems, applications, and services, pages 247–260, New York, NY, USA, 2005. ACM.

70