<<

TECHNICAL UNIVERSITY OF CATALONIA School of Informatics

Master on Architecture, Networks and Systems

Master Thesis

High Performance, Ultra-Low Power Streaming Systems

Student: Jos´eMar´ıaArnau

Advisor: Joan-Manuel Parcerisa

Advisor: Polychronis Xekalakis

Advisor: Antonio Gonz´alez

Date: September 20, 2011 2 Abstract

Smartphones are emerging as one of the fastest growing markets, with new devices and im- provements in their operating systems taking place every few months. The design of a CPU/GPU for such a mobile devices is challenging due to the users demands for a truly mobile experience, including highly responsive user interfaces, uncompromised web browsing performance or visually compelling gaming experiences, and the power constrains due to the limited capacity of the battery. In the last years, the power demand of these mobile devices has increased much faster than the battery improvements.

Our key ambition is to design a CPU/GPU for such a system, trying to minimize the power consumed, while also achieving the highest performance possible. We first analyze commercial Android workloads and establish that the most demanding applications in terms of performance are, as expected, games. We show that because these systems are based on OpenGL ES, the vast majority of the time the CPU is idle. In fact, we find that the GPU is much more active than the CPU and that the major performance limitation for these systems is the use of memory by the GPU.

We thus focus on the GPU and more specifically on its memory behavior. We show that for most of the caches employed in these systems, traditional prefetchers provide significant benefits. The exception is the texture , for which the patterns are irregular, especially for 3D games. We then demonstrate how we can aleviate this issue by using a decoupled access/execute like architecture. We also show that an important part of the power consumed can be reduced by carefully moving data around and by orchestrating the accesses to the L2 cache. The end design is able to achieve similar performance with a more traditional many-warp system, while consuming only a fraction of its power. Our experimental results using the latest version of Android and a commercial set of games, proves this claim. More specifically, our proposed system is able to achieve 29% improvements over state-of-the-art prefetchers, while consuming 6% less power.

Keywords

Prefetching, GPU, Android, rasterization, .

3

Contents

1 Introduction 11

1.1 Motivation...... 11

1.2 Objectives and contributions...... 13

1.3 Organization...... 13

2 Related work 15

2.1 Rasterization...... 15

2.2 Android...... 17

2.2.1 Android Renderer...... 19

2.3 State of the art architectures for mobile devices...... 21

2.3.1 Snapdragon...... 21

2.3.2 PowerVR chipsets...... 22

2.3.3 2...... 24

2.4 Data Cache Prefetching...... 26

2.4.1 CPU prefetchers...... 26

2.4.2 GPU prefetchers...... 29

3 Problem statement: Memory Wall for Low Power GPUs 37

3.1 Hiding memory latency on a modern low power mobile GPU...... 37

4 Proposal: Decoupled Access Execute Prefetching 41

4.1 Ultra-low power decoupled prefetcher...... 41 5 CONTENTS

4.1.1 Baseline GPU Architecture...... 41

4.1.2 Decoupled prefetcher...... 43

4.1.3 Decoupled prefetcher improvements...... 45

5 Evaluation methodology 49

5.1 Simulation infrastructure...... 49

5.1.1 GPU trace generation...... 49

5.1.2 Cycle accurate GPU simulator...... 51

6 Experimental results 55

6.1 Workload characterization...... 55

6.2 State of the art prefetchers performance...... 59

6.3 Ultra-low power decoupled prefetcher performance...... 61

7 Conclusions 67

6 List of Figures

1.1 Smartphones sales vs Desktop and Notebook sales. Data obtained from [1]...... 12

1.2 Energy need vs energy available in a standard size battery. Two days of battery life cannot be achieved with current batteries and the gap is getting bigger. Data obtained from [2]...... 12

2.1 Initial scene, intermediate results produced by the different stages of the rasterization and final result...... 16

(a) 3D triangles plot...... 16

(b) 2D triangles plot...... 16

() Clipped 2D triangles plot...... 16

() after rasterization...... 16

(e) Visible pixels after Z-test...... 16

(f) Shaded and textures pixels after stage...... 16

2.2 Rasterization ...... 18

2.3 Android architecture...... 19

2.4 System on Chip...... 21

2.5 PowerVR GPU architecture...... 23

2.6 NVIDIA Tegra 2 architecture...... 24

2.7 Ultra-low power GeForce architecture...... 25

2.8 Stride prefetching table...... 27

2.9 Markov prefetching. The left side of the figure shows the state of the correlation table after processing the miss address stream shown at the top of the figure. The right side illustrates the Markov transition graph that corresponds to the example miss address stream...... 28 7 LIST OF FIGURES

2.10 Distance prefetching. The address delta stream corresponds to the sequence of ad- dresses used in the example of figure 2.9...... 28

2.11 Global History Buffer...... 30

2.12 Distance prefetcher implemented by using a Global History Buffer. The Head Pointer points to the last inserted address in the GHB...... 30

2.13 An overview of the baseline GPGPU architecture...... 31

2.14 An example of memory address with/without warp interleaving...... 32

(a) Accesses by warps...... 32

(b) Accesses by a hardware prefetcher...... 32

2.15 Many- aware hardware prefetcher...... 33

2.16 Throttling heuristics...... 34

2.17 Baseline architecture for ...... 35

2.18 Texture cache prefetcher architecture...... 35

3.1 Effectiveness of multithreading for hiding memory latency. As we increase the num- ber of warps on each we obtain better performance...... 38

3.2 Power consumed by the GPU main register file for different configurations...... 38

4.1 Baseline GPU architecture (based on the ultra-low power GeForce GPU in the NVIDIA Tegra 2 chipset)...... 42

4.2 Decoupled prefetcher architecture...... 45

4.3 Improved decoupled prefetcher...... 46

5.1 GPU trace generation system...... 50

5.2 GPU architecture modelled by the cycle accurate simulator...... 52

6.1 CPU configuration for the experiments...... 56

6.2 CPI stacks for several Android applications. iCommando, Shooting Range 3D and PolyBreaker 3D are commercial games from the Android market...... 56

6.3 Misses per 1000 instructions for the different caches in the GPU...... 57

6.4 Texture and pixel cache analysis...... 57 8 LIST OF FIGURES

6.5 Analysis of the strides of the cache misses in the Pixel and Texture cache of one Streaming processor when running the 2D game iCommando. In the Sequitur gram- mars non-terminal symbols (rules) are represented by numbers and terminal symbols (strides) are represented by numbers in square brackets. After each rule we show the number of times the rule is applied to form the input sequence of strides. We only show the 5 most frequent rules of the grammar...... 58

6.6 Analysis of the strides of the cache misses in the Pixel and Texture cache of one Streaming processor when running the 3D game PolyBreaker 3D. For each cache the figure shows the 5 most frequent rules of the grammar and the 5 most frequent strides. 58

6.7 GPU configuration for the experiments. The baseline GPU architecture is the one illustrated in figure 5.2...... 60

6.8 Speedups for different state-of-the-art prefetchers...... 60

6.9 Normalized power consumption for different state-of-the-art prefetchers...... 61

6.10 Ultra-low power decoupled prefetcher compared with state-of-the-art prefetchers... 62

6.11 Ultra-low power decoupled prefetcher compared with the distance prefetcher imple- mented with GHB...... 63

6.12 Decoupled prefetcher power consumption...... 63

6.13 Normalized energy-delay product...... 64

6.14 Prefetch queue size evaluation. The graph shows the speedup achieved by the de- coupled prefetcher over the baseline GPU without prefetching for different sizes of the prefetch queue, for the game shooting...... 64

9

1 Introduction

1.1 Motivation

Mobile devices such as smartphones and tablets have become ubiquitous in the last few years. This kind of general purpose but battery limited devices has experienced a huge growth in both computing experience and market quota. Regarding the user experience, making calls is just one of the multiple features offered in phones since the user is able to browse the web, play high- definition videos or play complex 3D games. In regard to the market share, the total number of smartphones sold in 2008 exceeded the total number of desktop PCs and the gap is increasing each year [1]. Furthermore, the forecast for the next years predicts that the market will exceed the and desktop markets by 2012, as shown in figure 1.1.

The design of a CPU/GPU system for smartphones is very challenging due to the user expecta- tions for what these devices should do and the important power limitations. On the one hand, users are demanding a truly mobile computing experience: highly responsive user interfaces, uncompro- mised web browsing performance, visually compelling online and offline gaming experiences... On the other hand, the power demand is increasing faster than battery improvements, as shown in figure 1.2. Therefore, the combination of these two factors, the demand of complex applications and the power constrains, is putting a big pressure on the CPU/GPU due to the need of providing high performance without breaking the small power budget of mobile devices.

The of mobile CPUs and GPUs has been significantly increased in the last years. Nowadays, smartphones achieve a clock rate between 1 GHz and 1.5 GHz. Furthermore, companies such as Qualcomm and NVIDIA have announced CPU/GPU chipsets with clock rates between 2 GHz and 2.5 GHz for 2012. Hence the smartphones are going to hit the memory wall and the latency to access the main memory is going to be one of the main performance limiting factors. Thus, the 11 CHAPTER 1. INTRODUCTION

Figure 1.1: Smartphones sales vs Desktop and Notebook sales. Data obtained from [1]. use of techniques to hide the memory latency will be necessary to provide high performance.

Figure 1.2: Energy need vs energy available in a standard size battery. Two days of battery life cannot be achieved with current batteries and the gap is getting bigger. Data obtained from [2].

12 1.2. OBJECTIVES AND CONTRIBUTIONS

Prefetching is one of the main techniques for hiding the memory latency. Despite prefetching has been extensively studied in CPUs, as far as we know the use of prefetching has not been evaluated in low-power mobile GPUs running graphics workloads.

1.2 Objectives and contributions

The first objective is to get a better understanding of the applications available for smartphones. We want to evaluate the behavior of the CPU/GPU when running these applications and we want to identify the most demanding workloads. We focus on Android [3], since it is one of the most popular platforms for mobile devices and it is open source.

Another objective is the proposal of a technique to increase the performance of smartphones Graphics Processing Units (GPUs) while maintaining the power consumption in the limits of the small power budget. Since games are the most demanding applications for smartphones and the memory is one of the main limiting factors in the GPU, as we describe in section 6.1, it is necessary to find a mechanism to hide the latency of the main memory. We want to explore the use of prefetching in a low-power mobile GPU, evaluate the performance and power consumption of current state of the art prefetchers and propose a new low-power prefetching technique specifically designed for mobile devices if necessary.

In this report we claim the following contributions:

1. We perform a characterization of smartphones applications. More specifically, we characterize the behavior of multiple 2D and 3D games in the Android platform.

2. We develop a methodology to evaluate the performance and power consumption of mobile Graphics Processing Units. We propose a technique to identify the code executed by the GPU in the Android . Furthermore, we develop a cycle accurate GPU simulator which models a mobile GPU similar to the NVIDIA Tegra 2 [4], the simulator includes performance and power .

3. We evaluate the effectiveness of state of the art CPU and GPU prefetchers in reducing the memory latency of smartphones .

4. We propose our ultra-low power decoupled prefetcher, which outperforms previous proposals when running graphics workloads on a low-power mobile GPU. Furthermore, the ultra-low power decoupled prefetcher provides performance improvements while reducing energy con- sumption.

1.3 Organization

The remainder of this report is organized as follows. In chapter2 we provide basic background information on the rasterization process. Furthermore, we review the Android platform, some of the state of the art architectures for mobile devices and the most efficient prefetching techniques 13 CHAPTER 1. INTRODUCTION for CPUs and GPUs. In chapter3 we describe the problem to solve and in chapter4 we explain our solution: the ultra-low power decoupled prefetcher. In chapter5 we describe the evaluation methodology, we present the GPU trace generation system and the cycle accurate GPU simulator. In chapter6 we show the experimental results, this chapter includes a workload characterization of several smartphone applications, a comparative of different state of the art CPU and GPU prefetchers and the performance and power results of the ultra-low power decoupled prefetcher. Finally, in chapter7 we present the main conclusions of the report.

14 2 Related work

2.1 Rasterization

Rasterization is the process of taking an image described in a format (polygons) and converting it into a raster image (pixels or fragments) for output on the screen [5]. Nowadays rasterization is the most popular technique for producing real-time 3D . In comparison to other rendering techniques such as [6], rasterization is exceptionally fast. Usually, include specialized graphics hardware to carry out the task of rasterizing 3D models onto a 2D plane for display on the screen.

In its most basic form, the rasterization takes as input a set of 3D polygons and renders it onto a 2D surface, usually a . Polygons are described as a collection of 3D triangles and these 3D triangles are represented by three vertices in 3D space. Basically, rasterizers take a stream of 3D vertices, transform them into corresponding 2-dimensional points on the viewer’s screen and fill in the transformed 2-dimensional triangles as appropriate by processing the corresponding pixels.

The rasterization algorithm consists on different stages, each one of these stages produces a partial result as shown in figure 2.1. The rendering process starts with a vectorial description of a 3D scene (figure 2.1a). All the objects in the scene are described as a collection of triangles. At the same time, triangles are defined by 3 vertices in 3D-space. Different attributes are specified for each triangle: position, normal (for lighting computations), color, one or several texture coordinates (for texture mapping)... Therefore, all the 3D vertices with all the per-vertex information describe the scene and form the input for the first stage of the rasterization process.

The vertex stage is the first step in the rasterization algorithm. The input for this phase is the set of 3D vertices with all the per-vertex information (figure 2.1a). Several operations 15 CHAPTER 2. RELATED WORK

(a) 3D triangles (b) 2D triangles

(c) Clipped 2D triangles (d) Pixels

(e) Visible pixels (f) Shaded and textured pixels

Figure 2.1: Initial scene, intermediate results produced by the different stages of the rasterization process and final result. are applied to each vertex in the vertex stage. In first place vertices are transformed, the main transformations are , scaling and rotation. All the transformations are described by a transformation , so transforming a vertex consists on multiplying its 3D coordinates by this transformation matrix. In second place, vertices are lit according to the defined locations of light sources, reflectance and other surface properties. Finally, vertices are projected from a 3D-space to 16 2.2. ANDROID a 2D plane, this projection is done by multiplying each transformed vertex by a projection matrix. The result of the vertex stage is a set of 2D triangles, as shown in figure 2.1b.

Once 3D vertices have been transformed to their corresponding 2D locations, some of these locations may be outside the viewing window, or the area on the screen to which pixels will actually be written. For instance, in figure 2.1b the vertices V3 and V5 are outside the screen. So the next stage of the rasterization process is clipping. Clipping is the task of truncating triangles to fit them inside the viewing area. The most common technique is the Sutherland-Hodgeman clipping algorithm [7]. After clipping, triangles are truncated so all the vertices fit in the screen (figure 2.1c).

The next step of the rasterization process is to fill the 2D triangles that are now in the , this stage is also known as raster conversion or scan conversion. So raster conversion consists on converting the vectorial 2D clipped triangles (figure 2.1c) into pixels (figure 2.1d). There are a number of to fill pixels inside a triangle, the most popular of which is the scanline algorithm [8]. During raster conversion all the attributes of the 2D vertices (color, texture coordinates...) are interpolated across the triangle.

After raster conversion, the rasterization algorithm must ensure that pixels close to the viewer are not overwritten by pixels far away, this issue is known as the visibility problem. A Z-buffer [9] is the most popular solution. The Z-buffer is a 2D array corresponding to the image plane which stores a depth value for each pixel. Each time a pixel is drawn, the Z-buffer is updated with the pixel’s depth value. Any new pixel must check its depth value against the Z-buffer value before it is drawn. Closer pixels are drawn and further pixels are disregarded. This process of checking the depth value of each pixel with the value stored in the Z-buffer is called depth test. Figure 2.1d shows an example of input to the depth test and the corresponding output is shown in figure 2.1e.

Finally, all the pixels that pass the depth test (visible pixels) are processed in the last stage of the rasterization algorithm: the pixel stage. To compute a pixel’s color, pixels are textured and shaded in the pixel stage. Let’s briefly review how textures are applied. A texture map is a that is applied to a triangle to define its look. Each triangle vertex is associated with a texture and a texture coordinate (u, v) for normal 2D textures in addition to its position coordinate. Whenever a pixel on a triangle is rendered, the corresponding (or texture element) in the texture must be found. This is accomplished by interpolating among the triangle’s vertices’ associated texture coordinates by the pixels on-screen distance from the vertices. Moreover, lighting computations are also performed in the pixel stage. The result of this stage is the final image with textures and per-pixel lighting (figure 2.1f). A diagram of the whole rasterization process is shown in figure 2.2.

2.2 Android

Android [3] is a software stack for mobile devices such as mobile telephones and tablet comput- ers developed by Google and the Open Handset Alliance. Android consists of a mobile operating system based on the kernel, with , libraries and written in C and application software running on an application framework which includes Java-compatible libraries. Android uses the Dalvik virtual [10] with just-in-time compilation to run compiled Java code. An- droid has a large community of developers writing applications that extend the functionality of the 17 CHAPTER 2. RELATED WORK

Figure 2.2: Rasterization pipeline. devices, developers write primarily in Java.

The architecture of Android is shown in figure 2.3. Regarding the applications, Android provides a set of core applications including an email client, SMS program, calendar, maps, browser, contacts and others. All applications are written using the Java programming language.

Regarding the application framework, by providing an open development platform Android offers developers the ability to build rich applications. Developers are free to take advantage of the device hardware, access location information, run background services, set alarms... Developers have full access to the same framework APIs used by the core applications. The application archi- tecture is designed to simplify the reuse of components, any application can publish its capabilities and any other application may then make use of those capabilities. This same mechanism allows components to be replaced by the user.

On the other hand, Android includes a set of C/C++ libraries used by various components of the Android system. These capabilities are exposed to developers through the Android applica- tion framework. These libraries include an implementation of the standard C system library (libc), media libraries to support playback and recording of many popular audio and video formats, a rela- tional engine (SQLite) and many other functionality. An implementation of the OpenGL ES API [11] is also provided as one of these libraries. This 3D library uses either hardware 3D acceleration (where available) or the included highly optimized 3D software rasterizer, as described in section 2.2.1.

Regarding the Android Runtime, Android includes a set of core libraries that provides most 18 2.2. ANDROID

Figure 2.3: Android architecture. of the functionality available in the core libraries of the Java programming language. Every Android application runs in its own process, with its own instance of the Dalvik . Dalvik has been written so that a device can run multiple VMs efficiently. The VM is register-based, and runs classes compiled by a Java language compiler. The Dalvik VM relies on the Linux kernel for underlying functionality such as threading and low-level memory management.

Finally, Android relies on Linux version 2.6 for core system services such as security, memory management, process management, network stack, and driver model. The kernel also acts as an abstraction layer between the hardware and the rest of the software stack.

2.2.1 Android Software Renderer

Android supports the rendering of 3D graphics by providing an implementation of the OpenGL ES API [11]. The rasterization process, described in section 2.1, can be done in hardware by using a specialized graphics or in software on the CPU. When Android runs on a provided with a (for instance, the NVIDIA Tegra 2 described in section 2.3.3) then the GPU driver is employed and the rasterization is hardware accelerated. On the contrary, when Android is executed on a device without specialized graphics hardware then the Android software renderer performs the rasterization on the CPU. is also employed when executing Android on top of an emulator such as QEMU, as we will see in the 19 CHAPTER 2. RELATED WORK section describing our simulation infrastructure.

The Android software renderer is a library that provides support for 3D graphics, it is an implementation of the OpenGL ES 1.0 API. The software renderer is a piece of software of special interest for several reasons. In first place, it performs the rasterization when executing Android on top of a simulator. In second place, the Android software renderer, unlike GPU drivers, is open source. Since the source code of the software renderer is available we can review it and modify it. This means that we can, for example, instrument the software renderer to collect interesting information about the rasterization process. For instance, we can count the number of vertices processed, the number of triangles, the number of pixels generated for each triangle or even the memory addresses of the pixels that are accessed in the color buffer or in the textures. Therefore, by instrumenting the Android software renderer we can generate traces of the OpenGL ES ren- dering commands and we can feed these traces to a cycle-accurate GPU simulator, as described in section 5.1.

In this section we briefly describe the structure of the Android software renderer and in sec- tion 5.1.1 we describe the instrumentation of this library. The Android software renderer consists on two static libraries:

• libagl.a: This is the Android OpenGL library. This library provides all the functions in the OpenGL ES 1.0 API. It implements the vertex processing and clipping stages of the rasterization pipeline (figure 2.2).

• libpixelflinger.a: This library implements the raster conversion, depth test and pixel pro- cessing stages of the rasterization pipeline (figure 2.2).

The libagl.a library source code is located in the directory /frameworks/base//libagl of the Android distribution. This library implements all the functions in the OpenGL ES 1.0 API, these functions are called from the applications. Regarding the rasterization process, this library implements the vertex processing and clipping stages of the rasterization pipeline shown in fig- ure 2.2. The remainder of the stages are implemented in the libpixelflinger.a library. So the libagl.a library has classes to handle vertices, triangles, lights, transformation matrices and all the necessary stuff for the vertex processing.

In the OpenGL ES API we can identify two types of functions: functions to configure the rendering pipeline (set number of lights, set transformation matrices...) and functions to render polygons. The rasterization process explained in section 2.1 is triggered when the application calls a function of the second type (rendering function). There are only a few rendering functions in the OpenGL ES API. In first place, the glDrawArrays and glDrawElements are employed to render 3D triangles. In second place, the glDrawTex function is used to render textured 2D rectangles, usually in 2D games. Section 5.1.1 describes how these rendering functions are instrumented to generate GPU traces.

The libpixelflinger.a library source code is located in the directory /system/core/libpixelflinger of the Android distribution. This library implements the raster conversion, depth test and pixel processing stages of the rasterization pipeline shown in figure 2.2. The functions of this library are called from the libagl.a library to render 2D clipped triangles (figure 2.1). Although this library 20 2.3. STATE OF THE ART ARCHITECTURES FOR MOBILE DEVICES can also be used directly from the applications, usually developers employ the libagl.a library to do the rendering. The libpixelflinger.a library performs the pixel generation and processing: conversion from vectorial triangles to pixels, visibility determination by using a depth buffer [9], texture mapping, per-pixel lighting... This library employs the scanline algorithm [8] to rasterize the 2D triangles.

2.3 State of the art architectures for mobile devices

In this section we review the most popular CPU/GPU architectures for smartphones and tablets. Usually, these mobiles devices are provided with a system on (SoC) [12] including the CPU, the GPU and other specialized hardware for functions such as audio and video encod- ing/decoding. We start the review with the Qualcomm Snapdragon family of chipsets that we can find in many HTC and smartphones and in the Xperia PLAY. Next we describe the PowerVR family of chipsets, some of the devices using this SoC are, for instance, the Apple iPhone 4, iPad and iPad 2. Finally, we review the NVIDIA Tegra 2 SoC which is included in several smartphones and in the Samsumg Galaxy Tab.

2.3.1 Qualcomm Snapdragon

Snapdragon is a family of mobile System on Chips by Qualcomm [13], it is a platform for use in smartphones, tablets, and smartbook devices. The CPU of the Snapdragon chipset, Scorpion, is Qualcomm’s own design. It is very similar to the ARM Cortex-A8 core and it is based on the ARM v7 instruction set. However, it has much higher performance for multimedia related SIMD operations due to its advanced media processing engine.

Figure 2.4: Qualcomm Snapdragon System on Chip.

21 CHAPTER 2. RELATED WORK

On the other hand, all Snapdragon processors contain the circuitry to encode and decode high- definition video. Regarding the GPU, all the Snapdragon chipsets include the GPU, the company’s propietary GPU . This low power GPU provides for different graphic APIs: OpenGL ES, and OpenVG. Furthermore, the Adreno GPU is able to accelerate 3D user interfaces for Android and other mobile operating systems and it provides full support for websites based on Flash and WebGL frameworks. The Qualcomm Snapdragon chipsets also include circuitry for audio encoding/decoding, (3G modem) and GPS, as shown in figure 2.4.

Despite Adreno is a very powerful and interesting GPU, there is not any technical document describing the specifications of this piece of hardware. Information such as the number of processors, the size of the caches or the instruction set is not available at all.

2.3.2 PowerVR chipsets

PowerVR is a division of Imagination that develops hardware for 2D and 3D ren- dering [14]. PowerVR accelerators are not manufactured by PowerVR, but instead their integrated circuits and patents are licensed to other companies such as , Samsung, Apple and many others. The PowerVR graphics accelerators are included in the System on Chips of many popular devices: Apple iPhone 4 and iPad, or .

The PowerVR chipset uses a method of known as tile-based deferred render- ing [15] (often abbreviated as TBDR). As the application feeds triangles to the PowerVR GPU, it stores them in memory in a triangle strip or an indexed format. Unlike other architectures, polygon rendering is not performed until all polygon information has been collated for the cur- rent frame. Furthermore, the expensive operations of texturing and pixels are delayed, whenever possible, until the visible surface at a pixel is determined.

In order to perform the rendering, the display is split into rectangular sections in a grid pattern, each section is known as a tile. Associated with each tile is a list of triangles that visibly overlap that tile. Each tile is rendered in turn to produce the final image. Tiles are rendered using a process similar to ray-casting. Rays are cast onto the triangles associated with the tile and a pixel is rendered from the triangle closest to the camera.

The architecture implementing the tile-based rendering algorithm is shown in figure 2.5. As the application feeds triangles to the GPU the Tile Accelerator (TA) assigns these triangles to the corresponding overlapping tiles. So the TA creates a list of visible triangles for each tile, the list includes the triangles coordinates and all the necessary information: active textures, render states...

Once all the polygons in the scene have been dispatched to the GPU and have been classified into the corresponding tiles the rendering process starts. The Image Synthesis Processor (ISP) has the responsibility of determining which pixels in a tile are visible. Hidden Surface Removal (HSR) is performed on a tile per tile basis, with each tile’s HSR results sent to the Texture and Shading Processor (TSP) for rasterization of visible pixels. The ISP processes all triangles affecting a tile one by one. Calculating the triangle equation and projecting a ray at each position in the triangle return accurate depth information for all pixels. This depth information is then compared with the 22 2.3. STATE OF THE ART ARCHITECTURES FOR MOBILE DEVICES values in the tile’s depth buffer to determine whether these pixels are visible or not. The Texture and Shading Processors (TSP) in the PowerVR pipeline behaves much like a traditional shading and texturing engine.

Figure 2.5: PowerVR GPU architecture.

Tile based rendering architectures have several advantages over traditional rasterization archi- tectures. Since the scene is rasterized on a tile per tile basis and a tile is much smaller than the whole display then all the necessary information to process a tile (color buffer and depth buffer information for instance) can be stored on-chip. Therefore, off-chip memory accesses to external memory are avoided to a large extent. Other advantages such as great cache efficiency and parallel processing of localized data are also important factors.

Regarding the drawbacks of tile based rendering, despite off-chip memory accesses can be avoided in many cases, the memory bandwidth increases in other places in the pipeline. For exam- ple, the triangle/tile sorting needs to be done, and creating the triangle lists increases bandwidth usage. So the memory requirements for this triangle/tile sorting are expensive since the GPU has to capture the information of the whole 3D scene.

There is an ongoing debate on which architecture is the best suited for rasterization. As explained in [29], the performance of the different rendering architectures is clearly scene-dependent. This means that there will be three-dimensional scenes where a tiling architecture performs much better than a standard architecture, but the opposite is also true. Unfortunately, there is no academic study analyzing the advantages and disadvantages in terms of hardware implementation and memory bandwidth usage.

A more detailed review of the PowerVR GPU architecture is provided in [16]. 23 CHAPTER 2. RELATED WORK

2.3.3 NVIDIA Tegra 2

The NVIDIA Tegra 2 is a multi-core System on Chip for mobile devices such as smartphones and tablets. The Tegra integrates two ARM Cortex-A9 processors, an ultra-low power GeForce GPU and specialized hardware for audio and video encoding/decoding (figure 2.6).

Figure 2.6: NVIDIA Tegra 2 architecture.

The NVIDIA’s ultra low power GeForce GPU in the Tegra processor is derived from the desktop GeForce GPU architecture, but is specifically tailored to meet the growing demands of mobile applications. The ultra low power GeForce GPU is highly customized and modified to deliver high-end graphics while consuming ultra low power. The GeForce architecture is a fixed function pipeline architecture that includes fully programmable pixel and vertex , along with an advanced texture unit that supports high quality Anisotropic Filtering. Figure 2.7 shows the graphics processing pipeline of the GeForce GPU in the Tegra mobile processor.

The GeForce GPU includes four programmable vertex processors and four programmable pixel processors for high speed vertex and pixel processing. Although the GeForce GPU architecture is a pipelined architecture similar to traditional desktop graphics architectures, it includes several special features and customizations to significantly reduce power consumption and deliver increased performance and graphics quality.

One of this special features is the introduction of the Early-Z stage in the GPU pipeline that is placed before the pixel stage. Modern GPUs use a Z-buffer (depth buffer) to track which pixels in a scene are visible to the eye, and do not need to be displayed becaused they are occluded by other pixels. The depth test for individual pixel data as defined in the OpenGL logical pipeline happen after the pixels are processed by the pixel processor. The problem with evaluating individual pixels after the pixel shading process is that pixels must traverse nearly the entire pipeline to ultimately discover some are occluded and will be discarded. So processing these 24 2.3. STATE OF THE ART ARCHITECTURES FOR MOBILE DEVICES non-visible pixels involve significant amount of transactions between the GPU and in the case of mobile devices, this consumes significant amounts of power. By performing the depth test before the pixel processing stage, the GeForce architecture fetches depth, color and texture data only for the visible pixels that pass the Z-test. Therefore, the main benefit of the Early-Z processing is that it reduces power consumption by reducing memory traffic between the GPU and off-chip system memory.

Figure 2.7: Ultra-low power GeForce architecture.

Another feature is the use of pixel and texture caches to reduce memory transactions. The traditional OpenGL GPU pipeline specifies that pixel information such as texture, depth or color is stored in system memory (or frame buffer memory). The pixel information is moved to and from memory during the pixel processing stage. This requires a significant number of off-chip system memory transactions, and thus consumes large amounts of power. The GeForce architecture has implemented on-chip pixel and texture caches to reduce the system memory transactions. The pixel cache is used to store on-chip depth and color values of pixels and can be reused for all pixels that are accessed repeatedly. The texture cache is employed to store on-chip texture elements (texels).

Finally, the GeForce GPU implements several advanced techniques to re- duce power consumption including, for instance, multiple levels of and dynamic voltage 25 CHAPTER 2. RELATED WORK and frequency scaling.

A more detailed description of the NVIDIA Tegra 2 architecture is provided in [4].

2.4 Data Cache Prefetching

While trends in both underlying semiconductor technology and in have sig- nificantly increased processor clock rates, the major trend in main memory technology has been in the direction of higher densities with memory access time decreasing much less than processor cycle times. These trends have increased main memory latencies when measured in processor clock cycles. To avoid performance losses due to disparity of speed between CPU and main memory, rely on a hierarchy of cache memories. However, cache memories are not always effective due to limited cache capacity and limited associativity. In order to try to overcome these limitations of cache memories data can be prefetched into the cache.

In this section we review different prefechting schemes for CPUs. We start with the simplest prefetching scheme, the stride prefetcher, and next we review the Markov prefetcher and the distance prefetcher. First we review the implementation of these prefetchers by using a table to record the necessary information and later we show how these prefetching schemes can be implemented more effectively by using a Global History Buffer [37].

Several prefetching schemes have also been proposed for the GPU, this type of prefechers are aware of the special characteristics of the GPU architecture. In this section we review two prefetching schemes targeting GPUs. First, we describe the many-thread aware prefetcher proposed in [36]. Next we review a prefetching scheme specifically designed for texture caches [33].

The “agressiveness” of a prefetcher can be characterized by the prefetch degree. The degree of prefetching determines how many requests can be initiated by one prefetch trigger. Increasing the degree can be beneficial, if the prefetched lines are used by the application, or harmful, if the prefetched lines are evicted before being accessed by the application.

2.4.1 CPU prefetchers

Stride prefetcher

Conventional Stride Prefetching [31] uses a table to store stride-related local history information (figure 2.8). The program (PC) of a load instruction is employed to index the table. Each table entry stores the load’s most recent stride (the difference between the two most recently pending load addresses), last address (to allow computation of the next local stride), and state information describing the stability of the load’s recent stride behavior. When a prefetch is triggered, addresses a + s, a + 2s,...,a + ds are prefetched (a is the load’s current target address, s is the detected stride and d is the prefetch degree, an implementation dependent prefetch look-ahead distance).

When originally proposed, this method was applied to a single L1 cache, and all load PCs 26 2.4. DATA CACHE PREFETCHING

Figure 2.8: Stride prefetching table. were applied to the stride prefetching table. However, using all load PCs results in relatively high demand on L1 and L2 cache ports. Later Nesbit et al. [37] proposed to implement the stride prefetcher by using only the PCs and addresses of the loads that miss in the cache.

Markov prefetcher

Markov Prefetching [34] is an example of a correlation prefetching method. Correlation prefetch- ing uses a history table to record consecutive address pairs. When a cache miss occurs, the miss address indexes the correlation table (figure 2.9). Each entry in the Markov correlation table holds a list of addresses that have immediately followed the current miss address in the past. When a table entry is accessed, the members of its address list are prefetched, with the most recent miss address first. To update the table the previous miss address is used to index the table and the cur- rent miss address is inserted in the address list. To insert the address the current list of addresses is shifted to the right and the new address is inserted in the “most recent” position (the column labeled as “1st” in figure 2.9).

Markov prefetching models the miss address stream as a Markov graph, a probabilistic state machine. Each node in the Markov graph is an address and the arcs between nodes are labeled with the probability that the arc’s source node address will be immediately followed by the target node address. Each entry in the correlation table represents a node in an associated Markov graph, and its list of memory addresses represents arcs with the highest probabilities. Thus, the table maintains only a very raw approximation to the actual Markov probabilities.

Distance prefetcher

Distance prefetching [35] is a generalization of Markov Prefetching. Originally, Distance prefetching was proposed for prefetching TLB entries, but the method is easily adapted to prefetch- 27 CHAPTER 2. RELATED WORK

Figure 2.9: Markov prefetching. The left side of the figure shows the state of the correlation table after processing the miss address stream shown at the top of the figure. The right side illustrates the Markov transition graph that corresponds to the example miss address stream.

ing cache lines. This prefetching scheme uses the distance between two consecutive global miss addresses, an address delta, to index the correlation table. Each correlation table entry holds a list of deltas that have followed the entry’s delta in the past. Figure 2.10 shows an example of address delta stream and the state of the correlation table after processing the delta stream. When a cache miss occurs, the new delta is computed by subtracting the previous miss address to the current miss address. This delta is employed to access the table and the list of deltas in the corresponding entry is used to generate the prefetch requests. The table is updated by using the same mechanism that was explained in the Markov prefetcher.

Figure 2.10: Distance prefetching. The address delta stream corresponds to the sequence of ad- dresses used in the example of figure 2.9.

Distance prefetching is considered a generalization of Markov prefetching because one delta cor- relation can represent many miss address correlations. On the other hand, unlike Markov prefetch- ing, Distance prefetching’s state predictions are not prefetch addresses. To calculate prefetch ad- dresses the predicted deltas are added to the current miss address. 28 2.4. DATA CACHE PREFETCHING

Global History Buffer

Prefecth tables store prefetch history inefficiently. In first place, table data can become stale, and consequently reduce prefetch accuracy. In second place, tables suffer from conflicts that occur when multiple access keys map to the same table entry. The main solution for reducing conflicts is to increase the number of table entries. However, this approach increases the table’s memory requirements. In third place, tables have a fixed amount of history per entry. Adding more prefetch history per entry creates new opportunities for effective prefetching, but the additional history also increases the table’s memory requirements.

A new prefetching structure, the Global History Buffer, is proposed in [37]. This prefetching structure decouples table key matching from the storage of prefetch-related history information. The overall prefetching structure has two levels (figure 2.11):

• An Index Table (IT) that is accessed with a key as in conventional prefetch tables. The key may be a load instruction’s PC, a cache miss address, or some combination. The entries in the Index Table contain pointers into the Global History Buffer. • The Global History Buffer (GHB) is an n-entry FIFO table (implemented as a circular buffer) that holds the n most recent miss addresses. Each GHB entry stores a global miss address and a link pointer. Each pointer points to the previous miss address with the same Index Table Key. The link pointers are used to chain the GHB entries into address lists. Hence, each address list is a time-ordered sequence of addresses that have the same Index Table key.

All the prefetchers reviewed in the previous section can be implemented by using a GHB instead of a table. Depending on the key that is used for indexing the Index Table, the stride, Markov and Distance prefetchers can be implemented more effectively with a GHB. In this section we review the implementation of the Distance prefetcher by using the GHB approach.

Figure 2.12 illustrates how the GHB can prefetch by using a distance prefetching scheme. The “Deltas” box shown in the figure does not exist in GHB hardware, but is extracted by finding the difference between miss addresses in the GHB. As shown in the figure, prefetch addresses are generated by taking the miss address and accumulatively adding deltas, a valid prefetch address is created from each addition.

With the GHB approach, one can often get a better estimate of the actual Markov graph transition probabilities than with conventional correlation methods. In fact, the GHB allows a weighting of transition probabilities based on how recently they have occurred.

2.4.2 GPU prefetchers

Many-Thread Aware Prefetching Mechanisms

All the previous prefetching schemes were designed targeting CPU architectures, so they are not aware of the special characteristics of Graphics Processing Units. Lee et al. propose in [36] 29 CHAPTER 2. RELATED WORK

Figure 2.11: Global History Buffer.

Figure 2.12: Distance prefetcher implemented by using a Global History Buffer. The Head Pointer points to the last inserted address in the GHB. a new prefeching scheme specifically designed for CUDA applications in a GPGPU environment. The baseline GPGPU architecture is shown in figure 2.13, the architecture follows the NVIDIA’s CUDA programming model [17].

In the CUDA model, each core is assigned a certain number of thread blocks, a group of threads that should be executed concurrently. Each thread block consists of several warps, which 30 2.4. DATA CACHE PREFETCHING

Figure 2.13: An overview of the baseline GPGPU architecture.

are much smaller groups of threads. A warp is the smallest unit of hardware execution. A core executes instructions from a warp in an SIMT (Single-Instruction Multiple-Thread) fashion. In SIMT execution, a single instruction is fetched for each warp, and all the threads in the warp execute the same instruction in lockstep, except when there is control divergence. Threads and blocks are part of the CUDA programming model, but a warp is an aspect of the microarchitectural design.

The GPGPU architecture illustrated in figure 2.13 is similar to the state-of-the-art architecture of current NVIDIA’s GPUs. The basic design consists of several cores and an off-chip DRAM with memory controllers located inside the chip. Each core has SIMD execution units, a software- managed cache (shared memory), a memory request queue (MRQ) and other units. The processor has an in-order scheduler, it executes instructions from one warp, switching to another warp if source operands are not ready. The MRQ is employed to store both demand requests (from the application) and prefetch requests (from the prefetching engine). Each new request is compared to existing requests and in case of a match the requests are merged.

The prefetching scheme proposed in [36], the many-thread aware hardware prefetcher, has spe- cial features that make it more effective in a GPGPU environment. In first place, this prefetcher provides improved scalability. Current GPGPU applications exhibit largely regular memory access patterns, so traditional CPU prefetchers should work well. However, because the number of threads is often in the hundreds, traditional training mechanisms do not scale.

In the many-thread aware prefetcher the pattern detectors are trained on a per-warp basis, 31 CHAPTER 2. RELATED WORK similar to those in simultaneous multithreading architectures. This aspect is critical since many re- quests from different warps can easily confuse pattern detectors. An example is shown in figure 2.14. In this example a strong stride behavior within each warp exists, but due to warp interleaving, a hardware prefetcher only sees a random pattern. In order to prevent this problem, in the many- thread aware prefetcher stride information trained per warp is stored in a per warp stride (PWS) table. So the many-thread aware prefetcher is based on the stride prefetcher described in figure 2.8, the PC of the miss address is employed to index the different tables employed by this prefetcher and each entry of these tables contains stride information.

PC Warp ID Addr Delta PC Warp ID Addr Delta 0x08 1 0 - 0x08 1 0 - 0x08 1 100 100 0x08 2 10 10 0x08 1 200 100 0x08 1 100 90 0x08 2 10 - 0x08 3 20 -80 0x08 2 110 100 0x08 2 110 90 0x08 2 210 100 0x08 3 120 10 0x08 3 20 - 0x08 3 220 100 0x08 3 120 100 0x08 1 200 -20 0x08 3 220 100 0x08 2 210 10 (a) Accesses by warps (b) Accesses seen by a hardware prefetcher

Figure 2.14: An example of memory address with/without warp interleaving.

On the other hand, the many-thread aware prefetcher employs stride promotion. Since memory access patterns are fairly regular in GPGPU applications, when a few warps have the same access stride for a given PC, all warps will often have the same stride for the PC. Based on this obvervation, when at least three PWS entries for the same PC have the same stride, the prefetcher promotes the PC stride combination to the global stride (GS) table. By promoting strides, yet-to-be-trained warps can use the entry in the GS table to issue prefetch requests immediately without accessing the PWS table.

Another feature of the many-thread aware prefetcher is the inter-thread prefetching (IP): each thread can issue prefetch requests for threads in other warps, instead of prefetching for itself. The key idea behind IP is that when an application exhibits a strided across threads at the same PC, one thread generates prefetch requests for another thread. This information is stored in a separate table called an IP table. The IP table is trained until three accesses from the same PC and different warps have the same stride. Thereafter, the prefetcher issues prefetch requests from the table entry.

Figure 2.15 shows the overall design of the many-thread aware prefetcher, which consists of the three tables discussed earlier: PWS, GS and IP tables. The IP and GS tables are indexed in parallel with a PC address. When there are hits in both tables, the prefetcher gives a higher priority to the GS table because strides within a warp are much more common than strides across warps. Furthermore, the GS table contains only promoted strides, which means an entry in the GS table has been trained for a longer period than strides in the IP table. If there are no hits in any table, then the PWS table is indexed in the next cycle. However, if any of the tables have a hit, 32 2.4. DATA CACHE PREFETCHING

Figure 2.15: Many-thread aware hardware prefetcher. the prefetcher generates a request.

On the other hand, the many-thread aware prefetcher includes an adaptive prefetch throt- tling mechanism to control the agressiveness of prefetching (prefetch degree). Big prefetch degrees can reduce performance if the prefeched lines are useless (are evicted before being used). So the prefetcher should be able to eliminate the instances of prefetching that yield negative effects while retaining the beneficial cases. Two metrics are employed to control the prefetch degree. The early eviction rate is the number of cache blocks evicted from the prefetch cache before their first use divided by the number of useful prefetches:

#EarlyEvictions Metric(EarlyEviction) = #UsefulP refetches

The second metric is the merge ratio. Memory requests can be merged at various levels in the hardware. As shown in figure 2.13, each core maintains its own Memory Request Queue (MRQ). New requests that match with existing MRQ requests will be merged with the matching request. The merge ratio is the number of intra-core merges that occur divided by the total number of requests:

#IntraCoreMerging Metric(Merge) = #T otalRequest

The adaptive throttling mechanism maintains the early eviction rate and merge ratio in each one of the cores, periodically updating them and using them to adjust the degree of throttling. The throttling degree varies from 0 (0%: keep all prefetches) to 5 (100%: no prefetch). The prefetcher adjusts this degree using the current values of the two metrics according to the heuristics in figure 2.16. The early eviction rate is considered high if it is greater than 0.02, low if it is less than 0.01, and medium otherwise. The merge ratio is considered high if it is greater than 15% and low otherwise. 33 CHAPTER 2. RELATED WORK

Early Eviction Rate Merge Action High - No prefetch Medium - Increase throttle (fewer prefetches) Low High Decrease throttle (more prefetcher) Low Low No prefetch

Figure 2.16: Throttling heuristics.

Prefetching Architecture for Texture Caches

The prefetchers described in previous sections were designed for general purpose computing and the performance increase is specially significant for applications with regular memory access patterns. Even the GPGPU prefetcher, the many-thread aware prefetcher, was specifically designed for scientific applications in a CUDA like architecture.

Igehy et al. [33] proposed a prefetching architecture for texture caches. This prefetcher was designed targeting a traditional GPU architecture, similar to the GeForce architecture described in section 2.3.3, and graphics workloads. The main objective of this prefetcher is to accelerate the process of applying textures on triangles (texture mapping).

Texture mapping has become ubiquitous in real-time graphics hardware. In its most basic form, texture mapping is a process by which a 2D image is mapped onto a projected screen-space triangle under perspective. This operation amounts to a linear transformation in 2D homogeneous coordinates. The transformation is typically done as a backward mapping: for each pixel on the screen, the corresponding coordinate in the texture map is calculated. The backward mapped coordinate typically does not fall exactly onto a sample in the texture map, and the texture may be minified or magnified on the screen. Filtering is applied to minimize the effects of , and ideally, the filtering should be efficient and amenable to hardware acceleration. Mip mapping [40] is the filtering technique most commonly implemented in graphics hardware.

Figure 2.17 shows the part of the where texture mapping is performed. The rasterizer circuitry converts 2D triangles to pixels on the screen, each one of these pixels is processed in a fragment processor. In order to apply texture mapping, the fragment processor has to fetch the corresponding texels (texture elements) from texture memory. Since textures are located in off-chip main memory, the fragment processor is provided with a texture cache to reduce the latency of memory accesses and the number of off-chip system memory transactions.

The prefetching architecture for texture caches proposed in [33] is illustrated in figure 2.18. The architecture processes fragments as follows. As each fragment is generated, each of its texel addresses is looked up in the cache tags. If a tag check reveals a miss, the cache tags are updated with the fragment’s texel address immediately and the address is forwarded to the memory request FIFO. The cache addresses associated with the fragment are forwarded to the fragment FIFO and are stored along with all the other data needed to process the fragment: color, depth, filtering information... As the request FIFO sends requests for missing cache blocks to the texture memory system, space is reserved in the reorder buffer to hold the returning memory blocks. This guarantee of space makes the architecture robust and deadlock-free in the presence of an out-of-order system. 34 2.4. DATA CACHE PREFETCHING

Figure 2.17: Baseline architecture for texture mapping.

When a fragment reaches the head of the fragment FIFO, it can proceed only if all of its texels are present in the cache. Fragments that generated no misses can proceed immediately, but fragments that generated one or more misses must first wait for their corresponding cache blocks to return from memory into the reorder buffer. In order to guarantee that new cache blocks do not prematurely overwrite older cache blocks, new cache blocks are committed to the cache only when their corresponding fragment reaches the head of the fragment FIFO. Fragments that are removed from the head of the FIFO have their corresponding texels read from the cache and proceed onward to the rest of the texture pipeline.

Figure 2.18: Texture cache prefetcher architecture.

35 CHAPTER 2. RELATED WORK

One of the key parameters of this prefetching architecture is the size of the fragment FIFO. This FIFO primarly masks the latency of the memory system. If the system is not to stall on a cache miss, it must be able to continually service new fragments while previous fragments are waiting for texture cache misses to be filled. Thus, the fragment FIFO depth should at least match the latency of the memory system.

36 3 Problem statement: Memory Wall for Low Power GPUs

3.1 Hiding memory latency on a modern low power mobile GPU

The clock rate of mobile CPUs and GPUs has experimented a big growth in the last years. Nowadays, it is usual to find smartphones and tablets with a CPU clock rate of 1 GHz, and it seems this trend is going to continue in the next years. For example, the new Qualcomm Snapdragon S3 has a clock rate of 1.5 GHz [18] and the Qualcomm roadmap includes a mobile chipset with a clock rate between 2.0 GHz and 2.5 GHz by 2012 [13]. Hence, these mobile devices are going to hit the memory wall. Due to the disparity of speed between the CPU/GPU and memory, the performance of these mobile devices will be significantly affected by the latency to access main memory. Thus, the use of techniques to hide this latency is going to be necessary. Furthermore, these techniques must improve the behavior of the memory system without breaking the limited power budget of smartphones.

Basically, the three main techniques for hiding memory latency are caches, multithreading and prefetching. Caches are a very effective technique for CPUs, however, in GPUs caches focus on conserving bandwidth rather than reducing latency [32]. Graphics workloads usually exhibit irregular memory access patterns. Therefore, the typical hit rates of the caches in the GPU are not so big like in a CPU. Although the hit rates are far to be perfect, caches can filter a significant percentage of the accesses to system memory, so they are a good mechanism to save memory bandwidth in a GPU and mobile GPUs include different types of caches (section 2.3.3). However, caches are not the ideal solution for hiding memory latency on GPUs due to the special characteristics of graphics workloads.

On the other hand, multithreading is a very effective technique to keep all the GPU processors utilized and state of the art desktop GPUs support thousand of simultaneous threads [19]. Fig- 37 CHAPTER 3. PROBLEM STATEMENT: MEMORY WALL FOR LOW POWER GPUS ure 3.1 shows the effectiveness of multithreading for hiding the memory latency in different Android games. As we increase the number of threads in each processor we obtain better performance. With 16 warps per processor the performance is very close to the performance of a system with perfect caches, so multithreading is able to hide all the memory latency if the number of threads available is big enough (which is the case of graphics workloads).

Figure 3.1: Effectiveness of multithreading for hiding memory latency. As we increase the number of warps on each processor we obtain better performance.

Figure 3.2: Power consumed by the GPU main register file for different configurations.

Although multithreading is very effective for hiding the memory latency, it is also a power hungry technique. Due to the need for fast context switching, the GPU has to keep the architec- tural state of all the threads on execution in the register file. Since the number of threads is big (thousands) the size of the main register file becomes huge when applying agressive multithreading. As we can see in figure 3.2, the power consumed by the main register file is significantly increased as we increase the number of simultaneous threads. For 32 warps the power is close to 250 mW, so the GPU runs out of the power budget (the power budget of a mobile System on Chip is between 1 and 38 3.1. HIDING MEMORY LATENCY ON A MODERN LOW POWER MOBILE GPU

2 Watts, including CPU, GPU, specialized circuitry for video encoding/decoding...). Therefore, although multithreading is an effective techinque for hiding the latency of the main memory, it is not well suited for a low power environment.

The last technique is prefetching. Prefetching has been studied in depth and several prefetch- ing schemes have been proposed for both CPUs and GPUs, as we have seen in section 2.4. However, there are several issues with the previous proposals. In first place, the CPU prefetchers are effective for applications with regular memory access patterns and none of them directly apply to GPUs [36]. In second place, the GPU prefetchers described in section 2.4.2 are effective for scientific applica- tions written in CUDA. These prefetchers are effective in heavily multithreaded systems but they also require applications with regular memory access patterns, so they are not well suited for graph- ics workloads. The GPU prefetcher for texture caches described in section 2.4.2 is very effective for graphic applications. However, it was designed for a GPU with just one pixel processor and it cannot be directly applied to a multicore GPU. Applying the GPU prefetcher for texture caches to a multicore GPU introduces several challenges as we will describe in chapter4.

In conclusion, we have observed the lack of a mechanism for hiding the main memory latency in low power systems when executing graphics workloads.

39

4 Proposal: Decoupled Access Execute Prefetching

4.1 Ultra-low power decoupled prefetcher

In this chapter we present our ultra-low power decoupled prefetcher for Graphics Processing Units, this prefetching scheme has been designed for graphics workloads in low power environments. This section is organized as follows. First, we describe the baseline GPU architecture which is similar to the ultra low-power GeForce GPU in the NVIDIA Tegra 2 chipset (section 2.3.3). Next we present the first version of the decoupled prefetcher that is based in the prefetching architecture presented in [33]. Finally, we present additional optimizations to improve performance and reduce power consumption.

4.1.1 Baseline GPU Architecture

The baseline GPU architecture is illustrated in figure 4.1, it is based on the ultra-low power GeForce in the NVIDIA Tegra 2 chipset. In this architecture pixels are generated and processed as follows. In first place, the rasterizer circuitry performs the scan conversion or raster conversion described in section 2.1. The rasterizer takes 2D triangles as input and generates the pixels to fill these triangles (figure 2.1d shows an example of raster conversion). All the generated pixels, or fragments in OpenGL terminology, are inserted into the fragment queue.

After raster conversion, non-visible pixels are discarded by using the Z-buffer algorithm [9]. The hardware in the Early Depth Test stage performs the visibility determination. The depth value of each fragment read from the fragment queue is compared with the actual value in the Z-buffer. If the fragment’s depth value is smaller than the current value, the Z-buffer is updated and the fragment proceeds through the pipeline. Otherwise, the fragment is discarded. Hence, in 41 CHAPTER 4. PROPOSAL: DECOUPLED ACCESS EXECUTE PREFETCHING

Figure 4.1: Baseline GPU architecture (based on the ultra-low power GeForce GPU in the NVIDIA Tegra 2 chipset). order to perform the visibility determination the hardware in the Depth Test stage has to access to memory two times at most for each fragment: one time to read the actual depth value and, if the fragment passes the depth test, a second time to write the new depth value. The Depth Test stage employs a cache to optimize this process, so part of the Z-buffer is stored within this pixel cache.

After the Depth Test stage the visible fragments are packed in groups of n fragments or tiles, we have chosen 4 as the number of fragments in each tile. These tiles are inserted in the tile queue 42 4.1. ULTRA-LOW POWER DECOUPLED PREFETCHER to be processed by the fragment processors. The four fragments within a tile will be processed in the same streaming processor. The scheduler in the Fragment stage reads tiles from the queue and decides in which processor will be processed each one of the tiles. There are 4 streaming processors in the Fragment stage, the scheduler employs a round-robin policy to dispatch tiles to the processors.

The streaming processors perform several operations to each fragment like texture mapping, blending, per-pixel lighting... These processors are programmable and the user can specify the sequence of instructions to be applied to each fragment. The streaming processors are in-order processors and multithreading is employed to try to hide the memory latency. Each processor has 8 thread hardware contexts grouped in 2 warps. A warp is a group of threads that are scheduled together and executed in lockstep mode. Each tile is processed by one warp, and each one of the 4 threads in a warp processes one of the 4 fragments in the tile. There are also 4 SIMD execution units or vectorial units in each streaming processor, so at a given time just one of the warps is on execution. A streaming processor fetches and executes instructions from one warp until a cache miss is encountered, then the processor fetches instructions from the other warp to try to hide the latency of the memory access.

Each streaming processor is provided with a pixel cache and a texture cache. The pixel cache is employed to store color values of pixels (cache lines from the color buffer) and the texture cache is used to store texture elements (cache lines from texture memory). Hence, in the whole architecture there are 10 caches: the L2 cache, the pixel cache employed in the Depth Test stage and one pixel cache and one texture cache in each one of the 4 streaming processors. Prefetching can be applied to each one of these caches in order to improve performance.

4.1.2 Decoupled prefetcher

The traditional prefetchers are triggered on cache misses. Whenever a cache miss is produced, the prefetching engine triggers one or more (depending on the degree of prefetching) cache line requests to the next level of the following a prediction scheme based on history information. However, in the GPU architecture previously described a more efficient approach can be employed. The information stored in the fragment queue allows us to compute which cache lines from the Z-buffer will be accessed in the Depth Test stage. In the same manner, the information stored in the tile queue allows us to compute which cache lines from the color buffer and the texture memory will be accessed in the fragment processing stage. Therefore, this information can be employed to preemptively prefetch the cache lines that will be accessed during the processing of each one of the fragments.

In the decoupled prefetching scheme, a prefetch request is sent to the corresponding Tex- ture/Pixel cache for each cache line that will be requested in the future during the processing of the fragments. The cache controller handles prefetch requests as follows. In first place, the tags are checked to see if the target line of the prefetch request is already in cache. In case of a hit, the prefetch request is disregarded. In case of a miss, the prefetch request is redirected to the next level of the memory hierarchy. When the data is served by the next level the cache is updated.

The architecture of our proposed prefetching scheme is illustrated in figure 4.2. As we can 43 CHAPTER 4. PROPOSAL: DECOUPLED ACCESS EXECUTE PREFETCHING observe, the prefetching engine is decoupled from the caches and the streaming processors. The decoupled prefetcher works as follows. For each new fragment generated in the rasterizer the corresponding address in the Z-buffer is computed, so the fragment is inserted in the fragment queue and the memory address is inserted in the prefetch queue. While fragments are waiting in the fragment queue to be processed in the Depth Test stage, the prefetch requests for the corresponding cache lines are sent to the Pixel cache. The prefetch queue is traversed each cycle to try to send pending prefetch requests. By preemptively prefetching cache lines we expect that all the necessary depth values will be available in the pixel cache when the fragments are read from the fragment queue and processed in the Depth Test stage.

We can apply the same decoupled prefetching scheme to prefetch for the pixel and texture caches in the streaming processors. Once the visible fragments are packed in tiles, we can compute which cache lines from the color buffer and from texture memory will be accessed during the processing of each tile. So we can prefetch all the necessary cache lines while the tiles are waiting in the tile queue. However, in this case the prefetching is more challenging because there are 4 streaming processors and 8 caches, so we have to decide in which cache we are going to prefetch the lines. Furthermore, we have to guarantee that each tile is processed in the streaming processor in which we have prefetched its data. To solve this issue we move the scheduling from the entry of the Fragment stage to the output of the Depth Test stage. When a new tile is created it is scheduled to a streaming processor by using a round-robin policy, the ID of the processor is stored in the tile queue together with the rest of the tile information. All the necessary cache lines for the tile will be prefetched in the pixel and texture caches of the corresponding processor, the prefetch queue includes an additional field to identify the target cache of the prefetching request. When the tiles are read from the tile queue they are dispatched to the corresponding streaming processor.

Merging is employed to reduce the number of prefetch requests. For example, let’s assume a cache line size of 64 . If the four fragments within a tile will access to the memory addresses 4, 8, 12 and 16 respectively, then just one prefetch request to cache line 0 is issued to the prefetch queue. The prefetch queue is clocked each cycle to try to send pending prefetch requests. This queue has two fields for each entry: the tag of the cache line to be prefetched and the ID of the target cache (the cache where the data will be prefetched).

Our decoupled prefetcher is based on the prefetcher described in section 2.4.2. However, our work is different in several ways. In first place, since we have moved the Depth Test stage before the Fragment processing stage, color values and texture elements are prefetched just for the visible pixels. So we significantly reduce the number of prefetch requests. In second place, our prefetcher works in a multiprocessor environment with multiple caches. On the contrary, the texture cache prefetcher described in section 2.4.2 assumes just one cache and one streaming processor.

The size of the queues (two prefetch queues, fragment queue and tile queue) is a key parameter in this prefeching scheme. If the queues are small then the prefetcher cannot prefetch early enough, so maybe when the fragments are read from the queue the prefetch requests are still in flight and the data is not in cache. Thus, as we reduce the size of these queues we increase the number of compulsory misses. On the other hand, if the queues are too big we increase the number of conflicts due to the limited associativity. For example, assuming 2-way associative caches, if three different cache lines from three different tiles are prefetched to the same cache and they are mapped to the same set a conflict miss is produced. In this case, a cache line that will be accessed by a tile is

44 4.1. ULTRA-LOW POWER DECOUPLED PREFETCHER

Figure 4.2: Decoupled prefetcher architecture. evicted because of a younger prefetch request is mapped to the same set.

4.1.3 Decoupled prefetcher improvements

We can further improve the decoupled prefetcher by better utilizing the bandwidth to the L2 cache. When we implemented the decoupled prefetcher in our simulation infrastructure (sec- tion 5.1.2), we realized that we were often prefetching the same cache line to different caches. An 45 CHAPTER 4. PROPOSAL: DECOUPLED ACCESS EXECUTE PREFETCHING example of this case is illustrated in figure 4.3. In the example there is a prefetch request to cache line A targeting the texture cache of processor 2 and another prefetch request to the same cache line but targeting the texture cache of processor 3. So in this case a prefetch request will be sent to the texture cache of processor 2 for line A, if the line is not in the cache the request will be redirected to the L2 cache and the line A will be read from L2 cache and stored in texture cache 2. Furthermore, another prefetch request for the same line will be sent to texture cache 3 and, in case of a miss, the prefetch will also be resent to the L2 cache. Therefore, 2 prefetch requests to the L2 cache for the same line are generated and the line is read from L2 cache twice.

Figure 4.3: Improved decoupled prefetcher.

46 4.1. ULTRA-LOW POWER DECOUPLED PREFETCHER

We can employ a more efficient approach to handle the case described in the previous example. Since the cache line A is prefetched in the texture cache 2, the texture cache 3 can obtain this line from texture cache 2 instead of from the L2 cache. In this case, we save bandwidth to L2 cache and we reduce power since texture and pixel caches are much smaller than the L2 cache.

The improvement proposed in this section is implemented as follows. First, the prefetch queue includes an additional field, Source, with the ID of the cache where the prefetch request will be redirected in case of a miss in the target cache. When a prefetch request is inserted in the prefetch queue, the tag of this prefetch request is compared with the current tags in the prefetch queue. In case of a match, the Source field of the prefetch request will be set to the Cache ID tag of the matching request. If there is no match, the Source field is set to the ID of the L2 cache. In case of multiple matches, we select the youngest matching request. When the prefetch request is issued to the corresponding cache, the information in the Source field is packed within the request. This information is employed by the cache controller in case of a cache miss to redirect the prefetch request to the corresponding Texture/Pixel cache instead of to the L2 cache.

By introducing this improvement we try to save bandwidth to the L2 cache, since a significant percentage of the prefetch requests that in the previous scheme were served by the L2 cache now will be served by Pixel/Texture caches. Furthermore, accessing a Pixel/Texture cache requires less energy because these caches are much smaller than the L2 cache, so we also expect to save power. The experimental results presented in section 6.3 prove these claims.

Regarding the management of the requests in the cache controller, there is no prioritization of demand requests (requests from the application) over prefetch requests from the prefetch queue or remote prefetch requests from other caches, all the requests have the same priority. Maybe it would be beneficial to serve the demand requests first, however, since we have obtained important speedups without considering priorities we have not explored this option.

The architecture of the improved decoupled prefetcher is shown in figure 4.3. The connection between texture caches 2 and 3 is highlighted to illustrate that the cache line A is obtained from texture cache 2, not from the L2 cache.

47

5 Evaluation methodology

5.1 Simulation infrastructure

We have developed a simulation infrastructure in order to evaluate the performance of several prefetching techniques on a mobile GPU. Our infrastructure is divided in two main components: the trace generation system and the cycle accurate GPU simulator. The trace generation system is able to intercept all the rendering commands (OpenGL ES commands) in Android and save all the necessary information for each command: number of vertices processed, number of triangles, number of fragments generated for each triangle... This information is stored in a GPU trace which is the input to the cycle accurate GPU simulator. The simulator computes different GPU statistics such as number of cycles, IPC, miss rates for the different caches...

We have employed several existing tools to develop our infrastructure. For example, we have used Android and QEMU for the trace generation tool. Furthermore, our GPU simulator is based on a previous GPU simulator, Qsilver [39].

5.1.1 GPU trace generation

The GPU trace generation system is illustrated in figure 5.1. We employ QEMU [20] to boot and run the Android [21] operating system. On top of Android we run some smartphone applications like the web browser, the audio player or games.

When Android is executed on top of an emulator, such as QEMU, the OpenGL ES commands are processed by the Android Software Renderer, as described in section 2.2.1. We have instru- mented this library to collect all the necessary information for the cycle accurate GPU simulator. 49 CHAPTER 5. EVALUATION METHODOLOGY

The instrumentation code has been inserted in the three rendering functions of the OpenGL ES API: glDrawArrays, glDrawElements and glDrawTex.

Figure 5.1: GPU trace generation system.

Whenever an application calls a rendering function from the OpenGL ES API, our instru- mentation code starts to collect information about the rendering process. At the beginning of the rendering function some state information is collected:

• Lighting information: lighting enabled/disabled, number of lights...

• Texturing information: texturing enabled/disabled, number of active texture units...

• Array information: for each one of the OpenGL client arrays (vertex, color, normal and texture coordinates array) the following information is saved:

– Enabled/Disabled. – Base address of the array. – Size of each element of the array in bytes. – Stride between elements.

• Rendering mode: points, lines, triangles, triangle strip, triangle fan...

On the other hand, as the rendering command is processed we save several information for each triangle and for each pixel:

• Triangle information: visibility (is this triangle discarded in the clipping stage?) and list of all the pixels generated to fill the triangle. 50 5.1. SIMULATION INFRASTRUCTURE

• Pixel information:

– Visibility (Is this pixel discarded in the Depth Test?). – Address of the pixel in the Z-buffer. – Address of the pixel in the color buffer. – Addresses of all the texture elements accessed to process this pixel.

At the end of the rendering function, all the collected information about the rendering command is ready to be dumped to the trace file. All the instrumentation code is executed inside the guest operating system (Android), so if we try to open a file and save the information about the rendering command the file will be created in a virtual file system since Android is executed inside a virtual machine. So we would have to transfer the file from the virtual file system to the file system of the host operating system in order to feed this trace file to the GPU simulator. The host system is the system in which QEMU is executed, in our case Linux.

On the other hand, we can create the trace file directly in the host system by using a different approach. We can signal the end of the rendering command in the Android Software Renderer in some manner. Then, we can detect this signal in QEMU and save all the collected information to the trace file in the host file system. To signal the end of a rendering command we employ an interruption with a special code, 0x99. So we have modified the code translation in QEMU. When an interrupt instruction with code 0x99 is found, this instruction is replaced by a call to a function that saves all the collected information to the trace file.

5.1.2 Cycle accurate GPU simulator

The cycle accurate GPU simulator is able to read the information in the trace files, created by the GPU trace generation system previously described, and simulate the execution of the rendering commands in a state-of-the-art mobile GPU similar to the one inside the NVIDIA Tegra 2 chipset.

The architecture modelled by the GPU simulator is illustrated in figure 5.2. In this section we briefly describe each one of the components in the architecture modelled by the GPU simulator. Futhermore, we describe the power model employed to obtain the energy required to process the rendering commands.

The first stage in the graphics pipeline is the Primitive processing. This stage fetches all the necessary information for each vertex: position, color, normal, texture coordinates... The information stored in the GPU trace about the OpenGL client arrays (see section 5.1.1) is employed to issue the corresponding memory requests to the VBO (Vertex Buffer Object) Cache. As the vertex data is fetched from memory the vertices are inserted in the first Vertex queue.

Vertices are transformed and shaded in the Vertex processing stage. This stage contains several Streaming Processors, each of these processors is able to process one vertex. A sequence of instructions, or vertex shader, is applied to each one of the vertices read from the first Vertex queue. The vertex shader is obtained from the GPU trace. The instruction set employed is the OpenGL Architecture Review Board ISA for vertex programs [22]. The number of Streaming Processors is 51 CHAPTER 5. EVALUATION METHODOLOGY

Figure 5.2: GPU architecture modelled by the cycle accurate simulator.

a parameter of the simulator and it can be modified in the configuration file, the default value is 4. Once a vertex is processed in a Streaming Processor it is inserted in the second Vertex queue.

The next stage in the graphics pipeline is the Primitive Assembly. Vertices are read from the second Vertex queue and grouped into the corresponding triangles. Once the 3 vertices of a 2D triangle have been found, the triangle is clipped against the screen (as described in section 2.1). Finally, the clipped 2D triangles are inserted into the Triangle queue.

Raster conversion is the next step in the rasterization process. The Rasterizer takes 2D 52 5.1. SIMULATION INFRASTRUCTURE triangles from the Triangle queue and generates the pixels, or fragments in OpenGL terminology, to fill the triangles. The Rasterizer module in the simulator does not perform this raster conversion, but it employs the information stored in the GPU trace. As we have described in section 5.1.1, the GPU trace stores the list of pixels generated by the Android Software Renderer’s rasterizer for each triangle. So the simulator does not perform the raster conversion because this task was performed in the Android OpenGL driver and the output was saved into the GPU trace. The generated pixels are inserted into the Fragment queue.

The Early Depth Test stage performs the visibility determination by applying the Z-buffer algorithm [9]. For each fragment read from the Fragment queue, a memory request is issued to the Pixel cache in order to obtain the depth value in the corresponding position of the Z-buffer. If the fragment is visible, another memory request is issued to update the depth value in the Z-buffer. Finally, visible fragments are packed in groups of 4 fragments or tiles and they are inserted in the Tile queue. The information stored in the GPU trace for each fragment is also employed in this stage. This information includes the address in the depth buffer, this address is necessary to issue the memory requests to the Pixel cache. It also includes the fragment visibility: true if the pixel is visible or false if it must be discarded.

Finally, the tiles are processed in the Fragment processing stage. In this stage, fragments within tiles are textured and shaded. There are several Streaming Processors in the Fragment stage, each one of these processors is able to process multiple tiles. The Streaming Processors apply a sequence of instructions, or fragment shader, to each one of the fragments. The fragment shader is obtained from the GPU trace, as well as the memory addresses that have to be requested to process each fragment. The instruction set employed is the OpenGL Architecture Review Board ISA for fragment programs [23]. A more detailed description of these Streaming Multiprocessors is provided in section 4.1.1.

The GPU simulator computes several statitics. For example, it computes the total number of cycles to process all the rendering commands in the trace file, the number of instructions, IPC or the miss rates of each one of the caches.

In order to compute the cycles, the GPU simulator models a streaming processor as a very sim- ple in-order processor. The pipeline has 5 stages: Instruction Fetch, Instruction Decode, Operands Fetch, Execution and Writeback. Only one instruction can be fetched, decoded and issued per cycle. However, the same instruction is issued to all the SIMD execution units, so the same instruction is executed in parallel n times (where n is the number of execution units) but with different data. There is no forwarding mechanism, each instruction waits until its source operands are available and in case of a data dependency the pipeline is stalled. Regarding the latencies, each instruction spends a different number of cycles in the execution stage. We have obtained the latencies of each one of the instructions in the ISA from the Qsilver GPU simulator [39].

Regarding the power model, we employ CACTI [24] to compute the energy consumed by the caches, the queues between stages or the register files. Furthermore, we employ the power model of Qsilver [39] to obtain the energy consumed by the ALUs in the Streaming Processors. So the dynamic energy consumed by the GPU is the addition of the dynamic energy consumed by the following components:

53 CHAPTER 5. EVALUATION METHODOLOGY

• The caches: the simulator provides the number of accesses to each cache. On the other hand, by using CACTI we compute the dynamic energy required to access a cache. So we can obtain the total energy consumed in each one of the caches by multiplying the total number of accesses by the energy per access.

• The queues: as in the previous case, we multiply the total number of accesses to the queue (provided by the simulator) by the energy per access (computed with CACTI).

• The Streaming Processors: we account for the energy consumed by the Main and the SIMD execution units:

– Main Register File: we obtain the energy per access by using CACTI and we multiply this value by the total number of accesses to the main register file (obtained from the simulator statistics). – ALUs: We employ the power model from Qsilver [39], it is a simple power model in which each one of the instructions in the ISA has a fixed amount of energy assigned, this is the energy required to execute the instruction. So if there are N different instructions in the ISA, the energy consumed by the ALUs is defined by the following equation:

N X EALUs = Num instructionsi × Ei i=1 The number of executed instructions of each type is computed by the GPU simulator, the energy required to execute each instruction is obtained from Qsilver.

• The prefetchers: All the prefetching schemes employ one or several structures like tables or queues (see section 2.4) that are accessed multiple times. As in the previous cases, the number of accesses is computed by using the GPU simulator and the energy per access is provided by CACTI.

Regarding the static energy, the simulator is able to compute the number of idle cycle for each one of the hardware structures (caches, register files, queues...). By combining this information with the energy estimations provided by CACTI we can compute the total leakage power.

When we present power numbers in chapter6, these numbers include the power consumed by all the caches, all the queues and all the Streaming Processors. Furthermore, if some prefetching scheme is employed, the power results also include the power consumed by all the hardware structures used by the prefetcher.

54 6 Experimental results

In this section we present the experimental results obtained with the simulation infrastructure described in section 5.1. In first place, we analyze several Android applications from the Android Store and we establish that the most demanding applications are, as expected, games. Furthermore, we show the potential benefits of improving the memory system by analyzing the behavior of the texture and pixel caches. In second place, we present the performance and power results for the different state-of-the-art prefetchers. Finally, we compare these results with the performance and power consumption of our ultra-low power decoupled prefetcher.

6.1 Workload characterization

We have analyzed several Android applications from the Android Store in order to evaluate the behavior of the CPU and the GPU. We have included several common applications, such as the web browser and the audio player, and several 2D and 3D games. To obtain statistics about the CPU we have employed a full-system simulator, MARSSx86 [25]. MARSSx86 consists of QEMU, an emulator which is able to boot and run an OS, and PTLSim [26], a cycle accurate simulator for the instruction set. We have introduced several modifications to MARSSx86. In first place, we have modified PTLSim to compute CPI stacks [30]. In second place, we have integrated our GPU trace generator and cycle accurate simulator (section 5.1) in MARSSx86, so we can obtain information from the CPU and the GPU. Since PTLSim, the cycle accurate CPU simulator, only supports the x86 ISA we have employed the x86 version of Android [27].

The CPU configuration employed for the experiments is described in figure 6.1. We have con- figured the simulator to model a very simple out-of-order processor with small caches in order to keep power consumption inside the small power budget of the smartphones. The results of the 55 CHAPTER 6. EXPERIMENTAL RESULTS

CPU configuration 2-issue out-of-order core, 4 functional units: load unit, store unit, integer ALU and FPU. Two-level L1 Instruction cache: 64 bytes line size, 4-way associative, 16 KBytes, 2 cycles latency. L1 Data cache: 64 bytes line size, 4-way associative, 16 KBytes, 2 cycles latency. L2 cache: 64 bytes line size, 8-way associative, 256 KBytes, 12 cycles latency.

Figure 6.1: CPU configuration for the experiments.

Figure 6.2: CPI stacks for several Android applications. iCommando, Shooting Range 3D and PolyBreaker 3D are commercial games from the Android market.

CPU/GPU analysis are summarized in figure 6.2. This figure shows the CPI stacks for several Android applications: the Android app store, the audio player, the web browser and 3 commercial games. We have included in the CPI stacks the cycles that the CPU spents waiting for the GPU. As we can observe, the behavior of the games is different from the rest of the applications. For applications that are not games, the CPI is relatively small (between 1.5 and 2.5 cycles per instruc- tion) and the main source of pipeline stalls are branch mispredictions and L2 cache misses. On the other hand, for games the CPI is big (between 5.5 and 16 ) and the main source of stalls is the GPU. Hence, games are the most demanding applications and, furthermore, they are the only applications that stress the GPU. These characteristics make games the ideal applications for studying the memory behaviour of a GPU. 56 6.1. WORKLOAD CHARACTERIZATION

Figure 6.3: Misses per 1000 instructions for the different caches in the GPU.

We have analyzed the memory behavior of several commercial games by using our GPU simula- tion infrastructure. The GPU configuration employed for the experiments is described in figure 6.7. In first place, we have evaluated the miss rates for the different caches in the GPU, the results are shown in figure 6.3. The L2 cache presents the biggest miss rates in all the games except in iBowl, in which the texture cache turns to be the most problematic cache. Despite these miss rates seem to be very small, a significant performance speedup can still be achieved by improving the behavior of the caches.

Figure 6.4: Texture and pixel cache analysis.

57 CHAPTER 6. EXPERIMENTAL RESULTS

We have performed several experiments to evaluate the potential benefits of improving the pixel and texture caches, the results are shown in figure 6.4. As we can observe, the use of perfect texture caches provides a speedup of 48% on average. Furthermore, by making the pixel caches perfect we get an average speedup of 65%. Hence, a significant speedup can be achieved by improving the behavior of the different caches in the GPU.

Prefetching is one of the techniques that can be employed to improve the behavior of the memory system. However, as we have seen in section 2.4, conventional prefetchers for CPUs and GPUs work fine for applications with regular memory access patterns. Games, and graphical workloads in general, usually exhibit an unpredictable memory access pattern [32]. In order to understand the memory behavior of our applications we have analyzed the stride of all the cache misses by using Sequitur [38]. Sequitur is able to contruct the grammar that generates the sequence of strides we have recorded during the execution of an application. By analyzing these grammars we can identify patterns in the strides of the cache misses like, for example: a cache miss with stride 1 is usually followed by a cache miss with stride 2. Hence, by analyzing the Sequitur grammars we can evaluate how easily the strides can be predicted.

iCommando (2D game) Pixel cache Texture cache Rules Histogram Rules Histogram 1 -> [1] [1] (44238) 1 - 91.59% 146 -> [1] [1] (35184) 1 - 70.82% 4 -> 1 [1] (36190) 3 - 1.03% 14 -> 146 [1] (24802) -1 - 5.75% 23 -> 4 [1] (23880) 24 - 0.72% 183 -> 14 [1] (21983) 64 - 3.92% 45 -> 23 [1] (20327) 25 - 0.61% 405 -> 183 [1] (17428) 65 - 3.37% 516 -> 45 [1] (18393) 23 - 0.55% 108 -> 405 [1] (16203) -65 - 2.52%

Figure 6.5: Analysis of the strides of the cache misses in the Pixel and Texture cache of one Streaming processor when running the 2D game iCommando. In the Sequitur grammars non- terminal symbols (rules) are represented by numbers and terminal symbols (strides) are represented by numbers in square brackets. After each rule we show the number of times the rule is applied to form the input sequence of strides. We only show the 5 most frequent rules of the grammar.

PolyBreaker 3D (3D game) Pixel cache Texture cache Rules Histogram Rules Histogram 111 -> [1] [1] (43231) 1 - 63.77% 145 -> [1] [1] (20848) 1 - 48.39% 51 -> 111 [1] (26247) 25 - 5.44% 99 -> 145 [1] (16144) 8 - 6.08% 43 -> 51 [1] (18251) 24 - 3.84% 1556 -> 99 [1] (11941) 4 - 3.61% 60 -> 43 [1] (13584) -1 - 3.59% 7341 -> 1556 [1] (11811) -8 - 2.82% 104 -> 60 [1] (11964) 2 - 3.53% 10316 -> 7341 [1] (11718) -16 - 2.18%

Figure 6.6: Analysis of the strides of the cache misses in the Pixel and Texture cache of one Streaming processor when running the 3D game PolyBreaker 3D. For each cache the figure shows the 5 most frequent rules of the grammar and the 5 most frequent strides.

Figure 6.5 shows the result of the stride analysis for iCommando, one of the 2D games. As we can see, the stride 1 is the most common stride, 91.59% of the misses in the pixel cache and 58 6.2. STATE OF THE ART PREFETCHERS PERFORMANCE

70.82% of the misses in the texture cache have stride 1. The most frequently applyied rules of the grammar also include the stride 1. This means that most of the time when there is a cache miss the next cache miss will be in the next line. We have observed a similar behavior for all the 2D games we have evaluated. So for 2D games the memory access patterns are regular and the conventional prefetchers should work relatively well. This makes sense because 2D games basically consists on a sequence of blitting operations [28], in which a matrix of pixels is copied into another matrix (the color buffer).

On the other hand, 3D games exhibit the memory behavior described in figure 6.6. In this case, the frequency of the stride 1 is only 63% in the pixel cache and 48% in the texture cache and there are other stides such us 4 or 8 that are relatively common. Hence, for 3D games the strides of the cache misses are not so predictable like in the previous case, this makes the work of the prefetchers harder.

6.2 State of the art prefetchers performance

In this section we evaluate the performance and power consumption of different state-of-the-art CPU and GPU prefetchers. These are the configurations we have analyzed:

• Baseline - No prefetching: This is the baseline GPU architecture shown in figure 5.2 with- out any kind of prefetching. The parameters for the architecture are described in figure 6.7.

• Stride prefetcher (Table): In this configuration we have included the stride prefetcher implemented with a table shown in figure 2.8 in each one of the caches of the GPU. The stride table has a size of 16 entries and the prefetch degree is set to 2.

• Distance prefetcher (GHB): This configuration employs a distance prefetcher imple- mented with GHB (figure 2.12) in each one of the caches of the GPU. The Index Table has a size of 16 entries, the GHB has a size of 64 entries and the prefetch degree is set to 2.

• Many-Thread Aware Prefetcher with Throttling: This configuration employs the GPU prefetcher described in figure 2.15 in each one of the caches. Each one of the tables employed in this prefetcher (PWS, GS and IP table) has a size of 16 entries and the prefetch degree is dynamically adapted from 0 to 5.

• Perfect caches: All the caches are ideal, and have a hit rate of 100%.

Figure 6.8 shows the speedup for the different prefetching techniques. The stride prefetcher provides an average speedup of 1.31. The distance prefetcher with GHB and the many-thread aware prefetcher provide better performance than the stride prefetcher. The GHB prefetcher achieves a speedup of 2.27, which is slightly better than the speedup obtained with the many-thread aware prefetcher (2.19). Although the many-thread aware prefetcher has been designed specifically for GPUs, it does not provide better performance than the state-of-the-art CPU prefetcher. There are several reasons for this situation. In first place, the many-thread aware prefetcher has been designed targeting a GPU architecture similar to the NVIDIA Fermi [19] in which there are thousands of 59 CHAPTER 6. EXPERIMENTAL RESULTS

GPU configuration Fragment processing stage 4 Streaming processors Vertex processing stage 4 Streaming processors Streaming processor 4 SIMD execution units, 1 Pixel cache, 1 Texture cache, 8 thread hardware contexts (2 warps, 4 threads in each warp). Pixel cache 64 bytes per line, 2-way associative, 8 KBytes, 2 cycles latency Texture cache 64 bytes per line, 2-way associative, 8 KBytes, 2 cycles latency L2 cache 64 bytes per line, 8-way associative, 256 KBytes, 12 cycles latency

Figure 6.7: GPU configuration for the experiments. The baseline GPU architecture is the one illustrated in figure 5.2. simultaneous threads on execution at the same time. As we can see in figure 6.7, in our mobile GPU architecture there are only 8 thread hardware contexts per processor due to power constrains, whereas in the NVIDIA Fermi architecture there are 1024 simultaneous threads per streaming processor. So the effectiveness of some mechanisms like the inter-thread prefetching or the stride promotion (see section 2.4.2) is significantly limited due to the small number of in-flight threads. Furthermore the graphics workloads don’t exhibit regular memory access patterns, but the many- thread aware prefetcher has been designed for scientific applications developed in CUDA with very regular access patterns. Nevertheless, the performance of this prefetcher is very close to the GHB on average and it outperforms the distance prefetcher with GHB is some of the games (ibowl, pocketracing, quake2 and shooting). On the other hand, the speedups achieved by these prefetchers are far away from the speedup obtained by a system with perfect caches.

Figure 6.8: Speedups for different state-of-the-art prefetchers.

Regarding the power consumption, figure 6.9 shows the power for each one of the prefetchers normalized by the power of the baseline architecture without prefetching. As we can observe, the stride prefetcher is the prefetching scheme with the smallest power consumption (due to its 60 6.3. ULTRA-LOW POWER DECOUPLED PREFETCHER PERFORMANCE simplicity) and it only consumes 1.1% more than the baseline architecture on average. Once again, the behavior of the GHB prefetcher and the many-thread aware prefetcher are very close. The distance prefetcher with GHB consumes 4.5% more than the baseline GPU architecture on average, whereas the many-thread aware prefetcher consumes 4.9% more than the baseline. In three of the benchmarks the GHB prefetcher consumes more power (angryfrogs, ibowl and tankrecon), but in the other 5 games the many-thread aware prefetcher requires more power.

Figure 6.9: Normalized power consumption for different state-of-the-art prefetchers.

In conclusion, the three state-of-the-art prefetchers provide significant speedups over the base- line GPU without prefetching, specially the GHB buffer and the many-thread prefetcher. However, all the prefetchers also require more power than the baseline GPU.

6.3 Ultra-low power decoupled prefetcher performance

In this section we evaluate the performance and power consumption of our ultra-low power decoupled prefetcher. We have included these three additional configurations:

• Original decoupled prefetcher: This is the prefetching architecture for texture caches proposed by Igehy et al. in [33] and described in section 2.4.2. The original idea only works for systems with one processor, so the experiments for this prefetching scheme have been performed by using just one streaming processor instead of four. The size of the prefetch queues is 32 entries.

• Decoupled prefetcher: This configuration implements our decoupled prefetcher illustrated in figure 4.2. The size of the prefetch queues is 32 entries.

• Decoupled prefetcher with optimizations: This configuration implements our decoupled 61 CHAPTER 6. EXPERIMENTAL RESULTS

prefetcher with the optimizations described in section 4.1.3 to reduce the number of requests to the L2 cache. The size of the prefetch queues is also 32 entries.

Figure 6.10: Ultra-low power decoupled prefetcher compared with state-of-the-art prefetchers.

Figure 6.10 shows the performance improvement provided by our decoupled prefetcher. The original decoupled prefetcher causes a performance penalty with respect to the baseline GPU on average. However, this is not a fair comparison because the original decoupled prefetching scheme can only be implemented with one processor and the baseline GPU includes 4 streaming processors. Nevertheless, it offers 86% of the performance of a system with 4 processors just by using one streaming processor. Futhermore, it even outperforms the baseline GPU in some of the games (angryfrogs and pocketracing).

Regarding our decoupled prefetcher, it offers better performance than the state-of-the-art CPU and GPU prefetchers in all of the games and it achieves and speedup of 2.63. By reducing the number of requests to the L2 cache (decoupled prefetcher with optimizations) the speedup is even better, 2.94 on average, and it is close to the speedup of a system with perfect caches (3.37).

Figure 6.11 shows the speedups of our decoupled prefetcher compared to one of the state-of-the- art prefetchers, the distance prefetcher with GHB (as we have seen in the previous section this is the state-of-the-art prefetcher that provides the best performance for our mobile GPU architecture). As we can see, our decoupled prefetcher achieves 15% improvements on average over the GHB prefetcher. Furthermore, if we apply the optimizations it provides 29% improvements over the GHB.

Regarding the power consumption, the power results are presented in figure 6.12. This fig- ure shows the power consumed by each one of the prefetching schemes normalized by the power consumed by the baseline GPU architecture without prefetching. As we can see, the decoupled prefetcher consumes less power than the distance prefetcher with GHB and the many-thread aware prefetcher in all of the games. Furthermore, the optimizations to reduce the number of accesses 62 6.3. ULTRA-LOW POWER DECOUPLED PREFETCHER PERFORMANCE

Figure 6.11: Ultra-low power decoupled prefetcher compared with the distance prefetcher imple- mented with GHB.

Figure 6.12: Decoupled prefetcher power consumption. to the L2 cache turn to be effective to reduce power. The decoupled prefetcher with these opti- mizations consumes less power than the baseline GPU on average and in all of the games except in ibowl. It consumes about 6% less power than the state-of-the-art CPU and GPU prefetchers and 1.1% less power than the baseline GPU architecture on average. Although the power savings seem to be small, the optimized decoupled prefetcher offers these savigns in energy while providing significant performance improvements.

In the previous graphs we have reported different power savings for different speedups, but it would be interesting to compare both parameters, power and performance, at the same time. So we have computed the energy-delay product for the different prefetching schemes and we have 63 CHAPTER 6. EXPERIMENTAL RESULTS

Figure 6.13: Normalized energy-delay product. normalized the results by the energy-delay product of the baseline GPU without prefetching (fig- ure 6.13). As we can see, the improvement introduced by the decoupled prefetcher is even bigger if we consider the speedup and the energy savings at the same time.

Figure 6.14: Prefetch queue size evaluation. The graph shows the speedup achieved by the decou- pled prefetcher over the baseline GPU without prefetching for different sizes of the prefetch queue, for the game shooting.

64 6.3. ULTRA-LOW POWER DECOUPLED PREFETCHER PERFORMANCE

Finally, we have analyzed the impact of the prefetch queue size of the decoupled prefetcher in the performance improvements. Figure 6.14 shows the evolution of the speedup obtained over the baseline GPU without prefetching in the 3D game shooting as we increase the size of the prefetch queue. If the size of the prefetch queue is small the prefetcher can not prefetch the necessary lines early enough, so the number of compulsory misses increases. If the size of the prefetch queue is big, the likelihood of a cache line prefetched for a pixel being replaced for another cache line prefetched for a younger pixel increases. Therefore, the number of conflict misses increases as we increase the size of the prefetch queue. As we can observe in figure 6.14, we get the best results for intermediate values of the prefetch queue size (from 64 to 512 entries).

In conclusion, the ultra-low power decoupled prefetcher outperforms the state-of-the-art CPU and GPU prefetchers. It is 2.94 times faster than the baseline GPU architecture and 1.29 times faster than the best state-of-the-art prefetcher on average. Furthermore, it provides these perfor- mance improvements without increasing the power consumption. In fact, it consumes 1.1% less power than the baseline GPU architecture without prefetching on average.

65

7 Conclusions

Games are the most demanding applications for smartphones. Graphics workloads make an intensive use of the GPU while the CPU is idle most of the time. Due to the growing disparity of speed between the GPU cores and memory, one of the most performance limiting factors of the GPU is the latency to access main memory. Multithreading is a commonly used technique to tolerate memory latency. However, we found that it does so by significantly increasing power consumption. Prefetching is also a very effective technique for hiding memory latency on a mobile GPU, we have proved that by using prefetchers we can achieve a speedup of 2.94 on average over a GPU without prefetching in a commercial set of games. Furthermore, this speedup is achieved without increasing energy consumption, which is of primary importance in a mobile GPU.

Despite the special characteristics of graphics workloads, state of the art CPU and GPGPU prefetchers are an effective mechanism to improve the memory behavior of a mobile GPU. Just by using a simple stride prefetcher implemented with a table we get a speedup of 1.31 on average over a GPU without prefetching. The distance prefetcher implemented with GHB achieves an speedup of 2.27, whereas the state of the art GPGPU prefetcher (the many-thread aware prefetcher) provides a speedup of 2.19. However, all these prefetchers produce a small increase in energy consumption. Moreover, the performance enhancements are far from the speedup achieved by a system with perfect caches (3.37), so there is a significant margin for improvement.

A decoupled access/execute prefetching architecture can be very effective to hide the memory latency. Our decoupled prefetcher achieves a speedup of 2.63 over a GPU without prefetching and 1.15 over the distance prefetcher with GHB, just by using 1.4% more power than the baseline GPU. Furthermore, we also show that performance can be improved and power can be reduced by carefully moving data around and by orchestrating the accesses to the L2 cache (section 4.1.3). By using this optimizations the speedup achieved is 2.94 over a GPU without prefetching and 1.29 over the GHB prefetcher. Moreover, the power is reduced by 1.1% with respect to the baseline GPU.

67 CHAPTER 7. CONCLUSIONS

The traditional CPU and GPGPU prefetchers make predictions by using history information. These prefetchers are triggered on cache misses and the only information they have available is the sequence of miss addresses. By using the miss address stream they try to guess which cache lines will be requested next. On the contrary, the decoupled prefetcher employs the information about the pixels to compute which lines are going to be requested, so the prefetch requests are not based on predictions. Moreover, the decoupled prefetcher has a better knowledge of the whole system (number of processors, number of texture caches, number of pixel caches) and it can employ this information to do a more effective prefetching. For instance, if the prefetcher knows that a pixel is going to be processed in the streaming processor 0, then all the necessary data to process the pixel will be prefetched in the texture and pixel caches of processor 0. Therefore, the use of this information about exactly which cache lines are going to be requested in each one of the processors provides a big advantage to the decoupled prefetcher over the rest of prefetchers.

The prefetch request queue must be sized long enough to achieve timeliness of prefetching, which mostly depends on memory latency. But it must also avoid excessive length that could lead to late requests evicting yet-to-be-used prior prefetched data due to cache conflicts. We have found that lengths between 32 and 512 are appropiate for our workloads.

68 Bibliography

[1] http://assets.en.oreilly.com/1/event/39/Internet%20Trends%20Presentation.pdf.

[2] http://www.migsmobile.net/2010/01/12/evolution-of-mobile-device-uses-and-battery-life/.

[3] http://en.wikipedia.org/wiki/Android_%28operating_system%29.

[4] http://www.nvidia.com/content/PDF/tegra_white_papers/Bringing_High-End_ Graphics_to_Handheld_Devices.pdf.

[5] http://en.wikipedia.org/wiki/Rasterisation.

[6] http://en.wikipedia.org/wiki/Ray_tracing_%28graphics%29.

[7] http://en.wikipedia.org/wiki/Sutherland-Hodgeman.

[8] http://en.wikipedia.org/wiki/Scanline_algorithm.

[9] http://en.wikipedia.org/wiki/Z_buffer.

[10] http://en.wikipedia.org/wiki/Dalvik_virtual_machine.

[11] http://www.khronos.org/opengles/.

[12] http://en.wikipedia.org/wiki/System-on-a-chip.

[13] http://en.wikipedia.org/wiki/Snapdragon_%28system_on_chip%29.

[14] http://en.wikipedia.org/wiki/PowerVR.

[15] http://en.wikipedia.org/wiki/Tiled_rendering.

[16] http://www.imgtec.com/factsheets/SDK/PowerVR%20Technology%20Overview.1.0.2e. External.pdf.

[17] NVIDIA Corporation. CUDA Programming Guide, V3.0.

[18] http://www.qualcomm.com/snapdragon/specs.

[19] NVIDIA Fermi. http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_ Fermi_Compute_Architecture_Whitepaper.pdf.

[20] http://wiki.qemu.org/. 69 BIBLIOGRAPHY

[21] http://developer.android.com/guide/basics/what-is-android.html.

[22] http://oss.sgi.com/projects/ogl-sample/registry/ARB/vertex_program.txt.

[23] http://oss.sgi.com/projects/ogl-sample/registry/ARB/fragment_program.txt.

[24] http://www.hpl.hp.com/research/cacti/.

[25] http://www.marss86.org/.

[26] http://www.ptlsim.org/.

[27] http://www.android-x86.org/.

[28] http://en.wikipedia.org/wiki/Bit_blit.

[29] Tomas Akenine-Moller and Jacob Strom. Graphics processing units for handhelds. Proceedings of the IEEE, 96:779–789, 2008.

[30] Stijn Eyerman, Lieven Eeckhout, Tejas Karkhanis, and James E. Smith. A performance counter architecture for computing accurate cpi components. In Proceedings of the 12th inter- national conference on Architectural support for programming languages and operating systems, ASPLOS-XII, pages 175–184, New York, NY, USA, 2006. ACM.

[31] John W. C. Fu, Janak H. Patel, and Bob L. Janssens. Stride directed prefetching in scalar processors. SIGMICRO Newsl., 23:102–110, December 1992.

[32] M. Gebhart, D. R. Johnson, D. Tarjan, S. W. Keckler, W. J. Dally, E. Lindholm, and K. Skadron. Energy-efficient mechanisms for managing thread context in throughput pro- cessors. Proceedings of the ACM/IEEE International Symposium on (ISCA), June 2011.

[33] Homan Igehy, Matthew Eldridge, and Kekoa Proudfoot. Prefetching in a texture cache ar- chitecture. In SIGGRAPH / Eurographics Workshop on Graphics Hardware, pages 133–142, 1998.

[34] Doug Joseph and Dirk Grunwald. Prefetching using markov predictors. In In Proceedings of the 24th Annual International Symposium on Computer Architecture, pages 252–263, 1997.

[35] Gokul B. Kandiraju and Anand Sivasubramaniam. Going the distance for tlb prefetching: an application-driven study. In Proceedings of the 29th annual international symposium on Com- puter architecture, ISCA ’02, pages 195–206, Washington, DC, USA, 2002. IEEE Computer Society.

[36] Jaekyu Lee, Nagesh B. Lakshminarayana, Hyesoon Kim, and Richard Vuduc. Many-thread aware prefetching mechanisms for gpgpu applications. IEEE/ACM International Symposium on Microarchitecture, 0:213–224, 2010.

[37] Kyle J. Nesbit and James E. Smith. Data cache prefetching using a global history buffer. IEEE Micro, 25(1):90–97, 2005. 70 BIBLIOGRAPHY

[38] Craig G. Nevill-Manning and Ian H. Witten. Identifying hierarchical structure in sequences: a linear-time algorithm. J. Artif. Int. Res., 7:67–82, September 1997.

[39] J. W. Sheaffer, D. Luebke, and K. Skadron. A flexible simulation framework for graphics archi- tectures. In Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware, HWWS ’04, pages 85–94, New York, NY, USA, 2004. ACM.

[40] Lance Williams. Pyramidal parametrics. In Proceedings of the 10th annual conference on Computer graphics and interactive techniques, SIGGRAPH ’83, pages 1–11, New York, NY, USA, 1983. ACM.

71