High Performance, Ultra-Low Power Streaming Systems

TECHNICAL UNIVERSITY OF CATALONIA School of Informatics Master on Computer Architecture, Networks and Systems Master Thesis High Performance, Ultra-Low Power Streaming Systems Student: JoséMar´ıaArnau Advisor: Joan-Manuel Parcerisa Advisor: Polychronis Xekalakis Advisor: Antonio González Date: September 20, 2011 2 Abstract Smartphones are emerging as one of the fastest growing markets, with new devices and improvements in their operating systems taking place every few months. The design of a CPU/GPU for such a mobile devices is challenging due to the users demands for a truly mobile computing experience, including highly responsive user interfaces, uncompromised web browsing performance or visually compelling gaming experiences, and the power constrains due to the limited capacity of the battery. In the last years, the power demand of these mobile devices has increased much faster than the battery improvements. Our key ambition is to design a CPU/GPU for such a system, trying to minimize the power consumed, while also achieving the highest performance possible. We first analyze commercial Android workloads and establish that the most demanding applications in terms of performance are, as expected, games. We show that because these systems are based on OpenGL ES, the vast majority of the time the CPU is idle. In fact, we find that the GPU is much more active than the CPU and that the major performance limitation for these systems is the use of memory by the GPU. We thus focus on the GPU and more specifically on its memory behavior. We show that for most of the caches employed in these systems, traditional prefetchers provide significant benefits. The exception is the texture cache, for which the patterns are irregular, especially for 3D games. We then demonstrate how we can aleviate this issue by using a decoupled access/execute like architecture. We also show that an important part of the power consumed can be reduced by carefully moving data around and by orchestrating the accesses to the L2 cache. The end design is able to achieve similar performance with a more traditional many-warp system, while consuming only a fraction of its power. Our experimental results using the latest version of Android and a commercial set of games, proves this claim. More specifically, our proposed system is able to achieve 29% improvements over state-of-the-art prefetchers, while consuming 6% less power. Keywords Prefetching, GPU, Android, rasterization, smartphones. 3 Contents 1 Introduction 11 1.1 Motivation......................................... 11 1.2 Objectives and contributions............................... 13 1.3 Organization........................................ 13 2 Related work 15 2.1 Rasterization........................................ 15 2.2 Android........................................... 17 2.2.1 Android Software Renderer............................ 19 2.3 State of the art architectures for mobile devices..................... 21 2.3.1 Qualcomm Snapdragon.............................. 21 2.3.2 PowerVR chipsets................................. 22 2.3.3 NVIDIA Tegra 2.................................. 24 2.4 Data Cache Prefetching.................................. 26 2.4.1 CPU prefetchers.................................. 26 2.4.2 GPU prefetchers.................................. 29 3 Problem statement: Memory Wall for Low Power GPUs 37 3.1 Hiding memory latency on a modern low power mobile GPU............. 37 4 Proposal: Decoupled Access Execute Prefetching 41 4.1 Ultra-low power decoupled prefetcher........................... 41 5 CONTENTS 4.1.1 Baseline GPU Architecture............................ 41 4.1.2 Decoupled prefetcher................................ 43 4.1.3 Decoupled prefetcher improvements....................... 45 5 Evaluation methodology 49 5.1 Simulation infrastructure................................. 49 5.1.1 GPU trace generation............................... 49 5.1.2 Cycle accurate GPU simulator.......................... 51 6 Experimental results 55 6.1 Workload characterization................................. 55 6.2 State of the art prefetchers performance......................... 59 6.3 Ultra-low power decoupled prefetcher performance................... 61 7 Conclusions 67 6 List of Figures 1.1 Smartphones sales vs Desktop and Notebook sales. Data obtained from [1]...... 12 1.2 Energy need vs energy available in a standard size battery. Two days of battery life cannot be achieved with current batteries and the gap is getting bigger. Data obtained from [2]....................................... 12 2.1 Initial scene, intermediate results produced by the different stages of the rasterization process and final result................................... 16 (a) 3D triangles plot................................... 16 (b) 2D triangles plot................................... 16 (c) Clipped 2D triangles plot.............................. 16 (d) Pixels after rasterization............................... 16 (e) Visible pixels after Z-test.............................. 16 (f) Shaded and textures pixels after Pixel stage.................... 16 2.2 Rasterization pipeline.................................... 18 2.3 Android architecture..................................... 19 2.4 Qualcomm Snapdragon System on Chip.......................... 21 2.5 PowerVR GPU architecture................................. 23 2.6 NVIDIA Tegra 2 architecture................................ 24 2.7 Ultra-low power GeForce architecture........................... 25 2.8 Stride prefetching table................................... 27 2.9 Markov prefetching. The left side of the figure shows the state of the correlation table after processing the miss address stream shown at the top of the figure. The right side illustrates the Markov transition graph that corresponds to the example miss address stream..................................... 28 7 LIST OF FIGURES 2.10 Distance prefetching. The address delta stream corresponds to the sequence of ad- dresses used in the example of figure 2.9.......................... 28 2.11 Global History Buffer.................................... 30 2.12 Distance prefetcher implemented by using a Global History Buffer. The Head Pointer points to the last inserted address in the GHB...................... 30 2.13 An overview of the baseline GPGPU architecture.................... 31 2.14 An example of memory address with/without warp interleaving............ 32 (a) Accesses by warps.................................. 32 (b) Accesses by a hardware prefetcher......................... 32 2.15 Many-thread aware hardware prefetcher.......................... 33 2.16 Throttling heuristics..................................... 34 2.17 Baseline architecture for texture mapping......................... 35 2.18 Texture cache prefetcher architecture........................... 35 3.1 Effectiveness of multithreading for hiding memory latency. As we increase the number of warps on each processor we obtain better performance.............. 38 3.2 Power consumed by the GPU main register file for different configurations...... 38 4.1 Baseline GPU architecture (based on the ultra-low power GeForce GPU in the NVIDIA Tegra 2 chipset).................................. 42 4.2 Decoupled prefetcher architecture............................. 45 4.3 Improved decoupled prefetcher............................... 46 5.1 GPU trace generation system................................ 50 5.2 GPU architecture modelled by the cycle accurate simulator............... 52 6.1 CPU configuration for the experiments.......................... 56 6.2 CPI stacks for several Android applications. iCommando, Shooting Range 3D and PolyBreaker 3D are commercial games from the Android market............ 56 6.3 Misses per 1000 instructions for the different caches in the GPU............ 57 6.4 Texture and pixel cache analysis.............................. 57 8 LIST OF FIGURES 6.5 Analysis of the strides of the cache misses in the Pixel and Texture cache of one Streaming processor when running the 2D game iCommando. In the Sequitur gram- mars non-terminal symbols (rules) are represented by numbers and terminal symbols (strides) are represented by numbers in square brackets. After each rule we show the number of times the rule is applied to form the input sequence of strides. We only show the 5 most frequent rules of the grammar...................... 58 6.6 Analysis of the strides of the cache misses in the Pixel and Texture cache of one Streaming processor when running the 3D game PolyBreaker 3D. For each cache the figure shows the 5 most frequent rules of the grammar and the 5 most frequent strides. 58 6.7 GPU configuration for the experiments. The baseline GPU architecture is the one illustrated in figure 5.2................................... 60 6.8 Speedups for different state-of-the-art prefetchers..................... 60 6.9 Normalized power consumption for different state-of-the-art prefetchers........ 61 6.10 Ultra-low power decoupled prefetcher compared with state-of-the-art prefetchers... 62 6.11 Ultra-low power decoupled prefetcher compared with the distance prefetcher implemented with GHB...................................... 63 6.12 Decoupled prefetcher power consumption......................... 63 6.13 Normalized energy-delay product.............................. 64 6.14 Prefetch queue size evaluation. The graph shows the speedup achieved by the decoupled prefetcher over the baseline GPU without prefetching for different sizes of the prefetch queue, for the game shooting........................

High Performance, Ultra-Low Power Streaming Systems

Powervr SGX Series5xt IP Core Family

GPU Developments 2018

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27

An Emerging Architecture in Smart Phones

Webcore: Architectural Support for Mobile Web Browsing

Driver Riva Tnt2 64

(GPU) Computing

A Configurable General Purpose Graphics Processing Unit for Power, Performance, and Area Analysis

Troubleshooting Guide Table of Contents -1- General Information

EVA: an Efficient Vision Architecture for Mobile Systems

POWERVR 3D Application Development Recommendations

6Th Generation Intel® Core™ Processors Based on the Mobile U-Processor for Iot Solutions (Intel® Core™ I7-6600U, I5-6300U, and I3-6100U Processors)