Models and Techniques for Designing Mobile System-on-Chip Devices

by

Ayub Ahmed Gubran

A DISSERTATION SUBMITTED IN PARTIAL FULFILLMENT OFTHEREQUIREMENTSFORTHEDEGREEOF

Doctor of Philosophy

in

THE FACULTY OF GRADUATE AND POSTDOCTORAL STUDIES (Electrical and Computer Engineering)

The University of British Columbia (Vancouver)

August 2020

© Ayub Ahmed Gubran, 2020 The following individuals certify that they have read, and recommend to the Fac- ulty of Graduate and Postdoctoral Studies for acceptance, the dissertation entitled:

Models and Techniques for Designing Mobile System-on-Chip Devices submitted by Ayub Ahmed Gubran in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Electrical and Computer Engineering

Examining Committee:

Tor M. Aamodt, Electrical and Computer Engineering Supervisor

Steve Wilton, Electrical and Computer Engineering Supervisory Committee Member

Alan Hu, Computer Science University Examiner

Andre Ivanov, Electrical and Computer Engineering University Examiner

John Owens, The University of California-Davis, Electrical and Computer Engineering

External Examiner

Additional Supervisory Committee Members:

Sidney Fels, Electrical and Computer Engineering Supervisory Committee Member

ii Abstract

Mobile SoCs have become ubiquitous computing platforms, and, in recent years, they have become increasingly heterogeneous and complex. A typical SoC to- day includes CPUs, GPUs, image processors, video encoders/decoders, and AI engines. This dissertation addresses some of the challenges associated with SoCs in three pieces of work. The first piece of work develops a cycle-accurate model, Emerald, which pro- vides a platform for studying system-level SoC interactions while including the impact of graphics. Our cycle-accurate infrastructure builds upon well-established tools, GPGPU-Sim and gem5, with support for graphics and GPGPU workloads, and full system simulation with Android. We present two case studies using Emer- ald. First, we use Emerald’s full-system mode to highlight the importance of system-wide interactions by studying and analyzing memory organization and schedul- ing in SoCs. Second, we use Emerald’s standalone mode to evaluate a dynamic mechanism for balancing the work assigned to GPU cores. Our dynamic mechanism speeds up frame rendering by 7.3–19% compared to static load-balancing. The second work highlights the time-variant traffic asymmetry in heteroge- neous SoCs. We analyze the impact of this asymmetry on network performance and propose interleaved source injection (ISI), an interconnect topology and asso- ciated flow control mechanism to manage time-varying asymmetric network traf- fic. We evaluate ISI using stochastic traffic patterns and a set of traces that emulate mobile use cases with traffic from various IP blocks. We show that ISI increases saturation throughput by 80–184% for 12% increase in NoC area. In the last piece of work, we study the compression properties of surfaces and highlight the characteristics of surfaces generated by different appli-

iii cations. We use our analysis to propose Dynamic Color Palettes (DCP), a hardware scheme that dynamically constructs color palettes and employs them to efficiently compress framebuffer surfaces. We evaluated DCP against a set of 124 workloads and found that DCP improves compression rates by 91% for UI and 20% for 2D applications compared to previous proposals. We also propose a hybrid scheme (HDCP) that combines DCP with a generic compression scheme. HDCP outper- forms previous proposals by 161%, 124% and 83% for UI, 2D, and 3D applica- tions, respectively.

iv Lay Summary

Mobile devices, like phone and tablets, have become interwoven into the fabric of our daily lives. These devices are powered by integrated mobile chips, or system- on-chips (SoCs). Modern SoCs consist of a dozen or more specialized modules, where each module specializes in a specific task, like processing audio or video, rendering graphics, computing AI decisions, encrypting data, etc. As SoCs have become larger and more complicated, challenges have followed. These challenges include figuring out how to design hardware and for such systems to deliver the best possible user experience. This dissertation con- tributes a set of tools that allow us to improve our understanding of how SoCs work and how their modules interact. The work in this dissertation also studies and proposes techniques to improve how SoC modules communicate, and techniques to reduce the energy consumption of SoCs’ memory system.

v Preface

The work presented in this dissertation was carried by the author Ayub A. Gubran under the supervision of Prof. Tor M. Aamodt at the University of British Columbia, Point Grey campus. The research material presented in this dissertation, namely, Chapter 2, Chap- ter 3, and Chapter 4, is based on work that has either been published or currently under submission as detailed below. A version of Chapter 2 has been presented at the 2019 International Interna- tional Symposium on Computer Architecture (ISCA’19) and was included in the proceedings as:

• Ayub A. Gubran and Tor M. Aamodt. 2019. Emerald: graphics modeling for SoC systems. In Proceedings of the 46th International Symposium on Computer Architecture (ISCA ’19). ACM, New York, NY, USA, 169-182. DOI: https://doi.org/10.1145/3307650.3322221.

Ayub A. Gubran was the lead investigator, responsible for all major areas of con- cept formation, implementation, data collection and analysis, as well as manuscript composition. Prof. Tor M. Aamodt provided early concept formation, techni- cal guidance and feedback, and contributed to editing the final version of the manuscript. A version of Chapter 3 is currently under submission as “Ayub A. Gubran and Tor M. Aamodt. The Case for Interleaved Source Injection in Networks with Asymmetric Loads”. Ayub A. Gubran was the lead investigator, responsible for all major areas of concept formation, implementation, data collection and analysis, as well as manuscript composition. Francois Demoullin implemented the scripts

vi that were used to carry the experiments in Section 3.6.3.1. He also carried and col- lected the data for a number of experiments in the early stages of the work under the supervision of A. Gubran. Prof. Tor M. Aamodt provided technical guidance and feedback throughout the process and aided in writing the manuscript. A version of Chapter 4 was presented as a peer reviewed poster at the 7th Con- ference on High-Performance Graphics (HPG ’15) as “Framebuffer Compression Using Dynamic Color Palettes”. Also a pre-print version of this work is published as:

• Ayub A. Gubran, Felix Huang, and Tor M. Aamodt. “Surface Compression Using Dynamic Color Palettes.” arXiv preprint arXiv:1903.06658 (2019).

Ayub A. Gubran was the lead investigator, responsible for all major areas of con- cept formation, implementation, data collection and analysis. Felix Huang worked under the supervision of Ayub Gubran; he collected the workloads listed in Ta- ble 4.2, evaluated replacement policies (Section 4.7.3.2), and worked on collecting the results of the hybrid compression scheme (Figure 4.18). Prof. Tor M. Aamodt provided technical guidance and feedback throughout the process.

vii Table of Contents

Abstract ...... iii

Lay Summary ...... v

Preface ...... vi

Table of Contents ...... viii

List of Tables ...... xiv

List of Figures ...... xvi

List of Abbreviations ...... xx

Acknowledgments ...... xxiii

Dedication ...... xxiv

1 Introduction ...... 1 1.1 Motivation ...... 2 1.2 Research Challenges ...... 5 1.3 Research Objectives and Contributions ...... 6 1.4 Dissertation Organization ...... 7

2 Emerald: Graphics Modeling for SoC Systems ...... 9 2.1 Introduction ...... 9 2.2 Background ...... 12

viii 2.2.1 Computer Architecture Simulators ...... 12 2.2.2 Graphics ...... 14 2.2.2.1 The OpenGL Pipeline ...... 14 2.2.2.2 Recent Trends in Graphics APIs ...... 17 2.2.2.3 Example: Life of a Triangle ...... 17 2.2.3 Hardware Optimizations ...... 19 2.2.3.1 Early Elimination of Invisible Primitives and Frag- ments ...... 19 2.2.3.1.a Early Depth Testing ...... 19 2.2.3.1.b Hierarchical Depth Testing ...... 20 2.2.3.1.c Deferred Shading ...... 22 2.2.3.1.d Primitive Culling and Clipping . . . . 23 2.2.3.1.e Face Culling ...... 23 2.2.3.1.f View Frustum Culling ...... 23 2.2.3.1.g Clipping ...... 25 2.2.3.2 Hierarchical Position Rasterization ...... 26 2.2.3.3 Summary ...... 27 2.2.4 Graphics Hardware Architectures ...... 27 2.2.4.1 Tile-Based Rendering Architectures ...... 28 2.2.4.1.a TBR Performance Bottlenecks . . . . 30 2.2.4.1.b Contemporary TBR Architectures . . 31 2.2.4.2 Immediate Architectures . . . 31 2.2.5 GPU Compute Architecture ...... 33 2.3 Emerald SoC Architecture ...... 34 2.4 Emerald Graphics Architecture ...... 37 2.4.1 Emerald Graphics Pipeline ...... 38 2.4.1.1 Pipeline Overview ...... 38 2.4.1.2 Description of Pipeline Stages ...... 39 2.4.2 Emerald GPU Architecture ...... 42 2.4.3 Vertex Shading ...... 44 2.4.4 Primitive Processing ...... 46 2.4.5 Hierarchical-Z (Hi-Z) Operations ...... 49 2.4.6 The TC Stage ...... 52

ix 2.4.6.1 Tile Coalescing Example ...... 55 2.4.6.2 Out-of-order Primitive Rendering ...... 56 2.4.7 Model Validation ...... 56 2.4.8 Model Limitations and Future Work ...... 58 2.5 Emerald Software Design ...... 60 2.5.1 Emerald Standalone Mode ...... 60 2.5.2 Emerald Full-system Mode ...... 61 2.6 Case Study I: Memory Organization and Scheduling on Mobile SoCs ...... 61 2.6.1 Implementation ...... 62 2.6.1.1 DASH Scheduler ...... 62 2.6.1.1.a Clustering Bandwidth ...... 62 2.6.1.2 HMC Controller ...... 63 2.6.2 Evaluation ...... 65 2.6.2.1 Regular-load Scenario ...... 65 2.6.2.2 High-load Scenario ...... 67 2.6.2.3 Summary and Discussion ...... 69 2.7 Case Study II: Dynamic Fragment Shading Load-Balancing (DFSL) ...... 71 2.7.1 Experimental Setup ...... 71 2.7.2 Load-Balance vs. Locality ...... 72 2.7.3 Dynamic Fragment Shading Load-Balancing (DFSL) . . . 73 2.7.3.1 Implementation ...... 73 2.8 Related Work ...... 89 2.8.1 System Simulation ...... 89 2.8.2 GPU Simulators ...... 89 2.8.3 SoC Memory Scheduling ...... 90 2.8.4 Load-Balancing on GPUs ...... 90 2.8.5 Temporal Coherence in Graphics ...... 90 2.9 Conclusion and Future Work ...... 91

x 3 Balanced Injection of Asymmetric Network Loads Using Interleaved Source Injection ...... 93 3.1 Introduction ...... 93 3.2 Related Work ...... 98 3.2.1 Intra-IP and CPU-GPU NoCs ...... 98 3.2.2 Router Architecture ...... 99 3.3 Background ...... 100 3.4 Analyzing the Behavior of Asymmetric Traffic ...... 100 3.5 Interleaved Source Injection (ISI) ...... 104 3.5.1 ISI Architecture ...... 106 3.5.2 ISI Interleaving Degree ...... 107 3.5.3 Allocation and Arbitration ...... 108 3.5.4 Order-preserving Packet Routing ...... 111 3.5.5 Placement of Injection Ports ...... 112 3.5.6 Quality-of-Service (QoS) ...... 112 3.5.7 ISI and NoC Topologies ...... 112 3.5.8 Implementation Cost ...... 114 3.6 Methodology and Results ...... 115 3.6.1 Synthetic Traffic ...... 116 3.6.2 Mobile SoC Use Cases Results ...... 118 3.6.2.1 NoC Simulation ...... 120 3.6.2.2 Source Injection Placement ...... 120 3.6.2.3 Trace Injection Rate (TIR) ...... 120 3.6.3 Simulation Results ...... 123 3.6.3.1 Trace Permutations ...... 123 3.6.3.2 Quality-of-Service ...... 124 3.6.4 Area and Energy Efficiency ...... 125 3.6.4.1 Area Efficiency ...... 125 3.6.4.2 Energy Efficiency ...... 125 3.7 Conclusion ...... 125

4 Surface Compression Using Dynamic Color Palettes ...... 127 4.1 Introduction ...... 127

xi 4.2 Background and Related Work ...... 128 4.2.1 The Life Cycle of a Framebuffer Surface ...... 128 4.2.2 Surface Compression Techniques ...... 130 4.2.3 Mobile Use Patterns ...... 132 4.3 Temporal Coherence in Mobile Graphics ...... 132 4.4 Dynamic Color Palettes (DCP) Compression ...... 134 4.4.1 DCP Workflow ...... 136 4.4.2 The Frequent Values Collector (FVC) ...... 138 4.4.2.1 FVC Hardware Cost ...... 138 4.4.3 The Common Colors Dictionary (CCD) ...... 138 4.4.3.1 CCD Hardware Cost ...... 139 4.4.4 The Compression Status Buffer (CSB) ...... 139 4.4.5 Reading a Compressed Framebuffer Surface ...... 140 4.4.6 Multi-Surface Support ...... 140 4.4.7 Coupling DCP with Other Compression Algorithms . . . 140 4.4.8 Dynamically Enabling DCP ...... 141 4.5 DCP Schemes ...... 142 4.5.1 Baseline DCP ...... 142 4.5.1.1 Memory Layout and Effective Compression Rates 142 4.5.2 Adaptive DCP (ADCP) ...... 145 4.5.3 Variable DCP (VDCP) ...... 145 4.5.4 Hybrid DCP (HDCP) ...... 147 4.5.5 DCP Implementation ...... 147 4.6 Methodology ...... 147 4.7 Results and Discussion ...... 151 4.7.1 DCP Schemes ...... 151 4.7.2 Comparing VDCP, RAS and RED ...... 155 4.7.3 Factors Affecting FVC Fidelity ...... 155 4.7.3.1 FVC Size ...... 156 4.7.3.2 Replacement Policy and Associativity . . . . . 156 4.7.3.3 Sampling ...... 158 4.7.3.4 Frame Sampling ...... 158 4.7.4 Implementation Cost and Energy Savings ...... 159

xii 4.7.4.1 Energy Savings ...... 160 4.7.5 Hybrid Schemes ...... 161 4.8 Compression and Future Memory Trends ...... 162 4.9 Conclusion ...... 162

5 Conclusion and Future Work ...... 164 5.1 Summary of Chapter 2 ...... 164 5.2 Summary of Chapter 3 ...... 166 5.3 Summary of Chapter 4 ...... 167 5.4 Conclusion ...... 169

Bibliography ...... 170

A Emerald Software Design and Implementation ...... 200 A.1 Graphics Standalone Mode ...... 200 A.2 Fast-forwarding and ROI Simulation ...... 202 A.3 Graphics Memory Management ...... 203 A.3.1 Memory Initialization ...... 203 A.3.2 Graphics Memory Allocation ...... 205 A.4 Graphics Memory Virtualzation, TLBs and Page Fault Handling in the Full-system Mode ...... 206 A.5 Graphics Simulation on Android ...... 207 A.5.1 Command Execution and Synchronization ...... 210 A.5.2 The Gem5Pipe ...... 210 A.5.3 Checkpointing ...... 212 A.5.3.1 Checkpoint Recording ...... 214 A.5.3.2 Checkpoint Restoring ...... 216 A.5.4 Host-Render Mode ...... 219 A.5.4.1 Host-Render Mode Checkpoint Compatibility . 219 A.6 Deterministic Perfomance Modeling with Multi-Threaded Simula- tion ...... 220 A.7 TGSI to PTX Translation ...... 221

xiii List of Tables

Table 1.1 Area occupied by CPU and GPU cores in recent mobile chips . 4

Table 2.1 Simulation platforms ...... 11 Table 2.2 OpenGL ES versions and the OpenGL versions they subset . . 15 Table 2.3 SIMT core components ...... 36 Table 2.4 Summary of graphics models in Emerald...... 43 Table 2.5 Example vertex batch sizes used with different primitive types . 46 Table 2.6 Per fragment Hi-Z depth entry ...... 50 Table 2.7 Benchmarks used to compare Emerald’s model with the K1 GPU...... 57 Table 2.8 DASH configurations ...... 63 Table 2.9 Baseline and HMC DRAM configurations ...... 64 Table 2.10 Case Study I system configurations ...... 66 Table 2.11 Case Study I workload models and configurations ...... 67 Table 2.12 Case Study II GPU configuration ...... 72 Table 2.13 Case Study II workloads ...... 72 Table 2.14 List of computer architecture simulators ...... 88

Table 3.1 Example mobile SoC use cases ...... 94 Table 3.2 ISI versus related work ...... 98 Table 3.3 Experimental setup ...... 116 Table 3.4 Evaluation use cases ...... 119 Table 3.5 Baseline network simulation configurations ...... 120 Table 3.6 DSENT configuration ...... 125

xiv Table 4.1 Baseline DCP configurations ...... 148 Table 4.2 List of Android workloads...... 150 Table 4.3 System configurations and workloads summary...... 152 Table 4.4 DCP structures hardware cost and performance estimations. . . 160 Table 4.5 VDCP compression across DRAM generations ...... 163

Table A.1 Example graphics state data queried by gem5-graphics. . . . . 203 Table A.2 List of graphics state data used by gem5-graphics...... 204 Table A.3 Checkpoint command data ...... 216

xv List of Figures

Figure 1.1 SoC chip main components ...... 3

Figure 2.1 A taxonomy of computer architecture simulators ...... 13 Figure 2.2 The OpenGL Pipeline ...... 15 Figure 2.3 An example of rendering a triangle over the graphics pipeline 18 Figure 2.4 The Early-Z stage in the graphic pipeline ...... 20 Figure 2.5 Hi-Z buffering ...... 21 Figure 2.6 The Hi-Z stage ...... 22 Figure 2.7 Clipping and culling stages ...... 23 Figure 2.8 Vertex order and primitive winding ...... 24 Figure 2.9 View frustum culling ...... 24 Figure 2.10 Guard-band clipping ...... 25 Figure 2.11 Hierarchical rasterization ...... 26 Figure 2.12 The stages of hierarchical rasterization ...... 27 Figure 2.13 Hardware graphics pipeline ...... 28 Figure 2.14 TBR binning ...... 29 Figure 2.15 ITR binning ...... 32 Figure 2.16 Compute GPU architecture ...... 34 Figure 2.17 GPU Compute SIMT cluster ...... 35 Figure 2.18 Overview of gem5-emerald ...... 35 Figure 2.19 Emerald graphics pipeline ...... 37 Figure 2.20 Emerald SIMT cluster ...... 44 Figure 2.21 Example vertex batch distribution ...... 45 Figure 2.22 VPO Unit ...... 47

xvi Figure 2.23 The mapping of screen-space to GPU cores ...... 48 Figure 2.24 TC Unit ...... 52 Figure 2.25 TC unit handling of incoming raster tiles ...... 54 Figure 2.26 Examples of tile coalescing ...... 76 Figure 2.27 Emerald correlation with Tegra K1 GPU ...... 77 Figure 2.28 Emerald software architecture ...... 78 Figure 2.29 GPU execution time under a regular load ...... 79 Figure 2.30 M3-HMC DRAM bandwidth ...... 80 Figure 2.31 Page hit rate and bytes accessed per row activation normalized to baseline ...... 81 Figure 2.32 Performance under the high-load scenario ...... 82 Figure 2.33 Number of display requests serviced relative to baseline . . . 83 Figure 2.34 M1 Rendering by baseline and DASH DTB ...... 84 Figure 2.35 Comparing fine and coarse screen-space division for fragment shading ...... 85 Figure 2.36 Case study II workloads ...... 85 Figure 2.37 Frame execution time for WT sizes of 1–10 normalized to WT of1...... 86 Figure 2.38 Normalized (to WT1) execution times and the total L1 cache misses of various caches for W1 ...... 86 Figure 2.39 The average of MLB, MLC, SOPT, and DFSL per-frame speedup normalized to MLB ...... 87

Figure 3.1 ArChip16 SoC NoC ...... 94 Figure 3.2 3D Rendering on Android: memory traffic by source . . . . . 96 Figure 3.3 Mobile workloads NoC traffic asymmetry ...... 96 Figure 3.4 Comparing ISI to baseline and buffer sharing schemes. . . . . 97 Figure 3.5 Baseline router design ...... 101 Figure 3.6 The effect of traffic asymmetry on network saturation . . . . . 102 Figure 3.7 ISI traffic with two asymmetric injection sources ...... 104 Figure 3.8 Dynamic and leakage power breakdown for a 5×5 router using 22nm nodes ...... 106 Figure 3.9 ISI Architecture ...... 107

xvii Figure 3.10 ISI VC and switch allocation ...... 109 Figure 3.11 Virtual channel state (VCS) and VIP Packet Order Tracking (VPOT) tables ...... 110 Figure 3.12 Normalized area and power consumption of generic 5×5 router compared to 3-way ISI routers ...... 115 Figure 3.13 ISI vs. static speedup: Crossbar area and power vs. crossbar radix (x-axis) normalized to baseline (no speedup) ...... 116 Figure 3.14 Evaluating ISI under synthetic traffic using symmetric and asym- metric traffic ...... 117 Figure 3.15 ISI crossbar NoC packet request-to-response latency vs. trace injection rate ...... 121 Figure 3.16 ISI mesh NoC packet request-to-response latency vs. trace in- jection rate ...... 122 Figure 3.17 ISI IP latency vs. trace injection rate (Crossbar) ...... 124

Figure 4.1 The life cycle of a surface from rendering to display . . . . . 129 Figure 4.2 Pixel change and Color change in Google Chrome ...... 134 Figure 4.3 The cumulative distribution function (CDF) of unique color values in UI (Twitter) and 3D (Temple Run 2) Android appli- cations...... 135 Figure 4.4 Using DCP across frames ...... 136 Figure 4.5 DCP stages ...... 137 Figure 4.6 Reading a DCP compressed surface ...... 139 Figure 4.7 Compression rates vs. FVC coverage for Android workloads . 141 Figure 4.8 DCP memory layout ...... 143 Figure 4.9 DCP compression vs. CCD size ...... 144 Figure 4.10 VDCP example ...... 146 Figure 4.11 Compression rates for Android workloads ...... 154 Figure 4.12 Harmonic mean of DCP schemes compression rates per appli- cation category...... 155 Figure 4.13 Comparing RAS, RED and VDCP effective compression rates 156 Figure 4.14 Comparing FVC size with relative coverage and its effect on compression rates ...... 157

xviii Figure 4.15 The compression rate of a 64-entry FVC at different associa- tivity values (normalized to the compression rate of a fully as- sociative FVC)...... 157 Figure 4.16 Normalized compression rates vs. FVC pixel sampling rate for UI applications...... 158 Figure 4.17 VDCP normalized (to sampling period of 1) compression rates vs. FVC frame sampling period...... 159 Figure 4.18 Ratio of DCP vs. RAS compressed blocks in the hybrid scheme 161 Figure 4.19 Power efficiency across mobile Low-Power DRAM generations 162

Figure A.1 An overview of the graphics standalone simulation mode. . . 201 Figure A.2 Fast-forwarding and ROI simulation ...... 202 Figure A.3 Graphics memory layout ...... 205 Figure A.4 GPU TLB miss handling ...... 207 Figure A.5 Guest-side (Android) graphics flow ...... 208 Figure A.6 Host-side (gem5) graphics Flow ...... 209 Figure A.7 Gem5Pipe states ...... 211 Figure A.8 Guest-side and host-side graphics state changes ...... 213 Figure A.9 Graphics checkpoint recording flow ...... 215 Figure A.10 Graphics checkpoint restore flow ...... 216 Figure A.11 Gem5Pipe in Host-Render mode ...... 218 Figure A.12 Translating GLSL to PTX ...... 221

xix List of Abbreviations

ADCP Adaptive Dynamic Color Palettes ALU Arithmetic Logical Unit AOU Atomic Operations Unit API Application Programming Interface ATE Atomic Tile Execution BW BandWidth CAM Content-Addressable Memory CCD Common Colors Dictionary CPU CSB Compression Status Buffer CUDA Compute Unified Device Architecture DCP Dynamic Color Palettes DDR Data Rate Synchronous Dynamic Random Access Memory DRAM Dynamic Random-Access Memory FLOPS Floating-point Operations Per Second FVC Frequent Values Collector GLES OpenGL ES GLX OpenGL Extension to the GPGPU General-Purpose computing on GPU Graphics Processing Unit GUI Graphics User Interface HD High Definition HDCP Hybrid Dynamic Color Palettes

xx Hi-Z Hierarchical-Z ILP Instruction Level Parallelism IP Intellectual Property Core IPC Instructions Per Cycle ISA Instruction Set Architecture ISI Interleaved Source Injection ISP Image Signal Processor L1 Level 1 L2 Level 2 LPDDR Low-Power DDR LRU Least Recently Used MAD Multiply-Add MMU Memory Managment Unit MRT Multiple-Render Targets MVP Model-View-Projection NoC Network-on-Chip OGL OpenGL OGLES OpenGL ES OS OpenGL Open OpenGL ES OpenGL for Embedded Systems PMRB Primitive Mask Reorder Buffer PTX Parallel Thread Execution ROI Region Of Interest ROP Raster Operations RTL Register Transfer Level SIMT Single-Instruction, Multiple-Thread SRAM Static Random-Access Memory SoC System-on-Chip TBDR Tile-Based Deferred Rendering TBR Tile-Based Rendering TC Tile Coalescing TCE Tile Coalescing Engine

xxi TCS Tessellation Control TCTD TC Tile Distributor TES Tessellation Evaluation Shader TGSI Tungsten Graphics Shader Infrastructure TLB Translation Lookaside Buffer UI User Interface VDCP Variable Dynamic Color Palettes VPO Vertices Processing and Operations rCCD Reversed Common Colors Dictionary

xxii Acknowledgments

I would like to begin by expressing my sincere gratitude to my advisor, profes- sor Tor M. Aamodt, for his guidance, support, and encouragement throughout my Ph.D. years. His insightful advice, invaluable feedback, and wise judgment were instrumental during my academic journey. I would like to thank him as well for providing me with ample autonomy to learn, explore, and define my own research direction. I had the pleasure of interacting and working with many great friends and colleagues who made my time at UBC such an unforgettable experience. Thank you to Ahmed ElTantawy, Wilson Fung, Tayler Hetherington, Francois Demoullin, Aamir Raihan, Ali Bakhoda, Amruth Sandhupatla, Andrew Boktor, Dave Evans, Deval Shah, Dongdong Li, Hadi Jooybar, Inderpreet Singh, Jimmy Kwa, Maria Lubeznov, Negar Goli, Shadi Assadi, Tim Rogers, and Tommy Chou. I also wish to thank the people I worked with during my internships, where I learned a lot of things that helped me down the road. In particular, I would like to thank Venyu Narasiman from Research, for his mentorship and support, and Weiping Liao, Allan Knies, and Stephane´ Marchesin from Google, for their valuable insights and feedback. Needless to say, none of this work would have been possible without the gen- erous support of our funding sources: Natural Sciences and Engineering Research Council of Canada (NSERC), Technologies, and Google. Finally, my most profound gratitude and appreciation goes to my parents, sis- ters, and many uncles, aunts, and cousins for their love and encouragement through- out the years.

xxiii To my parents, Ahmed and Khadija.

xxiv Chapter 1

Introduction

The limits on scalar processor performance and the demand for energy efficiency are the two trends driving innovation in computer architecture today. Moore’s law [159, 206] and Dennard scaling [206] allowed chip manufacturers to produce smaller, faster, and more power efficient transistors each generation. With more transistors and higher operating frequencies, computer architects were able to ex- tract additional instruction level parallelism (ILP) using increasingly sophisticated designs. Computer architecture innovations led to superscalar processors [112] that can harness ILP close to its limits [199, 245] and existing applications were able to reap the benefits of single-thread performance improvements. Eventually, however, with sub-micron feature sizes, it was challenging to scale down the operating voltage further to maintain power density, i.e., watts per unit area [72], as Dennard scaling approached its limits. With Moore’s law still in action — at least for the time being — newer chips became increasingly power dense. This trend limited the percentage of a chip that can switch at the full frequency as larger numbers of transistors were packed into smaller areas. To manage higher power densities, frequency scaling had to grind to a halt; this was the beginning of the power wall [68]. The end of frequency scaling shifted the innovation frontier away from the one-size-fits-all superscalar processors. Demand for higher performance meant the beginning of a new era dominated by specialized power-efficient accelerators. This has lead to increasing the use of specialized accelerator cores in heteroge-

1 neous systems [96], where heterogeneous systems-on-chips (SoCs) are now the most ubiquitous computing platforms. On the other hand, the demand for energy-efficient computing came with the shift from personal to mobile and cloud computing [53, 54, 218, 219]. For mo- bile systems, energy is limited; thus, energy-efficiency is crucial for usability. On the other end of the spectrum, cloud servers operate large, complex, and power- hungry clusters of computing nodes. For the operators of cloud servers, power consumption is a significant expense, and energy-efficiency is crucial for the bot- tom line [30, 40, 71]. Consequently, and with the rapid expansion of mobile and cloud computing, became a benchmark for computer archi- tects [251], accelerating the adoption of specialized accelerators. This work aims to solve some of the challenges that arose with the adoption of specialized accelerators in the mobile domain, and it proposes a set of tools and solutions for designing mobile Systems-on-Chips (SoCs).

1.1 Motivation

Mobile SoCs today are complex chips with several power- and energy-efficient specialized accelerators [24, 44, 96, 175] that interact and cooperate to achieve a common goal, to provide the best possible user experience. Figure 1.1 shows a contemporary SoC chip with its three main components. The compute component of the chip consists of a set of specialized accelerator IPs1 ( 1 ). The network-on-chip component ( 2 ) connects IPs to each other and to a component ( 3 ). To design an efficient mobile system, a compre- hensive approach should be followed to optimize all three components. This means incorporating dependencies and interactions between the components. In addition, it is also important to consider how users interact with the system; ideally, SoC designs should optimize performance and energy for common use cases. SoCs designers still have a plethora of challenges to tackle as the number and complexity of SoC commercial chipsets have grown tremendously in the last

1IPs stand for Intellectual Property cores, which refer to the various compute cores used in SoC chips, like CPUs, GPUs, signal processors, modems, display controllers, and AI engines.

2 1 Accelerators SoC IPs CPU GPU Cluster Cluster AI Engine Camera ISP

Image Display Image Audio Video Processing Controller Processing Engine Engine 2 Network-on-Chip Network on Chip

Directory Directory Directory Directory DRAM/System DRAM/System DRAM/System DRAM/System Cache Cache Cache Cache 3 Memory

Figure 1.1: An SoC chip consists of three major components. The computing component 1 consists of a number of power-efficient accelerators that serve a variety of mobile uses cases. These accelerators are connected through a network-on-chip fabric 2 that enables accelerators to com- municate with each other and the memory system 3 . The later is the third components, which serve as a shared storage space and communi- cation medium for all accelerators. decade, with vendors battling to distinguish their products apart from competi- tion [98]. This dissertation chooses to tackle major challenges in SoC architectures across the three major SoC components. For IPs, it introduces tools that enable studying critical SoC IPs, namely, graphics accelerators (Chapter 2). For SoC NoCs, it proposes NoC designs that take into account SoC specific behavior (Chapter 3). Finally, for memory, it proposes compression techniques that reduce memory con- sumption in many common mobile SoCs uses cases (Chapter 4). In the first contribution of this dissertation we study specialized IP behavior, focusing on modeling and studying graphics accelerators. This work was moti- vated by the critical role of graphics in most SoC use cases. As explained by the interactive nature of mobile systems, and confirmed by research on user behav- ior [70, 172], the IPs that are active most of the time on a modern SoC are the CPU, the graphics processing unit (GPU) and the display system. Estimates from recent chips (Table 1.1) show that the area allocated for GPU processors on SoCs is equal to, or larger than, the chip area allocated for CPU units; this shows how critical GPUs have become in today’s systems. But due to the lack of architec-

3 Vendor Chip Year CPU Area GPU Area Source Apple A12 2018 13.46% 13.64% [232] HiSilicon Kirin 970 2017 7.91% 17.75% [231] 845 2017/18 11.26% 10.58% [233] Samsung Exynos 9810 2018 17.22% 19.16% [233]

Table 1.1: Estimated area occupied by CPU and GPU cores in recent mobile chips (estimated from the corresponding annotated die photos). These estimates show that GPU cores now occupy similar or larger chip area compared to CPU cores. ture simulation tools for mobile SoCs, and the difficulty of building such tools, academic researchers have been limited to using methods that approximate the be- havior of SoCs for many common use case scenarios that include graphics [60], which hindered their efforts to carry out impactful research. We cannot understand SoC systems, however, by only looking into the indi- vidual behavior of SoC IPs, while ignoring how these IPs communicate. For this reason, the second contribution of this dissertation in Chapter 3 studies SoC NoCs and proposes techniques for handling traffic in SoC NoCs. From a niche field, SoC NoC architecture developed into a significant compet- itive advantage in recent years where industry leaders have shown an increasing interest in acquiring state-of-the-art SoC NoC technology [170]. Academia, how- ever, lacks a body of research that studies such an important class of heterogeneous NoCs. The work in Chapter 3 sheds light on some the unique challenges of design- ing heterogeneous SoC NoCs and proposes a first-order solution to SoC NoCs use case dependent asymmetric traffic patterns. Finally, in the line of work in Chapter 4, we cover the third major component of SoCs, the memory system. In this work, we study reducing the energy cost of accessing off-chip memory under common SoC use cases; particularly, Chap- ter 4 proposes a surface compression mechanism to reduce framebuffer bandwidth traffic to SoC off-chip memory. The work in Chapter 4 was motivated by the fact that off-chip memory traffic is a significant source of power and energy consumption in SoCs [184] and that most of a ’s use time is allocated to GUI-based applications [70, 172] that

4 employ off-chip memory to store framebuffer surfaces. Following the spirit of Am- dahl’s law [10], we were motivated to design a surface compression scheme that is explicitly optimized for common mobile use cases. In this work, we develop Dy- namic Color Palettes (DCP), a technique that exploits temporal coherence present in graphics applications in general, in addition to intrinsic characteristics specific to GUI and 2D applications.

1.2 Research Challenges

Mobile SoCs include several architecturally diverse IPs that cover a broad spectrum of architectural knowledge domains. This complexity contributes to the scarcity of research on this important area of computer architecture. Some researchers, for example, propose using mathematical models to reason about high-level as- pects of SoC architecture, as developing a detailed performance model can be challenging [98], especially at early design stages. Others employ tools that use simplified models that combine software-model traces with detailed memory mod- els [49, 167, 168, 253]; however, as demonstrated in the uses cases in Chapter 2, these model can sometimes miss some crucial aspects of a system’s behavior. In work on graphics modeling (Chapter 2), the main challenge was the lack of publicly available resources for graphics architectures. Graphics APIs, like OpenGL [123] and [153], define high-level interfaces that GPU vendors use to drive and define their hardware. Due to the high-level nature of these APIs, GPU vendors have the flexibility to define a variety of compatible architectures un- der the hood (discussed in Chapter 2 Section 2.2). This means that significant parts of the architecture are inaccessible to users. Earlier work [62, 216] on graphics modeling studied older architectures that are less relevant to today’s GPU. While some researchers tried to approximate the graphics part in mobile SoCs [87], such approaches can result in an inaccurate representation of system behavior [60]. The challenges for our second work on SoC NoCs (Chapter 3) are related to the above-mentioned issues of SoC complexity. To properly evaluate NoC de- signs for SoC systems, a detailed full-system model is preferable. However, and as highlighted in Section 1.1, there are a lack of tools that provide such comprehen- sive models for SoCs. While our work on Emerald fulfills the need for a graphics

5 model, models for other SoC IPs are not readily available. As a result, we used a methodology that utilizes both traces and cycle-accurate models to evaluate our work on SoC NoCs. This methodology allowed us to carry a out an extensive set of experiments. We hope, however, that our work on SoC NoCs motivates further investment by the research community in SoC research tools. Finally, for our work on surface compression (Chapter 4), the first challenge was to ensure that the assumptions that we relied on to design our compression technique are applicable to a large number of workloads. To do that, we carried out an extensive study using over 100 workloads. This demonstrated to us that our scheme is effective under a wide range of use cases. The second challenge for this work was to design a hardware compression algorithm that would work with any hardware architecture, as GPU and memory design can vary widely from one vendor to another.

1.3 Research Objectives and Contributions

The objectives of this dissertation are to shed light on interesting research problems in mobile SoCs, provide methods to study them, and propose solutions to some of the current issues. Throughout this dissertation, we studied and made contributions that cover each of the main SoC components. In the first part, our objective was to fill the gap in academic architecture re- search by providing a platform that can model SoC systems with sufficient detail, including support for full-system simulation and essential SoC components. Our work in Chapter 2 contributes the following:

• a cycle-accurate simulation infrastructure that extends GPGPU-Sim [37] to provide a unified simulator for graphics and GPGPU workloads;

• an infrastructure to simulate SoC systems using Android and gem5;

• a study of the behavior of memory in SoC systems highlighting the im- portance of detailed modeling incorporating dependencies between compo- nents, feedback from the system, and the timing of events; and

6 • a dynamic load-balancing technique exploiting graphics temporal coherence to improve GPU performance.

In our second work, our objective was to highlight the properties of NoC traffic under heterogeneous mobile SoCs and propose solutions that improve throughput and energy efficiency. The work in Chapter 3 makes the following contributions:

• Highlighting the issue of time-variant traffic asymmetry in heterogeneous SoCs.

• Analyzing the consequences of traffic asymmetry on NoCs.

• Proposing Interleaved Source Injection (ISI), a technique that alleviates in- efficiencies caused by asymmetric traffic.

Finally, our objective for the work in Chapter 4 was to study how we can reduce the cost of accessing memory in mobile SoC systems. The work in Chapter 4 makes the following contributions:

• It characterizes compression properties of framebuffer surfaces of user-interface (UI) as well as non-UI 2D and 3D applications.

• It uses characterization results to propose and evaluate dynamic color palettes (DCP), a compression technique that offers higher compression rates for common UI and non-UI 2D applications.

• It proposes two DCP variations that dynamically choose an optimal palette size based on the frequencies of the values in color palettes.

• It evaluates DCP compression schemes using an extensive set of workloads consisting of 124 Android applications.

1.4 Dissertation Organization

The main contributions of this dissertation are organized into three main chapters:

• Chapter 2: Emerald: Graphics Modeling for SoC Systems.

7 • Chapter 3: Balanced Injection of Asymmetric Network Loads Using Inter- leaved Source Injection.

• Chapter 4: Surface Compression Using Dynamic Color Palettes.

Each of these chapters discusses relevant background material and related work. The last chapter (Chapter 5) concludes the dissertation and provides directions for potential future work. Finally, Appendix A presents some of the implementation details of our work on Emerald in Chapter 2.

8 Chapter 2

Emerald: Graphics Modeling for SoC Systems

This chapter presents the work on Emerald modeling infrastructure. The first part of this chapter, sections 2.1 and 2.2, presents background material on graphics APIs and hardware. The following three sections provide details on Emerald; Sec- tion 2.3 gives an overview of Emerald SoC model, Section 2.4 details Emerald graphics model, and Section 2.5 provides an overview of Emerald software de- sign. Sections 2.6 and 2.7 present two case studies on using Emerald. Section 2.6 evaluates proposals for SoC memory architectures, and Section 2.7 studies balanc- ing fragment shading on GPU hardware. Section 2.8 discusses related work and Section 2.9 concludes the chapter.

2.1 Introduction

With the increasing prevalence of heterogeneous SoCs there is a commensurate need for architectural performance models that model contemporary SoCs. Al- though existing performance models such as gem5 [41] and MARSSx86 [187] enable running a full operating system, they lack an ability to capture all hardware due to the absence of models for key elements such as graphics rendering [60, 88]. The exclusion of such key components limits researchers’ ability to evaluate com- mon use cases (e.g., GUI and graphics-intensive workloads), and may lead re-

9 searchers to develop solutions that ignore significant system-wide behaviors. A re- cent study [60] showed that using software rendering instead of a hardware model “can severely misrepresent” the behavior of a system. To provide a useful model for studying system-wide behavior we believe an architecture model for a contemporary heterogeneous SoC must:

1. Support a full operating system software stack (e.g., and Android).

2. Model the main hardware components (e.g., CPU, GPUs and the memory hierarchy).

3. Provide a flexible platform to model additional special components (e.g., specialized accelerators).

A tool that meets these requirements would allow architecture researchers to evalu- ate workloads capturing the intricate behavior of present-day use-cases, from com- puter games to artificial intelligence and augmented reality. The gem5 simulator [41] can evaluate multicore systems including executing operating system code and has been extended to model additional system compo- nents. Table 2.1 compares various frameworks for modeling GPU enabled SoCs including GemDroid [49], gem5-gpu [190]. These two simulators extend gem5 to capture memory interactions and incorporate existing GPGPU models, respec- tively. GemDroid uses multiple single-threaded traces of a modified Android emu- lator that captures events (e.g., IP core invoked via an API call) and adds markers in the instruction trace. When an event is encountered while replaying, an instruction trace memory traffic generated by that event is injected while pausing instruction execution. Multiprogrammed parallelism is modeled by replaying multiple inde- pendent single-threaded traces. Using static traces limits the ability to capture intricate system interactions or to evaluate ideas that exploit low-level system de- tails. While gem5-gpu adds a GPGPU model to gem5, it does not support graphics workloads. In this chapter, we introduce Emerald, a GPU simulator. Emerald is inte- grated with gem5 to provide both GPU-only and full-system performance mod- eling meeting the three requirements above. Gem5-emerald builds on GPGPU- Sim [37] and extends gem5 full-system simulation to support graphics. Emerald’s

10 graphics model is able to execute graphics shaders using the same model used by GPGPU-Sim thus enabling the same model to be used for both graphics and GPGPU workloads. Additional models were added to support graphics-specific functions. In addition, to use Emerald under a full-system SoC model, we modified Android to add a driver-like layer to control our GPU model. We employ Emerald in two case studies. In the first, we use gem5-emerald full-system to re-evaluate work [49, 240] that used trace-based simulation to eval- uate memory scheduling and organization schemes for heterogeneous SoCs. This case study demonstrates issues that are difficult to to detect under trace-based sim- ulation. The second case study uses the stand-alone GPU to evaluate dynamic frag- ment shading load-balancing (DFSL), a novel technique that dynamically balances fragment shading across GPU cores. DFSL employs “temporal coherence” [207] – similarity of successively rendered frames – and can reduce execution time by 7.3-19% over static work distribution.

Simulator Model GPU Model FS Simulation GPGPU Graphics gem5 [41] Execution driven No No Yes GemDroid [49] Trace driven No Yes No gem5-gpu [190] Execution driven Yes No Yes Emerald Execution driven Yes Yes Yes

Table 2.1: Comparing Emerald to other open-source simulation platforms. Emerald provides an execution driven GPU model that supports running both graphics and GPGPU applications under standalone and full-system (Android) configurations.

The work in this chapter makes the following contributions:

1. Emerald1, a simulation infrastructure that extends GPGPU-Sim [37] to pro- vide a unified simulator for graphics and GPGPU workloads;

2. An infrastructure to simulate SoC systems using Android and gem5;

1The source code for Emerald can be found at https://github.com/gem5-graphics/gem5-graphics

11 3. A study of the behavior of memory in SoC systems highlighting the im- portance of detailed modeling incorporating dependencies between compo- nents, feedback from the system, and the timing of events;

4. DFSL, a dynamic load-balancing technique exploiting graphics temporal co- herence to improve GPU performance.

2.2 Background

This section provides some background on architectural modeling and graphics. Section 2.2.1 gives a quick overview of computer architecture simulators. The following three sections present background; Section 2.2.2 dis- cusses graphics APIs, Section 2.2.3 goes over a number of common graphics hard- ware techniques, Section 2.2.4 reviews contemporary graphics hardware architec- tures, and Section 2.2.5 provides an overview of GPU compute architectures.

2.2.1 Computer Architecture Simulators

Designing and producing computer chips is a long, complicated and costly multi- stage process; it includes developing a high-level architecture, producing detailed architecture specifications, elaborating and validating a low-level digital design, manufacturing, and post-silicon debugging. Because of the costs involved in pro- ducing a chip, manufacturers rely on architectural simulators in the early stages to weigh the prospects of different architectural designs, to test and evaluate new ideas, and to explore the design-space. At the same time, architectural simulation is an accessible method that academics use in their research. Figure 2.1 shows a taxonomy of computer architecture simulators. Functional simulators ( 1 ) emulate the functional part of a design. For example, a CPU func- tional simulator is capable of executing a CPU program (i.e., ISA binaries) and can produce a set of expected output values, including, parts of the architectural state (e.g., values held in CPU registers, caches or memory). Unlike performance simulators 2 , functional simulators do not provide timing information, i.e., how long it takes to execute a workload or parts of a workload. Hence, they are used

12 Architectural Simulators

2 Performance 1 Functional

43 Cycle-based 54 Event-driven

Figure 2.1: A taxonomy of computer architecture simulators. Functional simulators emulate only the functional part and do not provide timing information. On the other hand, performance simulators provide per- formance timing data and use either a cycle-based implementation or an event-driven one. to verify the functionality of a design and to assist other stakeholders (e.g., using a GPU functional simulator to verify a GPU driver). Performance simulators 2 model the timings of chip components. They are used to predict the performance of a new chip design or to evaluate the effect of changes on an existing design. For example, a CPU performance simulator is capa- ble of producing timings that correspond to how long it takes to execute a sequence of instructions, the utilization of different stages, and the delays encountered by components like ALUs, caches and memory. Different performance simulators provide various levels of detail based on the purpose and design stage. For modeling, performance simulators can be further divided into cycle-based ( 3 ) simulators and event-driven ones ( 4 ). In cycle-based simulation, a simulator inspects the state of every component each cycle to find if any event should be performed at that cycle. On the other hand, event-driven simulators rely on an event queue that contains events that will occur in future cycles ordered by the timing of events. The event queue is filled starting from a main event, from there each event can trigger a number of new events that occur in future cycles. Event- driven simulators are faster because they only execute the code corresponding to each component when there is an event at that particular component. However, they are usually more complicated to build and use. While Figure 2.1 provides an overview of architectural simulation techniques,

13 in reality, and depending on the specifics of the issue under investigation, simu- lators used by academics and industry practitioners may utilize a mixture of the above mentioned techniques to model different parts of an architecture. Later Sec- tion 2.8 gives an overview of computer architecture simulation tools used in aca- demic research.

2.2.2 Graphics APIs

A graphics API defines a set of commands that track the state and control the pro- cess of 3D rendering. Microsoft’s Direct3D [153] and Khronos Group’s OpenGL [123] (along with their variations [120, 122, 125]) are the prevalent real-time graphics APIs in use today. OpenGL and Direct3D define similar graphics pipelines. This work describes the graphics pipeline using the specifications and the terminology of OpenGL, as it dominates a large segment of mobile platforms [15].

2.2.2.1 The OpenGL Pipeline

OpenGL defines a family of similar APIs that are designed to serve various pur- poses. This work focuses on OpenGL and OpenGL ES specifications. Standard OpenGL [123], or simply OpenGL, defines a general purpose profile. On the other hand, OpenGL ES defines a profile oriented towards embedded and mobile systems. For instance, OpenGL ES introduces precision qualifiers that de- velopers can use to reduce the power consumption of ALU operations. Table 2.2 lists OpenGL ES versions and the corresponding OpenGL versions [123]. Other application-specific graphics standards include OpenGL SC for safety critical ap- plications [122], WebGL for web-based applications [125], and OpenVG for vector graphics [119]. OpenGL draws primitives using a number of rendering modes. A primitive can be a point, a line segment, or a triangle. OpenGL commands, i.e., function calls, set modes, specify primitives, and initiate other OpenGL operations. A primitive is a group of one or more vertices. A vertex is a point, an endpoint of an edge, or a corner of a triangle where two edges meet. Vertex data include, among others, positional coordinates, colors, normals and texture coordinates.

14 OpenGL ES version Corresponding OpenGL version OpenGL ES 1.1 OpenGL 1.5 OpenGL ES 2.0 OpenGL 2.0 OpenGL ES 3.0 OpenGL 3.3 OpenGL ES 3.1 OpenGL 4.3

Table 2.2: OpenGL ES versions and the OpenGL versions they subset [123]

Inputs/Outputs 1 Vertex Array 6 Programmable stage Geometry 2 Shader Fixed-function stage Vertex Shader 7 Rasterization Optional stage 3 Tessellation 8 Control Shader Fragment Shader 4 Tessellation Primitive Gen. 9 Raster operations 5 Tessellation Evaluation 10 Shader Framebuffer

Figure 2.2: The OpenGL Graphics Pipeline [120, 123]. The pipeline includes fixed-function and programmable stages that operate on vertex or frag- ment data. Stages 3 to 6 process vertex data, while stages 7 to 9 process fragment data.

Figure 2.2 shows rendering stages defined by the OpenGL graphics pipeline [120, 123]. An application provides the pipeline with a buffer that contains vertex data ( 1 ). The pipeline processes input vertex data in stages 2 to 9 to produce a framebuffer ( 10 ) that contains a rendered image. The first step in the pipeline is the vertex shader ( 1 ). Vertex shaders perform a

15 set of operations that occur on vertex values and their associated attributes. Vertex attributes are a set of per-vertex values available to the vertex shaders [120, 123], such as positions, colors, normals and texture coordinates. Depending on the con- figurations defined by the programmer, vertex shaders feed their output to a tessel- lation control shader 3 , a geometry shader 6 , or directly to rasterization 7 . Tessellation and geometry shaders are relatively new additions to the OpenGL pipeline; they were introduced in OpenGL 4.0 and 3.2 respectively. The new stages define methods to add geometry on-the-fly and allow the graphics pipeline to render scenes with higher details without the need to fetch an additional amount of vertex data from system memory which otherwise would increase rendering time. Tessellation is an optional process used to dynamically add details to the 3D objects supplied by the programmer. Tessellation consists of a three-stage pipeline ( 3 , 4 and 5 ) that subdivides vertex data, grouped as patch primitives, into smaller primitives. The tessellation control shader, or TCS, is the first stage in the pipeline ( 3 ); it is a programmable stage that determines the degree of tessellation to be performed on input vertex patches [120, 123]. TCS feeds patch data and tes- sellation levels to a fixed-function stage, i.e., the tessellation primitive generation stage ( 4 ). The primitive generation stage creates a set of new abstract patches from input patches. The tessellation evaluation shader, or the TES ( 5 ), then con- sumes each vertex of the abstract patches produced by the primitive generation stage and generates a particular vertex with a position and attributes. The geometry shader stage ( 6 ) is another optional stage; it creates new geom- etry from existing geometry on-the-fly. Geometry shaders consume vertices from the vertex shader stage or, when tessellation is enabled, the TES stage. Geometry shaders operate on a single input primitive at a time and emit one or more output primitives, and input primitives are discarded after geometry shader execution. The output of vertex processing stages ( 2 - 6 ) is fed into the rasterization stage ( 7 ). This stage converts each individual primitive to a two-dimensional set of discrete elements (fragments), where each fragment contains information such as color and depth. The fragment shader stage ( 8 ) consumes fragments produced by rasterization and performs a set of per-fragment operations specified by the fragment shader pro- gram. Resulting fragments are then processed by the raster operations stage ( 9 ),

16 where special per-fragment operations are performed, including blending, stencil testing and depth testing operations. Once fragments are processed by the raster operations stage, the resulting fragment data (e.g., color and depth) are commit- ted to the framebuffer ( 10 ). Each screen pixel corresponds to a single fragment; however, when using multisampling anti-aliasing techniques, multiple fragments are used to generate each displayable pixel [123].

2.2.2.2 Recent Trends in Graphics APIs

OpenGL and Direct3D APIs provide an abstract interface that defines a client- server execution model. The client, a graphics program, issues commands that are interpreted and processed by a server (a graphics driver). Under this model, drivers, or API server-side implementations in general, maintain a number of graphics con- texts. Each context encapsulates graphics state and objects. In addition to state tracking, drivers carry a set of vendor-dependent optimizations that aim to improve performance and/or energy consumption. However, due to the abstract nature of graphics APIs, drivers have become large, complex and error-prone. Recently, GPU vendors started to offer low-level graphics APIs that hand over a higher degree of control to graphics programmers, where, for example, they are given control over memory management and resource bindings. This includes APIs like Khronos Group’s Vulkan [124], AMD’s Mantle [8], and, to a certain extent, Apple’s Metal [23]. These APIs aim to reduce driver overhead by offloading low- level functionalities to the programmer. However, this low-level control increases program complexity and the benefits of using such low-level APIs remain depen- dent on the nature of each graphics application and its development cycle [51]. In addition, programmers may have to hand-tune their code for different hardware architectures.

2.2.2.3 Example: Life of a Triangle

Figure 2.3 shows an example of the operations of the graphics pipeline in Fig- ure 2.2. In Figure 2.3 ( 1 ), the inputs are three vertices, v0, v1 and v2, that represent a 3D triangle. The first set of stages operate on vertices (i.e., geometry). In the first stage

17 v0 v0 v Input Vertex Geometry 0 v vertices 0 shading Tessellation shading

v v 1 v2 v2 v v2 v 2 1 v1 2 1 3 1 4

Fragment shading Raster Ops Rasterization

5 6 7

Framebuffer

Figure 2.3: An example of rendering a triangle using the graphics pipeline in Figure 2.2. The input triangle is specified using world-space coordi- nates 1 , which are converted to screen-space coordinates by the vertex shader 2 . Then, the tessellation stage uses existing vertices to pro- duce new vertices according to the programmer’s tessellation configura- tions 3 . The geometry shader eliminates and produces new primitives from existing ones 4 , where geometry output is used by the rasteriza- tion stage to produce screen-space fragments 5 , which are modified by the fragment shader 6 . Finally, the raster operations stage modifies the framebuffer accordingly to incorporate the new fragments 7 .

( 1 ), vertex shading, the vertices are transformed from world-space to screen- space. This converts the triangle’s 3D coordinates to their corresponding 2D coor- dinates using a model-view-projection matrix (MVP) specified by the programmer. Once screen-space vertices are generated ( 2 ), tessellation generates new vertices as specified by the programmer’s tessellation code ( 3 ). The output of the tessella- tion stage feeds into the geometry shading stage, which can eliminate primitives or generate new primitives based on existing ones. The output of the geometry stage ( 4 ) represents the final geometry before rasterization takes place. The second set of stages operate on screen-space fragments. Rasterization gen- erates fragments that cover the primitives produced by the geometry stage. Once

18 fragments are generated ( 5 ), fragment shading performs lighting and texturing operations to produce shaded fragments ( 6 ). Shaded fragments then pass through the raster operations stage, which performs operations like blending and depth test- ing. The raster operation stage then updates the framebuffer to modify framebuffer pixels that correspond to the new fragments ( 6 ).

2.2.3 Hardware Graphics Pipeline Optimizations

Graphics accelerators (GPUs) contain a number of hardware blocks that acceler- ate the various stages of the graphics pipeline. Accelerators, working along with drivers, also implement a series of optimizations that reduce the amount of work that goes into rendering a frame. Such optimizations eliminate redundant or in- effectual work while maintaining, from a programmer’s perspective, the same ap- parent behavior of the pipeline. This section describes some of the most common optimizations as well as some of the lesser-known ones adopted by contemporary graphics accelerators.

2.2.3.1 Early Elimination of Invisible Primitives and Fragments

This section discusses how position attributes can be used to eliminate invisible fragments in the early stages of the graphics pipeline.

2.2.3.1.a Early Depth Testing Graphics applications can optionally use depth to render 3D scenes. When used, fragment depth values determine their visibility. Most 3D applications use depth to render nontrivial scenes. To use depth, a depth buffer, parallel to the framebuffer in Figure 2.2, is maintained to store the depth values of visible (foremost) pixels. Early depth testing, or Early-Z, uses depth values in the depth buffer to elimi- nate fragments before the fragment shader stage [11, 14, 73, 85]. Thus, eliminating the need to apply fragment shading to what would be invisible fragments. Figure 2.4 shows Early-Z is placed after the rasterization stage and before the fragment shading stage ( 1 ). Each fragment produced by the rasterization stage has a depth value that is used for depth testing.

19 From Geometry 1 Rasterization Early-Z

2 Fragment Late-Z Shader To RasterOps

Figure 2.4: Early-Z stage 1 eliminates what would be invisible fragments before the fragment shading stage. In some cases, however, a Late-Z stage 2 is used, instead of Early-Z, to handle cases where fragment depths are modified at the fragment shading stage [85].

There are cases, however, where Early-Z cannot be applied. These cases in- clude when the fragment shading program contains a path where depth is modified, or, as of OpenGL 4.2 [42], when the programmer explicitly disables Early-Z. For such cases, a Late-Z stage must be used to eliminate invisible fragments as shown in Figure 2.4 2 . In addition, depth-based fragment elimination is not used when blending is enabled. Unlike depth testing, blending is accumulative and order de- pendent; the programmer must submit primitives in a particular drawing order to achieve the desired results.

2.2.3.1.b Hierarchical Depth Testing Hierarchical Depth Testing, or Hi-Z, is another method used for early elimination of what would be invisible frag- ments [11, 82, 91, 160]. Hi-Z uses an on-chip buffer [210, 211], the Hi-Z buffer, which is a low-resolution depth buffer that tracks the front and the back depth val- ues of the full-resolution depth buffer, which remains stored off-chip. Figure 2.5 explains how the Hi-Z buffer works. A shows a framebuffer with three triangles (R, G and B). The background and each triangle have a depth value as shown in the figure. The framebuffer is split into four regions, where each region is tracked by a single Hi-Z buffer entry in B. The Hi-Z buffer tracks the value of each framebuffer region using two values, min and max depths. The min depth value is the depth value of the backmost pixel in the region. On the other hand, the max depth value is the depth value of the foremost pixel in the region2.

2Note that the programmer can specify the function used for depth buffer comparisons using

20 B: Depth (1)

R: Depth (2) Min: 0 Min: 1 Max: 3 Max: 2

Min: 0 Min: 0 Max: 3 Max: 1 G: Depth (3)

Depth (0)

(A) Framebuffer (B) Hi-Z Buffer Figure 2.5: (A) shows a framebuffer that has four regions with three triangles and their corresponding depth values. (B) shows a Hi-Z buffer with four entries that track the foremost/backmost depth values of each region in (A).

Min and max depth values of a framebuffer region can be used to determine, early in the pipeline, if a fragment will be visible, invisible, or possibly either. Any fragment that has a depth value less than the min value is invisible; any fragment that has a depth value larger than the max value is visible; any fragment with a depth value that is less than the max depth value but larger than the min depth value can be either. The advantage of Hi-Z over Early/Late-Z is that it can be performed using a fast on-chip memory/buffer. Full-resolution depth buffers are large and cannot be fully loaded on-chip. Thus, Early/Late-Z have to access off-chip memory to deter- mine fragment visibility. On the other hand, Hi-Z tracks depth at lower resolution; therefore, the Hi-Z buffer can be fully contained using on-chip memory. Figure 2.6 shows where Hi-Z is performed. Similar to Early/Late-Z, Hi-Z will be performed before the fragment shader stage when possible ( 1 ); otherwise, it is glDepthFunc [123]. For example, with GL LESS, a fragment with a depth value that is less than the stored depth value passes the depth test, while GL GREATER generates the opposite outcome.

21 From vertex processing 1 Rasterization Early Hi-Z Early-Z

Fragment 2 Late Hi-Z Late-Z Shader To RasterOps

Figure 2.6: Hi-Z ( 1 , 2 ) is used to eliminate invisible fragments earlier in the pipeline using a fast on-chip Hi-Z buffer that contains a low- resolution version of the depth buffer. delayed until after the fragment shader stage ( 2 ).

2.2.3.1.c Deferred Shading Deferred shading is a hardware-based technique where visibility is fully evaluated in a depth/Z pre-pass before proceeding to frag- ment shading [64, 100] (note that deferred shading is also used to refer to a separate software technique for adding lighting effects [238]). Deferred shading aims to reduce overdrawing. Early-Z/Hi-Z eliminates frag- ments that are invisible only compared to previously rendered fragments. It is possible, however, for fragments that pass Early-Z/Hi-Z to be occluded, later on, by another fragment. To defer render a scene, the graphics pipeline is run twice. In the first run, the depth pass, geometry and rasterization are used to generate a depth buffer that con- tains only the depth values of visible fragments. In the second run (the rendering pass), all pipeline stages are executed/re-executed. However, this time Early/Late Z/Hi-Z stages use the depth buffer generated in the depth pass to only allow visible fragments to the fragment shader stage. Note that with this technique any transpar- ent surfaces should be taken into consideration and the depth buffer should be set to allow all visible layers (e.g., the depth of the most front opaque surface is used to filter fragments). Deferred shading adds some redundancy by re-executing the front end of the pipeline. However, deferred shading can drastically reduce total rendering cost, especially when complex/elaborate fragment shaders are used.

22 From vertex Clipping & To fragment Rasterization processing Culling processing

Figure 2.7: Clipping and culling are performed on primitives after vertex pro- cessing stages and before the rasterization stage.

2.2.3.1.d Primitive Culling and Clipping Face culling, view frustum culling, and clipping remove, fully or partially, invisible primitives early in the pipeline after vertex processing and before rasterization [80, 113, 145, 156, 182, 230, 250]. Figure 2.7 shows the placement of clipping & culling in the graphics pipeline.

2.2.3.1.e Face Culling Face culling removes primitives with a particular fac- ing. The programmer can choose to enable face culling by enabling GL CULL FACE. The programmer can also determine which face is culled using glCullFace [123]. By default, and in most cases, backfacing primitives are culled when face culling is used. Primitive facing is determined by its winding order, which is determined by the order the vertices are received in. Figure 2.8 shows an example primitive, a triangle, with two possible windings. In 1 , triangle vertices are ordered such as the triangle is front-facing the page (counter-clockwise winding). On the other hand, the winding in 2 is clockwise, and the triangle is back-facing the page.

2.2.3.1.f View Frustum Culling View frustum culling removes primitives that fully fall outside the view frustum defined by the programmer. The view frustum defines cull and clipping volumes [123] that are used to cull/clip each primitive. Figure 2.9 shows a view frustum ( 1 ) with five triangle primitives. Viewer’s camera is positioned at 2 . The view frustum is defined using the near plane (i.e., Z-near) in 3 , and the far plane (i.e., Z-far) in 4 . Visible primitives must totally or partially fall inside the frustum defined by the two planes. The coordinates of vertices are used to find if a primitive falls outside or inside the view frustum.

Figure 2.9 shows five triangle primitives, P0 to P4. P0-P2 are fully or partially located within the view frustum and they are not subject to culling. P3 and P4, however, fall outside the view frustum and are subject to culling.

23 v0 v →v →v Vertex order 1 0 1 2 Counter clockwise winding

Vertex order 2 → → v0 v2 v1 Clockwise winding v v1 2

Figure 2.8: Vertex order determines primitive winding. Vertex order in 1 produces a counter clockwise winding and the triangle is front-facing the page. Vertex order in 2 produces a clockwise winding and the triangle is back-facing the page.

1 View frustum (a truncated pyramid)

P4

P1 P3

P0

P2

2 3 Near plane 4 Far plane Camera position Figure 2.9: Primitive vertex coordinates are used to cull primitives that com- pletely fall outside the view frustum (P3 and P4).

24 Clipped 2 P 4 Visible screen area 3 P1

P3 Culled 1 P0 Guard-band P2 area 4

Figure 2.10: Guard-band clipping handles primitives that cover parts of the visible screen area and, at the same time, extend outside the guard- band area (P1 in this example). Such primitives are clipped to smaller sub-primitives. Only sub-primitives that cover the visible screen area ( 2 ) are forwarded to rasterization.

2.2.3.1.g Clipping Graphics accelerators clip primitives that partially fall in- side the visible screen area, e.g., P1 and P2 in Figure 2.9). Figure 2.10 shows a 2D projection of the example in Figure 2.9. Figure 2.10 defines two areas: the visible screen area ( 3 ) and the guard-band area ( 4 ). Guard-band clipping [113, 182, 250] clips only primitives that fall inside the visible area and outside the guard-band area (i.e., P1 in Figure 2.10). Other primi- tives that fall, fully or partially, outside the visible screen area are either culled (P3 and P4) or pass to the rasterization stage unchanged (P0 and P2).

Unlike P1, P2 is not clipped even if it partially falls outside the visible screen area ( 3 ). This is because, eventually, the rasterization stage will use coverage testing only to generate fragments that will fall inside the visible screen area. Ras- terization, however, cannot be relied on to filter primitives like P1 that also fall out- side the guard-band area. This is because rasterization is usually performed using fixed-function integer arithmetic. Primitives that partially fall outside the guard- band area can introduce integer overflows, which can lead to incorrect coverage

25 Sampling 1 Fine raster tile 2 location 3

Coarse raster tile Visible screen space

Figure 2.11: Hierarchical rasterization splits the screen into coarse tiles ( 1 ). A coarse rasterizer detects which tiles are covered by a primitive ( 1 , 2 ). A fine rasterizer iterates through fine tiles ( 2 ) and performs coverage testing on sampling locations within each fine tile in parallel ( 3 ) [188, 222]. testing. Thus, guard-band clipping guarantees correctness using simple rasteriza- tion, at the same time, reduces how often to preform clipping operations which introduce extra latency in the pipeline.

2.2.3.2 Hierarchical Position Rasterization

Position rasterization locates sampling locations, within the visible screen-space, that are covered by primitives. Half-plane equations or barycentric coordinates are used for coverage testing. Although rasterization is parallelizable [188], testing every possible screen-space location for every primitive is inefficient, especially when handling a large number of small primitives. Thus, hierarchical position rasterization is used to reduce the amount of sampling performed on each pixel [67, 134]. In hierarchical rasterization [2, 81], visible screen-space is split into coarse tiles and fine tiles as shown in Figure 2.11. Primitive data are forwarded to coarse rasterization 1 , which locates nonempty coarse tiles. Once located 2 , a fine rasterizer is used to test all sample locations within each nonempty fine tile in parallel ( 3 ) [188, 222].

26 A From vertex Coarse Fine To fragment Culling Hi-Z processing Rasterization Rasterization processing

B From vertex Coarse Fine To fragment Culling Hi-Z processing Rasterization Rasterization processing

Figure 2.12: In A, both hierarchical rasterization stages are performed before Hi-Z. In B, fine rasterization is delayed until after Hi-Z by implement- ing a coarse rasterizer that also generates coarse depth samples [143].

In some designs, coarse rasterization is performed before Hi-Z; approximate depth data are generated with coarse rasterization and used to eliminate coarse tiles in Hi-Z. Figure 2.12 shows two possible placements of the hierarchical raster- ization stages in the graphics pipeline.

2.2.3.3 Summary

Figure 2.13 shows the graphics pipeline with the optimization stages discussed in this section. Today’s state-of-the-art graphics architectures implement some ver- sion of this pipeline; however, the pipeline stages in Figure 2.13 can be used to define entirely different hardware designs as discussed in the next section.

2.2.4 Graphics Hardware Architectures

Contemporary graphics accelerators use one of two major rendering architectures, regular immediate-mode rendering (IMR), or tile-based rendering (TBR). Based on where the transition from object to image parallelism occurs, and according to Eldridge’s taxonomy [65], IMR GPUs often classify as sort-last fragment architec- tures [209], while TBR GPUs can be classified as tile-order sort-middle ones. In TBR, sorting happens in the middle just after geometry processing. On the other hand, IMR sort-last fragment architectures sort for screen-space after fragments are produced and just before carrying raster-ops [209]. This section discusses the difference between TBR and IMR and the advantages and limitations of tiling.

27 Inputs/Outputs Clipping & Vertex Array Late Hi-Z Programmable stage Culling

Coarse Fixed-function stage Vertex Shader Late Z Rasterization

Tessellation Raster Early Hi-Z Control Shader operations

Tessellation Fine Framebuffer Primitive Gen. Rasterization

Tessellation Evaluation Early-Z Shader

Geometry Fragment Shader Shader

Figure 2.13: The graphics pipeline with stages used in today’s state-of-the-art graphics hardware.

2.2.4.1 Tile-Based Rendering Architectures

Immediate-mode rendering (IMR) architectures render primitives in their corre- sponding draw call order, i.e., the order provided by the programmer. IMR, how- ever, can suffer from overdrawing, as every pixel location can be drawn to multiple times [6, 18]. Because the framebuffer is stored off-chip, overdrawing in IMR architectures leads to higher off-chip memory consumption. Early-Z and deferred shading can reduce, but not eliminate, overdrawing (e.g., we still overdraw when blending); also, Early-Z and deferred shading will need to access off-chip memory as well. The energy cost of accessing off-chip memory is at least an order of magnitude higher than that of off-chip memory [115, 243]. In addition, it is currently infeasi- ble, in most cases, to hold full on-chip. TBR architectures [20, 25, 43, 45, 59, 100, 194, 203] reduce the cost of pixel

28 Tile 0 Tile 1

P0 P1

Binning Bin 0: [P0] Bin 1: [P0, P1] Tile 2 Tile 3 Bin 2: [P2] Bin 3: [empty]

P2 Primitive bins

Visible screen space

Figure 2.14: In this binning example, the visible screen-space is divided into four tiles. The binning process creates a bin for each tile that contains the vertex data of primitives that cover all or part of the corresponding tile. Primitives can fall into a single bin (P1 and P2), or multiple bins (P0). overdrawing using tiling, a process which divides the screen-space into screen- space tiles, i.e., bins. A binning stage executes the vertex processing stages (Fig- ure 2.2 2 - 6 ) to obtain primitive screen-space coordinates. The binning stage uses the screen-space coordinates to detect which tiles are covered by each prim- itive and add primitives to their corresponding bins. Once binning completes, the bins will contain vertex screen-space coordinates in addition to any attributes gen- erated by the vertex processing stages. Figure 2.14 shows a binning example of a screen-space divided into four tiles/bins. A second stage uses the vertex data in the bins to serially render each screen- space tile using an on-chip buffer, i.e., the tile buffer. For each tile, primitive data are read from the bin and the fragment processing stages are executed (Figure 2.2 7 - 9 ). Once all the primitives of a tile are rendered to the tile buffer, the tile

29 buffer is resolved by submitting it off-chip to the framebuffer. This way, TBR is able to contain overdrawing to on-chip memory accesses.

2.2.4.1.a TBR Performance Bottlenecks Most mobile chips today use TBR. There are, however, pathological cases that challenge TBR architectures. For ex- ample, binning is costly with geometry-rich workloads that contain a very large number of vertices and low fragment to vertex ratio. TBR architectures will need to fetch a large amount of vertex data for the binning stage, store binning data to off-chip memory, and then fetch vertex data again for the second rendering stage. With a large amount of vertex data, binning overhead may overcome the benefits of eliminating overdrawing. To mitigate the risk of such pathological cases, TBR implementations utilize a set of schemes to reduce the bandwidth consumed by vertex data (e.g., using compression and compact representations). Other imple- mentations are able to operate in either IMR or TBR modes [193, 202]. Other TBR pathological cases include binding calls and reading framebuffer pixels in the middle of a frame [26, 195]. For example, glReadPixels can be used between draw calls to read a block of pixels from the framebuffer (e.g., in screen-space ray tracing). TBR tiles has to be resolved before issuing glReadPix- els. In other words, using glReadPixels, glCopyTexImage, or glTexSubImage forces a TBR accelerator to sync the framebuffer mid-frame, diminishing the savings of on-chip rendering. TBR architectures also incur additional overhead when using multiple render- ing targets 3 (MRT) [195]. There are two possible options to handle MRT in TBR architectures. The first option is to use smaller tiles; this allows sharing the tile buffer so that multiple targets are rendered concurrently. Smaller tiles, however, increase binning overhead and require additional hardware support. The second option is to serially process render targets. This option, however, requires fetching the vertex data multiple times, once for each render target. In conclusion, MRT diminishes TBR performance regardless of the chosen technique.

3A render target is a buffer that fragment shaders may render to. Typically there is only a single render target for color values; it is possible, however, to define other render targets that contain depth or normal values for example. The additional render targets might be used to create lighting effects later on.

30 2.2.4.1.b Contemporary TBR Architectures Contemporary TBR architectures implement a set of optimizations that improve performance over baseline TBR. In this section, we briefly discuss two examples of state-of-the-art TBR architectures.

Qualcomm Tiling Control Qualcomm mobile GPUs support both IMR and TBR modes. Their FlexRender technology dynamically switches between IMR and TBR modes [193, 208]. Qualcomm also provides an OpenGL ES extension that allows the programmer to send binning “hints” to the driver [202]. Another extension allows the programmer to use “application tiling” [215], which enables the programmer to render to a selected portion of the framebuffer. Application tiling specifies a rectangular tile rendering area where resolves are fully controlled by the programmer.

Imagination TBDR architecture Imagination implements what is called Tile- Based Deferred Rendering, or TBDR, which combines TBR and deferred shad- ing [100]. TBDR implements two passes in the second stage of TBR. The first pass processes the depth values of all primitive and the second pass executes the full graphics pipeline. Early depth testing of the second pass is performed against the depth values from the first pass. Similar to deferred shading in IMR, TBDR must fetch vertex data for each tile twice, once for each pass.

2.2.4.2 Immediate Tiled Rendering Architectures

Recent NVIDIA and AMD chips adopt a rendering architecture that combines as- pects of IMR and TBR [9, 90, 126, 132, 146, 147]. We shall refer to this hy- brid approach as immediate tiled rendering (ITR4). Similar to TBR, ITR divides the screen into a grid of tiles; however, instead of implementing a separate bin- ning pass where all geometry is processed first, ITR splits primitives into batches, where batch sizes can be adjusted to suit the rendered content [9]. ITR then bins and caches each batch’s vertex shading results on-chip before using them imme- diately for fragment shading [126]. ITR relies on locality between consecutive primitives [9], which will likely render to the same group of screen-space tiles, to

4ITR can be classified as a primitive-order sort-middle architecture [65].

31 P 0 Tile queue defines the render order of P2 batch [P0-P5]: Tile 1 (P P P P ) P5 0, 1, 2, 5 Tile 0 Tile 1 ITR Tile 0 (P1)

Tile 3 (P1) P1 P 3 Tile 2 (P3, P4) Next batch P4

Tile 2 Tile 3

Figure 2.15: An example ITR binning using a batch of six primitives (P0-P5) and a single tile buffer. Following primitives’ order, ITR loads Tile 1 to render P0, part of P1, P2 and P5, then it loads Tile 0 then 3 to continue rendering P1. Finally, Tile 2 is loaded to render the remaining primi- tives (P3 and P4). Note that each last tile is resolved to the framebuffer before loading the next tile. avoid the overhead of storing and reading the entire screen-space geometry to/from off-chip memory as it is the case with TBR; thus, ITR is more efficient than TBR when processing geometry-rich scenes. ITR is a compromise [9] that accommodate energy-aware mobile graphics and geometry-rich PC graphics5. Thus, ITR allows vendors to use the same hardware architecture for mobile as well as PC platforms. As shown in the following exam- ple, ITR takes advantage of two important observations:

1. Most primitives are small and only cover a single tile.

2. There is locality between consecutive primitives.

Figure 2.15 shows an ITR example with a batch of six primitives (P0 to P5).

ITR processes the first primitive in the batch, P0, which happens to cover Tile 1.

ITR then loads Tile 1 to the tile buffer and renders P0. Once P0 is processed, ITR starts processing P1. However, P1 covers Tile 0 and Tile 3 in addition to Tile 1.

5For example, Unity 3D, the most popular graphics engine today, suggests that programmers should limit geometry in mobile rendering to 100K vertices, compared to few million vertices for PC graphics [237].

32 Since the tile buffer currently contains Tile 1, ITR first processes the portion of P1 that covers Tile 1, adds Tile 0 and Tile 3 to a tile queue, then continues to process other primitives in the batch. For the following primitives, P2 and P5, ITR simply continues to use the tile buffer which still contains Tile 1. Now since the rest of the primitives in the batch, P3 and P4, only cover Tile 2, ITR adds Tile 2 to the tile queue and resolves Tile 1 to the framebuffer. Once Tile 1 is resolved, ITR loads the next tile in the tile queue (Tile 0) to the tile buffer and processes primitives that cover any part of Tile 0, i.e., P1. Once P1 is rendered to Tile 0, ITR resolves Tile 0, then the same process is applied to Tile 3 then Tile 2. Finally, ITR starts working on the next batch of primitives once all the primitives in the current batch are processed. In this example, ITR operates on a single tile. Contemporary ITR GPUs, however, implement scalable architectures that can operate on multiple-tiles concurrently.

2.2.5 GPU Compute Architecture

This section gives an overview of contemporary GPU compute hardware architec- ture. The exact architecture differs across vendors. Here we describe an architec- ture that is similar to NVIDIA’s architecture [1, 37], which we use as our baseline model in Emerald. As shown in Figure 2.16, GPU compute cores are organized in a set of single- instruction multiple-thread (SIMT [140, 174]) core clusters ( 1 ) and L2 cache with atomic operations unit (AOU) ( 2 ). An interconnection network ( 3 ) connects GPU clusters, the AUO and the L2 cache to each other and to the rest of the system (i.e., to DRAM or to an SoC NoC). Figure 2.17 shows the organization of each SIMT cluster. Each SIMT cluster has multiple SIMT cores. SIMT cores execute shader programs for vertices and fragments by grouping them in warps (sets of 32 threads), which execute on SIMT lanes in a lock-step manner. Branch divergence is handled by executing threads of taken and non-taken paths sequentially. A SIMT stack is used to track which path each thread has taken. Memory accesses are handled by a memory coalescing unit, which coalesces spatially local accesses into cache-line-sized chunks. Each SIMT core has a set of caches that handle different types of accesses. Table 2.3 provides a summary of SIMT core components.

33 SIMT Cores Clusters 1 SIMT SIMT SIMT SIMT Cluster Cluster 0 Cluster 1 Cluster 2 (N-1)

System 3 Interconnection network Network 2 Atomic Operations L2 cache Unit (AOU) GPU GPU L2 Cache & Atomic Operations Unit

Figure 2.16: Compute GPU architecture consists of a set of SIMT clus- ters 1 , where each cluster contains a number of SIMT cores. All clusters share an L2 cache and atomics unit 2 . The cluster and the L2 cache are connected to each other and the rest of the system through an on-chip network 3 .

2.3 Emerald SoC Architecture

This section describes the SoC level architecture that Emerald models. Figure 2.18 shows an overview of gem5-Emerald SoC architecture. First, it has a CPU cluster ( 1 ) with either in-order or out-of-order cores with multiple cache levels. The GPU cluster ( 2 ) consists of multiple GPU shader cores (SCs). Each SC consists of multiple execution lanes executing multiple GPU threads in a lock-step manner. GPU threads are organized in groups of 32 threads, i.e., warps [181]. In our baseline design, GPU L1 caches ( 3 ) are connected to a non-coherent interconnection network; meanwhile, the GPU’s L2 cache ( 4 ) is coherent with CPU caches (following a similar design to Power et al. [190]). The system network ( 5 ) is a coherent network that connects the CPU cluster, the GPU cluster, DMA devices 6 (e.g., display controllers) and the main memory 7 . It is relatively easy to change the connections and network types (gem5 pro- vides several interchangeable network models). We use gem5 classic network

34 Shared Register I-Cache MemoryShared FileRegister DataI-Cache MemorySIMTShared Stack FileRegister Data Memory File CacheI-Cache SIMTSIMT Lanes Stack Cache SIMT Stack Texture SIMT Lanes Texture SIMT Lanes Cache Coalesced ConstantCache Coalesced memory accesses CacheConstantData memoryCoalesced accesses Cache SIMTmemory Core accessesZ -Cachecaches SIMT Core Z-Cache SIMT Core SIMT Cluster

Figure 2.17: The GPU SIMT cluster. It consists of a number of SIMT cores, where each core has a number of SIMT compute lanes.

CPU Core CPU Core L1 ❶ L1

L2 L2 ❸ L1 ❹ Caches SC CPU NoC L2 L1 SC GPU NoC Caches ❺ System NoC GPU DMA ❻ ❼ ❷ Devices DRAM

Figure 2.18: An overview of gem5-emerald HW architecture. The SoC con- sists of a cluster of CPU cores 1 , another cluster for the GPU 2 , and DMA devices 6 ; all connected through a network 5 to off-chip memory 7 .

35 Name Description Single instruction multiple data (SIMD) pipeline Execution units (SIMT that executes scalar threads in groups of warps (each lanes) warp contains 32 threads). Stacks are used to manage SIMT thread branch di- SIMT Stacks vergence by keeping track of active threads in each branch on the top of the stack. A banked register file inspired by an NVIDIA’s de- Register File sign. Operand collectors are used for communicat- ing between execution units and the register file. Per SC banked scratchpad buffer used for intra- Shared Memory thread block communications. L1I: Instruction cache. Caches [181]6 L1D: For global (GPGPU) and pixel data. L1T: Read-only textures cache. L1Z: Depth cache. L1C: Constant & vertex cache. L2: A second level cache coherent with CPU L2 caches (in our baseline model we use L1 caches that are non-coherent with each other or the L2 cache). Coalesces memory requests from SIMT lanes to Coalescing Logic caches

Table 2.3: SIMT core components. The exact number of lanes per core and the number and the organization of caches differ between GPU vendors and architectures. models as they provide efficient implementations for faster full-system simulation. Network simulation can consume a considerable amount of time for detailed slower network models that also lack the support fast-forwarding (i.e., gem5 Ruby). The number of CPUs, GPU’s SCs, the type and number of specialized accelerators and the on-chip network connecting them are widely configurable as well. Addition- ally, cache hierarchies and coherence are also configurable.

36 Graphics Fixed-function stage Graphics/GPGPU fixed-function stage Graphics/GPGPU programmable stage A J K Input Hierarchical- Tile vertices Z Coalescing

B Vertex I Fine L Early-Z distribution rasterization

C Vertex H M Fragment Coarse shading & shading rasterization blending

D Primitive G Primitive N Late-Z assembly setup E F Clipping & Primitive O Framebuffer Culling distribution commit

Figure 2.19: Emerald graphics pipeline

2.4 Emerald Graphics Architecture

This section describes Emerald graphics architecture and is organized as follows: Section 2.4.1 presents an overview of the stages of the Emerald graphics pipeline; sections 2.4.2, 2.4.3, 2.4.4, 2.4.5 and 2.4.6 detail the architecture that implements the pipeline; Section 2.4.7 shows work on model validation and Section 2.4.8 dis- cusses model limitations and future work.

37 2.4.1 Emerald Graphics Pipeline

Emerald builds on the NVIDIA-like GPU model in GPGPU-Sim [37] version 3.2.2. Like all contemporary GPUs, Emerald implements a unified shader model rather than employing different processor cores for different shader types. Although many mobile GPUs employ tile-based rendering [59, 194, 203], Emerald, follows a hybrid design similar to that of NVIDIA’s architecture [50]. This design is em- ployed in NVIDIA’s discrete and mobile GPUs [177, 178], which combine aspects of immediate-mode rendering (IMR) and tile-based rendering (TBR); also AMD uses an architecture that follows a similar approach as discussed in Section 2.2.4.2. The remaining of this section gives an overview of the stages of Emerald graph- ics pipeline (Section 2.4.1.1) and the role of each stage (Section 2.4.1.2). The de- tails of the architecture that implements the pipeline is provided in the next section (Section 2.4.2).

2.4.1.1 Pipeline Overview

Section 2.2.2.1 describes programmer’s view of the graphics pipeline. This section, however, describes Emerald’s hardware pipeline stages which adopt some of the optimization techniques described in Section 2.2.3. Figure 2.19 shows the stages of the hardware pipeline Emerald models which employs stages found in contemporary GPU architectures [2, 13, 14, 67, 90, 91, 143, 158, 210]. A draw call initiates rendering by providing a set of input ver- tices A . Vertices are distributed across SIMT cores in batches ( B , C ). Once vertices complete vertex shading, they proceed through primitive assembly ( D ) and clipping & culling ( E ). Primitives that pass clipping and culling are dis- tributed to GPU clusters based on screen-space position ( F ). Each cluster per- forms primitive setup ( G ). Then, coarse rasterization ( H ) identifies the screen tiles each primitive covers. In fine rasterization ( I ) input attributes for individ- ual pixels are generated (positions, colors, texture coordinates, etc.). After prim- itives are rasterized, and if depth testing is enabled, fragment tiles are passed to a Hierarchical-Z/stencil stage ( J ), where a low-resolution on-chip depth/stencil buffer is used to eliminate invisible fragments. Surviving fragments are assembled in tiles ( K ) and shaded by the SIMT cores. Emerald employs programmable, or

38 in-shader, raster operations where depth and blending (stages L , M and N ), are performed as part of the shader program as opposed to using atomic units coupled to the memory access schedulers. Hardware prevents races between fragment tiles targeting the same screen-space position [90] (Section 2.4.6). Depth testing is ei- ther performed early in the shader, to eliminate a dead fragment before executing the fragment shader ( L ), or at the end of the shader ( N ). End of shader depth testing is used with fragment shader programs that include instructions that discard fragments or modify their depth values. Finally, fragment data is written to the corresponding pixel position in the framebuffer ( O ).

2.4.1.2 Description of Pipeline Stages

Here we describe the stages in Figure 2.19.

• Input vertices ( A ): OpenGL API state commands sets the rendering state (e.g., framebuffer size and format, depth buffer type, MVP matrix, etc) be- fore issuing draw call commands, so each draw call command is associated with a list of vertices that will be rendered by the graphics pipeline according to the defined state.

• Vertex Distribution ( B ): The list of draw call vertices from ( A ) are split into batches and assigned to different GPU cores for parallel processing (de- tails in Section 2.4.3).

• Vertex Shading ( C ): Each SIMT core processes the assigned batches of vertices according to the active vertex shading program, where each vertex is assigned to a core thread. Each thread fetches the corresponding vertex attributes from memory through system caches to capture any spatial and temporal locality of vertex attributes. Once a batch of vertices is processed, all result attributes of screen-space vertices are written to the L2 cache and position attributes are also forwarded to primitive processing stages ( D , E , and F ).

• Primitive Assembly, Clipping & Culling ( D & E ): The assembly stage constructs primitives from a stream of shaded vertices according to the primi- tive type associated to the state of the current draw call (e.g., GL LINE STRIP,

39 GL TRIANGLES, GL QUADS, etc). Assembled primitives are processed by a clipping and culling stage to clip primitives to the rendering volume and cull any primitives that fall outside the rendering volume as described in Section 2.2.3.1.

• Bounding Box Calculations & Primitive Distribution ( F ): Screen-space is divided among GPU cores, where each core is responsible for rendering fragments that fall into a number of predefined areas of the screen-space (more details are provided in Section 2.4.4). The bounding box calculation stage produces coverage data. The coverage data for each primitive is a bit mask that indicates which screen-space areas the primitive covers. According to the predefined fixed mapping of screen-space tiles to SIMT cores in SIMT clusters, coverage data are used to map each primitive to a single SIMT cluster/cores or a set of SIMT clusters/cores if a primitive touches screen-space areas assigned to more than one SIMT cluster (details in Section 2.4.4).

• Primitive Setup & Coarse Rasterization ( G & H ): Once a primitive is assigned to one or more SIMT clusters, clusters then independently carry the remaining operations in the pipeline (i.e., stages G to N ). At each cluster, a primitive goes through a primitive setup stage before pro- ceeding to a coarse-rasterization stage. Primitive setup is a fixed function hardware stage that calculates primitive data used for traversal and interpo- lation, such edge equations and differential.. The coarse-rasterization stage generates tiles of coarse position fragments. The main goal of coarse rasterization is to find which coarse tile positions the primitive covers. Each coarse tile covers a larger screen-space area than a regular (fine) fragment (i.e., each coarse tile corresponds to ≥1 screen- space pixel(s)). Note that for each SIMT cluster, only coarse tile positions that are assigned to that particular SIMT cluster are rasterized (i.e., for a primitive that maps to multiple SIMT clusters, each cluster only generates coarse tiles that belong in that cluster), the rest are ignored. Once coarse tiles are generated, fine fragments are generated at the fine rasterization stage I .

40 • Fine Rasterization ( I ): At this stage, fine fragments are generated for each coarse tile covered by a primitive. Each fragment at this stage corresponds to a single screen-space pixel (unless supersampling or multi-sampling [6, 234] are utilized, in that case each fragment corresponds to a screen sample). Fine fragment tiles are then passed to the Hi-Z stage along with depth info for the first stage of depth testing.

• Hierarchical-Z ( J ) Hi-Z operations use a buffer that contains a low-resolution depth buffer (i.e., the Hi-Z buffer). This buffer conservatively tracks the depth values of the full depth buffer stored on the off-chip memory. The Hi-Z buffer is distributed amongst SIMT cores according with their screen- space assignment. Using the Hi-Z buffer, the Hi-Z stage kills many invisible (occluded) frag- ments according to their depth. Only fragments that pass the Hi-Z test can continue to the next stage. Fragment tiles that pass the Hi-Z stage to the next stage have one or more fragments that may pass the per-fragment depth test later in the Early/Late-Z stages.

• Tile Coalescing( K ): This stage coalesces partially covered fragment tiles to produce denser tiles for running fragment shading on the SIMT cores (i.e., grouping fragments to execute in the same thread block). Tile coa- lescing (TC) units create new larger tile, TC tiles, from multiple, smaller, fragments raster tiles. Additionally, TC units can split raster tiles amongst multiple TC tiles. The main goal of tile coalescing is to create denser tiles that allow higher utilization of SIMT core lanes when carrying fragment shading of smaller primitives. In each SIMT cluster, a TC unit coalesces non-overlapping fragments from multiple raster tiles of different primitives that cover the same screen-space location. Once coalesced TC tiles are cre- ated, they are passed to the corresponding SIMT core for fragment shading. TC tiles are assigned to SIMT cores based on a predetermined fixed inter- leaved mapping (details in Section 2.4.6). The tile coalescer also makes sure that tiles that cover the same screen-space location are prevented from executing concurrently. This atomic tile execu-

41 tion model is described by Hakura et al. [90]. Atomic tile execution (ATE) withholds executing new TC tiles that overlap with a TC tile currently in the fragment shading stage. ATE allows fragment shaders to support in-shader depth testing and programmable blending operations [90, 176].

• Early-Z ( L ): In the early-Z stage, depth testing is carried out in software as part of the fragment shader during the initial part of fragment processing; early-Z kills fragments (by disabling the corresponding threads) that fail the depth test before proceeding to carry out lighting computations. As a result, work that would have been performed for shading such invisible fragments is avoided.

• Fragment shading and blending ( M ): The fragment shading stage carries lighting operations and execute the active fragment shader program associ- ated with the current draw call.

• Late-Z ( N ): Late-Z is an alternative to Early-Z. If the fragment shader pro- gram modifies fragment depth values, or if the program includes instructions for killing fragments, then it is not possible to run the Early-Z stage before executing the fragment shader. In such cases, Hi-Z is disabled and Late-Z is used instead of Early-Z.

• Framebuffer Commit ( O ): Once fragments are transformed by the corre- sponding shader program, SIMT cores commit result fragments to the frame- buffer.

2.4.2 Emerald GPU Architecture

Emerald’s hardware architecture is based on the architecture described in Sec- tion 2.2.5. Figure 2.20 shows the organization of each SIMT cluster in Emerald with graphics stages, which are described in more details in the following sections. Each SIMT cluster has multiple SIMT cores ( 1 ) and fixed pipeline stages ( 2 to 8 ) to implement stages G to K in Figure 2.19. In Emerald, screen-space is divided into tiles (TC tiles), where each TC tile position is assigned to a single SIMT core (details in Section 2.4.6).

42 Graphics Stage Model description Vertex distribution A configurable number of vertex batches sent to SIMT cores per cycle. Vertex threads are grouped in thread blocks (configurable size), which are assigned to SIMT cores using a module hash function. Primitive assembly Clipping & culling Bounding box calcula- These stages are modeled as part of a detailed model, the VPO tions unit, which is described in Section 2.4.4. Primitive distribution Setup A delay model with a configurable number of cycles per primitive (primitives are pipelined). Coarse rasterization Uses throughput models with a configurable number of Fine rasterization rasterization tiles (≥1) per cycle. Hierarchical-Z Tile coalescing The model is a version of the implementation described in Hakura et al. [90], which is detailed in Section 2.4.6. In this implementa- tion, the screen-space is split between cores for fragment process- ing and tile coalescing units, i.e. TC Units, manage the execution of the fragment tiles across cores. Depth testing & blending Depth traffic is passed through the Z-cache, which is connected to the GPU L2 cache and the rest of memory hierarchy through an interconnection network. Blending works in a similar way but uses the pixel (i.e., data) cache. Note that the cache hierarchy is flexible and the above-mentioned setup represent our baseline. Vertex shading An extended version of GPGPU-Sim [37] pipeline with an Fragment shading extended ISA for graphics support.

Table 2.4: Summary of graphics models in Emerald.

43 Coarse 4 Fine 5 6 Tile 8 Hi-Z rasterization rasterization Coalescer 7 Hi-Z buffer Setup stage Shared Register I-Cache 3 MemoryShared FileRegister DataI-Cache MemorySIMTShared Stack FileRegister DataI-Cache VPO Unit Memory File Cache SIMTSIMT Lanes Stack CacheData SIMT Stack Texture 2 SIMT Lanes TextureCache SIMT Lanes Cache Coalesced ConstantCacheTexture memoryCoalesced accesses CacheConstantCache Interconnection memoryCoalesced accesses CacheConstant Network SIMTmemory Core accessesZ -CacheCache 9 SIMT Core Z-Cache SIMT Core Z-Cache SIMT Cluster 1

Figure 2.20: Emerald SIMT cluster consists of a number of SIMT cores 1 , which execute vertex and fragment shader programs. The cores in a cluster share non-programmable graphics pipeline stages as shown in 2 - 8 .

Emerald provides varying levels of modeling detail for different components and stages. Table 2.4 presents a summary of Emerald models. The following sections elaborate on the details of non-trivial components and stages and how they operate.

2.4.3 Vertex Shading

This sections describes how Emerald implements the vertex shading stage, starting from vertex buffers provided by the programmer to the way vertices are trans- formed and distributed among compute clusters for fragment shading. Vertex shaders fetch vertex data from memory, transform vertices, and write result vertices, i.e., their position data along with other optional output attributes, to the L2 cache (Figure 2.16 2 ). In addition, SIMT cores (Figure 2.20 1 ) also send vertex position data generated by vertex shaders to the Vertex Processing and Operations (VPO) unit (Figure 2.20 2 ), which distributes primitives among SIMT cores as detailed in Section 2.4.4. When a draw call is invoked, vertices are assigned for vertex processing, in

44 Vertices shared across line strips

v3 v6 v0 …

Warp 0 (v0 – v3) on cluster 0 Warp 1 (v3 – v6) Warp 2 (v6 – v9) on cluster 1 on cluster 2

Figure 2.21: An example of vertex distribution with overlapping primitives. batches, to SIMT cores in a round-robin fashion. While batch size is configurable, a batch size that is multiple of warp size (e.g., 1-4× warp size) should be used to utilize all SIMT core lanes. Batches of, sometimes overlapping, warps are used when assigning vertices to SIMT cores. The type of primitive being rendered determines the overlapping degree and, effectively, how vertices are assigned to warps [4]. Without vertex overlapping, primitive processing stages (i.e., VPO 2 and setup 3 ) would need to consult vertices from multiple warps when primitive formats with vertex sharing are in use (e.g., GL LINE STRIP and GL TRIANGLE STRIP); thus, overlapped vertex warps allow parallel screen-space primitive processing even when using primitive formats with vertex sharing. Figure 2.21 shows an example of vertex distribution with overlapping primi- tives. The example shows a line strip, i.e., GL LINE STRIP, where each two ver- tices determine a line. Every vertex in this case, except the first and the last vertex in the strip, is common between two lines. In order to parallel process this line strip

45 Primitive type Per cluster batch size Point 32 Line 32 Line strip 62 Triangle 30 Triangle strip 90 Fan 90

Table 2.5: Example vertex batch sizes used with different primitive types (on NVIDIA GTX 1080 Ti). among multiple shader clusters/cores, the vertices in the line strip are divided into multiple batches. In this example, we assume that each batch has four vertices (i.e.,

first batch contains v0-v3, second batch v3-v6, etc). As shown in Figure 2.21, vertex batches are executed in warps (for simplicity, in this example we assume each warp has four vertices) that can run on different clusters; as a result, common vertices, e.g., v3 and v6, are redundantly included and executed with different warps. The main target here is to create warps that contain all the vertices required to process the primitives included in the warp. This setup allows processing warps indepen- dently by the primitive distribution stage as discussed in Section 2.4.4 below. Emerald can be configured to serve different batch sizes. In current hardware, vertex batch sizes are configured based on the primitive type in use. We studied batching used in NVIDIA’s hardware (GTX 1080 Ti) and we summarized our find- ings in Table 2.5. The table shows batch size in terms of unique vertices in each batch; more shared vertices between primitives, e.g., in triangles strips, means less unique vertices are included in each warp, hence, batch sizes are not all a multiple of 32 (warp size).

2.4.4 Primitive Processing

The VPO unit (Figure 2.22), which is inspired by NVIDIA’s work distribution crossbar interface [158, 201], assigns primitives to clusters for screen-space pro- cessing. Figure 2.23a shows an example of how primitives and their fragments are as- signed to processing units based on the screen-space position (obtained by profiling

46 From SIMT cores To raster pipeline VPO 1 Vertex warp Vertex warp Unit Vertexbuffer buffer warp buffer Primitive masks reorder buffer 4 Bounding-box From other calculations 2 To other clusters clusters Interconnection Primitive mask network distribution 3 5

Figure 2.22: The VPO unit distributes vertex coverage data to the corre- sponding clusters according to the screen-space assignment. SIMT cores transfer position data from processed vertex warps into one of the vertex warp buffers ( 1 ), which are used by the bounding-box calcula- tions unit ( 2 ) to generate warp-sized primitive masks for each SIMT cluster. The primitive mask distribution unit ( 3 ) sends the masks to the corresponding clusters, which can be the same cluster ( 4 or other clusters ( 5 ). an NVIDIA 1080 Ti). Each shaded square in the triangle shown in Figure 2.23a represents a tile of pixels (16×16 pixels). Each individual shade in turn represents the GPU cluster that tile is assigned to. Emerald models a configurable tile size and a modular hash function is used to map tiles to clusters as shown in Figure 2.23b. To carry out primitive distribution, SIMT cores transfer position data from ver- tex warps into one of the vertex warp buffers (Figure 2.22 1 ). A bounding-box calculation unit ( 2 ) consumes position data from each warp in ( 1 ) and calculates the bounding-box for each primitive covered by the warp. Since any vertices that are shared between primitives are duplicated between warps, there is no need to

47 Each color shade represents a 16x16 pixel tile assigned to the same cluster

Screen area

(a) An example triangle showing NVIDIA’s screen-space to cluster mapping using color coding. Each shade of red represent a set of pixels that are assigned to the same GPU cluster for fragment shading. Pixel tiles

Cluster 0 Cluster 1 Cluster 2 Cluster 1 Screen area Core 0 Core 0 Core 0 Core 1

Cluster 1 Cluster 2 Core 1 Core 1

(b) Emerald screen-space to GPU cores mapping using a modular hash function that uses cluster and core IDs.

Figure 2.23: The mapping of screen-space pixels to GPU cores

48 consult vertex position data from other warps that may be rendered on other SIMT clusters. Bounding-box calculations ( 2 ) generate a warp-sized primitive mask for each SIMT cluster. For each vertex warp, a cluster primitive mask conveys if that par- ticular cluster is covered (i.e., mask bit is set to 1), or not covered (i.e., mask bit is set to 0) by each of the primitives in the warp. The primitive mask distribution stage ( 3 ) sends each created mask to its corre- sponding cluster (i.e., all clusters exchange primitive coverage data). If the current cluster is the destination of a mask, the mask is committed locally to the primitive masks reorder buffer ( 4 ), i.e., the PMRB unit. Otherwise, the interconnection network ( 5 ) is used to communicate primitive masks to other clusters. On the receiving end, the PMRB unit collects primitive masks from all clusters. Each primitive mask contains an ID for the first primitive covered by the mask which allows the PMRB unit to store masks according to their draw call order. To avoid deadlocks, the vertex shader launcher limits the number of active vertex warps to the available space in PMRB units. The PMRB unit processes masks according to their draw call order. For each mask, the PMRB unit checks each mask bit to determine if the corresponding prim- itive covers part of the screen-space area assigned to the current cluster; if not, the primitive is simply ignored. On the other hand, if a primitive covers part of the screen-space area assigned to the current cluster (mask bit is set to 1), the primitive should be processed by the current cluster and the corresponding primitive ID is communicated to the setup stage (Figure 2.20 3 ). The setup stage uses primi- tive IDs to fetch the corresponding vertex data from the L2 cache. Following that, primitives go through stages 4 to 8 which resemble the pipeline described in Section 2.4.1.

2.4.5 Hierarchical-Z (Hi-Z) Operations

Hierarchical-Z operations (Figure 2.20( 6 and 7 ) are performed before a tile of fragments is issued to a SIMT core for fragment shading. Hi-Z is used [150, 210, 246] to filter out what would be invisible fragments before any shading cal- culations is performed on them; thus, avoiding wasting any compute resources on

49 Front Depth The depth of the foremost fragment in a tile Back Depth The depth of the fragment in the back of a tile

Table 2.6: Per fragment Hi-Z depth entry such fragments. Like other depth operations, Hi-Z is only performed if depth is enabled in the current draw call. Another condition to activate this stages is that the fragment shading stage does not include a path where fragment might be killed (i.e., using GLSL discard [118]) or where the depth is modified (i.e., by writing to gl FragDepth [118]) as part of the fragment shader program. The Hi-Z stage utilizes a Hi-Z buffer (Figure 2.20 7 ), which stores a low resolution depth buffer. Each Hi-Z depth entry represents the depth info of multiple screen-space pixel locations. Table 2.6 shows the contents of a Hi-Z buffer entry. Front and back depth values are tracked and used as shown in Algorithm 1, which describes our implementation of Hi-Z testing in Emerald. Note that as Hi-Z does not track the depth values of individual pixels, front and back depth values may not reflect the actual true front and depth value for the corresponding pixels but rather a conservative approximation (i.e., some invisible fragments may still pass to the next stage). We assume that Hi-Z keeps one depth data entry per raster tile, thus a Hi-Z entry tracks the front and the back depth values for each raster tile area of the screen. The front depth value is used to determine if a new fragment shall pass the depth test by being in front of any previous fragment in the tile. If this is the case, the fragment is flagged as passed (line 7). Later in the Early-Z (or Late-Z) stage, the depth test will not be performed and the new fragment depth will be written directly to the depth buffer. This is because we know this fragment will pass this test by being in front of any other previous fragments. Note that the front depth value of the Hi-Z entry is updated when the new fragment depth is in front of the current Hi-Z depth (lines 10–12 & 25). If a fragment is behind the front depth but in front of the back depth of the sampled tile, this means that the new fragment may pass the depth test. In such case, the fragment is flagged as live but not passed (line 14). Later in the Early- Z (or the Late-Z) stage, the depth of the new fragment will be tested against the existing fragment’s depth and will be committed to the framebuffer only if it passes

50 Algorithm 1 Hi-Z Depth Testing . // assuming Hi-Z uses a single entry of depth data per raster tile 1: [Hi-Z Front Depth] ← [Furthermost screen depth] 2: [Hi-Z Back Depth] ← [Closest to screen depth] 3: for Raster tile t in primitive p do 4: [CurFrontDepth] ← [Hi-Z Front Depth] 5: [CurBackDepth] ← [Closest to screen depth] 6: [BackDepthFlag] ← [false] 7: for Fragment f in raster tile t do 8: if f.depth front of [Hi-Z Front Depth] then 9: mark f as passed fragment 10: if f.depth front of [CurFrontDepth] then 11: [CurFrontDepth] ← f.depth 12: end if 13: else if f.depth front of [Hi-Z Back Depth] then 14: mark f as live fragment 15: else 16: mark f as dead fragment 17: end if 18: if f.depth is live or passed then 19: if [CurBackDepth] front of f.depth then 20: [CurBackDepth] ←f.depth 21: [BackDepthSet] ← [true] 22: end if 23: end if 24: end for 25: [Hi-Z Front Depth] ← [CurFrontDepth] 26: if t is fully covered and [BackDepthFlag] then 27: [Hi-Z Back Depth] ← [CurBackDepth] 28: end if 29: end for

51 the depth test (which will also update the depth buffer accordingly). Finally, if a new fragment fails to pass the back depth, this means that this fragment will be invisible and it is flagged as a dead fragment so it will not be shaded by the fragment shader nor it will be committed to the framebuffer. Note that we also keep track of cases when a new raster tile pixels fully cover the tile and and if the depth of the fragment in the back of the tile is in front of the existing Hi-Z entry back depth. In this case, we updated the Hi-Z entry back depth to reflect the new tile’s back depth value (as shown in lines 18–23 & 26–28). Since Hi-Z stores one depth entry per raster tile, it has a much smaller size. If DB SIZE the depth buffer size is DB SIZE and the tile size is N×N pixels, Hi-Z has N×N entries. A smaller buffer size allows the Hi-Z buffer to be stored on-chip for faster access. As raster tiles are distributed among SIMT cores according to their screen- space position, subsequently, the Hi-Z buffer is also distributed among SIMT cores according to the mapping of raster tiles to SIMT cores.

2.4.6 The TC Stage

2 Tile distributor (TCTD) From pipeline

TCE 0 TCE 1 TCE N Fine Rasterization Staged Staged Staged StagedStaged 3 Staged Staged Stagedtile StagedStagedtile StagedStagedtile 1 tiletile tile tile tile Tiles tiletile tiletile Hi-Z staged for coalescing (optional) Coalescing Coalescing Coalescing Logic 4 Logic Logic 7 TC tile TC Tile TC Tile TC Tile execution 5 state 6 SIMT core To fragment shading execution status Figure 2.24: The TC unit consumes raster tiles from the pipeline ( 1 ), where a tile distributor ( 2 ) is used assign tiles to TCEs, where they are staged ( 3 ). Each TCE works on coalescing staged tiles ( 4 ) before sending the coalesced tiles ( 5 ) to fragment shading ( 6 ).

52 The tile coalescing (TC) stage (Figure 2.20 8 ) assembles quads (2 × 2 sets) of fragments from multiple primitives before assigning them to the correspond- ing SIMT core for fragment shading (an example of tile coalescing is presented later in Section 2.4.6.1). By coalescing fragments from multiple primitives, TC coalescing improves SIMT core utilization when running fragment shading, es- pecially when handling small primitives. Emerald’s tile coalescing is inspired by NVIDIA’s design [90]. Tile coalescing is used as well to enable atomicity in pixel shading on SIMT cores. Atomic pixel execution means that at any point of time, and for a given screen-space location, only a single pixel is being shaded at a SIMT core. This allows operations like depth testing and blending to be performed by the SIMT cores, which enables additional features like programmable blending oper- ations [176]. Each SIMT cluster contains a tile coalescing unit, where each tile coalescing unit may consist of a number of Tile Coalescing Engines (TCEs). TC tiles can be configured to be larger than (or equal) in size to raster tiles. TC tile dimensions are an integer multiple of that of raster tiles, where a raster tile is a tile used by the pipeline in Figure 2.20 in 4 to 6 . This guarantees that each raster tile will always be mapped to a single TC tile. Typically, a TC tile will cover multiple raster tiles (e.g., 4×4 raster tiles). Figure 2.24 shows the TC unit. In 1 , the TC unit receives raster tiles from the fine-rasterization stage or the Hi-Z stage (if enabled). In 2 , a TC tile distributor (TCTD) stages incoming primitive fragment tiles to one of the TCEs. The number of TCEs determine how many tiles can concurrently be staged and assembled. The TCTD processes incoming tiles according to the flow in Figure 2.25 when assigning an incoming raster tile to a TC engine as described below:

• In 1 , TCTD uses the incoming tile’s coordinates to determine which TC tile it maps to. The TCTD checks if a previous TC tile, for the same screen- space location, has not finished shading (using TC tile execution state Fig- ure 2.24 7 ). If a previous tile for the same location exists, TCTD tries to stage the incoming raster tile to the corresponding TCE ( 2 ). If the corre- sponding TCE staging area is full ( 3 ), a TCE flush is initiated to empty the corresponding TC tile and the TCTD will retry inserting the incoming raster tile in the next cycle.

53 Incoming raster tile

1 Try to stage the Corresponding Yes incoming tile into TC tile under the No processing? corresponding 2 TCE 4 Idle TCE Yes Staging available? area full? 3 Yes No 5 Flush the current No TC tile in the Stage incoming Flush TCE with corresponding Add to staging raster tile into an oldest pending TC TCE engine & area idle TCE & retry next cycle retry next cycle.

Figure 2.25: A flowchart showing TC unit handling of incoming raster tiles.

• If none of the TCEs is currently processing the corresponding TC tile, TCTD checks if there is an available (idle) TC engine ( 4 ). If this is the case, TCTD stages the incoming tile in the next available TCE. If all TCEs are busy processing other TC tiles, a TCE flush is issued to the TCE processing the oldest TC tile and the TCTD retries inserting the incoming raster tile in the next cycle ( 5 ).

This flow means, at any given point, each TCE stages and coalesces ( 3 and 4 ) tiles that correspond to a single screen-space TC tile ( 5 ) and fragments from a particular TC tile are executed on the same SIMT core. The work performed by a TC engine is presented in Algorithm 2. Following the design in Hakura et al. [90], coalescing raster tiles into TC tiles is carried on quad granularity. TCEs flush TC tiles to the corresponding SIMT cores ( 6 ) when the staging area is full and no new quads can be coalesced. In addition, TCEs set a maximum number of cycles without new raster tiles before flushing the current TC tile. Once fragment shading for a TC tile is performed, result fragments are com- mitted to the framebuffer. In Case Study II in Section 2.7, we study how the granularity of TC tile map- ping to SIMT cores can affect performance, and we evaluate a technique (DFSL)

54 Algorithm 2 TC Engine Coalescing . // [ProcessingCycles] is initialized to 0 1: [NewQuadFlag]← false 2: if ( Staging area is not empty or TC tile is not empty) then . // loop staged tiles in insertion order 3: for Raster tile t in staged tiles do 4: for Quad q in t do 5: if (q has pixels) and (corresponding TC quad is empty) then 6: Insert q pixels into the TC tile 7: [NewQuadFlag]← true 8: Flag q done 9: end if . // once a tile is empty it can be used for a new raster tile next cycle 10: if (all quads in t are done) then 11: Flag t empty 12: end if 13: end for 14: end for 15: ([ProcessingCycles]←[ProcessingCycles]+1 ) 16: end if 17: if [ProcessingCycles]==[ThresholdCycles] or ((Staging area is full) and [NewQuadFlag] == false) then 18: Flag TC tile for flushing 19: [ProcessingCycles]← 0 20: end if that dynamically adjusts TC mapping granularity.

2.4.6.1 Tile Coalescing Example

Figure 2.26 shows two examples for tile coalescing operations at the TCE level. To keep the examples easy to follow, we assume that the TC tile size equals to 2×1 raster tiles. In the first example in Figure 2.26a, two neighbor raster tiles from a single primitive (P0) are staged for coalescing (raster tiles P0(0) and P0(1)). Since both raster tiles originate from the same primitive (P0), no coalescing conflict is possible (since all primitives must be convex [120, 123]). In this case, coalescing both raster

55 tiles is trivial and the result coalesced TC tile is a concatenation of both staged tiles. Figure 2.26b shows another example with the same staged tiles shown in the previous example in addition to a another raster tile (P1(0)) that originates from another primitive (P1), where P0 precedes P1 in the drawing order. Both P1(0) and P0(0) will occupy the same destination TC sub-tile (TC0). The coalescing logic will try to fit both P1(0) and P0(0) into (TC0). However, since the coa- lescing logic works on quads of pixels, P1(0) and P0(0) pixel quads (as shown in Figure 2.26b 1 ) are in conflict, where both tiles have pixels that occupy the upper left quad of the raster tile area. The coalescing logic resolves such conflict by choosing quads from primitives according to their drawing order. In this exam- ple, P0 should be drawn before P1. As a result, quads from P0 are chosen over quads from P1 when a conflict arises. The result coalesced TC tile is shown in Fig- ure 2.26b. Conflicting quads from primitives that are later in draw order (P1(0)) are buffered by the TCE unit so they can be coalesced into the next TC tile.

2.4.6.2 Out-of-order Primitive Rendering

In some cases, the VPO unit and the TC stage can process primitives out-of-order. For example, when depth testing is enabled and blending is disabled, primitives can be safely processed in an out-of-order fashion. However, this optimization is unexploited in the initial version of Emerald and we plan to add it in our future work.

2.4.7 Model Validation

Emerald does not fully implement NVIDIA’s GPU architecture; we have, how- ever, compared our microarchitecture model against NVIDIA’s Pascal architecture (found in Tegra X2). A set of microbenchmarks were used to better understand some of the hard- ware design details. For example, for fragment shading, we used NVIDIA’s exten- sions (NV shader thread group) to confirm that screen space is divided into 16x16 screen tiles that are statically assigned to shader cores using a complex hashing function. In Emerald, we used similar screen tiles that are pre-assigned using a modular hash that takes into account compute cluster and shader core. For vertex

56 shaders, we examined how vertices are assigned to SMs on NVIDIA’s hardware (using transform feedback [123]) and found they are assigned to SMs in batches (batch size varies with primitive type). We modified our model accordingly to capture the same behavior.

Benchmark Description Colored cube A colored cube model. Colored square A colored square covering the whole screen. Textured cube A textured cube used twice (with depth on & off). Textured square A textured square used twice (with depth on & off). Sibenik The Sibenik model from the Computer Graphics Archive [149]. Spot The Spot model from Keenan’s Repository [55]. Suzanne 1 Suzanne model 1 [185] used in three versions (with depth on & off and with a translucent surface). Suzanne 2 A different version of the Suzanne model (from the Computer Graphics Archive [149]) used with a translucent surface. Triangle A simple triangle.

Table 2.7: Benchmarks used to compare Emerald’s model with the Tegra K1 GPU.

Finally, we profiled Emerald’s GPU model against the GPU of Tegra K1 SoC [177] with a set of 14 benchmarks7, which are described in Table 2.7. Our results (Fig- ure 2.27a) show that pixel fill-rate (pixels per cycle) has a correlation of 76.5% with a 33% average absolute relative error (9.6% to 77%), where absolute relative |Hardware−Simulator| error is computed as Hardware . For draw execution time (Figure 2.27b), while results shows 32.2% average absolute relative error (5.2% to 312%), they also show 98% correlation between Emerald and NVIDIA’s hardware. High corre- lation results indicates our model’s ability to predict performance change. Ideally, we would like to reduce absolute errors as well; focusing on reducing absolute errors, however, means bringing Emerald models closer to exact NVIDIA’s archi- tecture. Such calibration is a time consuming process and might be especially unattainable for GPUs as many architectural details are concealed behind high- level APIs, and, at the same time, there is little information publicly available for

7APITrace traces for the benchmarks are available at https://github.com/gem5-graphics/corr workloads.

57 the GPU microarchitecture. On the other hand, many, if not most, architecture sim- ulation infrastructure (e.g., Table 2.14) do not compare against specific hardware, and focus instead on modeling architectures enough to predict the effect archi- tectural design changes, as predicting performance trends, not exact performance numbers, is the ultimate goal for most architectural research studies.

2.4.8 Model Limitations and Future Work

Modern APIs specify stages for vertex shading, tessellation, geometry shading, and fragment shading [123]. Currently, Emerald supports the most commonly used stages of vertex and fragment shading. Emerald covers part of the design spec- trum for graphics architectures with submodels that vary in their level of details as follows:

• Vertex fetching: We implement vertex fetching as part of the vertex shading process. For indexed vertices, our software front end processes index buffers before starting the vertex shaders, so that vertices are explicitly stored (with- out indexing) in memory before vertex shading starts, effectively executing DrawElements calls as DrawArrays calls [123]. An alternative design would use an input assembler unit that is dedicated to resolving and fetching in- dexed vertices with the use of some relevant caching mechanism.

• Tessellation and Geometry: Currently not implemented. Most mobile ap- plications do not use either tessellation or geometry shaders and they re- main almost exclusively used in high-end games. Supporting both shaders requires adding a functional support to the library in addition to the corresponding performance modeling.

• Setup, clipping, culling and rasterization: We use configurable throughout models (as detailed in Table 2.4). These models treat all primitives the same way. Future work may include more detailed models that take into account per-primitive properties that determine the behavior of these units. Current models, however, provide sufficient details for a large number of use-cases.

• Hi-Z: The Hi-Z buffer is modeled as an on-chip buffer with a fixed latency.

58 Future work may add details like modeling memory banks and a pixel to memory mapping that reduces access conflicts.

• Raster operations: GPU vendors use a diverse set of techniques to handle raster operations as detailed in Section 2.2.4. For Emerald users, a study that heavily relies on the raster operations model may consider alternative raster-ops hardware architectures that can deliver an accurate evaluation for a particular optimization or technique.

• Vertex distribution: We adopted a vertex distribution scheme that is influ- enced by NVIDIA’s design as detailed in Section 2.4.3. The main aim of this scheme is to balance vertex shading across compute cores. We believe that this vertex distribution model is sufficient for most, if not all, use cases.

• Tile distribution: Emerald uses a modular hash function to assign tiles across clusters and cores to map screen space pixel tiles to compute cores for frag- ment shading. Many alternative schemes are possible, which seems to be the case in some hardware vendors as detailed in Section 2.4.7. Tile distri- bution optimization should be considered by the model user when studying workloads that stress tile distribution (e.g., cause significant work imbalance across compute cores).

• Compute cores: Emerald compute core models are based on the GPGPU- Sim simulator [37], which is a widely used GPGPU research tool that pro- vides detailed core models. GPGPU-Sim is still an active project and Emer- ald shall take advantage of future project updates.

Future work also includes elaborating some of the less detailed submodels and adding configuration flexibility to facilitate experimenting with a broader range of architectures. This includes adding detailed texture filtering and caching models, support for compression, adding the option to globally perform depth and blending operations near the L2 cache, and adding support for managed tile buffers.

59 2.5 Emerald Software Design

Figure 2.28 highlights Emerald’s software architecture for the two supported modes: standalone and full-system modes. In the standalone mode, only the GPU model is used. On the other hand, in the full-system mode, the GPU operates under An- droid with other SoC components. In this work, and as detailed below, we chose to utilize a set of existing open-source simulators and tools like APITrace, gem5, Mesa3D, GPGPU-Sim, and Android emulation tools; this makes it easier for Emer- ald to stay up-to-date with the most recent software tools as systems like Android and graphics standards are consistently changing. We use MESA as our API in- terface and state handler so that future OpenGL updates can be easily supported by Emerald. Similarly, using Android emulator tools facilitates supporting future Android systems. Emerald execution modes are described below. More details on Emerald soft- ware architecture and implementation details are provided in Appendix A.

2.5.1 Emerald Standalone Mode

Figure 2.28a shows Emerald standalone mode. In this mode, we use APITrace [21] to record graphics traces. A trace file 1 then is played by a modified version of APITrace through gem5-emerald 2 and then to Mesa 3 . Emerald can be config- ured to execute frames of interest (a specific frame, set of frames, or a set of draw calls within a frame). For frames within the specified region-of-interest, Mesa sends all necessary state data to Emerald before execution begins. We also uti- lized some components from gem5-gpu [190] which connects GPGPU-Sim mem- ory ports to the gem5 interface. Finally, in 4 our tool, TGSItoPTX, is used to gen- erate PTX shaders that are compatible with GPGPU-Sim (we extended GPGPU- Sim’s ISA to include several graphics specific instructions). TGSItoPTX consumes Mesa3D TGSI shaders (which are compiled from higher level GLSL) and converts them to their equivalent PTX version.

60 2.5.2 Emerald Full-system Mode

In the full-system mode (Figure 2.28b), Emerald runs under the Android OS. Once an Android system has booted, it uses the goldfish- library [16] as a graphics driver to Emerald. We connect the goldfish library to the Android-side gem5-pipe 1 , which captures Android OpenGL calls and sends them to gem5 through pseudo-instructions. In gem5, the gem5-side graphics-pipe captures draw call packets and forwards them to the Android emulator host-side libraries 2 , which process graphics packets and convert them to OpenGL ES draw calls for Mesa 3 . Android host-side emulator libraries are also used to track each OpenGL context of each process running on Android. From Mesa 3 , Emerald follows the same steps in Section 2.5.1. For the full-system mode, we also added support for graphics checkpoint- ing 4 . We found this to be crucial to support full-system simulation. Booting Android on gem5 takes many hours and checkpointing the system state, includ- ing graphics, is necessary to be able to checkpoint and resume at any point during simulation. Graphics checkpointing works by recording all draw calls sent by the system and storing them along with other gem5 checkpointing data. When a check- point is loaded, Emerald checkpointing restores the graphics state of all threads running on Android through Mesa’s functional model.

2.6 Case Study I: Memory Organization and Scheduling on Mobile SoCs

This case study evaluates two proposals for memory scheduling [240] and orga- nization [49] aimed for heterogeneous SoCs. The details of memory controllers shipped by the various chip vendors are proprietary and not known to us; thus, we used these proposals as they represent some of the most relevant work in the area of SoC design. Both proposals relied on trace-based simulation to evaluate the system-wide behavior of heterogeneous SoCs. In this case study, we evaluate the heterogeneous (HMC) proposed by Chidambaram Nachiappan et al. [49] and the DASH memory sched-

61 uler by Usui et al. [240] under execution-driven simulation using Emerald’s GPU model and running under the Android OS. We note that Usui et al. [240] recog- nized the shortcomings of approximating memory behavior under a real system using traces.

2.6.1 Implementation

2.6.1.1 DASH Scheduler

The DASH scheduler [240] builds on the TCM scheduler [131] and the scheduler proposed by Jeong et al. [108] to define a deadline-aware scheduler. DASH aims to balance access to DRAM by classifying CPU and IP traffic into a set of priority levels that includes: (i) Urgent IPs; (ii) Memory non-intensive CPU applications; (iii) Non-urgent IPs; and (iv) Memory intensive CPU applications. DASH further categorizes IP deadlines into short and long deadlines. IPs with a unit of work (e.g., frame) that is extremely short (≤ 10µ seconds) are defined as short deadline IPs. In our system, the (i.e., 60 FPS, or 16ms per frame) determines both of our IP deadlines. As a result, both the GPU and the display controller are classified as long-deadline IPs. In addition to the priority levels listed above, DASH also implements prob- abilistic scheduling that tries to balance servicing non-urgent IPs and memory- intensive CPU applications. With a probability (P), memory intensive applications are prioritized over non-urgent IPs where P is updated every SwitchingUnit to bal- ance the number of requests serviced to each traffic source.

2.6.1.1.a Clustering Bandwidth One of the issues we faced in our implemen- tation is the definition of clustering bandwidth as specified by TCM [131] and used by DASH. The dilemma is that TCM-based clustering assumes a homogeneous set of cores, where their total bandwidth TotalBWusage is used to classify CPU threads into memory intensive or memory non-intensive threads based on some clustering threshold ClusterThresh. The issue in an SoC system is how to calcu- late TotalBWusage. In our system, which defines a contemporary mobile SoC, the number of CPU threads (i.e., CPU cores) is limited. In addition, there is signifi- cant bandwidth demand from non-CPU IPs (e.g., GPUs). Whether TotalBWusage

62 should include non-CPU bandwidth usage is unclear [131, 240]. We found that there are significant consequences for both choices. If non-CPU bandwidth is used to calculate TotalBWusage, it is more likely for CPU threads to be classified as memory non-intensive even if some of them are relativity memory intensive. On the other hand, using only CPU threads will likely result in CPU threads classified as memory intensive (relative to total CPU traffic) even if they produce very little bandwidth relative to the rest of the system. This issue extends to heterogeneous CPU cores where memory bandwidth demand vs. latency sensitivity becomes non- trivial as many SoCs deploy a heterogeneous set of CPU cores [27, 129, 178]. We think an improved clustering algorithm may extend TCM to incorporate other IPs and include other IP-specific properties besides bandwidth demand in the cluster- ing process, such as latency sensitivity and expected QoS-level; however, that is beyond the scope of this work. In this study, we evaluated DASH using both methods to calculate clustering TotalBWusage, i.e., including or excluding non-CPU bandwidth. For other config- urations, we used the configurations specified in Usui et al. [240] and Kim et al. [131] which are listed in Table 2.8.

Cycle unit CPU cycle Scheduling Unit 1000 cycles Switching Unit 500 cycles Shuffling Interval 800 cycles Quantum Length 1M cycles Clustering Factor 0.15 0.8 (0.9 for the Emergent Threshold GPU) Display Frame Period 16ms (60 FPS) GPU Frame Period 33ms (30 FPS)

Table 2.8: DASH configurations

2.6.1.2 HMC Controller

HMC [49] proposes implementing a heterogeneous memory controller that aims to support locality/parallelism based on traffic source. HMC defines separate mem-

63 ory regions by using different DRAM channels for handling CPU and IPs ac- cesses. CPU-assigned channels use an address mapping that improves locality by assigning consecutive addresses to the same row buffer (page striped addressing). On the other hand, IPs assigned channels use an address mapping that improves parallelism by assigning consecutive addresses across DRAM banks (cache-line stripped addressing). HMC targets improving IP memory bandwidth by exploiting sequential accessing to large buffers that benefits from accessing multiple banks in parallel. In our study, we look for the following:

1. The performance of the system under normal and high loads.

2. Access locality behavior of CPU cores vs. IP cores.

3. DRAM access balance (between CPU and IP-assigned channels).

Baseline Channels 2 Cache line size 128B DRAM address mapping Row[32:16]:Rank*:Bank[15:13]:Column[12:8]:Channel[7] Scheduler FRFCFS Read entries: 32 Scheduler queue entries Write entries: 64 Page policy Open adaptive Maximum accesses per row activation 16 HMC Channels 2 (1 for each source type) Cache line size 128B CPU channel address mapping Row[32:16]:Rank*:Bank[15:13]:Column[12:8]:Channel[7] IP channel address mapping Row[32:16]:Column[15:11]:Rank*:Bank[10:8]:Channel[7] Scheduler FRFCFS Read entries: 32 Scheduler queue entries Write entries: 64 Page policy Open adaptive Maximum accesses per row activation 16 *A single rank per channel

Table 2.9: Baseline and HMC DRAM configurations. DRAM model details are described in Hansson et al. [93].

64 2.6.2 Evaluation

We used Emerald to run Android in the full-system mode using Emerald’s GPU model and gem5’s CPU and display controller models. For this experiment, an earlier version of Emerald which features a simpler pixel tile launcher and a cen- tralized output vertex buffer and primitive distribution was used. System configu- rations are listed in Table 2.10. To evaluate SoC performance, we profile an Android application that loads and displays a set of 3D models (listed in Table 2.11). We run our system under normal and high load configurations. We use the baseline configuration in Table 2.10 to model a regular load scenario and use a low-frequency DRAM configuration (133 Mb/s/pin) to evaluate the system under a high load scenario. We opted for stressing the DRAM this way as an alternative to setting up different workloads with varying degree of complexity (which can take several weeks under full-system simulation). We compare DASH and HMC to a baseline DRAM configuration (Table 2.10) with FRFCFS scheduling and using the baseline address mapping as shown in Table 2.9.

2.6.2.1 Regular-load Scenario

All configurations produce a similar frame rate under this scenario (less than 1% difference). Both the GPU and the display controller were able to generate frames at 60 FPS (application target frame rate). However, looking further at each stage of rendering, we notice that GPU rendering time differs. Figure 2.29 shows the normalized execution time (lower is better) of the GPU portion of the frame. Com- pared to the baseline, the GPU takes 19–20% longer to render a frame with DASH, and with HMC it takes almost twice as long. First, we investigate why DASH prolongs GPU execution time. We found the reason for the longer GPU execution time is that DASH prioritizes CPU re- quests over that of the GPU’s while frames are being rendered. As long as the IP, i.e., the GPU, is consistently meeting the deadline, DASH either fully prioritizes CPU threads (if classified as memory non-intensive) or probabilistically prioritizes CPU threads (if classified as memory intensive) over that of the GPU. The final outcome from the user’s perspective has not changed in this case as the applica- tion still meets the target frame rate. However, GPU execution taking longer may

65 CPU Cores 4 ARM O3 cores [41] Freq. 2.0GHz L1 32 kB L2 (per core) 1 MB GPU # SIMT Cores 4 (128 CUDA Cores) SIMT Core Freq. 950MHz Lanes per SIMT Core 32 (warp size) L1D 16KB, 128B line, 4-way LRU L1T 64KB, 128B line, 4-way LRU L1Z 32KB, 128B line, 4-way LRU Shared L2 128KB, 128B line, 8-way LRU Output Vertex Buffer 36KB (Max. 9K vertices) Pixel Tile Size 32×32 Framebuffer 1024×768 32bit RGBA System OS Android JB 4.2.2.1 DRAM [93] 2-channel 32-bit wide LPDDR3, Data Rate: 1333Mb/s

Table 2.10: Case Study I system configurations increase energy consumption, which would be detrimental for mobile SoCs. In addition, this scheduling policy might unintentionally hurt performance as well (Section 2.6.2.2). For HMC, we found two reasons for the slower GPU performance. The first reason is that traffic from CPU threads and the GPU was not balanced throughout the frame. When GPU rendering is taking place, CPU-side traffic reduces signifi- cantly and the CPU-assigned channels are left underutilized. Figure 2.30 presents M3-HMC memory bandwidth from each source over time. In 1 , the CPU traffic increases before starting a new frame. Once a frame started, CPU traffic is re- duced in 2 after finishing pre-drawing computations and issuing a draw call to the GPU. This continues until the end of the GPU frame at 3 . As a result, the split-DRAM channel configuration becomes problematic in such cases due to the lack of continuous traffic balance. The second factor in reduced GPU performance under HMC was the lower

66 3D Workloads # of frames 5 (1 warm-up frame + 4 profiled frames) Model Name M1 Chair M2 Cube M3 Mask M4 Triangles Configs Abbrv. Description BAS Baseline configuration DCB DASH using CPU bandwidth clustering DTB DASH using system bandwidth clustering HMC Heterogeneous Memory Controllers AVG Average

Table 2.11: Case Study I workload models and configurations row-buffer locality at IP-assigned channels. The issue originates from HMC as- sumption that all IP traffic will use sequential accesses when accessing DRAM. Although this is true for our display controller, we found that it is not the case for our GPU traffic (we were unable to obtain the traces used in the original HMC study [49] to compare against). As a consequence, IP-assigned channels suffer from lower row-buffer hit rates and reduced number of bytes fetched per row-buffer activation, where the latter indicates higher per byte energy cost. Figure 2.31 shows HMC DRAM page hit rate and bytes accessed per row activation compared to the baseline. On average, the row buffer hit rate decreases by 15% and the number of bytes accessed per row activation drops by around 60%.

2.6.2.2 High-load Scenario

In this scenario, we evaluate DASH and HMC under lower DRAM bandwidth to analyze system behavior under high memory loads. Figure 2.32 shows normalized execution times (lower is better). First, HMC shows similar behavior to that described earlier in Section 2.6.2.1. With lower page hit rate and reduced locality; it takes on average 45% longer than the baseline to produce a frame.

67 We found DASH reduces frame rates compared to the baseline by an average of 8.9% and 9.7% for DCB and DTB configurations, respectively. In larger models (M1 & M3) the drop in frame rates was around 12%. The time it takes to render a GPU frame was increased by an average of 15.1% M1-DCB and 15.5% for M1- DTB. Figure 2.32 also shows that simpler models (M2 & M4) experience lower slowdowns as the DRAM still manages to provide a frame rate close to target even with slower GPU performance. In addition to GPU performance, we also looked into the display controller performance. The application rendering process is independent of screen refresh- ing by the display controller, where the display controller will simply re-use the last complete frame if no new frame is provided. As a result of high-load, and un- like the low-load scenario, the display controller suffered from below target frame rate. Figure 2.33 shows the normalized display traffic serviced in each configura- tion. First, we notice that HMC outperforms BAS, DCB and DTB with the smaller models (M2 & M4). What we found is that since the GPU load is smaller in these two cases (as shown in Figure 2.32), the IP-assigned channel was available for longer periods of times to service display requests without interference from CPU threads. However, as we can see in Figure 2.32, the overall performance of the application does not improve over BAS and DASH configurations. For larger models (M1 & M3) DCB and DTB both deliver lower display traf- fic and lower overall application frame rate. We can see the reason for DASH’s lower overall performance using the M1 traffic examples in Figure 2.34 and by discerning the behavior of BAS and DCB traffic. Comparing BAS CPU and GPU traffic to that of DASH, we see that DASH provides higher priority to CPU threads ( 4 ) compared to the FRFCFS baseline ( 1 ). This is because GPU timing is still meeting the set deadline and, as a consequence, classified as non-urgent. CPU threads either have absolute priority (if memory non-intensive) or a probabilistic priority (if memory intensive). By looking at the periods t1 in Figure 2.34a and t2 in Figure 2.34b we found that GPU read requests latencies are higher by 16.2%, on average, in t2 compared to t1, which leads to lower GPU bandwidth in 5 compared to 2 .

The prioritization of CPU requests during t2, however, does not improve overall application performance. Looking at 7 we can see that CPU threads are almost

68 idle at the end of the frame waiting for the GPU frame to finish before moving to the next job. This dependency is not captured by DASH’s memory access scheduling algorithm and leads to over-prioritization of CPU threads during t2; consequently, DASH can reduce DRAM performance compared to FR-FCFS. Examining display controller performance we find DASH DTB services 85% less display bandwidth than BAS (as shown in Figure 2.33 and Figure 2.34 3 vs. 6 ). Looking at Figure 2.34b 6 , we can see the reason for DASH’s lower display performance. The display controller starts a new frame in 6 , the frame is considered non-urgent because it just started and had not missed the expected progress yet. In addition, the CPU is consuming additional bandwidth as discussed earlier. Both factors lead to fewer display requests to be serviced early on. Even- tually, because of the below-expected bandwidth, the display controller aborts the frame and re-try a new frame later where the same sequence of events reoccurs (as highlighted in 6 ).

2.6.2.3 Summary and Discussion

In this case study, we evaluated two proposals for memory organization and schedul- ing for heterogeneous SoCs. We aimed to test these proposals under execution- driven simulation compared to trace-based simulations originally used to evalu- ate these proposals. We found some issues that trace-based simulation could not capture fully, including incorporating inter-IP dependencies, responding to system feedback, or lacking the details of IP access patterns. Traces can be created including inter-IP dependencies. However, adding de- pendency information is non-trivial for complex systems that feature several IPs and multi-threaded applications. This complexity requires an adequate understand- ing of workloads and reliable sources of traces. GemDroid collects traces using a single-threaded software emulator and incorporates rudimentary inter-IP depen- dencies which are used to define mutually exclusive execution intervals, e.g., CPU vs. other IPs. The evaluation of DASH performed by Usui et al. [240] used a mix of traces from unrelated processes “that execute independently” and “does not model feedback from missed deadlines”. In contrast, we find traffic in real systems is more complex as it is generated by inter-dependent IP blocks, sometimes working

69 concurrently, as shown in the examples in Figure 2.30 and Figure 2.34. We observed the display controller dropping frames when failing to meet a deadline. Thus, modeling system feedback between IP blocks appears important, at least when similar scenarios are relevant to design decisions. Another issue highlighted by our study is the impact of simulating SoCs without a detailed GPU model. HMC was designed assuming IP traffic that predominantly contains se- quential accesses which is not true for graphics workloads. A related key hurdle for academic researchers is difficulty obtaining real traces from existing (propri- etary) IP blocks; our results suggest it the value of creating detailed IP models that produce representative behavior. Detailed IP models provide the ability to evaluate intra-IP architectural changes and their system level impact. For memory scheduling, the demonstrated results in this section motivate us to look for possible ways we could improve memory scheduling in SoC systems. Here we summarize a couple of possible directions that could be pursued in future work:

• Universal QoS signaling for SoC IPs: given the fact that SoC IPs could vary widely in their architecture, it may be unwise to shift the whole responsibil- ity of prioritizing IPs access to memory to the memory controller. A better way to deal with this issue is to incorporate some form of universal QoS sig- naling from IPs to the memory controller that conveys the expected latency or bandwidth requirements of an IP. The controllers can use such signaling as a critical metric that exposes different IPs’ status and improves scheduling decisions.

• Hierarchical scheduling: many contemporary SoC IPs are connected using a hierarchical network topology (more on that in Section 3.5.7). A hierarchical network design means a memory request from an IP shall traverse multiple network nodes before it reaches the memory controller. This multi-stage traversal provides an opportunity to add prioritization schemes upstream in network nodes, which work as early memory scheduling stages. Each net- work node can use QoS and IP traffic data to regulate, early on, which re- quests have priority to proceed to memory. Previous research has shown the potential of upstream scheduling techniques in the context of GPUs [256].

70 2.7 Case Study II: Dynamic Fragment Shading Load-Balancing (DFSL)

In this section, we propose and evaluate a method for dynamically load-balancing the fragment shading stage on the GPU. This is achieved by controlling the granu- larity of the work assigned to each GPU core. The example in Figure 2.35 shows two granularities to distribute work amongst GPU cores. The screen-space is divided into tiles, where each tile is assigned to a GPU cluster/core as described in Section 2.4.2. In Figure 2.35a, smaller tiles are used to distribute load amongst GPU cores. When assigning these tiles across GPU cores, we improve load-balance across the cores. On the other hand, using larger tiles, as in Figure 2.35b, load-balance is reduced but locality improves. Locality might boost performance by sharing data across fragments as in texture filtering, where the same texels can be re-used by multiple fragments. Also, as we will see later, load-balance can change from one scene to another based on what is being rendered.

2.7.1 Experimental Setup

For this case study, we evaluate GPU performance using Emerald standalone mode (Section 2.5.1). Since this case-study focuses on the fragment shading stage, we only show performance results for fragment shading. We used a GPU configura- tion that resembles a high-end mobile GPU [177, 178] (Table 2.12). We selected a set of relatively simple 3D models frequently used in graphics research, which are listed in Table 2.13 and shown in Figure 2.36 (figures shown were rendered with Emerald); using more complex workloads, e.g., game frames, is possible but requires much longer simulation times with limited additional benefit for most ar- chitecture tradeoff studies. For work granularity we use work tile (WT) to define work granularity when assigning work to GPU cores (as explained in Figure 2.35). A WT of size N is N×N TC tiles, where N≥1. As discussed in Section 2.4.2, the minimum work unit that can be assigned to a core is a TC tile.

71 # SIMT Clusters 6 (192 CUDA Cores) SIMT Core Freq. 1GHz Max Threads per core 2048 Registers per core 65536 Lanes per SIMT Core 32 (warp size) L1D 32KB, 128B line, 8-way LRU L1T 48KB, 128B line, 24-way LRU L1Z 32KB, 128B line, 8-way LRU Shared L2 2MB, 128B line, 32-way LRU Raster tile 4×4 pixels TC tile size 2×2 TC engines per cluster 2 TC bins per engine 4 Coarse & fine raster throughput 1 raster tile/cycle Hi-Z throughput 1 raster tile/cycle Memory 4 channel LPDDR-3 1600Mb/s

Table 2.12: Case Study II GPU configuration

Model Name Textured? Translucent? Abbrv. W1 Sibenik [149] Yes No W2 Spot [55] Yes No W3 Cube [149] Yes No W4 ’s Suzanne Yes No W5 Suzanne transparent Yes Yes W6 Utah’s Teapot [149] Yes No

Table 2.13: Case Study II workloads

2.7.2 Load-Balance vs. Locality

Figure 2.37 shows the variation in frame execution time for WT sizes 1 to 10. For WT sizes larger than 10 the GPU is more prone to load-imbalance. As we can see in Figure 2.37, frame execution time can vary by 25% in W6 to as much as 88% in W5. The WT size that achieves the optimal performance varies from one workload to another; for W5, the best-performing WT size is 1, and for W2 and W4 the best performing WT size is 5.

72 We looked at different factors that may contribute to the variation of execution time with WT size. First, we noticed that L2 misses/DRAM traffic are very similar across WT sizes. However, we found that L1 cache miss rates significantly change with WT size. Figure 2.38 shows execution times and L1 misses for a W1 frame vs. WT sizes. The figure shows that L1 cache locality is a significant factor in performance. Measuring correlation, we found that execution time correlates by 78% with L1 misses, 79% with L1 depth misses and 82% with texture misses.

2.7.3 Dynamic Fragment Shading Load-Balancing (DFSL)

In this section, we evaluate our proposal of dynamic fragment shading load-balancing (DFSL). DFSL exploits temporal coherence in graphics [207], where applications exhibit minor changes between frames. DFSL exploits graphics temporal coherence in a novel way – by utilizing it to dynamically adjust work distribution across GPU cores so as to reduce rendering time. The goal of DFSL is not to achieve consistently higher frame rates but rather to lower GPU energy consumption by reducing average rendering time per frame assuming the GPU can be put into a low power state between frames when it is meeting its frame rate target. DFSL works by running two phases, an evaluation phase and a run phase. Algorithm 3 shows DFSL’s evaluation and run phases. For a number of frames equal to possible WT sizes, the evaluation phase (lines 13–25), renders frames under each possible WT value. At the end of the evaluation phase, WT with the best performance (WTBest at line 25) is then used in the run phase (line 27) for a set of frames, i.e., RunFrames. At the end of the run phase, DFSL starts another evaluation phase and so on. By changing RunFrames, we can control how often WTBest is updated.

2.7.3.1 Implementation

DFSL can be implemented as part of the graphics driver, where DFSL can be added to other context information tracked by the driver. DFSL Algorithm 3 will only execute a few times per second (i.e., at the FPS rate). For each application, the GPU tracks the execution time for each frame and WTBest. In our experiment,

73 Algorithm 3 Find and render with best WT sizes 1: Parameters 2: RunFrames: number of run (non-evaluation) frames 3: MinWT: minimum WT size 4: MaxWT: maximum WT size 5: MAX TIME: maximum execution time 6: 7: Initialization 8: CurrFrame ← 0 9: EvalFrames ← MaxWT − MinWT 10: 11: procedure DFSLRUN 12: while there is a new frame: 13: if CurrFrame%(EvalFrames+RunFrames) == 0 then 14: MinExecTime ← MAX TIME 15: WTSize ← MinWT 16: WTBest ← MinWT 17: end if 18: 19: if CurrFrame%(EvalFrames+RunFrames) < EvalFrames then 20: ExecTime ←execution time with WTSize 21: if ExecTime < MinExecTime then 22: MinExecTime ← ExecTime 23: WTBest ← WTSize 24: end if 25: WTSize ← WTSize + 1 26: else 27: Render frame using WTBest 28: end if 29: CurrFrame ← CurrFrame + 1 30: end procedure

74 each workload used 1–2 draw calls and we tracked WTBest on per frame basis. DFSL can be extended to also track WTBest at the draw call level or for a set of draw calls in more complex workloads. In Figure 2.39, DFSL performance is compared against four static configura- tions. MLB for maximum load-balance using WT size of 1, MLC for maximum locality using WT size of 10, and SOPT. To find SOPT, we ran all the frames across all configs and found the best WT, on average, across all workloads. For DFSL, we used an evaluation period of 10 frames and a run period of 100 frames. Results are shown in Figure 2.39, where it shows that DFSL is able to speed up frame rendering by an average of 19% compared to MLB and by 7.3% compared to SOPT.

75 Staged raster tiles TC tile (2×1 raster tile)

P0 (0) P0 (1) TC0 TC1 (a) Example 1 with no conflicts. Both raster tiles are merged into the same TC tile. Staged raster tiles TC tile (2×1 raster tile) Staged raster tiles TC tile (2×1 raster tile)

P0 (0) P0 (1) TC0 TC1

P0 (0) P0 (1) TC0 TC1 1 Staged rasterQuad tiles conflict TC tile (2×1 raster tile)

P1 (0)

P0 (0) P0 (1) TC0 TC1 1 Quad conflict

P1 (0) (b) Example 2 with conflicts. Due to the conflict between raster tiles P0(0) and P1(0), the tiles are merged partially in the current TC tile and the remaining P1(0) quads are merged into the following TC tile.

Figure 2.26: Tile coalescing examples. Pα(β) indicates a raster tile from primitive α at position β.

76 (a) Emerald pixel fill-rate vs. NVIDIA Tegra K1 GPU

(b) Emerald draw execution time vs. NVIDIA Tegra K1 GPU

Figure 2.27: In (a), Emerald has a 76.5% pixel-rate correlation with the Tegra K1 GPU with a 33% average absolute relative error. In (b), Emerald has a 98% execution time correlation with the Tegra K1 GPU and an average absolute relative error of 32.2%.

77 Emerald Modified componentsEmerald forModified Emerald components for Emerald

1 1 Trace file APITrace Trace file APITrace 2 Emerald + 2 GPGPUEmerald-Sim + gem5 GPGPU-Sim gem5 Graphics Standalone Graphicsmodel Standalonemodule model module TGSI to PTX TGSICompiler to PTX Compiler Mesa3D 4 Mesa3D 3 4 3 (a) Standalone mode (a) Standalone(a) Standalone mode mode

gem5 2 gem5 2 gem5-side graphicsgem5- side-pipe graphics -pipe 1 Android Emerald + 1 Android GPGPUEmerald-Sim + Android host-side Android-side GPGPUGraphics-Sim Androidemulator host libs-side graphicsAndroid --sidepipe Graphicsmodels emulator libs graphics -pipe models TGSI to PTX TGSICompiler to PTX Compiler Mesa3D Graphics 3 Mesa3D checkpointingGraphics 3 4 checkpointing 4 (b) Full(b) Full-system-system mode mode (b) Full-system mode Figure 2.28: Emerald software architecture. Both Emerald specific compo- nents and other tools modified for Emerald are shown. (a) shows the software blocks relevant to the standalone mode, where only the GPU model is used. (b) shows the software components that are used in the full-system mode to simulate an SoC system running under Android.

78 2.5

2

1.5

1

0.5

0 Normalized execution time execution Normalized

Figure 2.29: GPU execution time under a regular load normalized to the base- line (BAS). DASH, under both DCB and DTB configurations, takes on average 19–20% longer to render a frame than the baseline. For HMC, the results above show that it takes over twice as long to render a frame compared to the baseline.

79 2

3 Bandwidth 1

Time Figure 2.30: M3-HMC memory bandwidth from each source over time. IP activity varies widely over time; for example, in 1 , the CPU traffic increases before starting a new frame. Once a frame started, CPU traffic decreases in 2 after finishing pre-drawing computations and issuing a draw call to the GPU. This continues until the end of the GPU frame at 3 .

80 Rowbuffer hit rate Bytes accessed per row activation 1

0.8

0.6

0.4

0.2

0

Figure 2.31: Page hit rate and bytes accessed per row activation normalized to baseline (BAS). DASH (DCB & DTB) has similar performance to BAS. On the other hand, HMC, on average, reduces row buffer hit rate by 15% and the number of bytes accessed per row activation by around 60%.

81 Total frame time GPU rendering time 2

1.5

1

0.5

0 Normalized execution time execution Normalized

Figure 2.32: Performance under the high-load scenario normalized to base- line (BAS). HMC takes, on average, 45% longer than the baseline to render a frame. For DASH, it takes on average 8.9% (DCB) and 9.7% (DTB) longer than the baseline to render a frame.

82 1.2 2.4 6.3 2.2 1

0.8

0.6

0.4

0.2

0

Figure 2.33: Number of display requests serviced relative to baseline (BAS). Compared to the baseline, HMC services more requests in M2 & M4, while the number of requests is reduced in M1 and M3. For DASH, DCB underperforms the baseline except in M4, while DTB underper- forms the baseline under all the workloads.

83 2

t1

3 Bandwidth (Gb/s) Bandwidth

1

Time (milliseconds) (a) M1 Rendering by BAS

5

t2

Bandwidth (Gb/s) Bandwidth 6 7

4

Time (milliseconds) (b) M1 Rendering by DASH DTB

Figure 2.34: DASH provides higher priority to CPU threads ( 4 ) compared to the FRFCFS baseline ( 1 ), which leads to lower GPU bandwidth in 5 compared to 2 ; consequently, in 7 , DTB CPU threads are almost idle at the end of the frame. For the display controller, the baseline ( 3 ) provides a better level of service than DTB ( 6 ).

84 33 44 55 66 77 33 1313 1414 1515

2020 2222 2323 44 55 77 22 3131

3535 88 99 1111

1212 1313 1414 1515 5959 6060 6161 6262

(a) Fine screen-space division (b) Coarse screen-space division (b)(b) (a)(a) Figure 2.35: Comparing fine (a) and coarse (b) screen-space division for frag- ment shading

(a) Cube (b) Teapot (c) Suzanne

(d) Spot (e) Sibenik

Figure 2.36: Case study II workloads

85 2 W1 W2 W3 W4 W5 W6 1.8 1.6 1.4 1.2 1 0.8

Normalized time execution Normalized 0.6 1 2 3 4 5 6 7 8 9 10 WT size Figure 2.37: Frame execution time for WT sizes of 1–10 normalized to WT of 1

1 Execution Time Color misses Texture misses Depth misses 0.8

0.6

0.4

0.2

0 1 2 3 4 5 6 7 8 9 10 WT Size

Figure 2.38: Normalized (to WT1) execution times and the total L1 cache misses of various caches for W1

86 1.4 MLB MLC SOPT DFSL 1.3 1.2 1.1 1 0.9 0.8 0.7

Normalized Speedup Normalized 0.6 0.5 0.4 W1 W2 W3 W4 W5 W6 MEAN WT Size Figure 2.39: The average of MLB, MLC, SOPT, and DFSL per-frame speedup normalized to MLB

87 Simulator Methodology CPU Model GPU Model FS Simulation Architecture Multi-Cores Graphics GPGPU 1 Attila [62] ED, CA No No OpenGL 2.x, Direct3D9 No No 2 FusionSim [257] ED, CA x86 No No Yes No 3 gem5 [41] ED, CA Alpha, ARM, MIPS, Yes No No Linux, Android POWER, SPARC, x86 4 gem5-Aladdin1 [214] ED, AP x86 Yes No No No 5 gem5-GCN3 [7, 89] ED, CA No No No Yes No 6 gem5-gpu [190] ED, CA ARM, x86 No No Yes2 Linux 7 GemDroid [49] TD, AP ARM No Yes3 No Android4 8 GLTraceSim5 [212] TD, AP Yes Yes OpenGL No No 9 GPGPU-Sim [37] ED, CA No No No CUDA, OpenCL No 10 Macsim [127] TD, CA ARM, x86 Yes No CUDA, OpenCL No 11 MARSSx86 [187, 255] ED, CA x86 Yes No No Linux 12 MCMG6 [144] ED, CA x86 Yes Yes No Linux 88 13 Multi2Sim [236] ED, CA x86 Yes No OpenCL No 14 PARADE7 [52] ED, CA x86 Yes No No Linux 15 Qsilver [216] TD, CA, AP No No OpenGL No No 16 SimFlex [94] ED, CA SPARC, x86 Yes No No Linux, Solaris 17 SimpleScalar [36] ED,, CA Alpha, ARM, x86 No No No No 18 Sniper [46] TD, AP RISC-V, x86 Yes No No No 19 Teapot [32] ED, CA No No OpenGL ES No No 20 Zesto [141] ED, CA x86 Yes No No No 21 Zsim [204] ED, CA, AP x86 Yes No No No AP: Approximate Model CA: Cycle-Accurate ED: Execution-Driven TD: Trace-Driven 1,7 Adds accelerator models to gem5. 2 Integrates GPGPU-Sim with gem5. 3 Uses memory traces from Attila. 4 Android supported by combining software (single thread) memory traces from the Android emulator [75] and the memory traces from software or hardware implementations. 5 Combines traces from Mesa [151] + an approximate GPU model with gem5 CPU simulation. 6 Combines traces from Attila with gem5 Linux simulation.

Table 2.14: An overview of computer architecture simulation tools used in academic research. Few tools support GPU and full-system simulation, and only gem5 is capable of simulating a full stack mobile OS (Android). So far, no simulators are able to run a graphics workload using CPU+GPU model. Finally, there has been no GPU simulators capable of executing both GPGPU and graphics workloads. 2.8 Related Work

2.8.1 System Simulation

Researchers created several tools to simulate multi-core and heterogeneous sys- tems. These tools, however, have focused on simulating heterogeneous CPU cores or lack the support for specialized cores like GPUs and DSPs. Other work focused on CPU-GPGPU simulation [190, 236], while gemDroid provided an SoC simula- tion tool that combines software-model traces using gem5 DRAM model [49, 167, 168, 253] or simply use mathematical models [98]. Another gem5 based simulator, gem5-aladdin [214], provides a convenient way to model specialized accelerators using dynamic traces. Aladdin and Emerald can be integrated to provide a more comprehensive simulation infrastructure. Table 2.14 shows an overview of publicly available architectural research sim- ulators used in academia. The majority of simulators model the x86 instruction set architecture (ISA) that is popular on desktop computers. Fewer simulators support ARM, the most popular ISA on mobile platforms. Also only one simulator, gem5, is capable of running the full stack of a mobile system (Android). gem5, however, lacks a graphics model; workloads that use graphics on gem5 either ignore the graphics portion of the workload or use software rendering that severely distorts simulation outcome [60].

2.8.2 GPU Simulators

In addition to CPU simulators, Table 2.14 shows a number of GPU simulators. Attila [62] is a popular graphics GPU simulator that models an IMR architecture with unified shaders. The Attila project has been inactive for a few years and their custom graphics driver is limited to OpenGL 2.0 and D3D 9. Another tool, the Teapot simulator [32], has been used for graphics research using a TBR architec- ture and OpenGL ES [19, 33]. Teapot, however, is not publicly available. In addi- tion, Teapot models a pipeline with an older architecture with non-unified shader cores. Qsliver [216] is an older simulation infrastructure that also uses non-unified shaders. Finally, a more recent work, GLTraceSim [212], looks at the behavior of graphics-enabled systems. GLTraceSim does not model a particular GPU architec-

89 ture, but it approximates GPU behavior using the memory traces generated by the functional model provided by Mesa 3D [151]. Except for GLTraceSim which uses Mesa [151], none of these graphics simulators support newer OpenGL versions (> 2.0), or the OpenGL ES [120] standard used in mobile systems. Besides graphics GPU simulators, other GPU simulators focus on GPGPU ap- plications; this includes simulators like Multi2Sim [236], GCN3 [7, 89], Mac- sim [127] and GPGPU-Sim [37], where Emerald builds on the latter.

2.8.3 SoC Memory Scheduling

Several techniques were developed for memory scheduling in multi-core CPU sys- tems [35, 130, 131, 142, 227]. For GPUs, the work by Jog et al. [111] focused on GPU inter-core memory scheduling in GPGPU workloads. Another body of work proposed techniques for heterogeneous systems [108, 196, 240]. Previous work on heterogeneous systems, however, relied on using a mix of independent CPU and GPU workloads, rather than using workloads that utilize both processors. In our work, we introduce an extensive infrastructure to enable the simulation of realistic heterogeneous SoC workloads.

2.8.4 Load-Balancing on GPUs

Several proposals tackled load-balancing for GPGPU workloads. Son et al. [220] and Wang et al. [247] worked on scheduling GPGPU kernels. Other techniques use software methods to exploit GPGPU thread organization to improve local- ity [138, 242, 247]. On the other hand, DFSL tries to find a solution for work- load execution granularity, where the balance between locality and performance is reached through exploiting graphics temporal coherence. Other GPGPU schedul- ing techniques, e.g., the work by Lee et al. [137], can be adopted on top of DFSL to control the flow of execution on the GPU.

2.8.5 Temporal Coherence in Graphics

Temporal coherence has been exploited in many graphics software techniques [207]. For hardware, temporal coherence has been used to reduce rendering cost by batch

90 rendering every two frames [31], use visibility data to reduce redundant work [61], and in tile-based rendering to eliminate redundant tiles [95]. In this work, we present a new technique for exploiting temporal coherence to balance fragment shading on GPU cores.

2.9 Conclusion and Future Work

This work introduces Emerald, a simulation infrastructure that is capable of sim- ulating graphics and GPGPU applications. Emerald is integrated with gem5 and Android to provide the ability to simulate mobile SoC systems. We present two case studies that use Emerald. The first case study evaluates previous proposals for SoC memory scheduling/organization and shows that different results can be ob- tained when using detailed simulation. The case study highlights the importance of incorporating dependencies between components, feedback from the system, and the timing of events when evaluating SoC behavior. In the second case study, we propose a technique (DFSL) to dynamically balance the fragment shading stage on GPU cores. DFSL exploits graphics temporal coherence to dynamically update work distribution granularity. DFSL improves execution times by 7–19% over static work distribution. For future work, and in addition to improving graphics modeling (as high- lighted in Section 2.4.8) and developing Emerald compatible GPUWattch config- urations for mobile GPUs, we plan to create a set of mobile SoC benchmarks for Emerald that represent essential mobile uses-cases running commonly used An- droid applications. The case studies presented in this work were useful in reveal- ing some aspects of SoC and GPU behavior. However, in Case Study I, and to a lesser degree in Case Study II, we used simpler 3D models for our experimental evaluations. The main reason for this choice is the limitations on simulation speed, especially under full-system mode. Future work may include adding support to hardware-accelerated fast-forwarding to speed up checkpoint capturing, enabling the evaluation of some 3D-rich workloads. Such workloads can uncover some GPU and system-wide characteristics, as they feature a larger memory footprint and typ- ically utilize multiple-CPU threads and a wide variety of shading techniques. Finally, while this chapter studies the specific behavior of the GPU IP, next

91 chapter discusses how IPs, like the GPU, interact with the system through NoCs. Specifically, Chapter 3 discusses some unique challenges in designing NoCs for heterogeneous systems and proposes a scheme that allows NoCs to handle time- variant imbalances between sources of network traffic.

92 Chapter 3

Balanced Injection of Asymmetric Network Loads Using Interleaved Source Injection

3.1 Introduction

An extensive body of research has explored NoCs for homogeneous multi-core pro- cessors [83, 84, 155, 162, 248] and NoCs for heterogeneous processors geared to a particular workload or a use case [48, 165, 166, 223, 258]. The focus of such body of work is addressing fixed traffic asymmetry resulting from the communication pattern of a particular algorithm that is typically solved by developing specialized NoCs to handle the specifics of the algorithm. However, less attention was given to studying NoCs connecting nodes with widely varying bandwidth and latency requirements such as those in mobile SoC, where an ever increasing number of specialized accelerators (i.e., intellectual prop- erty, or IP, blocks) [96, 192] is used achieve higher performance and energy effi- ciency under a wide-range of workloads and use cases. These specialized IPs are connected through NoCs to memory controllers, system caches, or each other [12,

93 DSP big CPU LITTLE CPU GPU processor cluster cluster cluster Video Video L0 L1 L2 L3 Encoder decoder Modem Coherent network L4 L5 L6 L7 Non-coherent network AI Image Engine processor Camera Display L8 L9 L10 Memory network L11 L12 L13 System DRAM DRAM cache Channel Channel

Figure 3.1: ArChip16 SoC NoC [102]. The NoC consists of three networks: Coherent, Non-Coherent and the main (Memory) network.

Use case Relevant IPs Web browsing CPU, GPU, Modem, Display, Memory Video streaming CPU, GPU, Modem, Video decoder, Display, Memory Augmented-Reality CPU, GPU, Camera, Image proc., AI Engine, Memory

Table 3.1: Example mobile SoC use cases

27]. We find that such heterogeneous SoC NoCs incur time-variant asymmetric traffic patterns in which traffic between source-destination pairs varies over time for a given use case as well as from one application use case to another. This category of time-variant asymmetric traffic occurs in many use case scenarios of interest. This work investigates the patterns of time-varying bandwidth traffic exhibited in such modern SoCs and finds that it is possible, with small hardware changes, to efficiently support these patterns in a generic network architecture. This work pro- poses interleaved source injection (ISI), as a technique for improving performance and energy efficiency in mobile SoCs with IPs that have time-varying bandwidth demand. ISI is a buffer-sharing and load balancing technique that re-purposes un- derutilized virtual channels (VCs) by adding a group of small crossbar connections between traffic sources and VC buffers.

94 An example of heterogeneous SoCs is shown in Figure 3.1. The ArChip16’s NoC [102] is composed of three sub-networks: one connects coherent IPs, a sec- ond connects non-coherent IPs, and a third connects DMA devices. Table 3.1 lists example uses-cases and the IPs involved. For web browsing, the traffic is concen- trated on links L1, L2, L3, L6 and L10 through L13. For video streaming link L5 will also be used. For AR links L1, L2 and L7 through L13 are the most used. Note that the placement of IPs can vary between SoCs, depending on the latency and bandwidth requirements of each IP across use cases. Figure 3.2 shows an example of asymmetric traffic in heterogeneous SoCs. The figure plots memory bandwidth consumed per IP block using Emerald [86] running a 3D rendering workload under Android. We can see that the traffic injected by display, CPU and GPU varies over time, where the CPU ( A ), GPU ( B ), and display controller ( C ) are active at different times. Finally, Figure 3.3 quantifies traffic asymmetry, α, under a number of use cases (α is ratio of peak over average injection rate across IPs; details in Section 3.4). Perfectly symmetric traffic has α = 1. Higher α values indicates more asymmetry. We can see that all the measured use cases have significant asymmetry. The NoC architecture ramifications of such traffic patterns appear to have re- ceived little attention to date. Due to traffic asymmetry for a given use case only a portion of the NoC is used at a time, leading to significant network resource under utilization. The proposed ISI technique improves utilization of crossbar and VC buffers under such time-varying asymmetric traffic. As shown in Figure 3.4c, ISI adds paths allowing packets from a single IP block to be sent to multiple injection ports thus increasing injection bandwidth. ISI exploits underutilized buffering and routing resources of VC channels at the expense of adding a small crossbar. This work makes the following contributions:

• It highlights the issue of time-variant traffic asymmetry in heterogeneous SoCs;

• It analyzes the consequences of traffic asymmetry on NoCs;

• It proposes ISI, a technique that alleviates inefficiencies caused by asymmet- ric traffic.

95 B

A C

0.5 0.4

0.3 Figure 3.2: 3D Rendering on Android: memory traffic by source. Traffic 0.2 injected by display, CPU and GPU varies over time; the CPU ( A ), Factor GPU ( B ), and display controller ( C ) are active at different points in

0.1 time. Traffic Symmetry Symmetry Traffic 0 3D GM AG VI VP Workload

) ) 6 α 5 4 3 2 Asymmetry ( Asymmetry 1 0 Traffic Traffic 3D GM AG VI VP Workload

Figure 3.3: Mobile workloads NoC traffic asymmetry α = λh/λ (details in Section 3.4). Higher values indicates higher asymmetry and α = 1 in- dicates symmetric traffic.

96 50% utilization 50% utilization 100% utilization VC1 O0 P0 1 VC1 5 O0 P0 O0 P0 VC2 2 VC2

P1 VC1 O1 P1 VC1 O1 O1 VC2 VC2 Shared channel VC1 P2 VC1 O2 O2 P2 O2 buffers VC2 VC2

VC1 P3 3 VC1 O3 P3 O3 P3 O3

VC2 4 VC2 6

(a) An example of asymmetric traffic with two sources (P0 and P3) injecting to ports O0 through O3 leading to crossbar underutilization.(a) (b) (c) 50% utilization 50% utilization 100% utilization VC1 O0 P0 1 VC1 5 O0 P0 O0 P0 VC2 2 VC2

P1 VC1 O1 P1 VC1 O1 O1 VC2 VC2 Shared channel VC1 P2 VC1 O2 O2 P2 O2 buffers VC2 VC2

VC1 P3 3 VC1 O3 P3 O3 P3 O3

VC2 4 VC2 6

(b) Buffer sharing schemes [171, 229] leave crossbar underutilized under asymmetric (a)loads. (b) (c) 50% utilization 50% utilization 100% utilization VC1 O0 P0 1 VC1 5 O0 P0 O0 P0 VC2 2 VC2

P1 VC1 O1 P1 VC1 O1 O1 VC2 VC2 Shared channel VC1 P2 VC1 O2 O2 P2 O2 buffers VC2 VC2

VC1 P3 3 VC1 O3 P3 O3 P3 O3

VC2 4 VC2 6 (c) ISI enables tackles traffic asymmetry by interleaving injection across input ports lead- (a) ing(b) to better utilization. (c)

Figure 3.4: Comparing ISI to baseline and buffer sharing schemes.

97 Scalable Support Shared Shared Uses Radix? Multiple VC Injection minimal VCs? buffers? Ports? buffer- ing? RoCo [128]      MoDe-X [186]      DSB [197]      RoShaQ [235]   -   DAMQ [229]      ViChar [171]      ISI (this work)     

Table 3.2: ISI versus related work

The rest of this chapter is organized as follows: Section 3.2 summarizes related work, Section 3.3 provides some background, Section 3.4 presents insights into NoC behavior under asymmetric traffic, Section 3.5 describes our proposed ISI scheme, Section 3.6 presents experimental results, and we conclude in Section 3.7.

3.2 Related Work

3.2.1 Intra-IP and CPU-GPU NoCs

Earlier work studied how to connect nodes within what might be considered indi- vidual IP blocks in the context of a mobile SoC. For example NoCs have been stud- ied for multicore CPUs [58, 205], GPUs [38, 258] and accelerators [48, 99, 166]. More recently, NoC designs for heterogeneous systems combining CPU and GPU cores have been explored. In BiNoCHS [154], the network operates in one of two modes to optimize for latency or throughput. Lee et al. studied the behavior of NoCs when CPU and GPGPU application run simultaneously [136]. Jang et al. [103] study GPGPU traffic and propose asymmetric router design to suit GPU cores placement relative to memory controllers. In contrast to the above, this work addresses concerns arising when connecting a set of widely diverse IP blocks. While several works studying heterogeneous architectures assumed diverse nodes that are interleaved in a fine grained manner [114, 136, 154, 254], ISI applies

98 to contemporary SoC architectures employing heterogeneous clusters of homoge- neous IPs [28, 29, 102] which may be preferable when considering chip floor plan- ning and integrating externally developed IP blocks and/or managing large design teams organized by IP blocks.

3.2.2 Router Architecture

A summary of relevant related work on router architecture is provided in Table 3.2 and summarized below. Kim et al. [128] proposed ROCO. ROCO partitions X and Y traffic in 5-port routers into different VCs allowing ROCO to reduce contention during arbitration. Park et al. [186] proposed MoDe-X, an improvement over ROCO that reduces energy and area requirements. Ramanujam [197] proposes using distributed shared buffers (DSB) as a cost- effective way to emulate output-buffered routers inspired by Internet packet rout- ing [101, 191]. DSB adds middle-memory buffers and a fully-connected crossbar to distribute packets from injection ports to the middle-memories. DSB tries to mitigate destination conflicts using fully connected middle buffers, which still re- quire significant overheads (a total of 200% area and 40% power overhead for a 5×5 4-VC router with a minimal 5 single-flit middle memories). Tran and Baas [235] propose RoShaQ, which eliminates VCs and replaces them with a single shared queue. This limits RoShaQ to routing protocols that do not require multiple VCs. In contrast, ISI is able to support any number of VCs and therefore protocol dependent VC requirements [79, 92, 163, 221]. Finally, DAMQ [229] and ViChar [171] implement schemes that dynamically allocate the VC space available within input port to incoming packets; thus, DAMQ and ViChar are orthogonal to ISI and can be implemented on top of ISI to improve the utilization of combined VC buffer space. As shown in Section 3.6.1, when employed without ISI, DAMQ and ViChar have limited performance impact with asymmetric traffic. This is because they fail to take advantage of idle routing re- sources. Figure 3.4 gives an example of how asymmetric traffic works under different router designs. In (a), P0 is injecting traffic to O0 and O1 ( 1 , 2 , 5 ), P3 is

99 injecting traffic to O2 and O3 ( 3 , 4 , 6 ), while P1 and P2 still idle. Given this generic router design, only two flits can be serviced by the crossbar each cycle ( 5 and 6 ). Input ports P1 and P2 VC buffers and their corresponding crossbar ports sit idle. Similarly, VC sharing schemes like ViChar and DAMQ also fail to fully utilize the switching capacity as demonstrated in Figure 3.4b.

3.3 Background

Figure 3.5 illustrates the architecture of a generic NoC router employing virtual channel flow control and wormhole switching [56]. The router has P input and P output ports. Figure 3.5 also shows the switch allocator (SA), the virtual channel allocator (VA) and the routing computation unit (RC). The RC unit is responsible for directing the header flit of an incoming packet to an appropriate output port and also dictates a valid virtual channel at the selected output port. The VA unit arbi- trates amongst all packets requesting access to input virtual channels. Finally, the SA unit arbitrates amongst virtual channels requesting access to the crossbar. Once winning channels are decided, they are then able to send their flits across the cross- bar to the corresponding output links. Conventional routers use a classic five-stage pipeline design [56]: buffer write+route computation, VC allocation, switch allo- cation, switch traversal and link traversal; simpler router implementations require a single clock cycle for each stage. Low-latency routers use speculative allocation to parallelize the RC, VA and SA stages. Further, look-ahead routing can be used to calculate the routing of the next node at the current node, enabling two-stage and single-stage routers [164]. For the remainder of this chapter, we assume the single-cycle version of the generic router in Figure 3.5 unless stated otherwise.

3.4 Analyzing the Behavior of Asymmetric Traffic

This section provides a theoretical analysis that helps in understanding the impact of asymmetric traffic on network performance. Later in Section 3.6.1, a more de- tailed simulation-based analysis is provided. For the purpose of this analysis, we model a crossbar network with two types of injection sources as shown in Figure 3.6a. High-intensity sources (HIS) and

100 (P x P) crossbar Input channel 0 Input Output Port 0 channel 1 Credit out

Input channel 1 Input Port 1 Credit out Crossbar (5x5)

Output channel P Input channel P Input Port P Credit out

Switch Allocator (SA) Credits In VC Allocator (VA) Routing Computation (RC)

VC identifier

VC1

VC2 Input To

channel P crossbar

VCv

Figure 3.5: Baseline router design. A P×P router is shown where each port has v virtual channels (VCs). the low-intensity sources (LIS). An injection ratio (α) represents the asymmetry in traffic between HIS and LIS nodes as follows:

λ α = h (3.1) λl where λh and λl are the injection rates of high and low intensity sources, respec- tively. If we use λ to represent the average injection rate across all ports, then, for

101 (흀풍+흀풉)/ퟐ 흀풍/ퟐ μ μ 휆푙 흀풍/ퟐ

(흀 +흀 )/ퟐ 흀풉/ퟐ 풍 풉 μ μ 휆ℎ 흀풉/ퟐ

a given λ, α, number of HIS nodes (NodesHIS) and LIS nodes (NodesLIS) we have:

λ × (NodesLIS + NodesHIS) λl = (3.2) NodesLIS + α × NodesHIS

μ μ 휆푙

μ μ 휆ℎ

(a) A crossbar with two packet injection rates λl and λh and a service rate of µ 20 α =1.0 15 α =1.5 α =2.0 α =3.0

10 α = ∞ E(Tw) 5

0 0 0.5 1 Injection rate (흀 ത ) (b) The effect of injection ratio (α) of asymmetric traffic on the expected waiting time for a packet E(Tw) in (a) (Note: µ = 1)

Figure 3.6: The effect of traffic asymmetry on network saturation

We use Equation 3.2 later to sweep λ against latency in Section 3.6.1. Now to continue our theoretical analysis here, we assume a system with an

M/D/1 queue with deterministic service time and a system with NodesLIS = NodesHIS =

1, so that λ = (λh + λl)/2. For simplicity, we assume uniform destination traffic

and ignore crossbar conflicts. If α = 1 and λh = λl = λ, and for a packet service

rate of µ, the expected waiting time (Tw) for an arriving packet for the system in

102 Figure 3.6a is1: λ Esymm(Tw) = ,λ ≤ µ (3.3) 2µ(µ − λ) We can use Equation 3.3 to derive the weighted average expected waiting time of both queues when α ≥ 1:     α λh 1 λl Easym(Tw) = + (3.4) α + 1 2µ(µ − λh) α + 1 2µ(µ − λl) " # λ 1 α2 α + 1 = + ,λ ≤ µ (3.5) (α + 1)µ (α + 1)µ − 2λ (α + 1)µ − 2αλ 2α Figure 3.6b shows the latency curves for the example in Figure 3.6a using Equation 3.5 for λ = [0, µ] at multiple injection ratios (α). The value of λ required for network saturation decreases significantly from 1 for α = 1 to 0.76 at α = 3. From Equation 3.5 we can measure the decrease in the saturation throughput as injection asymmetry (α) increases:

α + 1 1 λ sat(α=∞) = lim λ sat(α=1) = λ sat(α=1) (3.6) α→∞ 2α 2

Equation 3.6 shows that the saturation throughput is reduced by half when α = ∞. Equation 3.6 can be derived for systems with a number of ports with multiple asymmetric injection rates. In a system with M = NodesHIS and K =

NodesLIS we find that saturation throughput is reduced to:

αM + K M λ sat(α=∞) = lim λ sat(α=1) = λ sat(α=1) (3.7) α→∞ α(M + K) M + K

Note that the change in λ sat with α is independent of the underlying network topology and only reflects the relative values of injection rate at the source (λ) and the corresponding port service rate (µ). Now we analyze the saturation throughput of an interleaved injection router like the one shown in Figure 3.4c. By modifying the example in Figure 3.6, we get the system shown in Figure 3.7. Using the same methodology of Equation 3.6 and

1Using the model in Dally and Towles [56] (Chapter 23).

103 (흀풍+흀풉)/ퟐ 흀풍/ퟐ μ μ 휆푙 흀풍/ퟐ

(흀 +흀 )/ퟐ 흀풉/ퟐ 풍 풉 μ μ 휆ℎ 흀풉/ퟐ

Figure 3.7: ISI traffic in a crossbar with two packet injection rates λl and λh and a service rate of µ.

Equation 3.7, we find that for the network in Figure 3.7:

For α ≥ 1μ : λsat(α) = λsat(α=1μ) (3.8) 휆푙 Which means that, regardless of the level of traffic asymmetry in the network, the saturation throughput remains constant. μ μ 휆ℎ 3.5 Interleaved Source Injection (ISI)

The main intuition behind proposing ISI is demonstrated in Figure 3.4c. A basic router with asymmetric traffic, representing a likely pattern in heterogeneous SoCs, is shown in Figure 3.4a, where P0 and P3 are injecting traffic while P1 and P2 are idle. Because of this asymmetry, some VC buffers and the corresponding routing resources are underutilized. VC buffering resources allocated to idle ports remain empty, and the corresponding switching resources as well. The first possible solution to handle such traffic pattern is to introduce internal router speedup, e.g., directly connect each VC port to the switch. Such solution, however, is a expensive in terms of area since it significantly increases the area and power consumed by the switch fabric. In addition, it reduces switching speed as latency increases with additional ports. Yet, this solution fails to utilize the unused VC buffering capacity of idle injection ports. Another body of work proposed a set of techniques to handle intermittent vari- ation of injection rate. But dynamic buffering techniques like DAMQ [229] and

104 ViChaR [171] (Figure 3.4b) are limited to handling intermittent variation of injec- tion rate from the same source across different VCs; they fail, however, to utilize both the idle switching and buffering capacity. On the other hand, our ISI (Figure 3.4c) proposal uses a simple, yet, effective idea of using a number of carefully added small crossbars that interleave injections from multiple sources to multiple VC buffers. This way, and as shown in Fig- ure 3.4c, ISI is able to improve the utilization of VC buffers and the switching fabric by using the VC buffers available across input ports. The idea is that, while adding these small crossbars means a small additional overhead, however, by keep- ing such connections local we can reduce the cost and improve NoC performance per area, as well as power efficiency. An example of the latter includes operating the router at lower frequency while still meeting the latency targets. Previous work, like RoShaQ [235] and DSB [197], proposed some forms of buffer sharing. RoShaQ, however, does not support multiple VCs, which are nec- essary for many QoS, routing algorithms and coherence protocols [79, 148, 163]. DSB, on the other hand, creates a set of middle memory buffers to improve uti- lization under output port conflicts by imitating Output Buffered Routers (OBRs). DSB aims to avoid adding extensive speedups required in true OBRs by adding a second fully-connected crossbar and middle memories that work together to em- ulate an OBR using limited buffers and no speedup. However, such solution is expensive to implement given that most of NoC area and power is consumed by buffering Figure 3.8. ISI, on the other hand, opts to use a number of small crossbar to interleave packets across the existing VC buffers. This approach has four advantages over DSB. The first advantage is that the cost of adding small crossbars is way cheaper than adding a fully-connected one (43% less area for a 5×5 4-VC router); sec- ond, there is no need to additional middle memories, not only that, but as we will see later, ISI can improve performance even when using a reduced number of VC buffers, which is not possible under DSB; third, ISI requires less buffering, in total, than DSB to maintain the flit flow given buffer turnaround restrictions [56, 109]; finally, our results show that DSB does not improve performance when asymmetric traffic is presented. The following sections describe ISI’s architecture, allocation and arbitration,

105 Process technology 45nm 32nm 22nm

Power consumption Dynamic 66.0% 34.0% 55.6% 44.0% 40.6% 59.4% Leakage 66%

order-preserving routing, placement of injection ports, and hardware cost.

Other Leakage Other Dynamic Other Crossbar 1% 1% Crossbar 4% 9% 18% Crossbar 21%

VC VC VC Buffers Buffers Buffers 78% 90% 78%

Area breakdown Power breakdown Figure 3.8: Dynamic and leakage power breakdown for a 5×5 router using 22nm nodes (assuming 100% injection rate for dynamic power).

3.5.1 ISI Architecture

ISI implements additional routing to allow buffer and port sharing; the main idea behind ISI is both agnostic to the underlying topology and offers path diversity. ISI applies three primary changes to the baseline router shown in Figure 3.5:

• VC buffer sharing: ISI adds a set of multiplexers (Figure 3.9 1 ) to facili- tate accessing a set of VCs by multiple input ports 2 . As a consequence, ISI defines what we call Virtual Input Port (VIPs) 2 and Real Input Ports (RIPs) 5 . Each RIP may correspond to a set of VIPs and vice versa.

• Source arbitration ( 4 : To accommodate shared VCs, ISI modifies VA and SA stages to allow the same VC to be shared among multiple VIPs.

• Packet Ordering Logic: Additional logic to maintain order-preserving rout- ing.

Each input source (VIP) simultaneously utilizes the VCs of multiple injection ports (RIPs); as a result, it is possible for multiple flits from the same source to transmit over the crossbar sententiously. This additional flexibility provides cost- efficient performance improvements; thus, it enables a network to operate at a lower frequency under asymmetric loads while maintaining throughput.

106 8 5 4 Source RIPs Arbitration 7 1

6 In port 0 6 w 3 VIP 0 VCs Out port 0 w w C w Directory 0 2 w

VIP 1 w In port 1 1 VCs Out port 1 w 1 w w Directory 1 VIP 2 w w

w In port 2 5 VCs w w w 4 w

w Figure 3.9: ISI Architecture. ISI adds a set of multiplexers (e.g., 1 & 6 ) 9 to facilitate accessing real input ports ( 5 ) VCs 3 by multiple virtual input ports (VIPs 2 ). ISI also modifies VA and SA stages ( 4 ) to allow the same VC to be shared among multiple VIPs.

ISI combines the effect of both speedup and concentration of traffic from multi- ple sources. VCs at each RIP are shared by multiple VIPs, ISI can utilize a reduced number of VCs per RIP without incurring a performance penalty. For networks that implement special-purpose virtual channels, special channels can be either imple- mented in all or some RIPs. ISI architecture can adapt successfully as long as each VIP has access each class of virtual channels through one of the attached RIPs.

3.5.2 ISI Interleaving Degree

Figure 3.9 shows a schematic for a 3-way ISI. In an N-way ISI, each RIP is shared among N IPs except for edge RIPs (e.g., 6 is shared between two IPs instead

of three). Each VIPi injects to RIPP P = [i − 1,i + 1] where 0 ≤ i, P < N. There are a couple of factors to consider when deciding ISI interleaving degree. The first factor is the cost of interleaving in terms of wiring. The second factor is that

107 limiting interleaving degree simplifies source arbitration as discussed later. It is possible to use ISI with any network agent, but since memory directories do not suffer traffic asymmetry, as the case with other IPs, its use is optional. Thus, the response network may remain unchanged. Given that, responses from memory should use VIP to route response packets to the corresponding IP.

3.5.3 Allocation and Arbitration

ISI allows the sharing of VCs across ports; subsequently, virtual channel alloca- tors and switch allocators are modified accordingly. Since ISI can return multiple candidate virtual channels of different ports, a two-stage virtual channel allocator is needed (Figure 3.10a). ISI switch allocation, however, differs from the baseline. Assuming determin- istic routing, the original routing function shall return a number of virtual channels that correspond to the same output port. For each input port, the first arbitration stage in Figure 3.10a arbitrates between V possible virtual channels, where V in this case corresponds to the number of virtual channels at each output port. For ISI, however, V corresponds to all VCs that the corresponding VIP can inject to. Which VCs are being bid on change every time, interleaving between VCs of dif- ferent RIPs. Figure 3.10b shows a standard 2-stage non-speculative switch allocator. The first stage arbitrates input virtual channels, where only a single winner emerges per input port. The second stage selects only one winner input port (and its correspond- ing virtual channel) per output port. ISI, however, needs an additional allocation stage for the ISI multiplexers; because multiple input ports can be allocated VCs on the same RIP, a third allocation stage is needed to determine which VIPs can access available RIPs each cycle. Effectively, ISI’s first two switch arbitration stages work the same way as the baseline arbiter in Figure 3.10b. The third stage uses Ri I:1 arbiters, where Ri is the number of interleaving multiplexers and I is the interleaving degree (edge RIPs only requires I-1:1 arbiters but for simplicity, we assume that all arbiters require the same resources). The new stage allows only a single winner request for each RIP amongst multiple possible bidding VIPs.

108 Stage 1 Stage 2

V:1 PV:1 arbiter arbiter 1 1

V:1

Input 1 Input port PV:1

arbiter arbiter 1 port Output P P Stage 1 Stage 2 RIP arbitration Original stages stage

V:1 Pi:1 arbiter I:1 V:1 P :1 V:1 arbiter i PV:1 arbiter arbiter arbiter arbiter 1 1 arbiter 1 1 1 1 1

V:1 Pi:1 Input P Input port V:1 PV:1 P port Output arbiter I:1 V:1 P :1 arbiter arbiter i arbiter P arbiter arbiter arbiter P P o Po Ri Po Po

(a) Baseline virtual channel allocator (b) Baseline switch allocator Stage 1 Stage 2 RIP arbitration Original stages stage

V:1 Pi:1 arbiter arbiter I:1 V:1 Pi:1 1 1 arbiter arbiter arbiter 1 1 1

V:1 Pi:1 arbiter arbiter I:1 V:1 Pi:1 arbiter arbiter Po Po arbiter Ri Po Po

(c) ISI switch allocator

Figure 3.10: ISI VC and switch allocation for a P×P router.

However, we realized that performing RIP arbitration at the end is suboptimal. An improved scheme is shown in Figure 3.10c, which places RIP arbitration at the beginning instead of the end. This re-ordering reduces the chances of unsuccessful bids. For example, imagine an input port with two pending requests at two different

109 1 Virtual channel state (VCS) v0 GS VIP CR OVC PN v1

VPOT VIP 0 OPC0 OPC1 2 VIP 1

Figure 3.11: Virtual channel state (VCS) and VIP Packet Order Tracking (VPOT) tables. VCS ( 1 ) has fields for GS (Global State), IP (In- put Port, where for ISI it is the VIP ID), CR (credit), and ISI adds an extra Packet Number field (PN), which tracks packet order. VPOT 2 tracks the number of packets heading from each VIP to each output port, where each VIP has an OPC (output port count) entry for each output port.

2 virtual channels, say requests R1 1 and R1 2 . With the first proposed arbitration order, only one winner can arise from the two requests after the first two arbitration stages shown in Figure 3.10b. Even if, say R1 1, wins the two arbitration stages, it may still lose to another request coming to the same RIP through an another VIP

(say R2 1). This scenario produces only one successful bid. Using the allocation order in Figure 3.10c, however, the eventual winner of the RIP is calculated first.

This time R1 2 may proceeds instead of R1 2, and if R1 2 is injecting to a different

RIP other the one requested by R2 1, then two winner bids emerge instead of a single one. Once a VIP has acquired a VC, the VIP uses the assigned RIP to push packet flits, where flow credits are assigned to RIPs. The same process occurs for every new packet, where VIP starts a new round of bidding RIP bidding.

110 3.5.4 Order-preserving Packet Routing

Interleaved injection can transmit packets out-of-order; many possible NoC clients, however, rely on in-order packet processing. ISI can be extended to support in- order transactions by adding a (small) packet order tracking table (VPOT) and some packet tagging. Figure 3.11 1 shows a virtual channel state (VCS) table [56], with fields for GS (Global State), IP (Input Port, where for ISI it is the VIP ID) and CR (credit). ISI adds an extra field (PN) for packet order number, which defines the order of a packet injected from a VIP to a destination output port (OP). Another table, the VPOT (VIP Packet Order Tracking) table ( 2 ), is created to track the number of packets heading from each VIP (VIPi) to each OP (OP j), where each VIP has an OPC (output port count) entry for each output port. The VCS table holds the order of each packet relative to other packets heading from the same VIP to the same OP. Order-preserving routing works as follows. First, OPC entries in the VPOT table are initialized to 0. When a new packet arrives from port VIPi through virtual channel VCk to output port OPC j then VCS[VCk][PN] and VPOT[VIPi][OPC j] are set to VPOT[VIPi][OPC j] + 1; this value is the order of this packet corresponding to all packets arriving from the same IP and heading to the same output port. Now when subsequent packets arrive, and to enforce packet order, the SA allocator pri- oritizes VCs with the lower PN values. This ensures that the flow from each VIP to any OP is deterministic.

When the tail flit of a packet exits VCk, the VPOT table is updated by de- creasing VPOT[VIPi][OPC j] by 1 and all PN values in the VCS table with VIPi are decreased by 1. Note that if the VA assigns a new packet from a VIP and at the same time a tail flit of an older packet from the same VIP exits, the value of

VPOT[VIPi][OPC j] and PN values in the VCS remain unchanged. This scheme can guarantee in-order transactions on all or a subset group of packets that require in-order handling. Note that ISI order-preserving routing does not degrade performance. This is because it only enforces ordering on packets heading from the same source to the same destination in the same cycle, where only one packet can move forward regardless of the order.

2 Using the notion Rinport out port .

111 3.5.5 Placement of Injection Ports

An aspect of ISI is determining how IP injection ports are interleaved when con- necting them to virtual input ports. Traffic sources vary in their demand across use cases and during a particular use case. A careful mapping of IP injection ports to virtual ports ensures IPs are less likely to place high demand on shared RIPs. To achieve this, use case traffic can be analyzed and the results can be used to place VIPs. We show a greedy algorithm (Algorithm 4), which starts with placing the IP with least bandwidth demand on the edge of the crossbar, from there the next IP to be used in the next VIP is picked by calculating the traffic overlap between the last placed IP and IPs pending for placement. The traffic overlap between two

IPs, TO(IP1,IP2), can be estimated by calculating the number of time slots where both IPs are injecting packets. It is worth noting that, sometimes, there are physical constrains on the placements of IPs and ports in real SoCs; as a result, we may only be apply Algorithm 4 to some, but not all, cases.

3.5.6 Quality-of-Service (QoS)

Sharing VCs among multiple sources may lead to a degradation in the QoS for some latency-sensitive IPs. However, as we will see in Section 3.6.3 QoS experi- enced by IPs is changed only slightly. Critical IPs can avoid this degradation if they are given a higher priority of being selected by the source arbiter ( 4 ), following the scheme for order-preserving routing. In addition, ISI conforms with SoC NoCs that uses VCs to define different QoS levels.

3.5.7 ISI and NoC Topologies

SoC NoCs today use a hierarchy of crossbars [28, 102]. ISI can be adopted to local as well as the main system crossbar (i.e., Memory Network in Figure 3.1) which combines traffic from all system IPs. It is also possible, however, to consider using ISI on 2D topologies like meshes [29]. There are two approaches to apply ISI to such topologies:

• Inter-port ISI: This approach uses the baseline design of ISI as shown in Fig- ure 3.9.

112 Algorithm 4 Mapping VIPs to RIPs 1: Parameters 2: IPlist← List of SoC IPs to be placed on the network 3: procedure PLACE IPSTO INJECTION PORT 4: InjectionPort0 ←IPminBW 5: IPprev ←IPminBW 6: Remove IPminBW from IPlist 7: CurrPort ← 1 8: while there is an unplaced IP in IPlist do 9: TOmin ← valuemax 10: for all IP in IPlist do: 11: if TO(IPprev, IP) ≤ TOmin then 12: TOmin ← TO(IPprev, IP) 13: IPdone ← IP 14: end if 15: end for 16: InjectionPortCurrPort ← IPdone 17: Remove IPdone from IPlist 18: IPprev ← IPdone 19: CurrPort ← CurrPort + 1 20: end while 21: end procedure

• Inter-node ISI: This approach implements interleaving across the ports of neighboring routers.

The challenge for implementing inter-node ISI, however, is the potential cost of extra wiring across the chip. However, SoC NoC designers, and for the prac- tical considerations of irregularly sized IP blocks [173], may choose to pipeline flits coming from IPs to place NoC routers close to each others [102] and reduce across chip wiring. In addition, the compact layout of routers facilitates a fast com- munication of flow-control signals from adjacent routers which, otherwise, would require additional cycles to travel larger distances across the chip.

113 3.5.8 Implementation Cost

We used DSENT [228] (Table 3.6) to estimate area and power cost of ISI designs. Dynamic power measured using the full injection rate (100%) to present the worst case scenario for ISI overhead. Arbiters consume minimal area and power (1.2 − 1.7%). Nevertheless, we assume that ISI arbiters consume twice the area and the power of the baseline router. Figure 3.12 shows area3 and power breakdowns for 3-way ISI routers compared to a 4-VC generic one. In total, a 4-VC ISI router adds 13.7% area and 15% power to the baseline 4-VC router. In the following sections, we compare ISI performance with limited channels to the 4-VC baseline router. A 2-VC ISI consumes 23% less area and 35% less leakage power compared to the baseline router. However, it consumes a similar amount of dynamic power due to the additional crossbar area. For 1-VC router, the area is 41% smaller than the baseline router. It also consumes less dynamic and leakage power (15% and 56% respectively). We assumed minimum buffering when calculating ISI cost relative to the baseline. For networks with deeper buffers, ISI overhead becomes even smaller relative to network area/power. We also calculated the cost of increasing the number of injection ports per VC (static speedup) compared to ISI. We found that just adding two extra ports per source to match 3-way ISI possible ports per source costs 36% area compared to 13% and 42% more power instead of 15% for ISI which uses smaller multiplexers units instead. Larger routers show even more favorable results for ISI. Figure 3.13 compares the cost of ISI vs. static speedup at different router radixes. We can see that ISI costs less and scales better than static speedup. For the order-preserving logic, the extra buffering required is minimal. Both PN and OPC have a bit width equal to the maximum number of VCs that can be assigned to a VIP (log2(I ×Vs)), where I is the interleaving degree and Vs is the number of virtual channels per RIP. For example, in a 5×5 router with 3-way ISI the number of bits needed per OPC/PN is 4, resulting in a VPOT size of 100 bits in addition to 20 bits of storage added to VCS. 3Networks in modern chips are constructed using a number of layers [104] and area cost can vary across chip configurations and technology generations.

114 1.2 Buffer Crossbar Arbitration Other 1 0.8 0.6 0.4 0.2 0 Baseline ISI ISI ISI Baseline ISI ISI ISI Router Router Router Router Router Router Router Router 4VCs 4VCs 2VCs 1VC 4VCs 2VCs 1VC

Dynamic Leakage

Buffer Crossbar Arbitration Other 1.2 1 0.8 0.6 0.4 0.2 0 Baseline ISI Router ISI Router ISI Router 1VC Router 4VCs 4VCs 2VCs (a) Router area 1.2 Buffer Crossbar Arbitration Other 1 0.8 0.6 0.4 0.2 0 Baseline ISI ISI ISI Baseline ISI ISI ISI Router Router Router Router Router Router Router Router 4VCs 4VCs 2VCs 1VC 4VCs 2VCs 1VC

Dynamic Leakage (b) Router power consumption

Figure 3.12: Normalized area and power consumption of generic 5×5 router Buffer Crossbar Arbitration Other compared to 3-way ISI routers (Router A uses 2 VCs, and Router B 11.2 VC per input port) 1 0.8 0.6 3.6 Methodology and Results 0.4 0.2 0 We evaluated ISI using both synthetic traffic in Section 3.6.1 and traces that imitate Baseline ISI Router ISI Router ISI Router 1VC a number of SoC use cases in Section 3.6.2. The synthetic traffic based analysis Router 4VCs 4VCs 2VCs in Section 3.6.1 aims to validate the theoretical analysis provided in Section 3.4 and, at the same time, to serve as a baseline in terms of asymmetric traffic behavior to our trace-based analysis in Section 3.6.2.

115 3.00 x2 speedup 2-ISI x3 speedup 3-ISI 5.00 x2 speedup 2-ISI x3 speedup 3-ISI 4.50 2.50 4.00 3.50 2.00 3.00 2.50

1.50 2.00

Normalized Area Normalized Normalized Power Normalized 1.50 1.00 1.00 5 6 7 8 9 10 5 6 7 8 9 10 (a) (b)

Figure 3.13: ISI vs. static speedup: Crossbar area and power vs. crossbar radix (x-axis) normalized to baseline (no speedup)

Network Simulator Garnet2.0 [3] # of directories 4 (9 or 25 injection nodes), 8 (64 injection nodes) Traffic Injection Uniform Random, Tornado, Shuffle & Trans- pose # of virtual Channels 4 Routing DOR (XY & YX) Router Model Single-Cycle Router

Table 3.3: Experimental setup

3.6.1 Synthetic Traffic

To confirm our analytical analysis in Section 3.4, we simulate network performance under asymmetric loads by varying λ and α (using the configurations in Table 3.3). Results show that network latency is higher with larger α values; meaning that the latency increases when traffic is generated by fewer nodes as predicted by the theoretical analysis. For a 9-nodes crossbar configuration, saturation throughput under random injection decreases by around 45% when α increases from 1 (all nodes injecting equally) to 10. While the saturation throughput of a 3×3 mesh is reduced by around 30%. Similarly, larger crossbar and mesh networks (with 25 and 64 injection nodes) and different traffic patterns show similar reduction in

116 XBAR-25 MESH-25 300 300

250 B1VC 250 B1VC

200 B4VC 200 B4VC B16VC B16VC 150 150 ISI1VC ISI1VC 100 100 ISI4VC ISI4VC Latency (Cycles) Latency 50 ISI16VC 50 ISI16VC 0 0 0 0.002 0.004 0.006 0.008 0.01 0 0.002 0.004 0.006 0.008 0.01 Injection rate (packets/cycle/node) (a) α = 1 XBAR-25 MESH-25 300 300 B1VC 250 250 B4VC

200 B1VC 200 B16VC B4VC ISI1VC 150 150 B16VC ISI4VC 100 ISI1VC 100 ISI16VC

Latency (Cycles) Latency 50 ISI4VC 50 ISI16VC 0 0 0 0.002 0.004 0.006 0.008 0.01 0 0.001 0.002 0.003 0.004 0.005 0.006 Injection rate (packets/cycle/node) (b) α = 5

Figure 3.14: Evaluating ISI under synthetic traffic using symmetric (α=1) and asymmetric (α=5) traffic. saturation throughput at higher α values. In summary, our results confirms the theoretical analysis that asymmetric traffic severely reduces total NoC throughput compared to the generally presumed case of symmetric traffic (from homogeneous nodes). Figure 3.14 shows injection rate vs. latency when α = 1 (symmetric traffic) and α = 5. We used baseline networks with a single (B1VC), four (B4VC), and sixteen (B16VC) virtual channels, and similarly for ISI (ISI-1VC, ISI-4VC, ISI- 16VC). For the crossbar network, we used 3-way ISI as shown in Figure 3.9, while for the mesh network we used 2-way inter-router ISI. In Figure 3.14a, ISI performs close to the baseline. Also we can see that B4VC, ISI4VC and ISI16VC outperform single VC implementations. This is due to head-

117 of-line (HOL) blocking caused by injecting packets for the same destination from multiple nodes. Using virtual channels alleviates these conflicts. On mesh net- works, packets are routed through intermediate nodes reducing the number of HOL conflicts. When adding asymmetry to the traffic in Figure 3.14b, we can see the diver- gence of ISI and baseline performance. In the crossbar network, we still see the same HOL blocking effect with 1-VC. For 4 and 16-VC implementations, ISI in- creases the saturation throughput by around 23% in both cases. The figure shows that increasing the number of VCs from 4 to 16 does not improve the saturation throughput. Meaning that once the HOL effect is eliminated with 4VCs, extra queuing capacity provided with techniques like DAMQ and ViChaR does not im- prove the performance as injecting to the crossbar becomes the bottleneck. How- ever, ISI was able to further increase performance by directing traffic through mul- tiple input ports as explained in Figure 3.4. Similarly, ISI outperforms both base- line implementations in the mesh network, increasing the saturation throughput by around 15% over baseline 4-VC and 16-VC implementations. We also evaluated DSB (using a modified version of Garnet2.0) and compared its performance under symmetric (α = 1) and asymmetric (α = 5) traffic. We used the same configurations in Table 3.3. We also assumed an infinite number of middle memories is available. We found that even with such relaxed assumptions, DSB fails to improve the performance under traffic asymmetry compared to the baseline, as it exhibits the same limitations of other techniques (as demonstrated in Figure 3.4).

3.6.2 Mobile SoC Use Cases Results

We evaluate ISI using traffic traces of a set of typical use case scenarios for mobile SoCs. Our uses-cases are listed in Table 3.4. For each use case scenario, we list active IPs that perform a task at some point during application execution. A major obstacle, however, was the lack of research simulation infrastructures with a representative NoC hierarchy or detailed IP models. For NoCs we injected traffic directly to a single fabric in both crossbar or mesh configurations; using hier- archal crossbars would yield similar results, as traffic coming through a periphery

118 use case Active IPs Description 3D Render (3D) CPU, GPU, Display Con- An application rendering troller 3D object Android GUI (AG) CPU, GPU, Display Con- Rendering Android troller Home-Screen Game (GM) CPU, GPU, Display Con- 3D rendering with Audio troller, Audio Decoder & Audio Output Video I-Frame (VI) CPU, Video Decoder & Video playback decoding Display Controller I-Frame Video P-Frame (VP) CPU, Video Decoder & Video playback decoding Display Controller P-Frame

Table 3.4: Evaluation use cases

NoC fabric acts as a single IP agent. For IP models, and while Emerald provides detailed CPU and GPU mod- els [86], others mostly rely on combining detailed and synthetic traces [49, 168, 253]. We follow the latter approach; traces and synthetic traffic are collected as follows:

• CPU: For each application, we collected CPU traces by running Android on the gem5 simulator [41]. We modeled a 2.0GHz Quad-core ARM system.

• GPU: For the GPU we modeled 326 GFLOPS GPU using Emerald.

• Video Decoder: We used traces collected from an RTL model [139] decoding 1080p video frames.

• Audio: For audio decoder and audio output we generated traces using the configurations used in Chidambaram Nachiappan et al. [49].

• Display Controller: We used the ARM HD LCD controller embedded in gem5. We used 1080p display resolution with a refresh rate of 60 FPS.

Traces from display, CPU and GPU are collected simultaneously running An- droid 4.x. We separate traffic from each source to combine them with other traces and construct trace permutations as described later. Baseline configuration are

119 summarized in Table 3.5, and for latency we measured packet request-to-response latency.

3.6.2.1 NoC Simulation

NoC simulator Garnet 2.0 Router Model Single-cycle router Topology Crossbar Channel width 128 bits Packet Size 128 bytes Protocol MOESI (using 3 networks) # of VCs 1 NoC cycle 1ns ISI Queue Size 1 per source-port pair # of Directories 4 DRAM model Unlimited bandwidth, 1 ns latency per request # of IPs 8

Table 3.5: Baseline simulation configurations

3.6.2.2 Source Injection Placement

For each scenario, we assumed ISI worst-case placement of sources to ports. Mean- ing that for n number of active IPs, they are randomly connected to ports[0-n] in a crossbar or rows [0-n/R] in an R×R mesh.

3.6.2.3 Trace Injection Rate (TIR)

To calculate network saturation curves using fixed traces, we vary network fre- quency while fixing the traces to obtain a variable trace injection rate (TIR). TIR represents trace injection rate relative to network operating frequency. In effect, TIR plays the role of varying injection rates in synthetic traffic experiments.

120 Crossbar:Crossbar: 3d,3d, gamegame

300300 300300 250250 250250 B1VC 200200 B1VCB1VC 200200 B1VC B2VCB2VC B2VCB2VC 150150 B4VCB4VC 150150 B4VCB4VC 100100 ISI-1VCISI-1VC 100100 ISI-1VCISI-1VC

ISI-2VCISI-2VC ISI-2VCISI-2VC

Latency (cycles) Latency

Latency (cycles) Latency Latency (cycles) Latency Latency (cycles) Latency 5050 5050 ISI-4VCISI-4VC ISI-4VCISI-4VC 00 00 Crossbar:00 gui0.50.5 11 1.51.5 22 Crossbar:00 Iframe,0.50.5 pframe 11 1.51.5 22 TraceTrace InjectionInjection RateRate TraceTrace InjectionInjection RateRate (a) 3D (b) GM 500 500 500 400 400 400 B1VC B1VC 300 B1VC 300 B2VC 300 B2VC B2VC B4VC B4VC 200 B4VC 200 ISI-1VC 200 ISI-1VC

ISI-1VC ISI-2VC ISI-2VC Latency (cycles) Latency Latency (cycles) Latency 100 100 Latency (cycles) Latency 100 ISI-2VC ISI-4VC ISI-4VC ISI-4VC 0 0 0 Crossbar: Iframe, pframe 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 0 1 2 3 Trace Injection Rate Trace Injection Rate Trace Injection Rate (c) AG (d) VI 500 500 400 400 B1VC B1VC 300 B2VC 300 B2VC B4VC B4VC 200 ISI-1VC 200 ISI-1VC

ISI-2VC ISI-2VC Latency (cycles) Latency Latency (cycles) Latency 100 100 ISI-4VC ISI-4VC 0 0 0 1 2 3 4 0 1 2 3 Trace Injection Rate Trace Injection Rate (e) VP

Figure 3.15: Comparing ISI and baseline crossbar NoC packet request-to- response latency vs. trace injection rate.

121 Mesh:Mesh: 3d, 3d, game game

200200 200200

150150 150150 B1VCB1VC B1VCB1VC B2VCB2VC B2VCB2VC 100100 100100 B4VCB4VC B4VCB4VC ISI-1VCISI-1VC ISI-1VCISI-1VC ISI-2VCISI-2VC

5050 ISI-2VCISI-2VC 5050 ISI-4VCISI-4VC

Latency (cycles) Latency (cycles) Latency (cycles) Latency Latency (cycles) Latency ISI-4VCISI-4VC 00 00 00 0.50.5 11 1.51.5 00 0.50.5 11 1.51.5 22 TraceTrace InjectionInjection RateRate Mesh:TraceTrace ifram InjectionInjection, pframe RateRate Mesh: gui (a) 3D (b) GM 300 200 200 250 150 150 B1VC 200 B1VC B1VC B2VC B2VC B2VC 150 100 100 B4VC B4VC B4VC ISI-1VC ISI-1VC ISI-1VC 100 ISI-2VC

ISI-2VC 50 ISI-2VC 50

Latency (cycles) Latency (cycles) Latency Latency (cycles) Latency ISI-4VC 50 ISI-4VC ISI-4VC 0 0 0 0 3 6 9 12 0 1 2 3 0 1 2 Mesh: ifram, pframeTrace Injection Rate Trace Injection Rate Trace Injection Rate (c) AG (d) VI 200 200

150 150 B1VC B1VC B2VC B2VC 100 B4VC 100 B4VC ISI-1VC ISI-1VC

50 ISI-2VC 50 ISI-2VC Latency (cycles) Latency Latency (cycles) Latency ISI-4VC ISI-4VC 0 0 0 1 2 3 0 1 2 Trace Injection Rate Trace Injection Rate (e) VP

Figure 3.16: Comparing ISI and baseline mesh NoC packet request-to- response latency vs. trace injection rate.

122 3.6.3 Simulation Results

Figure 3.15 and Figure 3.16 show packet latency vs. trace injection rate for each use case. For the crossbar topology results in Figure 3.15 B1VC and B4VC show very close performance. On the other hand, ISI-4VC has larger saturation thresholds than ISI-1VC. These results show that once ISI alleviated traffic asymmetry, adding virtual channels further reduces the probability of conflicts on the crossbar. On average, ISI-1VC increases crossbar saturation throughput over B1VC by around 113%, ranging from 33% for 3D and GM to 284% for AG. ISI-4VC increases saturation throughput by an average of 184%, from 142% for 3D, GM and VI to 361% for AG. For the mesh network in Figure 3.16 we see that ISI-1VC and ISI-4VC show close performance compared to crossbar results in Figure 3.15 as we saw with mesh results discussed in Section 3.6.1. On average, ISI improves saturation throughput by around 91% from 80% in 3D, GM and VP to 134% in AG. In addition to baseline configurations, we also show a 16 channels configuration to evaluate the upper limits of channel sharing schemes. Similar to Section 3.6.1 results, we found that the 16-VCs configuration performance is almost identical to that of 4-VCs.

3.6.3.1 Trace Permutations

Figure 3.15 and Figure 3.16 show the results for one possible traffic scenario. How- ever, since the timing of the traffic from each IP can vary from one application to another, and from one frame to another, we generated a number of permutations for each trace that cover multiple possible execution scenarios. For each use case, we further evaluated ISI using 20 different possible permutations. We found that ISI consistently improves the saturation threshold in each use case. Using 100 cy- cles as the cutoff latency, ISI increases saturation threshold for crossbar by around 200% for 1VC (from 3D:46% to VI 320%), and by around 270% for 4VC (from VI %96 to AG %470). Similarly for mesh topologies, ISI increases saturation by around 310% for 1VC (from 3D 81% to AG 700%) and 334% for 4VC (from 3D 81% to AG around 700%).

123 Disp,cpu,gpu, video Disp,cpu,gpu, video 1000 Disp,cpu,gpu1000 , video 2500 Disp,cpu,gpu, video 1000 800 2500 2500 1000 2500800 2000 2000 800 600 2000 8001500 2000cycles) 600 B1VC cycles) 600 400 1500 cycles) B4VC 1500 B1VC 6001000 B1VCB1VC 1500400 B1VC B1VC B4VC 400 B4VCB4VC 200 ISI-1VC 1000 B1VC 400 Latency(cycles) B4VC 2001000 B1VCISI-1VC 500 ISI-1VCB4VC ISI-4VC

1000 (cycles)Latency ISI-1VC Latency( B4VC 200 ISI-1VC 0 500 ISI-1VC B4VCISI-4VC 200Latency( ISI-4VC

Latency (cycles)Latency ISI-4VC 0500 ISI-1VC Latency( 0 ISI-4VC 500(cycles)Latency ISI-1VC 0 ISI-4VC 0 0.4 0.8 1.2 1.6 2 0 (cycles)Latency ISI-4VC 0 Trace Injection Rate 0 0.4 0.8 1.2ISI-4VC1.6 2 0.000 0.45.00 0.8 10.001.2 1.615.00 2 0 Trace Injection Rate 0 0.4Trace0.8 Injection1.2 Rate1.6 2 0.00 5.00 10.00 15.00 0.00 5.00 10.00 15.00 Trace Injection Rate Trace Injection Rate 0.00 5.00 10.00 15.00 Trace Injection Rate Trace Injection Rate 1000 400 400 Trace (a)InjectionCPU Rate (b) Display 1000 1000 400 1000 800400 300 800 300 800 600 300

800 cycles) 300 B1VC 200 cycles)

cycles) 600 B1VC cycles) B1VC 600200 400cycles) cycles) B1VC 200 B4VC 600 B1VC 200 B1VC 100 B4VC 400 B4VC 400 B1VCB4VC ISI-1VCB1VC ISI-1VC 100 B4VC 200 B4VC Latency(cycles)

400 Latency(cycles) ISI-1VC B4VCISI-1VC 100 ISI-4VCB4VC ISI-4VC 200 ( Latency ISI-1VC 100 ISI-1VC 0

200 ( Latency Latency ( Latency ISI-1VCISI-4VC 0 ISI-1VC

ISI-4VC ( Latency 200 0 ISI-4VC ( Latency ISI-4VC 0 0.4 0.8 1.2 1.6 2 Latency ( Latency 0 0 0 ISI-4VC 00 0.4 0.8 1.2 1.6ISI-4VC2 0 0 0.4 0.8 1.2 1.6 2 0 Trace0.4 Injection0.8 Rate1.2 1.6 2 Trace Injection Rate 0 0.4 0.8 1.2 1.6 2 0 0.4Trace0.8 Injection1.2 Rate1.6 2 0 0.4 0.8 1.2 1.6 2 Trace Injection Rate 0 0.4 Trace0.8 Injection1.2 1.6Rate 2 Trace Injection Rate Trace Injection Rate Trace Injection Rate (c) GPU (d) Video Decoder

Figure 3.17: ISI IP latency vs. trace injection rate (Crossbar)

3.6.3.2 Quality-of-Service

Figure 3.17 shows the latency vs. injection for a number of IPs, averaged across the five use cases and 20 permutations for each use case. Results show that ISI im- proves QoS for each IP compared to the baseline except for Display. We found that even though ISI enhances the saturation throughput for display by more than 300%, it shows a minimal increase (<1%) in latency for lower injection rates. Prioritizing Display traffic at the ISI did not change this behavior. We found the reason for this is the increased traffic from other sources at different ports competing for network access, which is the same behavior that the network will demonstrate under higher load. In order to maintain low latency, critical IPs should be both prioritized at the ISI as well as the switch level.

124 Version 0.91 Process 22nm # of bits per flit 128 # of virtual networks 1,4 Total buffers per input port 4

Table 3.6: DSENT configuration

3.6.4 Area and Energy Efficiency

In this section we calculate ISI crossbar network throughput per area/energy unit compared to the baseline using the same set of use cases.

3.6.4.1 Area Efficiency

ISI-1VC and ISI-4VC increase area by 12% and 7% compared 1 and 4-VC base- lines. ISI-1VC achieve, on average, 55.4% higher throughput per unit area com- pared to B1VC and 161% compared to B4VC. While ISI-4VC increases through- put per unit area by an average of 82.5% compared to baseline B4VC.

3.6.4.2 Energy Efficiency

We fed our simulation data into our power model to calculate throughput per joule. On average, ISI-1VC improves throughput per joule by 37% compared to B1VC (3D:20%, GM:30%, AG:14%, VI:14% and 55% for VP). Compared to B4VC, ISI- 1VC increase throughput per joule by an average of 176%. For ISI-4VC, it showed 63% average improvement of throughput per joule compared to B4VC (3D:36%, GM:66%, AG:86%, VI:39% and 87% for VP).

3.7 Conclusion

This work casts light on time-variant traffic asymmetry in contemporary SoC net- works. We analyzed the impact of this asymmetry on network performance and show that this traffic asymmetry significantly undermines network performance. We advocates for using interleaved source injection (ISI) as a reasonable method

125 to improve network performance under alternating asymmetric traffic. ISI al- lows NoC router ports to accept traffic from multiple sources based on network traffic conditions. We show that ISI can smooth traffic asymmetry and increase throughput while maintaining Quality-of-Service. We evaluated ISI under stochas- tic traffic patterns and traces that mimic mobile workloads, and we found ISI to increase throughput per unit area in crossbar networks between 55.4% and 82.5% and throughput per joule between 37% and 63%. In summary, this work highlights the problem of traffic asymmetry in hetero- geneous systems and proposes a first-order solution for re-balancing traffic asym- metry, and we conclude that further research in this area is essential with the ad- vent of more diverse heterogeneous systems. While we used traces to complement our studies with stochastic traffic, this approach, however, is still unable to cap- ture some system and implementation-specific aspects, like traffic with multiple QoS levels either from different IPs or the same IP, fine-grain inter-dependency between IPs (e.g., multi-stage data-chaining), and multiple IPs with hard deadlines (e.g., displays and camera sensors). Evaluating schemes like ISI with a more elab- orate setup shall not only provide a better assessment but also open the door for more optimizations like QoS-aware and deadline-sensitive arbitration schemes. Chapter 2 and this chapter discuss IP (GPU) design and communications be- tween SoC components through NoCs, respectively. The next chapter (Chapter 4) moves to focus on the memory module in SoCs (Figure 1.1 3 ), where off-chip memory, i.e. DRAMs, is a major source of power and energy consumption. Chap- ter 4 discusses methods and techniques for reducing SoC off-chip bandwidth, and consequently power and energy, consumed by graphics operations; specifically, Chapter 4 proposes a surface compression scheme to reduce the bandwidth con- sumed by framebuffer operations.

126 Chapter 4

Surface Compression Using Dynamic Color Palettes

4.1 Introduction

Off-chip memory traffic, including that of framebuffer surfaces, is one of the major contributors of power consumption on SoCs. In some cases, the energy consump- tion to access data stored on off-chip memory can dominate that from computa- tions [184]. In this work, we study the properties of framebuffer surfaces and pro- pose a set of unique compression techniques to reduce the bandwidth consumed by framebuffer operations. In the context of graphics rendering, a framebuffer surface is an off-chip mem- ory space that contains pixels generated by the GPU. and then used by the display controller to read pixels to the screen. In some cases, the display controller op- erates on multiple framebuffer surfaces, which are composited to a single surface for screen display. Also GPUs can use framebuffer surfaces as inputs to additional rendering stages, e.g., rendering to textures and deferred shading; as a result, any given application may utilize one or more framebuffer surfaces. In this work, we study a large set of Android workloads to infer the compres- sion properties of framebuffer surfaces generated by mobile UI, 2D and 3D appli- cations. Our study highlights how framebuffers from different classes of workloads have different compression properties, where we exploit these properties to propose

127 an effective palette-based framebuffer compression scheme that focuses on com- mon UI and 2D applications. In addition, we exploit graphics temporal coherence, where applications exhibit minor changes between frames, in order to construct adaptive compression palettes. Using temporal coherence and by focusing on common uses cases, we propose and evaluate our Dynamic Color Palettes (DCP) technique. DCP uses palette based compression and focuses on reducing the traffic caused by framebuffer operations in UI and 2D applications. To evaluate our compression scheme, we created an extensive set of workloads from 124 Android applications. We show that by com- bining DCP with other compression techniques [198], DCP is able to improve the compression rates of such techniques between 83% and 161% across UI, 2D, and 3D applications. This work makes the following contributions:

1. Characterizes compression properties of framebuffer surfaces from user-interface (UI) as well as non-UI 2D and 3D applications;

2. Uses characterization results to propose and evaluate dynamic color palettes (DCP), a compression technique that offers higher compression rates for common UI and non-UI 2D applications;

3. Proposes two DCP variations that dynamically choose an optimal palette size based on the frequencies of the values in color palettes;

4. Evaluates our compression schemes using an extensive set of workloads con- sisting of 124 Android applications.

4.2 Background and Related Work

4.2.1 The Life Cycle of a Framebuffer Surface

Figure 4.1 summarizes the life cycle of a frame surface in contemporary mobile systems (Android Ice Cream Sandwich 4.0 and later [17]). A typical scenario of drawing multiple surfaces simultaneously from multiple processes is shown: the status bar, Facebook, and the navigation bar. Each process independently renders

128 Legends Refresh rates vary by App Level Application 1 Application 2 Application 3 application and user OS Level activity OS Graphics Libraries / Graphic API calls HW Level

CPU (Software Rendering) / GPU (Hardware Rendering) Device Screen

Compression! Compression! @ screen refresh 1 rate (i.e., 60 FPS) Render Framebuffer 1 Render Framebuffer 2 Render Framebuffer 3

Compression! Called upon Display updating any 2 Controller of the System Compositor (e.g., GPU or CPU Display framebuffer 3 framebuffers Android’s SurfaceFlinger) Composition

Figure 4.1: The life cycle of a surface from rendering to display. In 1 , appli- cations render to their corresponding framebuffers. In 2 , a compositor combines the surfaces generated by different applications. In 3 , the composited surface is used and displayed on the screen by the display controller. to its own surface ( 1 ); for example, Facebook renders a new surface when the user scrolls or clicks, while the navigation bar updates the corresponding surface when the user clicks on one of its buttons. For display, a system compositor, such as SurfaceFlinger [17] in Android, combines surfaces from multiple applications before sending them to the screen ( 2 ). The compositor actively monitors the surfaces of all applications and when a process updates a surface, the compositor subsequently updates the composited surface. Simultaneously, the display controller hardware continuously reads the composited surface to the screen at 60 frames per second (FPS) or higher ( 3 ). Note that because using the same surface for updates and screen refresh operations can cause artifacts, such as flickering and tearing, double (or triple) buffering is used [15]. The example in Figure 4.1 shows how a surface can be used and re-used mul- tiple times and this is why it is important to reduce the overhead of framebuffer

129 manipulation through compression.

4.2.2 Surface Compression Techniques

Surface compression is used to reduce off-chip memory traffic, which can improve performance and/or reduce energy consumption. Graphics pipeline implementa- tions utilize compression for textures [224], surfaces [5, 20, 198], depth [97] and vertex data [117]. Many of the existing compression techniques use lossy compression as well as lossless compression. For framebuffer surfaces, however, lossless compression is used to avoid error accumulation upon reading and writing surfaces multiple times. An example of that is when a surface is used as a texture, as the case with screen composition. Surface compression differs from texture compression in that both encoding and decoding are performed in real time. In addition, and in opposite to surface compression schemes, most texture compression algorithms are lossy [183, 224, 225]. Another crucial aspect of surface compression is random accessibility. Tech- niques like Run-Length Encoding (RLE) are unable to provide such accessibility. However, it important to be able to randomly access a surfaces that are used for sampling (e.g., used as a texture), resizing, or composition. Compression algo- rithms have used block-based schemes to enable random access for their simplic- ity and practicality. Block-based compression mechanisms define a number of preconfigured compression sizes that allow random access to compressed surfaces. Block-based mechanisms have been used for compressing integer (e.g., RGBA) surfaces [198], floating-point surfaces [189, 226] and depth buffers [97, 226]. The work by Rasmusson et al. [198] (which we refer to by RAS) evaluated several surface compression proposals [157, 161, 241] and compared them against their technique. RAS is a lossless block-based compression technique for integer buffers that encodes the difference between adjacent pixel values. RGB pixels are converted to the Ycocg (luminance-chrominance) format, to increase compression efficiency. We compare against RAS in this work since it reports better compression results compared to prior work.

130 Other existing techniques include one described by NVIDIA [178]. In this scheme, for each block going to memory, the algorithm checks if 4×2 pixels in sub-blocks within a block are identical. If so, the block is compressed 1:8. When that is not possible, the algorithm then checks if 2×2 regions have identical colors, if so, the block is compressed 1:4, otherwise the block remains uncompressed. This algorithm fits well when there are regions of identical color values, as the case with UI surfaces. Other compression work includes the work by Danielson [57], which pro- poses using a dictionary-based compression in which the operating system and/or program specify the colors to configure a dictionary. In contrast to Danielson’s work, our work exploits temporal coherence to dynamically construct dictionaries (palettes) avoiding the need for software changes. Another work by Shim et al. [217] use a dictionary-based compression mech- anism targeted at display buffer compression. Shim et al.’s approach compresses surfaces using Huffman coding after rendering is completed to reduce the band- width of display refresh operations. Rendered surfaces are read to construct critical color differences, which are used in a second stage to construct a Huffman dictio- nary. The third stage re-reads the surface buffer and writes out a compressed buffer that is then used for screen refresh operations. In contrast to previous work, we propose employing temporal coherence to predict the values for the dictionary, avoiding submitting uncompressed surfaces to memory or requiring additional surface read/write operations. Also we propose an adaptive compression scheme that avoids Huffman coding inefficiencies with probability distributions that are not exact powers of two. Finally, there is a body of work has exploited temporal coherence in real- time rendering through inter-frame data reuse. These techniques, in addition to offline rendering techniques like ray-tracing, are summarized in the survey work of Scherzer et al. [207]. Here we propose a different application for temporal co- herence by exploiting it for compression.

131 4.2.3 Mobile Use Patterns

In this work, we focus on developing an effective scheme for compressing UI and 2D framebuffer surfaces. The reason for this choice is that studies found users to spend 70% of their time running UI applications [70, 172], where over half of the time is spent on web browsing, messaging and social media alone. Whereas, games of all types, 2D, and 3D, account for only 30% of the usage time. SoC designers target a range of goals for each SoC use case. For demanding applications, such as 3D games and HD image processing, the focus is to meet per- formance requirements within a feasible power and thermal budget. On the other hand, for less demanding applications, such as UI and many 2D applications, the focus is to reduce the total energy consumption under typically long runtimes. This work proposes an effective scheme that targets such long-running common use cases. Additionally, in Section 4.7.5 we show how DCP scheme can be combined with other generic compression algorithms to provide a comprehensive framebuffer compression solution that works with all categories of mobile use cases.

4.3 Temporal Coherence in Mobile Graphics

This section demonstrates a couple of characteristics in many mobile applications that DCP takes advantage of; specifically, this section shows the compression po- tential resulting from high temporal color coherence with low pixel change rates, and the color characteristics of UI and 2D surfaces. Section 4.4 later shows how DCP exploits both characteristics to improve the compression rates of UI and 2D surfaces. Temporal coherence is the property of inter-frame similarity [207]; this means that in a sequence of frames, content only gradually changes from one frame to the next. To quantify temporal coherence, we use two measurements: Color change and Pixel change. Color change is the total difference in pixel color frequencies between two frames regardless of the locations of the pixels. On the other hand, Pixel change is the total number of pixels that change color between frames which is measured by counting the number of pixel locations that differ in value between two frames.

132 Color change estimates how similar two frames are, only with regard to color fre- quency. While Pixel change captures the movement of content on a surface. To illustrate color and pixel change, we use an example from the Google Chrome browser in Figure 4.2. The example shows pixel and color change for different events. Notice that in some cases, when a new content is displayed, both pixel and color change values are high (e.g., new web search). In most cases, how- ever, pixel change is always higher than color change; this means that in many cases the content is moving but not changing, as the case with scrolling (after BBC news loading) in Figure 4.2. Looking at a range of mobile workloads, we found that temporal color coher- ence is reflected by low Color change values, especially in UI applications. We analyzed a set of nine Android UI applications and games (UI: Twitter; Facebook; Chrome; and Android Home Screen, and 3D: Fruit Ninja; Need 4 Speed; Gun- ship 2; and Temple Run 2). In 3D applications, color and pixel change rates are 15.7% and 65%, respectively. On the other hand, UI applications has rates of 3.3% and 14.5% for color and pixel change values, respectively. These numbers show that 2D and UI applications exhibit higher temporal color coherence relative to 3D applications. In addition to higher temporal coherence, we found that UI applications tend to use fewer colors. Figure 4.3 demonstrates how a small number of frequent pixel color values dominates a typical UI application compared to a 3D one. Figure 4.3 shows the cumulative distribution function of colors used in Twitter UI (a), com- pared to a 3D game, Temple Run 2 (b). In (a), the top 100 most common color values cover over 80% of the frame’s surface, while coverage is 10% for Temple Run 2. Measuring compressibility with Shannon entropy, we found that Twitter has an entropy of 4.5 bits per pixel, while it is 14 bits per pixel for Temple Run, indicating higher compressibility for Twitter. The next section shows how to take advantage of temporal coherence and the color characteristics of UI applications to design a dynamic color palette scheme for compressing UI and 2D surfaces.

133 Scrolling New web search Loading BBC Loading news Amazon 90% 80% Color change 70% Pixel change 60% 50% 40% 30% 20%

CHANGE (% OF PIXELS) OF (% CHANGE 10%

0%

1

101 201 301 401 501 601 701 801 901

1001 1101 1201 1301 1401 1501 1601 1701 1801 1901 2001 2101 2201 2301 FRAME # Figure 4.2: Pixel change and Color change in Google Chrome. In most cases, pixel change is higher than color change; this means that con- chrome tent is moving but not changing most of the time. Our compression scheme takes advantage of lower color change between frames to pre- dict compression palettes.

4.4 Dynamic Color Palettes (DCP) Compression

DCP is a technique to exploit graphics temporal color coherence for framebuffer compression. For each frame, DCP carries two operations in parallel: color fre- quency collection and framebuffer compression. For color frequency collection, DCP tracks most frequently used colors as the rendering of a frame progresses; meanwhile, DCP works on compressing the pixels in the frame with a palette con- structed using the frequency information of the previous frame. DCP has two main advantages over previous dictionary-based techniques [57, 217]. First, it employs sampling to exploit temporal coherence to predict future dictionary values on-the-fly, alleviating the need for software hints or a multi-stage dictionary update process. This allows DCP to compress intermediate surfaces (i.e., application surfaces) as well as the framebuffer surface used by the display unit. Second, as we will show later, DCP maximizes compression using adaptive dictionary sizing, which puts to use the color frequency data collected for each

134 (a) Twitter.

(b) Temple Run 2.

Figure 4.3: The cumulative distribution function (CDF) of unique color val- ues in UI (Twitter) and 3D (Temple Run 2) Android applications.

135 Construct FVC from the pixel values of the current frame Frame 0 Frame 1 Frame 2

FVC CCD FVC CCD FVC

To memory Framebuffer Construct the CCD access dictionary using To memory To memory compressed FVC values using CCD

Time Figure 4.4: Using DCP across frames. In Frame 0, DCP starts constructing a palette using FVC. DCP uses the resulting CCD to compress Frame 1; additionally, in Frame 1, the FVC constructs a new palette to be used in Frame 2 CCD. frame. DCP relies on two structures (shown in Figure 4.4), the Frequent Values Col- lector (FVC) for color frequency collection, and the Common Colors Dictionary (CCD) for compressing new pixels. The FVC identifies most commonly occurring colors, while the CCD encodes the most frequent colors as identified by the FVC from the previous frame. As shown in Figure 4.4, each frame the FVC collects color frequency information that are then used to construct the CCD of the next frame.

4.4.1 DCP Workflow

Figure 4.5 shows DCP workflow. In 1 , the GPU commits tiles of pixels to the off-chip memory in multiple batches, i.e., blocks of spatially adjacent pixels [105– 107]. For the example in Figure 4.5, we use a block size of 4×4 pixels and a sub-block size of 2×2. Pixels in each block are sent to the FVC a1 and the CCD b1 . In a1 , the FVC uses pixel values in each block to update the common color frequencies of the current frame (more details on that in the next Section).

136 a1 a2 On-chip Buffer FVC (constructs a Pixel Sampling CCD for the next frame) Dispatch Pixel Blocks CCD (Compress sub- blocks) 1 b1

GPU b2 Write a Each pixel in Write a non- Yes the current No compressed compressed sub-block has sub-block CCD entry? sub-block

b4 b3 Write block to CSB off-chip Block Buffer Buffer memory

Figure 4.5: DCP stages: In 1 , the GPU commits blocks of pixels, which are sampled by the FVC ( a1 and a2 ). In b1 , the CCD compresses pixel blocks in batches of sub-blocks. In b2 , if all pixel values in a sub-block have an entry in the CCD, the sub-block is compressed. If one of the pixel values in a sub-block does not have a CCD entry, then the whole sub-block remains uncompressed b4 . Compressed and non- compressed sub-blocks are buffered, and once a full block is processed, it is then submitted to memory b3 .

In b1 , the CCD compresses pixel blocks in batches of sub-blocks. In b2 , if all pixel values in a sub-block have an entry in the CCD, the sub-block is deter- mined to be compressible. Each color value in a compressible sub-block is repre- sented using log2(CCD size) bits, e.g., 6 bits per pixel for a CCD with 64 entries. If one of the pixel values in a sub-block does not have a CCD entry, then the whole sub-block remains uncompressed. Compressed and non-compressed sub-blocks are buffered and once a full block is processed, it is then written to the off-chip

137 memory b3 . Like other block-based compression schemes [133, 198], DCP uses an a meta- data compression status buffer (CSB) that contains a compression status bit for each sub-block. Upon compressing a sub-block, the corresponding entry in the CSB is set b4 , and upon reading a compressed surface, the CSB is consulted to determine how many bytes should be fetched from memory. In the next two sections we shall describe in details each of DCP’s components in Figure 4.5.

4.4.2 The Frequent Values Collector (FVC)

FVC is a relatively small — e.g., 16 to 128 entries — associative memory struc- ture. The FVC stores a set of pixel values and their corresponding frequencies as value-frequency pairs. FVC is initialized with all entries invalidated, then for each pixel access, the FVC determines if the pixel already has an entry in the FVC, if so, the FVC increases the corresponding frequency counter by one, otherwise the new pixel value is added to the FVC. However, because FVC size is limited, the FVC uses an eviction policy to determine which pixel frequencies to keep track of. Similar to a fully associative cache, the FVC uses the least frequent color (LRC) policy, where it evicts the pixel value with the smallest frequency when an entry is needed to track the frequency of a new pixel value.

4.4.2.1 FVC Hardware Cost

Each FVC entry contains a color value (32 bits for RGBA), a validity flag (1 bit), and a counter with log2(number of screen pixels) bits. For example, a 64-entry FVC sized for a 4k×4k display will only require 456 bytes of storage.

4.4.3 The Common Colors Dictionary (CCD)

CCD is used to encode compressed pixels. At the end of each frame, the FVC holds the frequencies of the estimated most common colors. The FVC is then used to construct the CCD for the next frame. Each CCD entry maps a pixel value to

138 Off-chip CSB buffer Memory De-compress To display/Texture Determine block- Block or Composition Unit size

Fetch Block rCCD

Figure 4.6: Reading a DCP compressed surface: the process starts with load- ing the corresponding rCCD and CSB. To read a pixel, CSB entries are decoded to determine the size of compressed data and how many bytes should be fetched for each block. Once a block is fetched, the rCCD then decompresses the values in each sub-block. a dictionary (encoding) value. The CCD is implemented using a fully associative structure. When reading a surface, the mapping of CCD is reversed to decompress en- coded pixels. We call the direct mapped structure that holds this reversed mapping the rCCD. Upon compressing a frame, or a set of frames, the rCCD mapping is at- tached to the frame and stored in main memory. Later on, when the frame is read, the rCCD is used to decompress the frame as described in Section 4.4.5 below.

4.4.3.1 CCD Hardware Cost

CCD/rCCD with 64 entries only requires 264 bytes of storage.

4.4.4 The Compression Status Buffer (CSB)

Similar to other block-based compression algorithms [97, 189, 198, 226], a meta- data buffer is used to hold the status of each compression block. For DCP, the CSB buffer indicates whether a given sub-block is compressed. In our baseline, this translates to a cost of 1 bit per 128 bit of surface data. Both CSB and rCCD are needed to read a compressed frame as explained in the next section.

139 4.4.5 Reading a Compressed Framebuffer Surface

Figure 4.6 shows the process of reading a compressed surface. It starts with load- ing the corresponding rCCD and CSB. To read a pixel, CSB entries are decoded to determine the size of compressed data and how many bytes should be fetched for each block. To avoid double latency, and since CSB sizes are relatively small, the CSB can be prefetched to a small on-chip buffer/cache. Once the CSB is used to determine the size of a compressed block, the block is then fetched accordingly and the rCCD is used to decompress the values in each sub-block as shown in Fig- ure 4.6.

4.4.6 Multi-Surface Support

Multiple Render Targets (MRT): Some graphics applications may render to mul- tiple target surfaces. Techniques that use MRT, like deferred shading, are popular in 3D applications and used to render scenes with complex lighting [238]. To sup- port multiple render targets, DCP shall require replicating some of the structures in Figure 4.5 to match the maximum possible number of target surfaces. DCP will need a single FVC and a single CCD unit per render target. However, no need for additional FVC and CCD units if multiple passes are used to process MRT. Since most UI and 2D workloads render to a single target, a typical hardware implementation may only need support a single render target and the rare cases of multiple targets are handled by using DCP on a single surface. However, as discussed in Section 4.7.4, adding additional structures is relatively cheap and costs little chip area. Multi-Surface Composition: Contemporary compositor engines can compos- ite up to 16 surfaces in one pass [244]. To support multi-surface composition, the number of rCCD structures in Figure 4.6 should match the number of surfaces that can be composited in parallel.

4.4.7 Coupling DCP with Other Compression Algorithms

DCP is designed to target common UI and 2D applications. Other compression algorithms are better suited to 3D and some 2D applications. Industry practition-

140 6 Compression Rate 5 4 3 2 1 0 0.11 0.46 0.65 0.73 0.80 0.85 0.88 0.91 0.95 0.96 0.98 0.99 1.00 FVC Coverage

Figure 4.7: Compression rates vs. FVC coverage of workloads in Table 4.3 sorted by FVC coverage rate (applications with the lowest FVC cover- age are on the left). ers have already proposed supporting multiple compression algorithms [133, 178]. This means that in a hybrid scheme, each block can be compressed either using DCP or an alternative algorithm. In Section 4.7.5 we evaluate the results of com- bining DCP with RAS.

4.4.8 Dynamically Enabling DCP

In this section, we explain how DCP can be enabled/disabled based on the expected compression performance. DCP performance can be predicted using the frequen- cies collected by the FVC at the end of a frame. By adding frequency values in the FVC, then comparing it to the total number of pixels (sample size), we can calcu- late what we call FVC coverage, which can be used to predict DCP performance, where: Sum of FVC frequencies FVC coverage = Number of samples By defining a coverage threshold (CT) and comparing it to FVC coverage, DCP can be used only if FVC coverage ≥ CT. By periodically enabling FVC, e.g., once every n frames, FVC coverage can be updated and used to determine if DCP should be enabled. For an N-entry FVC, calculating coverage takes N −1 integer addition and one division operations per frame.

141 Figure 4.7 shows FVC coverage vs. compression rates across workloads in Table 4.3. It is clear that higher compression rates are achieved with higher FVC coverage. In our set of workloads, using DCP with FVC coverage ≥ 0.7 seems to achieve good compression rates (> 2). Figure 4.7 also shows some cases where larger FVC coverage yields lower compression. These cases represent workloads that exhibit sudden changes in frames, as a result, temporal coherence is lower than that of other benchmarks with similar FVC coverage. Two examples from Figure 4.7 (the two large dips at the right end) are Unwind which exhibits a UI with changing color brightness and Super Hexagon which exhibits an interface that continuously switches theme colors.

4.5 DCP Schemes

In the following sections we describe three variations of DCP: Baseline DCP, Adaptive DCP (ADCP), and Variable DCP (VDCP).

4.5.1 Baseline DCP

In baseline DCP, the CCD is constructed using all FVC entries; thus, the number of entries in the CCD will always match FVC, and compressed blocks will have a

fixed size of log2(FVC size)× (pixels per block) bits.

4.5.1.1 Memory Layout and Effective Compression Rates

Figure 4.8 shows the memory layout of a DCP compressed surface. Space allo- cated to each DCP block (i.e., Block0, Block1, and Block2) is fixed with a certain size (S0, i.e., the size of an uncompressed block). On the other hand, the actual uti- lized space is determined by the size of compressed data (S2). But because DRAM reads/writes data blocks using a number of bandwidth cycles that are burst size multiples, a block that should be compressed by a rate of S0/S2 will have an ef- fective compression rate of S0/S1, where S1 is the size of DRAM bursts needed to read/write compressed data.

142 Block 0 Blocks layout in Block 0 (Compressed) DRAM DRAM bus layout DRAM layout

DCP compressed Effective Block 0 data (S₂) Compressed compression Pixels size

Block 1 Unused Unused space

DRAM Effective size (S₁)= Block 2 bursts S₀ overhead N × (DRAM burst size) (i.e., required bandwidth to access S₀)

Figure 4.8: DCP memory layout: space allocated to DCP blocks (e.g., Block0 and Block1) has a fixed size (S0), which is the size of an un- compressed block. However, the actual utilized space is determined by the size of compressed data (S2). But, because DRAM reads/writes data blocks using a number of bandwidth cycles that are burst size multiples, a block that should be compressed by a rate of S0/S2 will have an ef- fective compression rate of S0/S1, where S1 is the size of DRAM bursts needed to read/write compressed data.

In this work, we use the effective compression rate, which reflects the reduc- tion in memory bandwidth. In the remainder of this section, two variants of DCP (ADCP and VDCP) are introduced in addition to a hybrid scheme combining DCP and RAS (HDCP).

143 1.25 Kindle Facebook 1.15 1.05 0.95 0.85 0.75 0.65 0.55 0.45 8 16 32 64 128 256 512 # of FVC/CCD Entries

Figure 4.9: Showing DCP compression vs. CCD size for Kindle and Face- book apps. Kindle benefits from a smaller CCD size, thus increasing the number of CCD entries decreases compression rates. On the other hand, Facebook benefits from a larger CCD size and achieves the high- est compression rate using a CCD with 256 entries.

Algorithm 5 Predicting optimal CCD size. INPUTS (FrameSizePixels, PixelSizeBits, FVC Val, Max FVC Size) . predicted compressed frame size in bits expected frame size = Frame W×H*PixelSizeBits . Optimal CCD entries = 2opt CCD opt CCD = 0 for i=0 to log2(Max FVC Size) do sum = SumFrequencies(FVC Val(0) to FVC Val(2i-1)) frame size = sum * i + (FrameSizePixels-sum)* PixelSizeBits

if frame size < expected frame size then expected frame size = frame size opt CCD = i end if end for return 2opt CCD

144 4.5.2 Adaptive DCP (ADCP)

Adaptive DCP is a variation of DCP that uses the distribution of frequent color values in the FVC to adjust the number of CCD entries, i.e., the number of entries used in compression. ADCP looks for the best trade-off between the number of compressible blocks and the size of their encoding. Frames of different applications, or different frames within the same applica- tion, may perform better/worse under larger/smaller palette sizes. Figure 4.9 shows DCP compression rates for Facebook and Kindle using 16 to 512 entry CCDs. Kin- dle with its simple text achieves higher compression rates using smaller CCDs. On the other hand, Facebook achieves the best compression rate using a 256-entry CCD. A larger CCD covers a wider range of values and it is able to compress more blocks, while a smaller CCD uses smaller encoding sizes. For example, if a frame that uses 32-bit pixels with blocks that are 80% white, 18% blue, 1% black and 1% red uses a 2-entry CCD, 98% of the blocks can be compressed using 1 bit per pixel for a total compression rate of 19.75:1 (ignoring metadata overhead). Another option is to use a 4-entry CCD to compress all the frame using 2 bits per pixel producing a compression rate of 16:1. ADCP optimizes CCD size for each case by actively predicting the optimal number of CCD entries. CCD size determines encoding sizes and subsequently the size of compressed blocks. ADCP uses FVC to predict the optimal CCD size as shown in Algorithm 5. In Algorithm 5, FVC frequencies, sorted from most to least frequent in FVC Val, are used as inputs. Note that to simplify calculations, DRAM burst size and pixels layout were ignored. ADCP has a negligible overhead; the number of iterations in Algorithm 5 de- pends on the number of FVC entries. For example, for 64-entry FVC, the loop will only execute six times (i.e., log2(FVC size)).

4.5.3 Variable DCP (VDCP)

VDCP is another DCP variation that goes further than ADCP by adapting palette sizes to optmize compression at the sub-block level. VDCP uses variable-length coding by changing the number of rCCD entries used to encode/decode each sub-

145 CDD

0 C0 1 C1 2 C2 3 C3 4 C4 CCD 5 C 5 C 000 Block Encoding CSB 6 C6 0 C 001 {C , C } {10,11} 010 7 C7 1 2 3 C2 010 {C0, C1} {0,1} 001 C3 011 {C0, C0} {φ} 000 C4 100 {C1, C5} {001,101} 011 C5 101 {C6, C7} {110,111} 011 C6 110 {C3, Cy} {C3,Cy} 111 C7 111 {Cx, C4} {Cx,C4} 111 (a) (b) Figure 4.10: VDCP example encoding. (a) shows CCD entries, while (b) shows the encoding of 2-pixel blocks using the CCD in (a).

block. VDCP reduces the number of encoding bits per pixel to

i = ceil(log2(max(pixel color index))) (4.1)

which means that i is determined by the pixel within the sub-block that has the highest index (i.e., lowest frequency) in the CCD. With VDCP, metadata, i.e., the CSB, is used to determine the number of rCCD entries used for each sub-block. The number rCCD entries equals to 2CSB Value (i.e., encoded colors fall in the first 2CSB Value CCD entries). A special CSB Value is used for uncompressed sub-blocks. Figure 4.10 shows a VDCP example. In Figure 4.10.a, an example CCD is

shown where the most frequent color value, C0, is encoded as 000 and the least

frequent value, C7, is encoded as 111. Figure 4.10.b shows the VDCP encoding for seven 2-pixel sub-blocks. As shown in the figure, the CSB tracks each sub-block’s

encoding. 0002 in the CSB indicates that only the most frequent color in the CCD

C0 is present in the sub-block, while 0012 in the CSB indicates that the top 2 CCD

colors, {C0 and C1}, are present and so on. A CSB value of 1112 is used to indicate uncompressed sub-blocks.

In Figure 4.10.b, the first row shows a sub-block with values C2 and C3, this

146 means that only the top 22 entries in the CCD are used for encoding the sub-block.

Subsequently, the corresponding CSB entry is set to 0102, and each pixel color is encoded using 2 bits. The second sub-block in Figure 4.10.b contains {C0,C1}, encoded using the top 21 CCD entries (1 bit per color), and the corresponding CSB entry is set to 0012. The third sub-block contains only C0, and the corresponding

CSB value in this case is 0002, which indicates the content for the entire sub-block

(since only the top color in the CCD is used). Note that CSB values 1002 to 1102 are not in use as the CCD is shown with eight entries (instead of 64) to make this example easier to follow.

4.5.4 Hybrid DCP (HDCP)

DCP is only effective on a subset of applications, so ideally it should be used with other compression algorithms. We evaluate a Hybrid DCP that combines DCP with RAS. HDCP compresses each block using DCP and RAS and uses the result with higher compression rate. To support the additional compression modes, the number of meta-data bits is increased. Results in Section 4.6 show that this technique produces higher compression rates than RAS and DCP.

4.5.5 DCP Implementation

In addition to the hardware structures (FVC, CCD and rCCD), DCP requires some support from the software layer. To implement DCP, the graphics driver will attach DCP data as part of the state associated with a surface (along other state data like surface size and format). For VDCP, Algorithm 5 can be added to the driver as well, where it can calculate next CCD size at the end of each frame.

4.6 Methodology

For our experiments, we used the configurations listed in Table 4.1. We calculated compression rates using a model that uses final surface values (i.e., the output from the tile buffer in tile-based GPU architectures). Our model works as follows:

• First, we feed the frames of each workload to our model, which then splits each frame to 8x8 blocks.

147 Baseline DCP Configurations FVC Size 64 entries FVC replacement policy Least-frequent value CCD size 64 entries Pixel Block size 8× 8 Pixel Sub-block size 2×2 1 (DCP, ADCP & HuffDCP) CSB bits per sub-block 3 (VDCP), 5 (HDCP) Memory Burst Size 128 bits Pixel Sampling Rate 1:1

Table 4.1: Baseline DCP configurations

• For each block, the model calculates the compressed size of each sub-block. The total of compressed and uncompressed sub-block sizes are added to cal- culate the compressed size of the block.

• The compressed block size is then used to calculate the number of DRAM bursts required. The model then calculates the total bandwidth consumed by a compressed frame by summing the number of DRAM bursts of all the blocks in the frame.

Note that the model computes compression rates starting from the second frame and using the first frame to populate the first FVC and CCD. We evaluated surface compression using our set of randomly chosen popular Android applications (Table 4.2). We split applications into three groups: UI appli- cations, 2D applications, and 3D games. All of our benchmarks use OpenGL ES and render to a single target buffer (up to OpenGL ES 2.0 MRT is only supported through vendor extensions [121]). We manually interacted with each application to execute a simple task. In total, we used 34468 frames that represent 124 applications (as summarized in Table 4.3). We only consider regions of interest in each workload that represent the typical use case of the workload (i.e., loading/initialization frames are not considered). The rest of configurations are listed in Table 4.1. The effective compression rate and metadata overhead are taken into account when calculating the total com- pression rate. We use a block size of 8×8 pixels (256 bytes), which matches the

148 block sizes used by RAS. In addition to DCP, we evaluate two lossless methods described in Section 4.2. RED uses NVIDIA’s compression [178] and RAS, which is based on work of Ras- musson et al. [198].

149 # Cat. Benchmark # Cat. Benchmark # Cat. Benchmark # Cat. Benchmark # Cat. Benchmark 1 UI Android Settings 26 UI Pocket 51 UI Yellowpages 76 UI Textra 101 2D Unwind 2 UI Morecast 27 UI ES File Explorer 52 UI Eye in the Sky 77 UI WPS Office 102 2D Color Switch 3 UI Poweramp 28 UI Chrome 53 UI OfficeSuite 78 UI People Contacts 103 2D Impossible Game 4 UI Speedest 29 UI Applock 54 UI Dictionary.com 79 UI Unit Conv. Ult. 104 2D Flow 5 UI Twitter 30 UI Accuweather 55 UI Walgreens 80 UI Skyscanner 105 2D 2048 6 UI Facebook 31 UI Flipboard 56 UI Walmart 81 UI Calendar 106 2D Gyro 7 UI Twitch 32 UI Booking.com 57 UI CNN 82 UI Merriam Webster 107 2D 99 Problems 8 UI Wish 33 UI Shazam 58 UI File Commander 83 UI ESPN 108 2D Dumb Ways to Die 9 UI Imgur 34 UI Zedge 59 UI Terminal Emulator 84 UI Tumblr 109 2D Piano Tiles 10 UI Soundcloud 35 UI Indeed 60 UI Adobe Acrobat 85 UI Quickpic 110 2D loop 11 UI Automate 36 UI Runkeeper 61 UI Android Call 86 UI Duolingo 111 2D Ultraflow 12 UI Musixmatch 37 UI Steam 62 UI Gallery 87 UI Clock 112 2D Okay 150 13 UI Airbnb 38 UI Khan Academy 63 UI Feedly 88 UI Google Messenger 113 3D Traffic Rider 14 UI CBS Sports 39 UI The Weather Channel 64 UI Baconreader 89 UI Calculator 114 3D Extreme Car Driving 15 UI Etsy 40 UI Yahoo Finance 65 UI aCalendar 90 UI Soundhound 115 3D 3D Bowling 16 UI Android Home 41 UI Tapatalk 66 UI Bakareader 91 UI Translate 116 3D Dr. Driving 17 UI Pinterest 42 UI Kickstarter 67 UI Kindle 92 UI Any.do 117 3D Paper Toss 18 UI Aldiko 43 UI Amazon Store 68 UI eBay 93 2D Candy Crush Saga 118 3D Rolling Sky 19 UI Letgo 44 UI Zomato 69 UI Venmo 94 2D Trainyard 119 3D Stack 20 UI Yelp 45 UI Spotify 70 UI Mcdonalds 95 2D Mines 120 3D Zigzag 21 UI Android Messaging 46 UI Runtastic 71 UI Colornote 96 2D Cut the Rope 2 121 3D Stargather 22 UI BBC iPlayer 47 UI theScore 72 UI Reddit 97 2D Angry Birds 122 3D Commute H. Traffic 23 UI Tachiyomi 48 UI Food Network 73 UI Checkout 51 98 2D Strata 123 3D Crossy Road 24 UI gReader 49 UI MX Player 74 UI Tasker 99 2D Brain it On 124 3D Smashy Road 25 UI Google Maps 50 UI VLC 75 UI IFTTT 100 2D Super Hexagon

Table 4.2: List of Android workloads. RAS is a prediction based algorithm that predicts the value of a pixel using neighbor pixel values. The difference between prediction and the actual value is then encoded using Golomb-Rice coding. We used parameters suggested by Ras- musson et al. [198], namely 8×8 blocks and, as described in the paper, we set the value of the Golomb-Rice parameter k by exploring values between 0 and 6, use k = 7 for the “special mode”, and use the suggested “3 sizes mode” for higher compression rates. We organize color values by their color channel as described in Strom¨ et al. [226]. We experimented with RAS using RGBA and Ycocg formats and found that for many applications, particularly UI and 2D, RAS shows favor- able results using RGBA channels. So we used RAS with RGBA channels in our comparison. For CSB, DCP and ADCP use 1-bit per sub-block. VDCP uses 3 bits per sub- block and with an FVC size of 64, seven combinations are used–1, 2, 4, 8, 16, 32 and 64, plus a combination for non-compressed sub-blocks. To compare against techniques that use Huffman coding [217], a Huffman coded DCP (HuffDCP) is implemented, where FVC frequencies are used to construct CCD with variable length Huffman coding.

4.7 Results and Discussion

4.7.1 DCP Schemes

To compare DCP schemes, we isolate the effect of memory burst size and only take into account CSB overhead. Later, bursts are taken into account when com- paring DCP, RAS and RED. We compare baseline DCP against ADCP, VDCP and HuffDCP. Figure 4.11a shows compression rates in each category sorted by baseline DCP compression rate (the same order used in Table 4.2). The figure shows that the baseline DCP is the least effective scheme. UI applications 1 , such as Android Settings, Morecast, and Poweramp, show low compression rates of less than 2. Af- ter examining these applications, we found that they feature gradient backgrounds and graphical elements that DCP cannot compress using small palettes. In 2 (Zedge) and 3 (Spotify), ADCP achieves higher compression rates than

151 System Configurations Operating System Android 4.2.2 (API 17) Display Size 720×1280 Android workloads Category # of workloads UI Applications 92 (24031 frames) 2D Applications 20 (7888 frames) 3D Applications 12 (2549 frames) Total # of Applications 124 (34468 frames)

Table 4.3: System configurations and workloads summary.

HuffDCP and VDCP. Looking at these applications we found that they contain a mix of solid backgrounds and frames that contain images which DCP will, mostly, not be able to compress. ADCP can compress frames with solid backgrounds with lower overhead than VDCP since it has a lower CSB overhead. On the other hand, in frames containing images, both ADCP and VDCP will not be able to perform well, but ADCP will incur lower CSB overhead. For applications with simple color schemes, such as OfficeSuite 4 and Any.do 5 , VDCP and DCP achieve high compression rates. Nevertheless, VDCP, ADCP, and HuffDCP were all able to achieve even higher compression rates. Looking at 2D applications, performance varies significantly. In 6 , appli- cations with sophisticated graphics, like Candy Crush, Trainyard, Mines, Cut the Rope and Angry Birds, have low compression rates (< 1.7). On the other hand, applications using simpler graphics (e.g., loop Ultraflow, and Okay) achieve high compression rates, especially with VDCP 7 . A similar trend is exhibited in 8 , where graphically rich 3D games (Traffic Rider, Extreme Car Driving, and 3D Bowling) show low compression rates. On the other hand, games like Smashy Road, show good compression rates (highest VDCP at 10.63). Also Stargather 9 , with similar characteristics to UI applications in 2 and 3 , shows higher rates with HuffDCP and VDCP. Figure 4.12 summarizes the results in Figure 4.11a. VDCP shows better com- pression rates for UI and 2D applications with 5.56 and 3.02 respectively. For 3D

152 games, HuffDCP shows the highest rate (1.80). HuffDCP and VDCP do better with 3D workloads since their compression rates are similar to VDCP but with lower CSB overhead. Interestingly, using Huffman encoding in HuffDCP achieves lower compression rates than VDCP in UI and 2D workloads. This is due to Huff- man inefficiencies with probability distributions that are not exact powers of two. For example, if we have 32 bit values and frequencies of A(49.5%), B(49.5%), C(0.5%) and D(0.5%) then Huffman encoding will assign codes of 1 bit to A, 2 bits to B and 3 bits to C and D with a total compression rate of 21.12. ADCP and VDCP encode A and B using 1 bit, while keeping C and D uncompressed, resulting in a compression rate of 24.4.

153 35 AVDCP DCP ADCPV HDCP 33 1-92 (UI) 31 93-113 (2D) 29 7 27 25 5 23 21 113-124 (3D) 19 17 4 15 3 13 2 11 9 Compression RateCompression 9 7 5 6 3 1 8 1

154 (a) Comparing DCP schemes compression rates across application categories. 31 RASS RED VADCP 93-113 (2D) 1-92 (UI) 26

21

16 113-124 (3D)

11 Compression RateCompression 6

1

(b) Comparing RAS, RED, and VDCP compression rates across application categories.

Figure 4.11: Compression rates for Android workloads ordered from left to right following their order in Table 4.2. DCP ADCP VDCP HuffDCP 5.56 5.19 4.22 3.02 2.87 2.62 2.90 2.22

1.59 1.70 1.78 1.80 Compression rate Compression

UI 2D 3D

Figure 4.12: Harmonic mean of DCP schemes compression rates per appli- cation category.

4.7.2 Comparing VDCP, RAS and RED

Figure 4.11b compares VDCP with RAS and RED and Figure 4.13 summarizes the results in Figure 4.11b. Both memory bursts and CSB overheads were taken into account. For UI applications, VDCP achieves a mean effective compression rate of 5.26 compared to 2.73 for RAS and 2.77 for RED. VDCP performs well with UI and 2D applications. On the other hand, RAS, a more generic compression algorithm, has consistent performance across all workloads. RAS outperforms VDCP in 3D games (2.54, compared to 1.75 for VDCP). Similar to VDCP, RED performs well with UI and 2D applications, but with lower rates that VDCP. VDCP performance with 3D workloads is the reasoning behind suggesting a hybrid approach consisting of DCP and another general purpose compression algorithms–similar to what is described some implementations [133, 178]. A Hybrid VDCP-RAS scheme is discussed in Section 4.7.5.

4.7.3 Factors Affecting FVC Fidelity

In this section we discuss and quantitatively evaluate four factors that affect FVC and should be considered when designing DCP.

155 RAS RED VDCP 5.26

2.90 2.73 2.77 2.90 2.54 1.93 1.75

1.35 Compression rate Compression

UI 2D 3D

Figure 4.13: Comparing RAS, RED and VDCP effective compression rates across applications categories.

4.7.3.1 FVC Size

Larger FVC sizes can capture frequent colors more accurately, as they are less likely to evict a frequent value from the FVC because of capacity. To evaluate how FVC size affect accuracy, we use relative coverage. For an N-entry FVC, we calculate relative coverage by dividing the number of pixels rep- resented by the N top colors collected by FVC by the number of pixels represented by the actual N most frequent colors. Figure 4.14a shows the effect of FVC size for UI applications. A 16-entry FVC has a relative coverage of 94% compared to 98.3% for 512-entry FVC. This means a 16-entry FVC is able to capture colors that cover 94% of the area covered by the actual 16 most frequent colors, while the 512-entry FVC is able to capture 98% of the coverage the actual 512 most common colors are able to cover. Figure 4.14b shows how accuracy affect compression rates, as frequencies collected using larger FVCs are a better representation of the actual most common colors.

4.7.3.2 Replacement Policy and Associativity

We evaluated using a number of replacement policies: the baseline least-frequent color (LFC), second least-frequent (2LFC), least-recently-used (LRU), and random replacement. The idea behind including a 2LFC is to see the effect of avoiding thrashing newly discovered colors that are prone to eviction. Using UI workloads with 64-entry FVC, the mean compression rate with LFC

156 FVC size vs. Relative coverage

UI 2D 1

0.9

0.8 Normalized Normalized

compression rate compression 0.7 16 32 64 128 256 512 FVC Size FVC size vs. compression rate (a) FVC size vs. Relative Coverage. 7.22 UI 2D 6.57 5.84 5.26 4.53 4.85 3.89 3.21 3.51

2.57 2.77 2.90

Normalized Normalized compression rate compression 16 32 64 128 256 512 # of FVC Entries

(b) FVC size vs. VDCP compression rate.

Figure 4.14: Comparing FVC size with relative coverage (a) and its effect on compression rates (b). ADC Compression vs # of set in 64 entry FVC

1 UI 2D

0.8

0.6 Normalized Normalized

compression rate compression 0.4 2 4 8 16 32 64 #of sets

Figure 4.15: The compression rate of a 64-entry FVC at different associativ- ity values (normalized to the compression rate of a fully associative FVC).

157 ADCP Compression vs Pixel Sampling Rate

1.05 UI 2D

0.95

0.85 Normalized Normalized

compression rate compression 0.75

Pixel sampling rate

Figure 4.16: Normalized compression rates vs. FVC pixel sampling rate for UI applications. is 5.26, while it is 5.25 for 2LFC. On the other hand, LRU and random achieve lower rates of 3.04 and 2.92, respectively. We also evaluated changing FVC associativity from fully associative to direct- mapped, and used color channel values to determine the set. As expected, the FVC performance degrades as we increase the number of sets (as shown Figure 4.15).

4.7.3.3 Pixel Sampling

We noticed that FVC can be constructed using a subset of frame pixels, i.e., by sampling them using only one in every nth pixel to collect frequent colors statistics. Figure 4.16 illustrates the effect of pixel sampling on VDCP. We evaluate sam- pling rates from 1:1 (every pixel accesses the FVC) to 1:16384. 1:16 sampling achieves 98.7% (UI) and 102% (2D) of the compression achieved by 1:1 sampling. We expect that the slightly higher compression rate for 2D workloads is caused by sampling working as a noise filter.

4.7.3.4 Frame Sampling

In frame sampling, the same CCD is used to compress a number of frames (N) instead for just one frame. We vary the sampling period (N) for VDCP between 1 (every frame) and 60 frames. Figure 4.17 shows compression rates relative to N=1. VDCP maintains good compression rates with N=2, with a relative compression

158 ADCP Compression vs Frame Sampling

1 UI 2D

0.8

0.6 rate 0.4

Normalized compression compression Normalized 0.2 2 5 8 10 15 20 30 40 50 60 Sampling period (in frames)

Figure 4.17: VDCP normalized (to sampling period of 1) compression rates vs. FVC frame sampling period. rate of 97%. Compression rates, however, significantly decrease with higher N values with 44.6% and 43.3% for N values of 50 and 60, respectively.

4.7.4 Implementation Cost and Energy Savings

Section 4.5 mentions storage requirements associated with DCP. Specifically, for 64-entry FVC/CCD, 456 bytes are need for FVC and 264 bytes for CCD. The cost of rCCDs is (264 bytes) x (maximum number of surfaces that can be read in parallel). Current systems support up to 16 surfaces [244]. For energy, we used DRAMPower v4.0 [47] to estimate the energy cost of accessing a MICRON 1600 x32 LPDDR3 DRAM. We found the cost of DRAM accesses to be around 451.2 pJ/byte (this number excludes DRAM idle energy and other system energy costs like the interconnection network). For DCP we used CACTI v7.0 [39] with the 22nm process to estimate the area/energy/latency of DCP structures as shown in Table 4.4. Using the numbers in Table 4.4, DCP total area cost with support of 16 surfaces equals to 0.009527672 mm2. To compare this area with current hardware, it is less than 0.003% of NVIDIA’s Xavier die area [63]. For the the dynamic energy cost of compressing/decompression a byte using DCP, we found it to be around 1.3 pJ/byte, i.e., less than 0.29% of DRAM access cost.

159 Structure FVC CCD rCCD Type CAM CAM Cache Area cost (mm2) 0.00304232 0.00197988 0.000281592 Leakage power (mW) 0.881899 0.39 0.281112 Access cost (pJ) 0.766572 0.402 0.104106 Access latency (ns) 0.131695 0.128338 0.0722227 Max bandwidth (MPixels/s) 7241 7431 13203

Table 4.4: DCP structures hardware cost and performance estimations.

4.7.4.1 Energy Savings

DRAM consumes around 199.6 mW (629.4 mW including static power) for op- erating on a single framebuffer under a typical rate of 60 FPS using HD frames (GPU writing/display controller reading, or 949.21 MB/s). This number is 1.5× higher in newer devices with a refresh rate of 90 FPS. We calculated DCP total compression/decompression static and dynamic energy consumption (4.83 pJ/byte) and we compared it to only DRAM dynamic energy consumption (451.2 pJ/byte). We found that VDCP reduces the energy consumed by framebuffer operations by 80.9% for UI apps, 65.8% for 2D apps, and 42.7% for 3D apps. Finally, looking at the big picture, we want to consider the overall DCP effect on battery lifetime and user experience. Getting SoC power breakdowns is dif- ficult as such power and energy data are vendor and device specific and are not published. In addition, DCP will be typically deployed along other memory op- timization schemes like compaction or other compression algorithms specific to other domains such as video streaming. Numbers from some devices, however, show that memory consumes a significant amount of power that is 0.5×–1.0× the amount of power consumed by CPU cores [12, 200], which translates to 20-30% of total SoC power [200]. Using a conservative estimation that GPU operations con- sumes only 25% of DRAM bandwidth shall put DCP total device power savings between 2%–6%.

160 1 VDCP RAS 0.9 0.8 Avg. 0.7 VDCP:0.49 0.6 0.5 0.4 0.3 0.2

0.1 Ratio of blocks compressed using VDCP using compressed Ratioblocks of 0 Workloads (ordered by the ratio of VDCP compressed block)

Figure 4.18: Ratio of DCP vs. RAS compressed blocks in the hybrid scheme across all workloads (workloads with the lowest ratio of VDCP com- pressed blocks are on the left).

4.7.5 Hybrid Schemes

Our hybrid compression scheme uses RAS and VDCP. We compress using both algorithms and use the best of the two. This exploits VDCP high compression rates for simpler surfaces while falling back on RAS for other cases. RAS+VDCP outperforms RAS and VDCP (with rates of 7.2, 5.206 and 3.23 for UI, 2D and 3D applications, respectively). The ratio of VDCP vs. RAS compressed blocks varies by application as shown in Figure 4.18. However, we found that, on average, VDCP and RAS compress an equal number of blocks. A RAS+VDCP scheme can be implemented in hardware without the need to fully compress each block using both algorithms. This can be achieved by adding a stage to check each pixel value in a block against the CDD. If all the values in a block are present in the CCD, this means the block is compressible by VDCP and it can be used in this case; otherwise, the block cannot be compressed with VDCP and RAS is then used instead.

161 1

0.8

0.6 per bytes/s) per

mW 0.4

( Relative power efficiency power Relative 0.2 LPDDR 2 LPDDR 3 LPDDR 4 LPDDR 4X LPDDR 5

Figure 4.19: Power efficiency in terms of mW/bytes/s across mobile Low- Power DRAM (LPDDR) generations normalized to LPDDR 2 (Data source: Kim [110]).

4.8 Compression and Future Memory Trends

Newer mobile DRAM chips provide significant improvements in terms of power consumption compared to previous generations [110]. Figure 4.19 highlights the progress achieved in improving DRAM power efficiency. For example, the most recent LPDDR generation, LPDDR5, consumes 74% less power than LPDDR2. However, despite these improvements, SRAMs remain orders of magnitude more power efficient. Table 4.5 shows that VDCP compression achieves similar reduction in energy consumption compared to different DRAM generations 1 .

4.9 Conclusion

This work presents surface compression techniques that reduce the off-chip band- width of framebuffer operations in energy-constrained mobile devices. In this work, we analyze and characterize the framebuffer surfaces of UI, 2D and 3D ap- plications and highlight the unique characteristics of each. To evaluate our compression schemes, we created and used a set of workloads

1Because our tools lack the support of smaller feature sizes [39], we used the same VDCP energy numbers estimated using a 22nm process as detailed in Section 4.7.4. Looking at newer processes, however, even just going from 16nm to 7nm SRAMs achieves a power reduction of over 65% [252].

162 DRAM Type VDCP reduction in energy UI 2D 3D LPDDR 2 81.2% 66.1% 43.0% LPDDR 3 80.9% 65.8% 42.7% LPDDR 4 80.5% 65.4% 42.3% LPDDR 4X 80.2% 61.1% 42.0% LPDDR 5 79.9% 64.8% 41.7%

Table 4.5: Estimating the reduction in energy consumption of VDCP (as de- tailed in Section 4.7.4) compared to recent mobile DRAM technologies. The cost of accessing DRAM is an order of magnitude higher than that of on-chip buffers; as a result, compression achieves comparable reduc- tion in energy consumption even when implemented under more power efficient DRAMs. that represents 124 popular mobile applications. Our results show that VDCP im- proves compression by an average of 93% relative to RAS for UI applications, while improving UI and 2D applications over RED by 89% and 50%, respectively. DCP focuses on 2D and UI applications and can complement other generic compression algorithms. We evaluated a hybrid VDCP+RAS (HDCP) scheme; the scheme was able to increase compression rates by 163%, 79% and 27% over RAS, and by 159%, 169% and 139% over RED for UI, 2D and 3D applications, respectively.

163 Chapter 5

Conclusion and Future Work

Mobile SoCs have emerged as the most ubiquitous computing platforms [116]; yet, there has been a surprising lack of research in this area — only 1% of the research published between 2007 and 2017 in top computer architecture venues focused on mobile architectures [200]. This is partly because of the lack of understanding of how such complex devices operate and the practical challenges they face [200], and partly because studying such complex systems requires developing equally complex research tools. The work in this dissertation addresses several research questions regarding mobile SoC architectures, such as how to study mobile architectures, why detailed modeling matters, what characteristics distinguish NoC traffic in mobile SoCs, and how to incorporate information on user behavior when designing hardware for these user-centric interactive devices. The remaining of this chapter highlights the significance and contributions of the work presented in this dissertation and the potential impact on the field, com- ments on strengths and limitations, and discusses future research directions.

5.1 Summary of Chapter 2

Chapter 2 presents Emerald, a research infrastructure that facilitates studying and evaluating architectural proposal using a detailed cycle-accurate model, which pro- vides the ability to study graphics in the context of SoCs. The significance of this

164 work comes from the fact that mobile SoCs are interactive devices, where graphics plays a crucial and integral part of how these devices are used. This work goes beyond providing the research infrastructure to make the case for detailed modeling of such systems. Some previous proposals target the early inception phases of architecture research [98], while others used simplified or brief models [49, 60]. This work demonstrates, through the presented use cases, that it is necessary to evaluate some architectural proposals using elaborate models that are capable of capturing the synergies between IPs, especially, when working on system-level optimizations. Another contribution of Emerald is bringing into discussion modern graphics hardware architecture. GPGPU applications and research have significantly grown in the last decade. However, graphics, unsurprisingly, remains the most important class of workloads that run on GPUs. As a byproduct of Emerald, researchers can utilize the standalone mode to conduct experiments that focus on graphics architecture, or evaluate their proposals on a mixed set of GPGPU and graphics workloads. The DFSL use case in Section 2.7 offers an example of such research direction. Notwithstanding the presented contributions, there are, however, several chal- lenges and obstacles that we would like to overcome in followup work, which includes improving the details of some of the GPU components like texturing units and compression. A major challenge in this aspect, however, is that most details of contemporary graphics hardware architectures are unpublished and exact im- plementations vary widely between vendors. On another front, further work is required to support a wider range of graphics workloads that uses geometry and tessellation shaders. The fact remains that realizing a fully functioning OpenGL implementation is a challenging endeavor. For example, we use Mesa 3D for parts of the functional model and as the API state tracker; however, even Mesa, which has been around for over 20 years and has over 640 contributes [151, 152], features an incomplete OpenGL software implementation, which is yet to support major ex- tensions like tessellation among others [69]. In addition to improving the existing graphics model, Emerald future plans in- clude integration with other tools to provide a comprehensive SoC model, where Emerald is part of a larger framework for SoC research. Some researchers have

165 already expanded the gem5 framework with additional models like Amber [78] for system drives and Aladdin [213] for small-scale accelerators. However, a compre- hensive SoC model requires other key IPs like modems, display engines, DSPs, image processors, and AI engines. Finally, it is worth mentioning that this work has received direct support from leading industry partners (Qualcomm and Google), who saw the value of such line of work and recognize SoCs as one of the most interesting, challenging, and underresearched areas of computer architecture.

5.2 Summary of Chapter 3

Chapter 3 examines an overlooked area in NoCs research that deals with the archi- tecture of heterogeneous SoC NoCs. Earlier research focused primarily on design- ing NoCs for “heterogeneous” systems that serve a specific task or use case (i.e., image and video processing) [48, 165, 166, 223, 258]. On the other hand, modern heterogeneous SoCs portray a new class of heterogeneous systems, which encom- pass a set of complex IPs that, collectively, support a diverse set of use cases. However, designing NoCs for heterogeneous SoCs remained, mostly, a problem restricted to the realm of industry. Our work on Emerald motivated this project; noticing the interesting traffic pattern exhibited by IPs motivated us to closely examine this problem. We used the insight obtained from our Emerald model and carried out a number of studies to design an architecture that deals with the intermittent nature of traffic asymmetry of traffic from IPs across use cases. The first main contribution of this work is ISI, a solution for NoCs that can handle the time-variant asymmetric traffic pattern exhibited by SoC IPs. ISI pro- vides a simple, yet, effective solution, which can be adopted to different baseline NoC topologies or protocols. The second main contribution of this work is casting the light upon a neglected area in NoC research. The industry has considered the challenges of SoC NoCs for some time, while academia has largely ignored this topic among other SoC research areas. Finally, the third main contribution of this work is presenting an example demonstrating the need for comprehensive SoC re- search tools, as highlighted in the previous section. To carry our work on ISI, we

166 had to rely on traces generated from Emerald and other standalone models rather than a detailed SoC model; this limited the solution space we could explore. While ISI solves a significant aspect of SoC NoC traffic, detailed models, however, would be capable of uncovering a number of other new problems by revealing the subtle synergies between IPs under various use cases. Effectively, this work provides the research community a window to a set of possible research problems and directions in this field. We see future work directions to include use case based NoC designs, which depart away from typical asymmetric and regular NoC designs to peruse designing tools that are unhesitant to use irregular, or asymmetrical, architectures, which are designed with a holistic view of how SoCs are used, as well as user experience. The industry seems going towards this direction too, with some specializing in producing NoC generation tools that synthesize NoCs according to the specific needs of each SoC [34, 135].

5.3 Summary of Chapter 4

Chapter 4 proposes a compression scheme (DCP) for graphics surfaces that targets common use cases on mobile SoCs. DCP exploits inter-frame coherence, or inter- frame similarity, in graphics to generate and optimize custom compression palettes for each frame, or set of frames. This work was motivated by the cost endured when accessing off-chip mem- ory in mobile SoCs [169, 184]. Mobile SoCs are, predominantly, interactive de- vices and graphics workloads constitute the primary use cases; thus, a major area of improvement is reducing the cost of off-chip memory accessing for graphics- based workloads. While graphics workloads vary in their use of optional off-chip buffers, like textures and depth buffers, they all, however, produce framebuffer sur- faces which are reused multiple times throughout their lifetime as detailed in Sec- tion 4.2.1. DCP takes an approach (inter-frame coherence), which is widely used in video compression and applies it in the context of framebuffer compression. Video frames encoding/decoding algorithms, such as H.264 [249], are elaborate and con- sume an extensive amount of compute and energy resources, and applying such

167 techniques to framebuffers is impractical. DCP, on the other hand, exploits the main property of framebuffer coherence with a more straightforward and more practical scheme using color frequencies. DCP design follows the principals of this dissertation, which considers the ef- fect of a solution on the system as a whole. When designing DCP, we looked at how framebuffer surfaces are used beyond the compression surface itself; we con- sidered how memory accessing is effected, how composition engines use surfaces, and how to handle the assumptions of the software stack (i.e., the graphics API). Another contribution of this work is using user behavior as guidance for the solution. DCP proposes an additional compression path that targets UI and 2D mobile applications, which comprise over 70% of mobile devices’ use time [70, 172]. Dark silicon in modern chips [68] justifies adding such specialized blocks of hardware for specific tasks; additional examples of that in GPUs include tensor cores [179] and ray tracing units [180]. For follow up work, there are a couple of possible extensions, palette localiza- tion and lossy compression. For palette localization, a palette is created for each region of a frame, rather than for the whole frame. This allows creating smaller palettes for frame regions, yielding higher compression rates. Furthermore, region sizes can be dynamically determined according to the content of the frames. For lossy compression, it allows achieving higher compression rates. A sig- nificant challenge, however, is controlling error accumulation as surface data can be used, for example, in texturing and composition multiple times, as discussed in Chapter 4. A system-wide approach should be adopted to control surface qual- ity during its lifetime. One possible way to track surfaces is to assign a “quality score” to each surface, which determines its quality score according to the scores of other surfaces used to composite it. Lossy compression, in this case, uses sur- face scores to decide how aggressive the lossy compression can be, and if lossy compression possible at all. System-wide changes should be made to adopt such a solution, including the hardware, the graphics APIs, and the OS.

168 5.4 Conclusion

The dissertation highlights a set of research problems in mobile SoCs, provides methods to study them, and proposes solutions that address several challenges faced by mobile SoCs today. The presented work studies and proposes solutions across SoC components: IPs, the interconnection network, and the memory sys- tem. In this dissertation, we aspired to depart away from siloed solutions and adopted a system-wide approach. We recognized SoCs as integrated interdependent sets of components. In Chapter 2, we looked at the role of graphics in mobile SoCs and made the case for the continuous need for detailed modeling and evaluation, even for such complex systems. In Chapter 3, we shed light upon traffic asymmetry ex- hibited in SoC NoCs, and opened the doors for further work in this area. In Chap- ter 4, we showed how user behavior was taken into consideration when designing architectural solutions, and showed how we considered system-wide consequences for a problem that was, typically, studied in isolation from the rest of the system.

169 Bibliography

[1] T. M. Aamodt, W. W. Fung, I. Singh, A. El-Shafiey, J. Kwa, T. Hetherington, A. Gubran, A. Boktor, T. Rogers, A. Bakhoda, and J. H. GPGPU-Sim 3. x manual, 2015. URL http://gpgpu-sim.org/manual/index.php/Main Page. [Online; accessed 1-Mar-2019]. → pages 33, 204 [2] M. Abrash. Rasterization on Larrabee. Dr. Dobbs Journal, 2009. URL http://www.cs.cmu.edu/afs/cs/academic/class/15869- f11/www/readings/abrash09 lrbrast.pdf. [Online; accessed 25-Apr-2019]. → pages 26, 38 [3] N. Agarwal, T. Krishna, L.-S. Peh, and N. K. Jha. GARNET: A detailed on-chip network model inside a full-system simulator. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 33–42. IEEE, 2009. doi:10.1109/ISPASS.2009.4919636. → page 116 [4] N. Agrawal, A. Jain, D. Kirkland, K. Abdalla, Z. Hakura, and H. Kethareswaran. Distributed index fetch, primitive assembly, and primitive batching, June 22 2017. US Patent App. 14/979,342. → page 45 [5] T. Akenine-Moller¨ and J. Strom.¨ Graphics for the masses: a hardware rasterization architecture for mobile phones. In ACM Transactions on Graphics (TOG), volume 22, pages 801–808. ACM, 2003. doi:10.1145/1201775.882348. → page 130 [6] T. Akenine-Moller, E. Haines, and N. Hoffman. Real-time rendering. AK Peters/CRC Press, 2018. URL https://www.realtimerendering.com. → pages 28, 41 [7] AMD. GCN3 GPU model. URL https://www.gem5.org/documentation/general docs/gpu models/GCN3. [Online; accessed 1-Aug-2020]. → pages 88, 90

170 [8] AMD. AMD’s revolutionary mantle graphics API adopted by industry leading dame developers Cloud Imperium, Eidos-Montreal´ and Oxide, 2013. URL https://www.amd.com/en/press-releases/amds-revolutionary- mantle-2013nov4. [Online; accessed 1-Aug-2019]. → page 17

[9] AMD Technologies Group. Radeon’s next-generation Vega architecture, 2017. URL https://en.wikichip.org/w/images/a/a1/vega-whitepaper.pdf. [Online; accessed April 25, 2019]. → pages 31, 32

[10] G. M. Amdahl. Validity of the single processor approach to achieving large scale computing capabilities. In Proceedings of the April 18-20, 1967, Spring Joint Computer Conference, pages 483–485. ACM, 1967. doi:10.1145/1465482.1465560. → page 5

[11] C. Amsinck, E. B. Lum, B. Rodgers, T. Louca, C. Rouet, and J. Dunaisky. Updating depth related graphics data, Dec. 4 2014. US Patent App. 13/907,711. → pages 19, 20

[12] AnandTech. The Samsung Exynos 7420 deep dive - inside a modern 14nm SoC, 2015. URL https://www.anandtech.com/show/9330/exynos-7420-deep-dive/2. [Online; accessed 1-Aug-2019]. → pages 93, 160

[13] M. Anderson, A. Irvine, N. Kamath, C. Yu, D. Chuang, Y. Tian, and Y. Qi. Graphics pipeline and method having early depth detection, 2004. US Patent 8184118. → page 38

[14] M. Anderson, A. Irvine, N. Kamath, C. Yu, D. Chuang, Y. Tian, and Y. Qi. Graphics pipeline and method having early depth detection, Sept. 8 2005. US Patent App. 10/949,012. → pages 19, 38

[15] Android. Android : Graphics architecture, . URL https://source.android.com/devices/graphics/architecture.html. [Online; accessed 1-Aug-2019]. → pages 14, 129

[16] Android. Android Goldfish OpenGL, . URL https://android.googlesource.com/device/generic/goldfish-opengl. [Online; accessed 1-Aug-2019]. → page 61

[17] Android. SurfaceFlinger and hardware composer, . URL https://source.android.com/devices/graphics/arch-sf-hwc.html. [Online; accessed 1-Aug-2019]. → pages 128, 129

171 [18] Android Developers Guide. Reduce overdraw. URL https://developer.android.com/topic/performance/rendering/overdraw. [Online; accessed 1-Aug-2019]. → page 28 [19] M. Anglada, E. de Lucas, J.-M. Parcerisa, J. L. Aragon,´ and A. Gonzalez.´ Early visibility resolution for removing ineffectual computations in the graphics pipeline. In Proceedings of the IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 635–646. IEEE, 2019. doi:10.1109/HPCA.2019.00015. → page 89 [20] I. Antochi, B. Juurlink, S. Vassiliadis, and P. Liuha. Memory bandwidth requirements of tile-based rendering. In Computer Systems: Architectures, Modeling, and Simulation, pages 323–332. Springer, 2004. doi:10.1007/978-3-540-27776-7 34. → pages 28, 130

[21] APITrace. APITrace. URL http://apitrace.github.io. [Online; accessed April 25, 2019]. → pages 60, 200 [22] Apple. OpenGL ES Programming Guide: Concurrency and OpenGL ES. URL https://developer.apple.com/library/archive/documentation/ 3DDrawing/Conceptual/OpenGLES ProgrammingGuide/ ConcurrencyandOpenGLES/ConcurrencyandOpenGLES.html. [Online; accessed 1-Jun-2018]. → page 210 [23] Apple. Apple Metal framework. URL https://developer.apple.com/documentation/metal. [Online; accessed 1-Aug-2019]. → page 17

[24] Apple. The future is here: iPhone X, 2017. URL https: //www.apple.com/ca/newsroom/2017/09/the-future-is-here-iphone-x/. [Online; accessed 1-Mar-2018]. → page 2

[25] Apple. About GPU family 4, 2019. URL https://developer.apple.com/ documentation/metal/mtldevice/ios and tvos devices/about gpu family 4. [Online; accessed April 25, 2019]. → page 28 [26] ARM. ARM Mali GPU OpenGL ES application optimization guide, 2013. URL http://infocenter.arm.com/help/topic/com.arm.doc.dui0555c/ DUI0555C optimization guide.pdf. [Online; accessed 1-Aug-2019]. → page 30

[27] ARM. big.LITTLE technology: The future of mobile, 2013. URL https: //www.arm.com/files/pdf/big LITTLE Technology the Futue of Mobile.pdf. [Online; accessed 1-Aug-2019]. → pages 63, 94

172 [28] ARM. ARM CoreLink CCI-550 cache coherent interconnect, 2016. URL https://static.docs.arm.com/100282/0001/corelink cci550 cache coherent interconnect technical reference manual 100282 0001 01 en.pdf. [Online; accessed 1-Aug-2019]. → pages 99, 112

[29] ARM. Arm CoreLink CMN-600 coherent mesh network technical reference manual, 2016. URL https://static.docs.arm.com/100282/0001/corelink cci550 cache coherent interconnect technical reference manual 100282 0001 01 en.pdf. [Online; accessed 1-Aug-2019]. → pages 99, 112

[30] M. Armbrust, A. Fox, R. Griffith, A. D. Joseph, R. Katz, A. Konwinski, G. Lee, D. Patterson, A. Rabkin, I. Stoica, and M. Zaharia. A view of cloud computing. Communications of the ACM, 53(4):50–58, 2010. doi:10.1145/1721654.1721672. → page 2

[31] J. M. Arnau, J.-M. Parcerisa, and P. Xekalakis. Parallel frame rendering: Trading responsiveness for energy on a mobile GPU. In Proceedings of the ACM/IEEE International Conference on Parallel Architecture and Compilation Techniques (PACT), pages 83–92. IEEE, 2013. doi:10.1109/PACT.2013.6618806. → page 91

[32] J. M. Arnau, J.-M. Parcerisa, and P. Xekalakis. TEAPOT: A toolset for evaluating performance, power and image quality on mobile graphics systems. In Proceedings of the ACM International Conference on Supercomputing (ICS), pages 37–46, 2013. doi:10.1145/2464996.2464999. → pages 88, 89

[33] J. M. Arnau, J.-M. Parcerisa, and P. Xekalakis. Eliminating redundant fragment shader executions on a mobile GPU via hardware memoization. In Proceedings of the ACM/IEEE International Symposium on Computer Architecture (ISCA), pages 529–540, 2014. doi:10.1145/2678373.2665748. → page 89

[34] Arteris . FlexNoc. URL http://www.arteris.com/flexnoc. [Online; accessed April 25, 2019]. → page 167

[35] R. Ausavarungnirun, K. K.-W. Chang, L. Subramanian, G. H. Loh, and O. Mutlu. Staged memory scheduling: Achieving high performance and scalability in heterogeneous systems. In Proceedings of the ACM/IEEE International Symposium on Computer Architecture (ISCA), pages 416–427, 2012. doi:10.1109/ISCA.2012.6237036. → page 90

173 [36] T. Austin, E. Larson, and D. Ernst. SimpleScalar: An infrastructure for computer system modeling. IEEE Computer, 35(2):59–67, 2002. doi:10.1109/2.982917. → page 88

[37] A. Bakhoda, G. L. Yuan, W. W. Fung, H. Wong, and T. M. Aamodt. Analyzing CUDA workloads using a detailed GPU simulator. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 163–174. IEEE, 2009. doi:10.1109/ISPASS.2009.4919648. → pages 6, 10, 11, 33, 38, 43, 59, 88, 90

[38] A. Bakhoda, J. Kim, and T. M. Aamodt. Throughput-effective on-chip networks for manycore accelerators. In Proceedings of the ACM/IEEE International Symposium on Microarchitecture (MICRO), pages 421–432. IEEE Computer Society, 2010. doi:10.1109/MICRO.2010.50. → page 98

[39] R. Balasubramonian, A. B. Kahng, N. Muralimanohar, A. Shafiee, and V. Srinivas. CACTI 7: New tools for interconnect exploration in innovative off-chip memories. ACM Transactions on Architecture and Code Optimization (TACO), 14(2):14:1–14:25, June 2017. doi:10.1145/3085572. → pages 159, 162

[40] L. A. Barroso and U. Holzle.¨ The case for energy-proportional computing. IEEE Computer, (12):33–37, 2007. doi:10.1109/MC.2007.443. → page 2

[41] N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood. The gem5 simulator. ACM SIGARCH Computer Architecture News, 39(2):1–7, Aug. 2011. doi:10.1145/2024716.2024718. → pages 9, 10, 11, 66, 88, 119

[42] J. Bolz and P. Brow. ARB shader image load store, 2014. URL https://www.khronos.org/registry/OpenGL/extensions/ARB/ ARB shader image load store.txt. [Online; accessed 1-Aug-2019]. → page 20

[43] I. Bratt. The ARM Mali-T880 mobile GPU. In IEEE Hot Chips Symposium (HCS), pages 1–27. IEEE, 2015. doi:10.1109/HOTCHIPS.2015.7477462. → page 28

[44] S. Brightfield. High-res image processing, low power consumption Vector eXtensions (VX), 2014. URL https://developer.qualcomm.com/blog/high-res-image-processing-low-

174 power-consumption-qualcomm-hexagon-vector-extensions-vx. [Online; accessed 1-Aug-2019]. → page 2

[45] Broadcom. VideoCore IV 3D architecture reference guide, 2013. URL https://docs.broadcom.com/docs/12358545. [Online; accessed April 25, 2019]. → page 28

[46] T. E. Carlson, W. Heirman, and L. Eeckhout. Sniper: Exploring the level of abstraction for scalable and accurate parallel multi-core simulation. In Proceedings of the ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC), page 52. ACM, 2011. doi:10.1145/2063384.2063454. → page 88

[47] K. Chandrasekar, C. Weis, Y. Li, B. Akesson, N. Wehn, and K. Goossens. DRAMPower: Open-source DRAM power & energy estimation tool. 22, 2012. URL http://www.drampower.info. [Online; accessed 1-Aug-2019]. → page 159

[48] C.-H. O. Chen, S. Park, T. Krishna, S. Subramanian, A. P. Chandrakasan, and L.-S. Peh. SMART: A single-cycle reconfigurable NoC for SoC applications. In Proceedings of the IEEE Conference on Design Automation and Test in Europe (DATE), pages 338–343. EDA Consortium, 2013. doi:10.7873/DATE.2013.080. → pages 93, 98, 166

[49] N. Chidambaram Nachiappan, P. Yedlapalli, N. Soundararajan, M. T. Kandemir, A. Sivasubramaniam, and C. R. Das. GemDroid: A framework to evaluate mobile platforms. In Proceedings of the ACM International Conference on Measurement and Modeling of Computer Science (SIGMETRICS), pages 355–366, 2014. doi:10.1145/2591971.2591973. → pages 5, 10, 11, 61, 63, 67, 88, 89, 119, 165

[50] Christoph Kubisch. Life of a triangle - NVIDIA’s logical pipeline, 2015. URL https://developer.nvidia.com/content/life-triangle-nvidias-logical-pipeline. [Online; accessed April 25, 2019]. → page 38

[51] Christoph Kubisch. Transitioning from OpenGL to Vulkan, 2016. URL https://developer.nvidia.com/transitioning-opengl-vulkan. [Online; accessed 1-Aug-2019]. → page 17

[52] J. Cong, Z. Fang, M. Gill, and G. Reinman. Parade: A cycle-accurate full-system simulation platform for accelerator-rich architectural design and exploration. In IEEE/ACM International Conference on

175 Computer-Aided Design (ICCAD), pages 380–387. IEEE, 2015. doi:10.1109/ICCAD.2015.7372595. → page 88

[53] Costello, Katie and Hippold, Sarah. Gartner forecasts worldwide public cloud revenue to grow 17.3 percent in 2019, 2018. URL https: //www.gartner.com/en/newsroom/press-releases/2018-09-12-gartner- forecasts-worldwide-public-cloud-revenue-to-grow-17-percent-in-2019. [Online; accessed 1-Aug-2019]. → page 2

[54] Costello, Katie and Pettey, Christy. Gartner says worldwide PC shipments grew for the first time in six years during the second quarter of 2018, 2018. URL https://www.gartner.com/en/newsroom/press-releases/2018-07-12-- gartner-says-worldwide-pc-shipments-grew-for-the-first-time-in-six- years-during-the-second-quarter-of-2018. [Online; accessed 1-Aug-2019]. → page 2

[55] K. Crane. Keenan’s 3D model repository. URL https://www.cs.cmu.edu/∼kmcrane/Projects/ModelRepository/. [Online; accessed April 25, 2019]. → pages 57, 72

[56] W. J. Dally and B. P. Towles. Principles and practices of interconnection networks. Elsevier, 2004. → pages 100, 103, 105, 111

[57] B. H. Danielson, J. J. Watters, and T. J. McDonald. Method and apparatus for displaying computer graphics data stored in a compressed format with an efficient color indexing system, Apr. 14 1998. US Patent 5,740,345. → pages 131, 134

[58] R. Das, S. Eachempati, A. K. Mishra, V. Narayanan, and C. R. Das. Design and evaluation of a hierarchical on-chip interconnect for next-generation CMPs. In Proceedings of the IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 175–186. IEEE, 2009. doi:10.1109/HPCA.2009.4798252. → page 98

[59] J. Davies. The bifrost GPU architecture and the ARM Mali-G71 GPU. In IEEE Hot Chips Symposium (HCS), pages 1–31, Aug 2016. doi:10.1109/HOTCHIPS.2016.7936201. → pages 28, 38

[60] R. De Jong and A. Sandberg. NoMali: Simulating a realistic graphics driver stack using a stub GPU. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 255–262. IEEE, 2016. doi:10.1109/ISPASS.2016.7482100. → pages 4, 5, 9, 10, 89, 165, 219

176 [61] E. De Lucas, P. Marcuello, J.-M. Parcerisa, and A. Gonzalez. Visibility rendering order: Improving energy efficiency on mobile GPUs through frame coherence. IEEE Transactions on Parallel and Distributed Systems, 30(2):473–485, Feb 2019. doi:10.1109/TPDS.2018.2866246. → page 91

[62] V. M. Del Barrio, C. Gonzalez,´ J. Roca, A. Fernandez,´ and E. Espasa. ATTILA: A cycle-level execution-driven simulator for modern GPU architectures. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 231–241. IEEE, 2006. doi:10.1109/ISPASS.2006.1620807. → pages 5, 88, 89

[63] M. Ditty, A. Karandikar, and D. Reed. NVIDIA’s Xavier SoC, 2018. URL https://www.hotchips.org/hc30/1conf/1. 12 Nvidia XavierHotchips2018Final 814.pdf. [Online; accessed 1-Aug-2019]. → page 159

[64] J. F. Duluk Jr, R. E. Hessel, V. T. Arnold, J. Benkual, J. P. Bratt, G. Cuan, S. L. Dodgen, E. S. Fang, Z. Gong, T. Y. Ho, H. Hsu, S. Li, S. Ng, M. N. Papakipson, J. R. Redgrave, S. S. Trivedi, T. N. D, S. W. Go, L. Fung, T. D. Nguyen, J. P. Grass, B. Hong, A. Mammen, A. Rashid, and A. S.-W. Tsay. Deferred shading graphics pipeline processor having advanced features, Jan. 23 2007. US Patent 7,167,181. → page 22

[65] M. Eldridge. Designing graphics architectures around scalability and communication. Doctoral dissertation, Stanford University, 2001. URL https://graphics.stanford.edu/papers/eldridge thesis. → pages 27, 31

[66] Epic Games. Unreal engine: Threaded rendering. URL https://docs.unrealengine.com/en- us/Programming/Rendering/ThreadedRendering. [Online; accessed 1-Jun-2018]. → page 210

[67] L. Eric, W. R. Steiner, and J. Cobb. Early sample evaluation during coarse rasterization, Nov. 15 2016. US Patent 9,495,781. → pages 26, 38

[68] H. Esmaeilzadeh, E. Blem, R. S. Amant, K. Sankaralingam, and D. Burger. Dark silicon and the end of multicore scaling. In Proceedings of the ACM/IEEE International Symposium on Computer Architecture (ISCA), pages 365–376. IEEE, 2011. doi:10.1145/2000064.2000108. → pages 1, 168

[69] R. Failliot, T. Droste, and R. McCorkell. Mesamatrix. URL https://mesamatrix.net/. [Online; accessed 25-Apr-2019]. → page 165

177 [70] Flurry Analytics. Flurry Five-Year Report: It’s an App World. The Web Just Lives in It, 2013. URL https://flurrymobile.tumblr.com/post/115188952445/flurry-five-year-report- its-an-app-world-the. [Online; accessed 1-Aug-2019]. → pages 3, 4, 132, 168

[71] A. Fox, R. Griffith, A. Joseph, R. Katz, A. Konwinski, G. Lee, D. Patterson, A. Rabkin, and I. Stoica. Above the clouds: A berkeley view of cloud computing. Technical Report. Dept. Electrical Eng. and Comput. Sciences, University of California, Berkeley, Rep. UCB/EECS, 28(13), 2009. URL https://www2.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-28.pdf. → page 2

[72] D. J. Frank, R. H. Dennard, E. Nowak, P. M. Solomon, Y. Taur, and Hon-Sum Philip Wong. Device scaling limits of Si and their application dependencies. Proceedings of the IEEE, 89(3):259–288, 2001. doi:10.1109/5.915374. → page 1

[73] M. J. French, E. M. Kilgariff, S. E. Molnar, W. R. Steiner, D. A. Voorhies, and A. C. Weitkemper. Optimizing a graphics rendering pipeline using early Z-mode, Jan. 13 2015. US Patent 8,933,933. → page 19

[74] Google. Configure for the Android Emulator, . URL https://developer.android.com/studio/run/emulator-acceleration. [Online; accessed 1-Aug-2019]. → page 207

[75] Google. Run apps on the Android emulator, . URL https://developer.android.com/studio/run/emulator. [Online; accessed 1-Aug-2019]. → page 88

[76] Google. Android emulator OpenGL acceleration guest-side implementation, . URL https://android.googlesource.com/device/generic/goldfish-opengl/. [Online; accessed 1-Aug-2019]. → page 207

[77] Google. Android emulator OpenGL acceleration host-side implementation, . URL https://android.googlesource.com/platform/external/qemu/. [Online; accessed 1-Aug-2019]. → page 207

[78] D. Gouk, M. Kwon, J. Zhang, S. Koh, W. Choi, N. S. Kim, M. Kandemir, and M. Jung. Amber: Enabling precise full-system simulation with detailed modeling of all SSD resources. In Proceedings of the ACM/IEEE

178 International Symposium on Microarchitecture (MICRO), pages 469–481. IEEE, 2018. doi:10.1109/MICRO.2018.00045. → page 166

[79] P. Gratz, C. Kim, R. McDonald, S. W. Keckler, and D. Burger. Implementation and evaluation of on-chip network architectures. In Proceedings of the IEEE International Conference on Computer Design (ICCD), pages 477–484. IEEE, 2006. doi:10.1109/ICCD.2006.4380859. → pages 99, 105

[80] E. C. Greene. Hardware-assisted z-pyramid creation for host-based occlusion culling, Oct. 21 2003. US Patent 6,636,215. → page 23

[81] N. Greene. Hierarchical polygon tiling with coverage masks. In Proceedings of the ACM International Conference on Computer Graphics and Interactive Techniques (SIGGRAPH), pages 65–74. ACM, 1996. doi:10.1145/237170.237207. → page 26

[82] N. Greene, M. Kass, and G. Miller. Hierarchical Z-buffer visibility. In Proceedings of the ACM International Conference on Computer Graphics and Interactive Techniques (SIGGRAPH), pages 231–238. ACM, 1993. doi:10.1145/166117.166147. → page 20

[83] B. Grot, J. Hestness, S. W. Keckler, and O. Mutlu. Express cube topologies for on-chip interconnects. In Proceedings of the IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 163–174. IEEE, 2009. doi:10.1109/HPCA.2009.4798251. → page 93

[84] B. Grot, J. Hestness, S. W. Keckler, and O. Mutlu. Kilo-NOC: a heterogeneous network-on-chip architecture for scalability and service guarantees. In ACM SIGARCH Computer Architecture News, volume 39, pages 401–412. ACM, 2011. doi:10.1145/2024723.2000112. → page 93

[85] A. E. Gruber and C. J. Brennan. Method and apparatus for processing pixel depth information, Oct. 14 2014. US Patent 8,860,721. → pages 19, 20

[86] A. A. Gubran and T. M. Aamodt. Emerald: Graphics modeling for SoC systems. In Proceedings of the ACM/IEEE International Symposium on Computer Architecture (ISCA), pages 169–182, 2019. doi:10.1145/3307650.3322221. → pages 95, 119

[87] A. Gutierrez, R. G. Dreslinski, T. F. Wenisch, T. Mudge, A. Saidi, C. Emmons, and N. Paver. Full-system analysis and characterization of interactive smartphone applications. In Proceedings of the IEEE

179 International Symposium on Workload Characterization (IISWC), pages 81–90. IEEE, 2011. doi:10.1109/IISWC.2011.6114205. → page 5

[88] A. Gutierrez, J. Pusdesris, R. G. Dreslinski, T. Mudge, C. Sudanthi, C. D. Emmons, M. Hayenga, and N. Paver. Sources of error in full-system simulation. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 13–22. IEEE, 2014. doi:10.1109/ISPASS.2014.6844457. → page 9

[89] A. Gutierrez, B. M. Beckmann, A. Dutu, J. Gross, M. LeBeane, J. Kalamatianos, O. Kayiran, M. Poremba, B. Potter, S. Puthoor, M. D. Sinclair, M. Wyse, J. Yin, X. Zhang, A. Jain, and T. Rogers. Lost in abstraction: Pitfalls of analyzing GPUs at the intermediate language level. In Proceedings of the IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 608–619. IEEE, 2018. doi:10.1109/HPCA.2018.00058. → pages 88, 90

[90] Z. Hakura, L. Eric, D. Kirkland, J. Choquette, P. R. Brown, Y. Y. Uralsky, and J. Bolz. Techniques for maintaining atomicity and ordering for pixel shader operations, July 10 2018. US Patent App. 10/019,776. → pages 31, 38, 39, 42, 43, 53, 54

[91] Z. S. Hakura, M. B. Cox, B. K. Langendorf, and B. W. Simeral. Apparatus, system, and method for Z-culling, July 13 2010. US Patent 7,755,624. → pages 20, 38

[92] A. Hansson, K. Goossens, and A. Radulescu.˘ Avoiding message-dependent deadlock in network-based systems on chip. VLSI design, 2007. → page 99

[93] A. Hansson, N. Agarwal, A. Kolli, T. Wenisch, and A. N. Udipi. Simulating DRAM controllers for future system architecture exploration. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 201–210. IEEE, 2014. doi:10.1109/ISPASS.2014.6844484. → pages 64, 66

[94] N. Hardavellas, S. Somogyi, T. F. Wenisch, R. E. Wunderlich, S. Chen, J. Kim, B. Falsafi, J. C. Hoe, and A. G. Nowatzyk. Simflex: A fast, accurate, flexible full-system simulation framework for performance evaluation of server architecture. ACM SIGMETRICS Performance Evaluation Review, 31(4):31–34, 2004. doi:10.1145/1054907.1054914. → page 88

180 [95] P. Harris. The Mali GPU: An abstract machine, part 2 - tile-based rendering, 2014. URL https://community.arm.com/graphics/b/blog/posts/the-mali-gpu-an- abstract-machine-part-2---tile-based-rendering. [Online; accessed April 29, 2019]. → page 91

[96] Harvard Architecture, Circuits, and Compilers Group. Die photo analysis. URL http://vlsiarch.eecs.harvard.edu/accelerators/die-photo-analysis. [Online; accessed April 25, 2019]. → pages 2, 93

[97] J. Hasselgren and T. Akenine-Moller. Efficient depth buffer compression. In Proceedings of the ACM SIGGRAPH/EUROGRAPHICS Conference on Graphics Hardware (HWWS), volume 3, pages 103–110, 2006. doi:10.1145/1283900.1283917. → pages 130, 139

[98] M. Hill and V. J. Reddi. Gables: A roofline model for mobile SoCs. In Proceedings of the IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 317–330. IEEE, 2019. doi:10.1109/HPCA.2019.00047. → pages 3, 5, 89, 165

[99] W. H. Ho and T. M. Pinkston. A design methodology for efficient application-specific on-chip interconnects. IEEE Transactions on Parallel and Distributed Systems, 17(2):174–190, 2006. doi:10.1109/TPDS.2006.15. → page 98

[100] . PowerVR hardware architecture overview for developers, 2018. URL http://cdn.imgtec.com/sdk-documentation/ PowerVR+Hardware.Architecture+Overview+for+Developers.pdf. [Online; accessed 1-Aug-2019]. → pages 22, 28, 31

[101] S. Iyer, R. Zhang, and N. McKeown. Routers with a single stage of buffering. In ACM SIGCOMM Computer Communication Review, volume 32, pages 251–264. ACM, 2002. doi:10.1145/964725.633050. → page 99

[102] K. C. Janac. PIANO: Physical interconnect aware network optimizer. URL http://www.ispd.cc/slides/2018/s7 2.pdf. [Online; accessed 1-Jun-2019]. → pages 94, 95, 99, 112, 113

[103] H. Jang, J. Kim, P. Gratz, K. H. Yum, and E. J. Kim. Bandwidth-efficient on-chip interconnect designs for GPGPUs. In Proceedings of the Annual Design Automation Conference (DAC), page 9. ACM, 2015. doi:10.1145/2744769.2744803. → page 98

181 [104] D. Jayasimha, B. Zafar, and Y. Hoskote. On-chip interconnection networks: Why they are different and how to compare them. Technical Report, Platform Architecture Research, Corporation, 2006. URL https://blogs.intel.com/wp-content/mt- content/com/research/terascale/ODI why-different.pdf. → page 114

[105] JEDEC. JEDEC LPDDR2 standard (JESD209-2F), 2013. URL https://www.jedec.org/standards-documents/results/JESD209-2F. [Online; accessed 1-Aug-2019]. → page 136

[106] JEDEC. JEDEC LPDDR3 standard (JESD209-3C), 2015. URL http://www.jedec.org/standards-documents/results/jesd209-3c. [Online; accessed 1-Aug-2019].

[107] JEDEC. JEDEC LPDDR4 standard (JESD209-4A ), 2015. URL http://www.jedec.org/standards-documents/results/jesd209-4a. [Online; accessed 1-Aug-2019]. → page 136

[108] M. K. Jeong, M. Erez, C. Sudanthi, and N. Paver. A QoS-aware memory controller for dynamically balancing GPU and CPU bandwidth use in an MPSoC. In Proceedings of the Annual Design Automation Conference (DAC), pages 850–855. ACM, 2012. doi:10.1145/2228360.2228513. → pages 62, 90

[109] N. E. Jerger and L.-S. Peh. On-chip networks, Second Edition. Synthesis Lectures on Computer Architecture, 2017. doi:10.2200/S00209ED1V01Y200907CAC008. → page 105

[110] Jin Kim. The future of graphic and mobile memory for new applications. In IEEE Hot Chips Symposium (HCS), pages 1–25, 2016. doi:10.1109/HOTCHIPS.2016.7936170. → page 162

[111] A. Jog, O. Kayiran, A. Pattnaik, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das. Exploiting core criticality for enhanced GPU performance. In Proceedings of the ACM International Conference on Measurement and Modeling of Computer Science (SIGMETRICS), pages 351–363, 2016. doi:10.1145/2964791.2901468. → page 90

[112] W. M. Johnson. Super-scalar processor design. Technical Report CSL-TR-89-383, Stanford, CA, USA, 1989. URL https://vlsiweb.stanford.edu/people/alum/pdf/ 8906 MikeJohnson SuperScalar Processor Design.pdf. [Online; accessed 1-Aug-2019]. → page 1

182 [113] S. Jung and J. Park. Graphics data processing method and apparatus, July 28 2016. US Patent App. 14/844,701. → pages 23, 25

[114] O. Kayiran, N. C. Nachiappan, A. Jog, R. Ausavarungnirun, M. T. Kandemir, G. H. Loh, O. Mutlu, and C. R. Das. Managing GPU concurrency in heterogeneous architectures. In Proceedings of the ACM/IEEE International Symposium on Microarchitecture (MICRO), pages 114–126. IEEE, 2014. doi:10.1109/MICRO.2014.62. → page 98

[115] S. W. Keckler, W. J. Dally, B. Khailany, M. Garland, and D. Glasco. GPUs and the future of . IEEE Micro, (5):7–17, 2011. doi:10.1109/MM.2011.89. → page 28

[116] G. Keizer. Windows comes up third in OS clash two years early, 2016. URL https://www.computerworld.com/article/3050931/windows-comes- up-third-in-os-clash-two-years-early.html. [Online; accessed 1-Aug-2019]. → page 164

[117] A. Khodakovsky, P. Schroder,¨ and W. Sweldens. Progressive geometry compression. In Proceedings of the ACM International Conference on Computer Graphics and Interactive Techniques (SIGGRAPH), pages 271–278. ACM, 2000. doi:10.1145/344779.344922. → page 130

[118] Khronos Group. The OpenGL shading language, 2006. URL https://khronos.org/registry/OpenGL/specs/gl/GLSLangSpec.1.20.pdf. [Online; accessed April 25, 2019]. → pages 50, 204

[119] Khronos Group. OpenVG specification (version 1.0.1), 2007. URL https://www.khronos.org/registry/OpenVG/specs/openvg 1 0 1.pdf. [Online; accessed 1-Aug-2019]. → page 14

[120] Khronos Group. OpenGL ES specification (version 3.1), 2016. URL https://www.khronos.org/registry/OpenGL/specs/es/3.1/es spec 3.1.pdf. [Online; accessed 1-Aug-2019]. → pages 14, 15, 16, 55, 90, 210

[121] Khronos Group. OpenGL ES extension 91, 2016. URL https://www.khronos.org/registry/gles/extensions/NV/NV draw buffers.txt. → page 148

[122] Khronos Group. OpenGL SC specification (version 2.0.0), 2016. URL https://www.khronos.org/registry/OpenGL/specs/sc/sc spec 2.0.pdf. [Online; accessed 1-Aug-2019]. → page 14

183 [123] Khronos Group. The OpenGL graphics system: A specification (version 4.5 core profile), 2017. URL https://www.khronos.org/registry/OpenGL/specs/gl/glspec45.core.pdf. [Online; accessed April 29, 2019]. → pages 5, 14, 15, 16, 17, 21, 23, 55, 57, 58, 210

[124] Khronos Group. Vulkan 1.1.96 - A specification, 2018. URL https: //www.khronos.org/registry/vulkan/specs/1.1-extensions/pdf/vkspec.pdf. → page 17

[125] Khronos Group. WebGL 2.0 specification, 2018. URL https://www.khronos.org/registry/webgl/specs/latest/2.0. → page 14

[126] E. M. Kilgariff, S. E. Molnar, S. J. Treichler, J. S. Rhoades, G. Schaufler, D. L. Kirkland, C. A. E. Allison, K. M. Wurstner, and T. J. Purcell. Hardware-managed virtual buffers using a shared memory for load distribution, 2014. US Patent 8,760,460. → page 31

[127] H. Kim, J. Lee, N. B. Lakshminarayana, J. Sim, J. Lim, and T. Pho. Macsim: A cpu-gpu heterogeneous simulation framework user guide. Technical Report. Georgia Institute of Technology, 2012. URL http://comparch.gatech.edu/hparch/macsim/macsim.pdf. [Online; accessed April 29, 2019]. → pages 88, 90

[128] J. Kim, C. Nicopoulos, D. Park, V. Narayanan, M. S. Yousif, and C. R. Das. A gracefully degrading and energy-efficient modular router architecture for on-chip networks. ACM SIGARCH Computer Architecture News, 34(2): 4–15, 2006. doi:10.1109/ISCA.2006.6. → pages 98, 99

[129] M. Kim, H. Kim, H. Chung, and K. Lim. Samsung exynos 5410 processor—experience the ultimate performance and versatility. Samsung whitepaper, 2013. → page 63

[130] Y. Kim, D. Han, O. Mutlu, and M. Harchol-Balter. ATLAS: A scalable and high-performance scheduling algorithm for multiple memory controllers. In Proceedings of the IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 1–12. IEEE, 2010. doi:10.1109/HPCA.2010.5416658. → page 90

[131] Y. Kim, M. Papamichael, O. Mutlu, and M. Harchol-Balter. Thread cluster memory scheduling: Exploiting differences in memory access behavior. In Proceedings of the ACM/IEEE International Symposium on

184 Microarchitecture (MICRO), pages 65–76, 2010. doi:10.1109/MICRO.2010.51. → pages 62, 63, 90

[132] C. Kubisch. Life of a triangle - NVIDIA’s logical pipeline. 2015. URL https://developer.nvidia.com/content/life-triangle-nvidias-logical-pipeline. [Online; accessed 1-Aug-2019]. → page 31

[133] N. Kulshrestha, D. K. McAllister, and S. E. Molnar. Selecting and representing multiple compression methods, Oct. 7 2010. US Patent App. 12/900,362. → pages 138, 141, 155

[134] M. M. Leather. Method and apparatus for rasterizer interpolation, June 13 2006. US Patent 7,061,495. → page 26

[135] J.-J. Lecler and G. Baillieu. Application driven network-on-chip architecture exploration & refinement for a complex SoC. Design Automation for Embedded Systems, 15(2):133–158, 2011. doi:10.1007/s10617-011-9075-5. → page 167

[136] J. Lee, S. Li, H. Kim, and S. Yalamanchili. Design space exploration of on-chip ring interconnection for a CPU–GPU heterogeneous architecture. Journal of Parallel and Distributed Computing, 73(12):1525–1538, 2013. doi:10.1016/j.jpdc.2013.07.014. → page 98

[137] M. Lee, S. Song, J. Moon, J. Kim, W. Seo, Y. Cho, and S. Ryu. Improving GPGPU resource utilization through alternative thread block scheduling. In Proceedings of the IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 260–271. IEEE, 2014. doi:10.1109/HPCA.2014.6835937. → page 90

[138] A. Li, S. L. Song, W. Liu, X. Liu, A. Kumar, and H. Corporaal. Locality-aware CTA clustering for modern GPUs. In Proceedings of the ACM Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 297–311, 2017. doi:10.1145/3093336.3037709. → page 90

[139] C. C. Lin. Implementation of H. 264 Decoder in Bluespec SystemVerilog. PhD thesis, Massachusetts Institute of Technology, 2007. → page 119

[140] E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym. : A unified graphics and computing architecture. IEEE Micro, 28(2):39–55, 2008. doi:10.1109/MM.2008.31. → page 33

185 [141] G. H. Loh, S. Subramaniam, and Y. Xie. Zesto: A cycle-level simulator for highly detailed microarchitecture exploration. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 53–64. IEEE, 2009. doi:10.1109/ISPASS.2009.4919638. → page 88

[142] Z. Lu and Y. Yao. Aggregate flow-based performance fairness in CMPs. ACM Transactions on Architecture and Code Optimization (TACO), 13(4): 53:1–53:27, Dec. 2016. doi:10.1145/3014429. → page 90

[143] E. B. Lum, J. Cobb, and B. N. Rodgers. Pixel serialization to improve conservative depth estimation, June 20 2017. US Patent 9,684,998. → pages 27, 38

[144] J. Ma, L. Yu, M. Y. John, and T. Chen. MCMG simulator: A unified simulation framework for CPU and graphic GPU. Journal of Computer and System Sciences, 81(1):57–71, 2015. doi:10.1016/j.jcss.2014.06.017. → page 88

[145] M. A. Mang, R. C. Taylor, M. J. Manter, and T. B. Pringle. Method and apparatus for clipping an object element in accordance with a clip volume, Jan. 14 2003. US Patent 6,507,348. → page 23

[146] M. Mantor, L. Lefebvre, M. Alho, M. Tuomi, and K. Kallio. Hybrid render with preferred primitive batch binning and sorting, 2016. US Patent App. 15/250,357. → page 31

[147] M. Mantor, L. Lefebvre, M. Fowler, T. Kelley, M. Alho, M. Tuomi, K. Kallio, P. K. R. Buss, J. A. Komppa, and K. Tuomi. Hybrid render with deferred primitive batch binning, Jan. 1 2019. US Patent 10,169,906. → page 31

[148] M. M. Martin, D. J. Sorin, B. M. Beckmann, M. R. Marty, M. Xu, A. R. Alameldeen, K. E. Moore, M. D. Hill, and D. A. Wood. Multifacet’s general execution-driven multiprocessor simulator (GEMS) toolset. ACM SIGARCH Computer Architecture News, 33(4):92–99, 2005. doi:10.1145/1105734.1105747. → page 105

[149] M. McGuire. Computer graphics archive, July 2017. URL https://casual-effects.com/data. [Online; accessed April 25, 2019]. → pages 57, 72

186 [150] M. McGuire, D. P. Luebke, and M. T. Mara. System, method, and computer program product for sampling a hierarchical depth map, Aug. 18 2015. US Patent 9,111,393. → page 49

[151] Mesa 3D. The Mesa 3D graphics library, . URL http://www.mesa3d.org/. [Online; accessed 25-Apr-2019]. → pages 88, 90, 165 [152] Mesa 3D. Mesa 3D graphics library source code, . URL https://github.com/mesa3d/mesa. [Online; accessed 25-Apr-2019]. → page 165 [153] Microsoft. Direct3D 11 graphics, 2018. URL https://docs.microsoft.com/en-us/windows/desktop/direct3d11/atoc-dx- graphics-direct3d-11. [Online; accessed 1-Aug-2019]. → pages 5, 14 [154] A. Mirhosseini, M. Sadrosadati, B. Soltani, H. Sarbazi-Azad, and T. F. Wenisch. BiNoCHS: Bimodal network-on-chip for CPU-GPU heterogeneous systems. In ACM/IEEE International Symposium on Networks-on-Chip (NoCs), page 7. ACM, 2017. doi:10.1145/3130218.3130222. → page 98 [155] A. K. Mishra, O. Mutlu, and C. R. Das. A heterogeneous multiple network-on-chip design: An application-aware approach. In Proceedings of the Annual Design Automation Conference (DAC), pages 1–10. IEEE, 2013. doi:10.1145/2463209.2488779. → page 93 [156] J. Mitchell, S. Morein, R. Taylor, and J. Carey. Method and apparatus for object based visibility culling, Sept. 8 2005. US Patent App. 10/790,904. → page 23 [157] S. E. Molnar, B.-O. Schneider, J. Montrym, J. M. Van Dyke, and S. D. Lew. System and method for real-time compression of pixel colors, Nov. 30 2004. US Patent 6,825,847. → page 130 [158] S. E. Molnar, E. M. Kilgariff, J. S. Rhoades, T. J. Purcell, S. J. Treichler, Z. S. Hakura, F. C. Crow, and J. C. Bowman. Order-preserving distributed rasterizer, Nov. 19 2013. US Patent 8,587,581. → pages 38, 46 [159] G. E. Moore. Cramming more components onto integrated circuits. Proceedings of the IEEE, 86(1):82–85, 1998. doi:10.1109/N-SSC.2006.4785860. → page 1 [160] S. L. Morein. System, method, and apparatus for multi-level hierarchical Z buffering, Aug. 15 2006. US Patent 7,091,971. → page 20

187 [161] S. L. Morein and M. A. Natale. System, method, and apparatus for compression of video data using offset values, July 13 2004. US Patent 6,762,758. → page 130

[162] T. Moscibroda and O. Mutlu. A case for bufferless routing in on-chip networks. ACM SIGARCH Computer Architecture News, 37(3):196–207, 2009. doi:10.1145/1555815.1555781. → page 93

[163] S. S. Mukherjee, P. Bannon, S. Lang, A. Spink, and D. b. The Alpha 21364 network architecture. IEEE MICRO, 22(1):26–35, 2002. doi:10.1109/HIS.2001.946702. → pages 99, 105

[164] R. Mullins, A. West, and S. Moore. Low-latency virtual-channel routers for on-chip networks. In Proceedings of the ACM/IEEE International Symposium on Computer Architecture (ISCA), pages 188–197. IEEE, 2004. doi:10.1109/ISCA.2004.1310774. → page 100

[165] S. Murali and G. De Micheli. Bandwidth-constrained mapping of cores onto NoC architectures. In Proceedings of the IEEE Conference on Design Automation and Test in Europe (DATE), volume 2, pages 896–901. IEEE, 2004. doi:10.1109/DATE.2004.1269002. → pages 93, 166

[166] S. Murali, P. Meloni, F. Angiolini, D. Atienza, S. Carta, L. Benini, G. De Micheli, and L. Raffo. Designing application-specific networks on chips with floorplan information. In IEEE/ACM International Conference on Computer-Aided Design (ICCAD), pages 355–362. ACM, 2006. doi:10.1109/ICCAD.2006.320058. → pages 93, 98, 166

[167] N. C. Nachiappan, P. Yedlapalli, N. Soundararajan, A. Sivasubramaniam, M. T. Kandemir, R. Iyer, and C. R. Das. Domain knowledge based energy management in handhelds. In Proceedings of the IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 150–160. IEEE, 2015. doi:10.1109/HPCA.2015.7056029. → pages 5, 89

[168] N. C. Nachiappan, H. Zhang, J. Ryoo, N. Soundararajan, A. Sivasubramaniam, M. T. Kandemir, R. Iyer, and C. R. Das. VIP: Virtualizing IP chains on handheld platforms. In Proceedings of the ACM/IEEE International Symposium on Computer Architecture (ISCA), pages 655–667, 2015. doi:10.1145/2749469.2750382. → pages 5, 89, 119

[169] O. Nagashima. Low power DRAM evolution, 2016. URL https://www. jedec.org/sites/default/files/Osamu Nagashima Mobile August 2016.pdf. → page 167

188 [170] D. Nenni. Qualcomm Intel Facebook and semiconductor IP), 2019. URL https://semiwiki.com/x-subscriber/netspeed-systems/8068-qualcomm- intel-facebook-and-semiconductor-ip/. [Online; accessed July 1, 2019]. → page 4

[171] C. A. Nicopoulos, D. Park, J. Kim, N. Vijaykrishnan, M. S. Yousif, and C. R. Das. ViChaR: A dynamic virtual channel regulator for network-on-chip routers. In Proceedings of the ACM/IEEE International Symposium on Microarchitecture (MICRO), pages 333–346. IEEE, 2006. doi:10.1109/MICRO.2006.50. → pages 97, 98, 99, 105

[172] Nielsen. All about Android, 2011. URL http://www.nielsen.com/us/en/insights/webinars/2011/all-about-android- insights-from-nielsens-smartphone-meters.html. [Online; accessed 1-Aug-2019]. → pages 3, 4, 132, 168

[173] E. Norige and S. Kumar. Heterogeneous SoC IP core placement in an interconnect to optimize latency and interconnect performance, Nov. 10 2015. US Patent 9,185,023. → page 113

[174] NVIDIA. NVIDIA’s next generation CUDA compute architecture. 2009. URL https://www.nvidia.com/content/PDF/fermi white papers/ NVIDIA Fermi Compute Architecture Whitepaper.pdf. [Online; accessed 25-Apr-2019]. → page 33

[175] NVIDIA. Chimera: The NVIDIA computational photography architecture. NVIDIA whitepaper, 2013. URL https://www.nvidia.com/docs/IO/116757/Chimera whitepaper FINAL.pdf. [Online; accessed 1-Aug-2019]. → page 2

[176] NVIDIA. NVIDIA Tegra 4 family GPU architecture. NVIDIA whitepaper, 2013. URL https://www.nvidia.com/docs/IO/116757/ Tegra 4 GPU Whitepaper FINALv2.pdf. [Online; accessed 1-Aug-2019]. → pages 42, 53

[177] NVIDIA. K1: A new era in mobile computing. NVIDIA, Corp., White Paper, 2014. URL https://www.nvidia.com/content/PDF/ tegra white papers/Tegra K1 whitepaper v1.0.pdf. [Online; accessed April 25, 2019]. → pages 38, 57, 71

[178] NVIDIA. NVIDIA Tegra X1. NVIDIA whitepaper, 2015. URL https://international.download.nvidia.com/pdf/tegra/Tegra-X1-whitepaper-

189 v1.0.pdf. [Online; accessed April 25, 2019]. → pages 38, 63, 71, 131, 141, 149, 155

[179] NVIDIA. NVIDIA Tesla V100 GPU architecture, 2017. URL https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture- whitepaper.pdf. [Online; accessed 1-Aug-2019]. → page 168

[180] NVIDIA. NVIDIA Turing GPU architecture, 2018. URL https://www.nvidia.com/content/dam/en-zz/Solutions/design- visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture- Whitepaper.pdf. [Online; accessed 1-Aug-2019]. → page 168

[181] NVIDIA. Parallel thread execution ISA, 2019. URL https://docs.nvidia.com/cuda/parallel-thread-execution/index.html. [Online; accessed April 25, 2019]. → pages 34, 36

[182] J. Nystad and E. Faye-Lund. Methods of and apparatus for processing computer graphics, Feb. 14 2012. US Patent 8,115,783. → pages 23, 25

[183] J. Nystad, A. Lassen, A. Pomianowski, S. Ellis, and T. Olson. Adaptive scalable texture compression. In Proceedings of the ACM Conference on High Performance Graphics (HPG), pages 105–114. Eurographics Association, 2012. doi:10.2312/EGGH/HPG12/105-114. → page 130

[184] T. J. Olson. Saving the planet, one handset at a aime: Designing low-power, low-bandwidth GPUs. In ACM SIGGRAPH Mobile, pages 1:1–1:1, New York, NY, USA, 2012. doi:10.1145/2341910.2341912. → pages 4, 127, 167

[185] opengl tutorial.org. Suzanne model. URL https://github.com/opengl- tutorials/ogl/blob/master/tutorial10 transparency/suzanne.obj. [Online; accessed Aug 20, 2020]. → page 57

[186] D. Park, A. Vaidya, A. Kumar, and M. Azimi. MoDe-X: Microarchitecture of a layout-aware modular decoupled crossbar for on-chip interconnects. IEEE Transactions on Computers, (1):1, 2012. doi:10.1109/TC.2012.203. → pages 98, 99

[187] A. Patel, F. Afram, S. Chen, and K. Ghose. MARSS: A full system simulator for multicore x86 CPUs. In Proceedings of the Design Automation Conference (DAC), page 1050–1055, New York, NY, USA, 2011. ACM. doi:10.1145/2024724.2024954. → pages 9, 88

190 [188] J. Pineda. A parallel algorithm for polygon rasterization. In Proceedings of the ACM International Conference on Computer Graphics and Interactive Techniques (SIGGRAPH), volume 22, pages 17–20. ACM, 1988. doi:10.1145/378456.378457. → page 26

[189] J. Pool, A. Lastra, and M. Singh. Lossless compression of variable-precision floating-point buffers on GPUs. In Proceedings of the ACM SIGGRAPH Symposium on Interactive 3D Graphics and Games (I3D), pages 47–54. ACM, 2012. doi:10.1145/2159616.2159624. → pages 130, 139

[190] J. Power, J. Hestness, M. S. Orr, M. D. Hill, and D. A. Wood. gem5-gpu: A heterogeneous CPU-GPU simulator. IEEE Computer Architecture Letter, 14(1):34–36, Jan 2015. doi:10.1109/LCA.2014.2299539. → pages 10, 11, 34, 60, 88, 89, 206

[191] A. Prakash, A. Aziz, and V. Ramachandran. Randomized parallel schedulers for switch-memory-switch routers: Analysis and numerical studies. In IEEE International Conference on Computer Communications (INFOCOM), volume 3, pages 2026–2037. IEEE, 2004. doi:10.1109/INFCOM.2004.1354611. → page 99

[192] Qualcomm. Qualcomm introduces new Snapdragon 700 mobile platform series. URL https://www.qualcomm.com/news/releases/2018/02/27/qualcomm- introduces-new-snapdragon-700-mobile-platform-series. [Online; accessed 1-Mar-2018]. → page 93

[193] Qualcomm. FlexRender, 2013. URL https://www.qualcomm.com/videos/flexrender. [Online; accessed April 25, 2019]. → pages 30, 31

[194] Qualcomm Technologies. The rise of mobile gaming on Android: Qualcomm Snapdragon technology leadership, 2014. URL https://developer.qualcomm.com/qfile/27978/rise-of-mobile-gaming.pdf. [Online; accessed April 25, 2019]. → pages 28, 38

[195] Qualcomm Technologies. OpenGL ES developer guide, 2015. URL https://developer.qualcomm.com/download/adrenosdk/adreno- opengl-es-developer-guide.pdf. [Online; accessed 1-Aug-2019]. → page 30

191 [196] S. Rai and M. Chaudhuri. Improving CPU performance through dynamic GPU access throttling in CPU-GPU heterogeneous processors. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pages 18–29. IEEE, 2017. doi:10.1109/IPDPSW.2017.37. → page 90

[197] R. S. Ramanujam, V. Soteriou, B. Lin, and L.-S. Peh. Design of a high-throughput distributed shared-buffer NoC router. In ACM/IEEE International Symposium on Networks-on-Chip (NoCs), pages 69–78. IEEE, 2010. doi:10.1109/NOCS.2010.17. → pages 98, 99, 105

[198] J. Rasmusson, J. Hasselgren, and T. Akenine-Moller. Exact and error-bounded approximate color buffer compression and decompression. In Proceedings of the ACM SIGGRAPH/EUROGRAPHICS Conference on Graphics Hardware (HWWS), volume 4, pages 41–48, 2007. doi:10.2312/EGGH/EGGH07/041-048. → pages 128, 130, 138, 139, 149, 151

[199] B. R. Rau and J. A. Fisher. Instruction-level parallel processing: History, overview, and perspective. The Journal of Supercomputing, 7(1):9–50, May 1993. doi:10.1007/BF01205181. → page 1

[200] V. J. Reddi, H. Yoon, and A. Knies. Two billion devices and counting: An industry perspective on the state of mobile computer architecture. IEEE Micro, 38:6–21, 2018. doi:10.1109/MM.2018.011441560. → pages 160, 164

[201] J. S. Rhoades, S. E. Molnar, E. M. Kilgariff, M. C. Shebanow, Z. S. Hakura, D. L. Kirkland, and J. D. Kelly. Distributing primitives to multiple rasterizers, Apr. 22 2014. US Patent 8,704,836. → page 46

[202] M. Ribble. QCOM binning control, 2012. URL https://www.khronos.org/ registry/OpenGL/extensions/QCOM/QCOM binning control.txt. [Online; accessed 1-Aug-2019]. → pages 30, 31

[203] Rys Sommefeldt. A look at the PowerVR graphics architecture: Tile-based rendering, 2015. URL https://www.imgtec.com/blog/a-look-at-the- -graphics-architecture-tile-based-rendering. [Online; accessed 1-Aug-2019]. → pages 28, 38

[204] D. Sanchez and C. Kozyrakis. ZSim: Fast and accurate microarchitectural simulation of thousand-core systems. In ACM SIGARCH Computer

192 Architecture News, volume 41, pages 475–486. ACM, 2013. doi:10.1145/2508148.2485963. → page 88

[205] D. Sanchez, G. Michelogiannakis, and C. Kozyrakis. An analysis of on-chip interconnection networks for large-scale chip multiprocessors. ACM Transactions on Architecture and Code Optimization (TACO), 7(1):4, 2010. doi:10.1145/1736065.1736069. → page 98

[206] R. R. Schaller. Moore’s law: past, present and future. IEEE Spectrum, 34 (6):52–59, 1997. doi:10.1109/6.591665. → page 1

[207] D. Scherzer, L. Yang, O. Mattausch, D. Nehab, P. V. Sander, M. Wimmer, and E. Eisemann. Temporal coherence methods in real-time rendering. Computer Graphics Forum, 31(8):2378–2408, 2012. doi:10.1111/j.1467-8659.2012.03075.x. → pages 11, 73, 90, 131, 132

[208] A. Seetharamaiah and C. P. Frascati. Switching between direct rendering and binning in graphics processing using an overdraw tracker, Aug. 25 2015. US Patent 9,117,302. → page 31

[209] L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash, P. Dubey, S. Junkins, A. Lake, J. Sugerman, R. Cavin, R. Espasa, E. Grochowski, T. Juan, and P. Hanrahan. Larrabee: a many-core x86 architecture for visual computing. In ACM Transactions on Graphics (TOG), volume 27, page 18. ACM, 2008. doi:10.1145/1360612.1360617. → page 27

[210] L. D. Seiler and S. L. Morein. Method and apparatus for hierarchical Z buffering and stenciling, July 12 2011. US Patent 7,978,194. → pages 20, 38, 49

[211] J. V. Sell. Computer graphics processing system, computer memory, and method of use with computer graphics processing system utilizing hierarchical image depth buffer, Apr. 18 2006. US Patent 7,030,877. → page 20

[212] A. Sembrant, T. E. Carlson, E. Hagersten, and D. Black-Schaffer. A graphics tracing framework for exploring CPU+ GPU memory systems. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC), pages 54–65. IEEE, 2017. doi:10.1109/IISWC.2017.8167756. → pages 88, 89

[213] Y. S. Shao, B. Reagen, G.-Y. Wei, and D. Brooks. Aladdin: A pre-RTL, power-performance accelerator simulator enabling large design space

193 exploration of customized architectures. In Proceedings of the ACM/IEEE International Symposium on Computer Architecture (ISCA), pages 97–108. IEEE, 2014. doi:10.1109/ISCA.2014.6853196. → page 166

[214] Y. S. Shao, S. L. Xi, V. Srinivasan, G.-Y. Wei, and D. Brooks. Co-designing accelerators and soc interfaces using gem5-Aladdin. In Proceedings of the ACM/IEEE International Symposium on Microarchitecture (MICRO), pages 1–12. IEEE, 2016. doi:10.1109/MICRO.2016.7783751. → pages 88, 89

[215] C. Sharp and J. Leger. GL QCOM tiled rendering, 2009. URL https://www.khronos.org/registry/OpenGL/extensions/QCOM/ QCOM tiled rendering.txt. [Online; accessed 1-Aug-2019]. → page 31

[216] J. W. Sheaffer, D. Luebke, and K. Skadron. A flexible simulation framework for graphics architectures. In Proceedings of the ACM SIGGRAPH/EUROGRAPHICS Conference on Graphics Hardware (HWWS), pages 85–94. ACM, 2004. doi:10.1145/1058129.1058142. → pages 5, 88, 89

[217] H. Shim, Y. Cho, and N. Chang. Frame buffer compression using a limited-size code book for low-power display systems. In Proceedings of the IEEE Workshop on Embedded Systems for Real-Time Multimedia, pages 7–12. IEEE, 2005. doi:10.1109/ESTMED.2005.1518059. → pages 131, 134, 151

[218] Shirer, Michael and Chou, Jay. Despite steady commercial uptake, personal computing device market expected to decline at a -1.8% CAGR through 2022, according to IDC, 2018. URL https://www.idc.com/getdoc.jsp?containerId=prUS43596418. [Online; accessed 1-Aug-2019]. → page 2

[219] Shirer, Michael and Stolarski, Kuba and Fernandez, Lidice. Worldwide cloud IT infrastructure revenues continue to grow by double digits in the first quarter of 2018 as public cloud expands, according to IDC, 2018. URL https://www.idc.com/getdoc.jsp?containerId=prUS44025018. [Online; accessed 1-Aug-2019]. → page 2

[220] D. O. Son, C. T. Do, H. J. Choi, J. Nam, and C. H. Kim. A dynamic CTA scheduling scheme for massive parallel computing. Cluster Computing, 20 (1):781–787, Mar. 2017. doi:10.1007/s10586-017-0768-9. → page 90

194 [221] Y. H. Song and T. M. Pinkston. A progressive approach to handling message-dependent deadlock in parallel computer systems. IEEE Transactions on Parallel and Distributed Systems, 14(3):259–275, 2003. doi:10.1109/TPDS.2003.1189584. → page 99 [222] W. R. Steiner, F. C. Crow, C. M. Wittenbrink, R. L. Allen, and D. A. Voorhies. Method for parallel fine rasterization in a raster stage of a graphics pipeline, Jan. 6 2015. US Patent 8,928,676. → page 26 [223] M. B. Stensgaard and J. Sparsø. Renoc: A network-on-chip architecture with reconfigurable topology. In ACM/IEEE International Symposium on Networks-on-Chip (NoCs), pages 55–64. IEEE, 2008. doi:10.1109/NOCS.2008.4492725. → pages 93, 166 [224] J. Strom¨ and T. Akenine-Moller.¨ iPACKMAN: High-quality, low-complexity texture compression for mobile phones. In Proceedings of the ACM SIGGRAPH/EUROGRAPHICS Conference on Graphics Hardware (HWWS), pages 63–70. The Eurographics Association, 2005. doi:10.2312/EGGH/EGGH05/063-070. → page 130 [225] J. Strom¨ and M. Pettersson. ETC 2: Texture compression using invalid combinations. In SIGGRAPH/Eurographics Workshop on Graphics Hardware, pages 49–54, 2007. doi:10.2312/EGGH/EGGH07/049-054. → page 130 [226] J. Strom,¨ P. Wennersten, J. Rasmusson, J. Hasselgren, J. Munkberg, P. Clarberg, and T. Akenine-Moller.¨ Floating-point buffer compression in a unified codec architecture. In Proceedings of the ACM SIGGRAPH/EUROGRAPHICS Conference on Graphics Hardware (HWWS), pages 75–84. Eurographics Association, 2008. doi:10.2312/EGGH/EGGH08/075-084. → pages 130, 139, 151 [227] L. Subramanian, D. Lee, V. Seshadri, H. Rastogi, and O. Mutlu. The blacklisting memory scheduler: Achieving high performance and fairness at low cost. In Proceedings of the IEEE International Conference on Computer Design (ICCD), pages 8–15. IEEE, 2014. doi:10.1109/ICCD.2014.6974655. → page 90 [228] C. Sun, C.-H. O. Chen, G. Kurian, L. Wei, J. Miller, A. Agarwal, L.-S. Peh, and V. Stojanovic. DSENT - A tool connecting emerging photonics with electronics for opto-electronic networks-on-chip modeling. In ACM/IEEE International Symposium on Networks-on-Chip (NoCs), pages 201–210. IEEE, 2012. doi:10.1109/NOCS.2012.31. → page 114

195 [229] Y. Tamir and G. L. Frazier. High-performance multi-queue buffers for VLSI communications switches. In Proceedings of the ACM/IEEE International Symposium on Computer Architecture (ISCA), pages 343–354. IEEE, 1988. doi:10.1109/ISCA.1988.5245. → pages 97, 98, 99, 104 [230] R. C. Taylor, M. Mantor, and M. A. Mang. Method and apparatus for primitive processing in a graphics system, Nov. 22 2005. US Patent 6,967,664. → page 23 [231] TechInsights. Mate 10 teardown, 2017. URL https://www.techinsights.com/about-techinsights/overview/blog/huawei- mate-10-teardown. [Online; accessed 1-Aug-2019]. → page 4 [232] TechInsights. Apple iPhone Xs Max teardown, 2018. URL https://www.techinsights.com/about-techinsights/overview/blog/apple- iphone-xs-teardown. [Online; accessed 1-Aug-2019]. → page 4 [233] TechInsights. teardown, 2018. URL https://www.techinsights.com/about-techinsights/overview/blog/samsung- galaxy-s9-teardown. [Online; accessed April 25, 2019]. → page 4

[234] Tomasz Cugowski . Anti-aliasing techniques comparison. URL https: //www.sapphirenation.net/anti-aliasing-comparison-performance-quality. [Online; accessed April 25, 2020]. → page 41 [235] A. T. Tran and B. M. Baas. Achieving high-performance on-chip networks with shared-buffer routers. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 22(6):1391–1403, 2014. doi:10.1109/TVLSI.2013.2268548. → pages 98, 99, 105 [236] R. Ubal, B. Jang, P. Mistry, D. Schaa, and D. Kaeli. Multi2Sim: a simulation framework for CPU-GPU computing. In Proceedings of the ACM/IEEE International Conference on Parallel Architecture and Compilation Techniques (PACT), pages 335–344. IEEE, 2012. doi:10.1145/2370816.2370865. → pages 88, 89, 90 [237] Unity Technologies. Optimizing graphics performance, . URL https://docs.unity3d.com/Manual/OptimizingGraphicsPerformance.html. [Online; accessed 1-Aug-2019]. → page 32 [238] Unity Technologies. Deferred shading rendering path, . URL https://docs.unity3d.com/Manual/RenderTech-DeferredShading.html. [Online; accessed 1-Aug-2019]. → pages 22, 140

196 [239] Unity Technologies. Unity3D engine: Optimizing graphics rendering in Unity games, . URL https://docs.unrealengine.com/en- us/Programming/Rendering/ThreadedRendering. [Online; accessed 1-Aug-2019]. → page 210

[240] H. Usui, L. Subramanian, K. K.-W. Chang, and O. Mutlu. DASH: Deadline-aware high-performance memory scheduler for heterogeneous systems with hardware accelerators. ACM Transactions on Architecture and Code Optimization (TACO), 12(4):65:1–65:28, Jan. 2016. doi:10.1145/2847255. → pages 11, 61, 62, 63, 69, 90

[241] T. J. Van Hook. Method and apparatus for compression and decompression of color data, May 2 2006. US Patent 7,039,241. → page 130

[242] N. Vijaykumar, E. Ebrahimi, K. Hsieh, P. B. Gibbons, and O. Mutlu. The locality descriptor: A holistic cross-layer abstraction to express data locality in GPUs. In Proceedings of the ACM/IEEE International Symposium on Computer Architecture (ISCA), pages 829–842. IEEE, 2018. doi:10.1109/ISCA.2018.00074. → page 90

[243] O. Villa, D. R. Johnson, M. O’Connor, E. Bolotin, D. Nellans, J. Luitjens, N. Sakharnykh, P. Wang, P. Micikevicius, A. Scudiero, S. W. Keckler, and W. J. Dally. Scaling the power wall: a path to exascale. In Proceedings of the ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pages 830–841. IEEE, 2014. doi:10.1109/SC.2014.73. → page 28

[244] . Composition processing cores (CPC), 2016. URL http://www.vivantecorp.com/index.php/en/technology/composition.html. [Online; accessed 1-Aug-2019]. → pages 140, 159

[245] D. W. Wall. Limits of instruction-level parallelism. In Proceedings of the ACM Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 176–188, 1991. ISBN 0-89791-380-9. doi:10.1145/106975.106991. → page 1

[246] T. Wang, A. E. Gruber, and S. Khandelwal. Selectively merging partially-covered tiles to perform hierarchical z-culling, Apr. 12 2016. US Patent 9,311,743. → page 49

[247] Z. Wang, J. Yang, R. Melhem, B. Childers, Y. Zhang, and M. Guo. Simultaneous multikernel GPU: Multi-tasking throughput processors via fine-grained sharing. In Proceedings of the IEEE International Symposium

197 on High-Performance Computer Architecture (HPCA), pages 358–369. IEEE, 2016. doi:10.1109/HPCA.2016.7446078. → page 90

[248] D. Wentzlaff, P. Griffin, H. Hoffmann, L. Bao, B. Edwards, C. Ramey, M. Mattina, C.-C. Miao, J. F. Brown III, and A. Agarwal. On-chip interconnection architecture of the tile processor. IEEE Micro, (5):15–31, 2007. doi:10.1109/MM.2007.4378780. → page 93

[249] T. Wiegand, G. Sullivan, G. Bjontegaard, and A. Luthra. Overview of the H. 264/AVC video coding standard. Circuits and Systems for Video Technology, IEEE Transactions on, 13(7):560–576, 2003. doi:10.1109/TCSVT.2003.815165. → page 167

[250] C. M. Wittenbrink, H. P. Moreton, D. A. Voorhies, J. S. Montrym, and V. S. Parikh. Clipping with addition of vertices to existing primitives, Nov. 6 2007. US Patent 7,292,242. → pages 23, 25

[251] D. H. Woo and H.-H. S. Lee. Extending Amdahl’s law for energy-efficient computing in the many-core era. IEEE Computer, 41(12):24–31, 2008. doi:10.1109/MC.2008.494. → page 2

[252] S. Wu, C. Y. Lin, M. C. Chiang, J. J. Liaw, J. Y. Cheng, S. H. Yang, C. H. Tsai, P. N. Chen, T. Miyashita, C. H. Chang, V. S. Chang, K. H. Pan, J. H. Chen, Y. S. Mor, K. T. Lai, C. S. Liang, H. F. Chen, S. Y. Chang, C. J. Lin, C. H. Hsieh, R. F. Tsui, C. H. Yao, C. C. Chen, R. Chen, C. H. Lee, H. J. Lin, C. W. Chang, K. W. Chen, M. H. Tsai, K. S. Chen, Y. Ku, and S. M. Jang. A 7nm CMOS platform technology featuring 4th generation FinFET transistors with a 0.027um2 high density 6-T SRAM cell for mobile SoC applications. In IEEE International Electron Devices Meeting (IEDM), pages 2.6.1–2.6.4, 2016. doi:10.1109/IEDM.2016.7838333. → page 162

[253] P. Yedlapalli, N. C. Nachiappan, N. Soundararajan, A. Sivasubramaniam, M. T. Kandemir, and C. R. Das. Short-circuiting memory traffic in handheld platforms. In Proceedings of the ACM/IEEE International Symposium on Microarchitecture (MICRO), pages 166–177. IEEE, 2014. doi:10.1109/MICRO.2014.60. → pages 5, 89, 119

[254] J. Yin, O. Kayiran, M. Poremba, N. E. Jerger, and G. H. Loh. Efficient synthetic traffic models for large complex SoCs. In Proceedings of the IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 297–308. IEEE, 2016. doi:10.1109/HPCA.2016.7446073. → page 98

198 [255] M. T. Yourst. PTLsim: A cycle accurate full system x86-64 microarchitectural simulator. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 23–34. IEEE, 2007. doi:10.1109/ISPASS.2007.363733. → page 88

[256] G. L. Yuan, A. Bakhoda, and T. M. Aamodt. Complexity effective memory access scheduling for many-core accelerator architectures. In Proceedings of the ACM/IEEE International Symposium on Microarchitecture (MICRO), pages 34–44. ACM, 2009. doi:10.1145/1669112.1669119. → page 70

[257] V. Zakharenko, T. Aamodt, and A. Moshovos. Characterizing the performance benefits of fused CPU/GPU systems using FusionSim. In Proceedings of the IEEE Conference on Design Automation and Test in Europe (DATE), pages 685–688, 2013. doi:10.7873/DATE.2013.148. → page 88

[258] A. K. Ziabari, J. L. Abellan,´ Y. Ma, A. Joshi, and D. Kaeli. Asymmetric NoC architectures for GPU systems. In ACM/IEEE International Symposium on Networks-on-Chip (NoCs), page 25. ACM, 2015. doi:10.1145/2786572.2786596. → pages 93, 98, 166

199 Appendix A

Emerald Software Architecture and Implementation

Section 2.5 describes Emerald software design on a high level. In this appendix, we discuss some of the detailed aspects of Emerald software design and implemen- tation. Figure 2.28 shows the overall architecture of Emerald’s standalone and full- system modes. This appendix provides more details on Emerald’s components shown in Figure 2.28 like the TGSItoPTX translation, fast-forwarding, checkpoint- ing, memory management, and full-system simulation communication and states handling.

A.1 Graphics Standalone Mode

This mode uses only the GPU and the main memory models. OpenGL or OpenGL ES calls are played from a pre-recorded trace where a modified version of the open- source tool (apitrace [21]) is used for recording and playing graphics calls. apitrace trace was modified to be used as a shared library and stubs are added so it can be invoked from gem5. Figure A.1 shows an overview of the standalone mode. A pre-recorded OpenGL or OpenGL ES trace 1 is parsed and played by a modified version of apitrace 2 . OGL/OGLES calls generated by apitrace in 2 are captured by a modified version

200 of Mesa 3D in 3 . Mesa communicates with gem5 through a set of calls and struc- tures (i.e., the Mesa-GPGPUSim interface 4 ). Mesa-GPGPUSim initializes and invokes graphics simulation on the gem5+GPGPU-Sim integration, which we refer to by gem5-graphics 5 .

Tracing Thread Simulation Thread

1 OGL/OGLES trace

2 apitrace

Mesa-GPGPUSim Mesa 3D 4 3 Interface

gem5-graphics (integrated 5 performance model)

Figure A.1: An overview of the graphics standalone simulation mode.

As shown in Figure 2.28, Emerald also supports simulating graphics in full- system mode running under Android (details in Section A.5). The standalone mode and the full-system mode share some components, which are Mesa3D 3 , the Mesa-GPGPUSim interface 4 , and gem5-graphics 5 . In the following sec- tions, we shall describe some of the common features between the two modes before discussing the full-system mode in Section A.5.

201 OGL/GLES calls

No Update state Is draw call?

Yes

No Functional No Draw call in Ignore the draw rendering ROI? call enabled? 2

Yes Yes 1 Render the draw call using the Render the draw performance call using Mesa model

Figure A.2: Fast-forwarding and ROI simulation

A.2 Fast-forwarding and ROI Simulation

Both the standalone and the full-system modes support fast-forwarding the simu- lation to a Region of Interest (ROI). An ROI is defined on draw call granularity; an ROI region defined by [SF:SD, EF:ED] has a starting a start frame (SF) and a start draw call (SD), and an end frame (EF) and an end draw call (ED). In the fast-forwarding mode, graphics calls, except draw calls, continuously up- date the state. Draw calls, however, can be either functionally simulated or ignored. Figure A.2 shows how OpenGL calls within and outside simulation ROI are han- dled. Fast-forwarding with functional rendering(Figure A.2 1 can be time con- suming when a large number of frames precede the ROI, so the user can choose to disable functional simulation ( 2 ) to speed up trace playing in the standalone mode or upon checkpoint loading in the full-system mode (details in Section A.5.3). In some cases, however, disabling functional rendering when fast-forwarding may lead to incorrect results when rendering frames in the ROI. For example, in- correct results may show when textures are created from render targets using draw calls that precede the ROI. While incorrect functionality may not affect perfor- mance simulation in most cases when simply textures are just loaded and applied; there are, however, some cases where errors can propagate to the performance

202 State Description Depth Depth test state (if enabled or disabled) and the depth func- tion type. Blending Blending state (if enabled or disabled) and the blending function type. Textures Texture dimensions and format. Color buffer Depth buffer dimensions and format. Depth buffer Depth buffer dimensions and format.

Table A.1: Example graphics state data queried by gem5-graphics. model. An example of that is a fragment shader program that uses texel values to determine which lighting effects should be applied. In conclusion, a user should be aware of possible issues when choosing to speed up the fast-forwarding mode by disabling functional rendering.

A.3 Graphics Memory Management

On real systems, GPU hardware drivers are responsible for managing graphics memory. In Emerald, Mesa manages graphics state and performs the correspond- ing memory management operations. gem5-graphics (the integrated performance model), however, cannot use Mesa memory directly for performance modeling; al- ternatively, gem5-graphics queries the graphics state from Mesa and allocates and initializes simulation memory before each draw call. This section discusses how Emerald manages and uses graphics memory.

A.3.1 Memory Initialization

Graphics state is initialized for the performance model before each draw call — as state modifying calls can occur inbetween draw calls. All relevant graphics state data are copied from Mesa to the performance model’s memory before launching the timing simulation. The performance model can be configured to simulate ren- dering a range of frames, part of a frame, or even a single draw call. State data needs to be copied from Mesa to the performance model to represent the state of the model up to the chosen start draw call. The state data includes partially ren-

203 State Memory Description Type Textures Texture Texutres and any associtated mipmaps for all textures that can be accessed from vertex or fragment shaders. Color Buffer Global Currently active render target color buffers. Depth Buffer Global Currently active render target depth buffer (the Z-buffer). Shader Programs Instruction PTX representation of vertex and fragment shaders (translated from TGSI as discussed in Section A.7). Vertex and Constant Uniforms are global variables that are fixed fragment shaders across a draw call [118] which includes any uniforms and transformation matrices (model, view and constants projection matrices), lights, etc.

Table A.2: List of graphics state data used by gem5-graphics. dered color and depth buffers for partially rendered frames. Some state data are queried (as shown in Table A.1) to determine how the performance model memory should be initialized. Finally, at the end of each draw call, the performance model copies any updated state data back from the performance model’s memory back to Mesa. State initialization operations are functional and do not involve performance simulation as, on a real system, such data would have been already established by the driver before a draw call is issued. In effect, before invoking the performance model for graphics rendering, the memory is set as if the performance model has performed all previous rendering operations up to the current draw call. Table A.2 summarizes the state data that gem5-graphics initializes in the mem- ory for each draw call. Each type of state data is copied to the appropriate GPU memory type. Table A.2 shows the memory type according to the GPGPU memory taxonomy used by NVIDIA and GPGPU-Sim [1]. Memory type choice determines which path a memory request takes from the memory and which cache is used to hold it.

204 A.3.2 Graphics Memory Allocation

Timing Model 1 Main Memory

L-Blocks

System Memory 2

4

M-Blocks

Graphics GPU S-Blocks Memory 3

Figure A.3: Graphics memory layout

A graphics specific memory allocator manages the memory space of the perfor- mance model. GPU drivers manage the memory allocations on real systems. As an alternative, a graphics memory allocator was created to manage graphics memory for both the standalone and the full-system modes. Figure Figure A.3 shows the memory layout for graphics simulation. In the full-system mode, the performance model’s main memory 1 , i.e., the off-chip DRAM, consists of two parts. An OS managed memory 2 and a graphics mem- ory 3 . The graphics memory allocator manages the graphics memory; it uses fixed-size blocks allocation (memory pools) as shown in 4 . It uses best-fit match- ing algorithm with three pools (block sizes) are implemented 4 (Large (L), Medium (M) and Small (S)), where block sizes are configurable. Block pools are sized to

205 serve the following type of allocations:

• Small blocks are used to hold transformation matrices, light states, uniform variables, and other small state data (~512–1024 bytes).

• Medium blocks are used to hold vertex buffers and shader programs code.

• Large blocks are used to hold large vertex buffers, textures, and color and depth buffers.

Finally, since the graphics state can change between draw calls, GPU caches — including read-only textures, instruction, and constant caches — are flushed after each draw call to avoid using stale state data. We would like state updates and cache state to be updated on a finer granularity, but we left such optimization for future work.

A.4 Graphics Memory Virtualzation, TLBs and Page Fault Handling in the Full-system Mode

As detailed in Section A.3.2, graphics memory is managed by a graphics mem- ory allocator. As a result, graphics memory accesses are handled differently from regular memory access. When a TLB miss for a request from a GPU applica- tion with OS allocated memory, the miss is handled by the performance model by walking through OS page tables [190], modeling the generated corresponding memory requests. On the other hand, for graphics applications, the performance model estimates the delay of OS page table walks when a TLB miss occurs us- ing a configurable delay value. Graphics memory allocations use the same value for physical and virtual addresses, so all TLB accesses are essentially carried to account for the timing aspects of memory access translation. Memory requests from GPU graphics applications never trigger a page fault as graphics memory is always pinned to the main memory. For GPGPU applications, however, page faults are only handled with under X86 architecture [190]. For a system running on ARM ISA, page faults from GPGPU application are avoided by pinning all the CUDA memory allocations. Figure A.4 summarizes GPU TLB miss handling.

206 GPU TLB Miss

Address in Yes No Walk OS Page range Wait TLB miss Tables allocated delay by OS?

Complete TLB Page fault? Miss Handling Yes No

Trigger OS Page Complete TLB Fault Handling Miss Handling

Figure A.4: GPU TLB miss handling

A.5 Graphics Simulation on Android

To simulate graphics under the Android OS, system side OpenGL implementation has to be provided in place of a driver layer OpenGL. We used simulation pseudo- instructions1 and the Android emulator guest and host-side OpenGL acceleration libraries [74] to implement the OpenGL API on Android. To use the Android Emulator libraries to support graphics simulation, changes had to be made on the guest-side [76] (Android) and the host-side [77] (gem5- graphics) libraries. Figure A.5 shows the Android side changes to support graphics simulation. In 1 , Android applications invoke OpenGL ES calls which are captured by the OGLES API wrappers in 2 . These wrappers invoke the guest-side emulator li- braries that encode the data in each OGLES call into a byte stream, i.e., graphics stream, that feeds the Gem5Pipe in 4 . Finally, Gem5Pipe feeds this byte stream

1Special instructions that are added to the Instruction Set Architecture (ISA) to facilitate simula- tion. For example, pseudo-instructions are used to mark simulation ROI, trigger checkpointing, or print statistics at prespecified simulation points.

207 Original Android Source Android Application 1 OGLES Calls Modified for Emerald OpenGL ES API Added for Emerald 2 Wrapper Decoded responses

OpenGL ES emulator libraries) guest-side 3 Encoded Encoded OGLES OGLES calls Responses

Gem5Pipe Guest-side 4 (Android machine)

Host-side 5 (Simulator machine)

Figure A.5: Guest-side (Android) graphics flow that carries OGLES rendering commands to the host-side (gem5-graphics side) 5 . The Gem5Pipe copies graphics stream data to the gem5-graphics side using gem5 pseudo-instructions; the performance model assumes that pseudo-instructions operations are executed in zero time (no performance penalty). Once OGLES calls are handled on the host-side, any responses (return values) are carried back to the caller application as shown in Figure A.5. Note that each application thread context on Android will have a copy of the stack shown in Figure A.5( 1 – 4 ). This allows tracking the state of multiple appli- cations (processes), where each application can concurrently have multiple render threads.

208 1 T1 Gem5Pipe T2 Gem5Pipe Modified for gem5-graphics Added for gem5-graphics Guest-side Encoded OGLES Response (Android machine) calls stream buffers

pseudo-instructions Host-side gem5 (Simulator machine) Encoded GraphicsStream OGLES calls 2 stream + 209 responses 5 6 T1 OGLES Context Checkpointing 3 Android Emulator T2 OGLES Context Mesa 3D (host-side libraries) 4 T3 OGLES Context

Mesa-GPGPUSim Interface 7

Figure A.6: Host-side (gem5) graphics Flow Figure A.6 shows how the gem5-graphics side handles a graphics stream. Mul- tiple threads can stream to the simulator side at a given point of time 1 . All graphics streams are handled by a single GraphicsStream object 2 on the gem5 side (more on the Gem5Pipe communication protocol in Section A.5.2). Graph- icsStream fork the stream for checkpointing 3 (more on that later) and for the modified Android Emulator libraries 4 . Android Emulator libraries 4 decode the graphics stream and invoke the en- coded OGLES calls. Each graphics thread context in 1 on the Android-side, will be mapped to a corresponding graphics thread context on the host-side as shown in 5 . Graphics contexts in 5 communicate with the Mesa 3D library 6 , which then communicates with the Mesa-GPGPUSim interface that manages and invokes graphics simulation.

A.5.1 Command Execution and Synchronization

The performance model supports running a single context at a time. OpenGL speci- fications [120, 123] allow implementations to buffer multiple commands in a com- mand queue before sending them for execution. glFlush and glFinish are used to force all previously issued command to complete. As a result, non-drawing API calls (state updates) that are executed on the host-side return immediately from guest-side point of view, while draw calls (e.g., glDrawArrays, glDrawEle- ments) are immediately executed once issued. Subsequent draw calls, however, are blocked from returning until previous draw calls are completed; effectively, this limits command buffering, but it is within what is allowed by OpenGL speci- fications. It is worth mentioning that in practice, and because OpenGL specifications do not guarantee non-blocking command issuing, rendering is executed by a dedicated thread separate from the main application thread [22, 66, 239].

A.5.2 The Gem5Pipe

Gem5Pipe (Figure A.5 4 ) communicates graphics packets between Android and the host-side (gem5-graphics). Gem5Pipe has four types of messages: block BLK, read RD, write WR, and debug DBG.

210 RD/RW command GPU Available 2 START Check GPU status (BLK command) 1 3 Issue Wait Command Commands GPU Busy (RD/RW)

Done issuing command

(a) Gem5Pipe sender(s) states Non-drawing RD/RW command

Drawing RD/RW command START 1 Receive Commands 2 Block Commands (GPU busy)

GPU done rendering

(b) Gem5Pipe receiver states

Figure A.7: Gem5Pipe states

RD and RW commands are issued when reading/writing blocks of data that represent OpenGL draw calls coming from the OpenGL ES Wrapper (Figure A.5 2 ) and their corresponding return values. RW commands are initiated and then fol- lowed by RD commands when return values are expected (depending on the API call). For larger RD/RW commands, messages are transmitted in page size chunks. Each page is touched before sending, to avoid page faults on the simulator when copying messages from the guest-side (Android) to the host-side (gem5-graphics). Before issuing RD/RW commands, the BLK command checks if further com- mands can be processed (i.e., the GPU is available and not busy rendering another

211 draw call). If the GPU is available, RW/RD commands are issued and processed as long as no draw calls are initiated (i.e., only state update calls). To facilitate communication between both host and guest sides, Gem5Pipe tracks two states, a single state for each side. Figure A.7 shows Gem5Pipe sender and receiver states. In Figure A.7a, a sender starts waiting ( 1 ) for high level read/write commands from an OpenGL ES Wrapper (Figure A.5 2 ). When a high-level command is received 2 , the sender side issues BLK commands to poll the GPU state. At this point, the rendering thread is blocked until the GPU becomes available and further commands are allowed once the GPU becomes available 3 . Figure A.7b shows Gem5Pipe receiver states. The receiver has only two states. In state 1 the receiver waits for RD/RW commands. If the received RD/RW command only contains state update calls (no drawing), the receiver remains in the receiving state 1 . However, when the received RD/RW command is a draw- ing command, the receiver blocks further commands from all senders until the GPU completes processing the received drawing command ( 2 ). Once the GPU completes rendering the drawing command, the receiver returns to the receiving state 1 . Note that each render thread has a separate Gem5Pipe state. On the other hand, there is only one receiver state on the simulator-side. Finally, DBG commands provide debug data (e.g., message sizes, guest-side failures) to the simulator and they do not affect sender/receiver states.

A.5.3 Checkpointing

Graphics checkpointing allows Emerald to checkpoint and restore gem5 check- points with graphics; thus, the graphics state on both the Android side and the gem5 side needs to be restored when resuming simulation from a checkpoint. The graphics state should be captured such that, when restored, the graphics state on the host-side (i.e., gem5 and Mesa) matches the state on the guest-side (i.e., Android). The state on the Android side is captured as part of the gem5 state (CPUs and memory system state). This leaves the host-side state (gem5 and Mesa) to be captured and restored. In deciding how to capture and restore the graphics

212 Guest-side Host-side 1 Graphics Context Graphics Context State G0 State H0 A Guest-side OGL call Host-side 2 Graphics Context Graphics Context State G1 State H0

Guest-side Host-side 3 Graphics Context Graphics Context State G1 State H1 B Guest-side Host-side OGL call Graphics Context Graphics Context 4 response State G1 State H2

Guest-side Host-side 5 Graphics Context Graphics Context State G2 State H2

Figure A.8: Guest-side and host-side graphics state changes state on the host-side, the following options were considered:

• The first method, which similar to the approach used in gem5, simulation state can be captured by storing the values related to graphics contexts in Mesa, GraphicsStream (Figure A.6 2 ), and the GPU (i.e., GPGPU-Sim state).

• The second method is to record the OpenGL commands that enable re- establishing the graphics state on the host-side that correspond to the graph- ics state on the guest-side upon loading a checkpoint.

The main advantage of the first approach is that the size of a checkpoint is inde- pendent of the number of commands used to reach the captured graphics state. On

213 the other hand, in the second approach, the size of a checkpoint increases linearly with the number of commands. While the first approach is more space-efficient, we used the second approach instead. The main reason for this choice is that the first approach requires a significant amount of work and source code refactoring to both Mesa and GPGPU-Sim. In addition, such an approach would have multiple points of failure and would require a significant amount of testing and verification. To understand how graphics commands related to the graphics state, we shall use the example in Figure A.8, which shows how graphics state is updated on the guest and host sides. At 1 , the guest-side (i.e., Figure A.6 1 ) and the host-side (i.e., Figure A.6 2 – 7 ) are at some initial state (G0 and H0). When the guest-side issues an OpenGL call, its state progresses (to State G1 2 ); now the guest-side is expecting a response for the issued call. Once the host-side receives the call, its states are updated in from H0 to H1 in 3 . Once a response is issued from the host, the host-side enters a new state (H2 4 ). Finally, when the guest-side receives the response, it advances to state G2 in 5 . Note that this example assumes that the OpenGL call causes a state change, but not all calls would cause a state change (e.g., query calls). Checkpointing in Emerald works by recording only the graphics calls ( Fig- ure A.8 A ) and then replaying the OpenGL commands on the host-side. As noted earlier, the corresponding state on the guest-side is captured as part of the sys- tem state (i.e., Android); thus, no special handling is required, and all responses generated when re-establishing the graphics state upon checkpoint loading (i.e., Figure A.8) are ignored for that purpose. The next section provides more details on how graphics calls are recorded as a stream of Gem5Pipe commands. For checkpoint timing (specifying when a checkpoint should be taken), it is set using gem5’s global cycle value, or using frame and draw call numbers, i.e., checkpointing at a certain frame and a certain draw call within that frame.

A.5.3.1 Checkpoint Recording

The checkpointing module in Figure A.6 3 snoops Gem5Pipe commands and responses between the GraphicsStream module 2 and emulator libraries 4 . Figure A.9 shows how the checkpointing module handles each Gem5Pipe com-

214 Guest-side (Android) GLES Context Gem5Pipe Commands

Add command RW Yes command? info + buffer data 1 to checkpoint

RD or No Yes Add command info recordable Ignore command debug to checkpoint 6 command? 2 without buffer data

Figure A.9: Graphics checkpoint recording flow mand type. If a command is an RW command 1 , then the command info and the data it carries are added to checkpoint data. On the other hand, if the command is an RD command or a recordable DBG command 2 (a selected number of debug commands to facilitate debugging), only command information, without any asso- ciated data, are stored. We ignore non-RW command data as they are not required to restore the host-side state. Graphics checkpoint commands are stored on a temporary file while simula- tion or fast-forwarding (Section A.2) in progress. When a checkpoint is captured, graphics checkpoint commands are fetched from this temporary file to create the graphics part of the newly created checkpoint. Table A.3 lists the data stored with each command type. PIDs and TIDs are used to distinguish different contexts, where each unique PID/TID combination is allocated a separate context on the host-side (as shown in Figure A.6). For RW commands, buffer length and buffer data are used when re-reloading a checkpoint as explained in the next section. For RD commands, the buffer length is used to determine how much data is expected to be received when re-loading a checkpoint.

215 Command Data Description Command Type RD or RW PID Process ID TID Thread ID Buffer Length RD or RW command data buffer length Buffer Data RW command data buffer content

Table A.3: Checkpoint command data

A.5.3.2 Checkpoint Restoring

Fetch next checkpoint command 1

New PID/TID Yes Create new combination host-side ? context 2 3 No Write Is RW command to Yes command corresponding ? context 4 5 No Fetch and Send RD Yes Is RD check command to No command command corresponding ? response 7 context 6

Figure A.10: Graphics checkpoint restore flow

Figure A.10 shows the flow of checkpoint restoring. Upon loading from a checkpoint, graphics checkpoint commands are fetched from the checkpoint file 1 . In 2 command PID and TID are checked against the PID and TID of existing graphics contexts on the host-side. If a new PID/TID combination is observed, a new graphics context is created with its dedicated GraphicsStream instance ( 3 ). On the other hand, if there is already an existing context for the PID/TID combi- nation, we go directly to step 4 . In 4 , we check if the graphics command is an RW command, in such case, the command is sent along the corresponding buffer data using the corresponding GraphicsStream object 5 . If the command is an RD

216 command 6 , the RD command is sent 6 and its response is fetched 7 . Since the guest-side state is already established within the restored system state, response data are not used, and RD commands are just used to advance the guest-side state. Only response size is checked against the response size recorded in the checkpoint to confirm that its the expected response size. Once the response is fetched, and its size is checked, the next command in line is fetched 1 .

217 T1 Gem5Pipe 3 T2 Gem5Pipe Host-render mode Guest-side Encoded OGLES Response (Android machine) calls stream buffers

Host-side gem5 (Simulator machine) Encoded GraphicsStream OGLES calls 218 stream + Translated API responses 2 calls (GLES to 4 T1 OGL Context Checkpointing OGL) Host-side OpenGL Android Emulator T2 OGL Context libraries (with GLES Implementation to OGL translation) T3 OGL Context 1

Figure A.11: Gem5Pipe in Host-Render mode A.5.4 Host-Render Mode

Gem5Pipe can be used to only functionally render graphics on the host machine, either using GLES or standard OpenGL. Rendering on host hardware can be useful when graphics performance data are of no interest or when developing/debugging non-graphics simulation features. Previous work by ARM [60] aimed to provide an infrastructure to run a full-system simulation on Android without using guest- side software rendering. Their work shows that using software rendering (on the CPU) as a replacement for hardware GPU modeling can “severely misrepresent the behavior of the system”. The NoMali tool provides a stub GPU that eliminates software rendering by capturing and voiding all graphics calls and, subsequently, eliminate the noise from software rendering. The host-render mode provides the same functionally as NoMali; in addition, and unlike NoMali, this mode allows the user to monitor progress by rendering on the host-side. This provides the user with extra insight into the state of the system under simulation. In the host-render mode, Mesa does not invoke the GPU model and renders draw calls using Mesa’s soft-pipe implementation. Also, host-side native renderer (i.e., a hardware GPU) can be used to render the guest-side graphics instead of Mesa’s softpipe. Using the host’s native renderer has the added benefit of render- ing workloads with OpenGL features that are not supported by Mesa. Figure A.11 shows the flow when host-side native renderer is used. GLES calls are translated by the emulator libraries to standard OpenGL calls (including shader programs) 1 . In 2 a standard OpenGL context is created for each guest-side GLES context 3 . Finally, all the rendering is performed by host-side OpenGL implementation (in- cluding any hardware support if available) ( 4 ).

A.5.4.1 Host-Render Mode Checkpoint Compatibility

Host-render mode checkpoints can be used in simulation after loading when Mesa is used. However, when rendering on the host hardware (Figure A.11), checkpoints taken with such a method are not necessarily compatible with the simulation mode. The reason for this is that OpenGL calls may differ between Mesa and host-side graphics, as each one may run a different version of OpenGL; however, check-

219 pointing requires using a compatible OpenGL version to successfully re-establish graphics state at both the guest-side and the host-side.

A.6 Deterministic Perfomance Modeling with Multi-Threaded Simulation

Most architecture performance simulators use a single thread to guarantee a deter- ministic execution that provides reproducible results. Emerald, however, utilizes several sources that are designed with performance, rather than determinism, in mind. This includes Mesa, apitrace and Android emulation libraries. This section discusses design choices we have taken to enable deterministic performance with Emerald while utilizing other open-source tools. To guarantee deterministic execution, the first rule when simulating the perfor- mance of graphics calls on Emerald is to process draw calls on a first-come-first- serve basis. Draw calls from any single source are not buffered and are served as soon as they are received. When multiple contexts try to send draw calls, as in the Android full-system mode, draw calls from the first context is served while others are blocked until the current draw call completes execution on the performance model. Subsequent draw calls are blocked at the GraphicsStream level (Figure A.6 2 ). Since both simulators, gem5 and GPGPU-Sim, are running on the same thread, the timing of each draw call relative to each subsequent draw call is deterministic. Regardless of the behavior of the Mesa thread, draw calls are serviced in the same order every time at the same clock cycle when the same performance model and configurations are used. For the standalone mode, the simulation runs in two threads as shown in Fig- ure A.1. The first thread is the apitrace thread that invokes trace API calls and also holds the Mesa graphics state. While the second thread runs the performance model. Unlike the full-system mode, there are multiple entry points in the stan- dalone mode (i.e., different OpenGL draw calls), where the API calls are sent from apitrace to the Mesa and the performance model. To prevent non-deterministic behavior, all OpenGL calls are blocked at Mesa when the performance model is

220 GLSL shaders

TGSI shaders + Mesa TGSI shaders Mesa-GPGPUSim configurations 1 TGSItoPTX compiler interface 2

Simulation settings PTX shader

Figure A.12: Translating GLSL shaders to PTX running. A wrapper added around each OpenGL, GLX and EGL call to block calls while the performance model is running another draw call. The only limi- tation of this setup is that the performance model may report a different number of idle cycles between draw calls. However, such statistic is irrelevant as we are only interested in performance data during draw calls (similar to the case for kernel execution time for GPGPU applications).

A.7 TGSI to PTX Translation

In this section, we briefly discuss how PTX shaders are generated from GLSL shaders using the TGSItoPTX tool shown in Figure 2.28. OpenGL shaders, i.e., GLSL shaders, are processed by Mesa’s compiler and translated into Tungsten Graphics Shader Infrastructure (TGSI) shaders. Since TGSI is an assembly like presentation, Emerald takes advantage of that and uses the TGSI shaders to obtain the corresponding NVIDIA’s PTX code, which is the supported format for the GPU performance model. Figure A.12 shows the pipeline that a GLSL shader goes through. In 1 , the Mesa compiler processes the C-like GLSL shaders and translates them to the TGSI format. In 2 , the Mesa-GPGPUSim interface processes the TGSI shaders for translation into PTX and, at the same time, processes various relevant data that are needed for properly setting and using the PTX shaders. These data include, for example, the shader register usage, which determines how many threads can be assigned concurrently to the same GPU core. Also, it sends a number of arguments

221 to the TGSItoPTX tool. These arguments determine what PTX code should be generated based on the current graphics state. Such arguments include, for exam- ple, texture names and filtering modes, depth and blending status, and the format of constant inputs. Finally, we would like to note the difference between TGSI and PTX. TGSI uses vector instruction, with four elements per instructions. On the other hand, PTX is a scalar instruction set. As a result, the mapping between the two instruc- tions sets is not 1:1, and special attention has to be paid when translating some instructions like dot product operations. In addition, we ascertain to map frag- ments to GPU threads according to their 2D position, such that each quad of ad- jacent fragments is executed on the same warp, where such quad would have been processed by the same set of TGSI vector instructions.

222