Download.Nvidia.Com/Pdf/Tegra/Tegra-X1-Whitepaper
Total Page:16
File Type:pdf, Size:1020Kb
Models and Techniques for Designing Mobile System-on-Chip Devices by Ayub Ahmed Gubran A DISSERTATION SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Doctor of Philosophy in THE FACULTY OF GRADUATE AND POSTDOCTORAL STUDIES (Electrical and Computer Engineering) The University of British Columbia (Vancouver) August 2020 © Ayub Ahmed Gubran, 2020 The following individuals certify that they have read, and recommend to the Fac- ulty of Graduate and Postdoctoral Studies for acceptance, the dissertation entitled: Models and Techniques for Designing Mobile System-on-Chip Devices submitted by Ayub Ahmed Gubran in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Electrical and Computer Engineering Examining Committee: Tor M. Aamodt, Electrical and Computer Engineering Supervisor Steve Wilton, Electrical and Computer Engineering Supervisory Committee Member Alan Hu, Computer Science University Examiner Andre Ivanov, Electrical and Computer Engineering University Examiner John Owens, The University of California-Davis, Electrical and Computer Engineering External Examiner Additional Supervisory Committee Members: Sidney Fels, Electrical and Computer Engineering Supervisory Committee Member ii Abstract Mobile SoCs have become ubiquitous computing platforms, and, in recent years, they have become increasingly heterogeneous and complex. A typical SoC to- day includes CPUs, GPUs, image processors, video encoders/decoders, and AI engines. This dissertation addresses some of the challenges associated with SoCs in three pieces of work. The first piece of work develops a cycle-accurate model, Emerald, which pro- vides a platform for studying system-level SoC interactions while including the impact of graphics. Our cycle-accurate infrastructure builds upon well-established tools, GPGPU-Sim and gem5, with support for graphics and GPGPU workloads, and full system simulation with Android. We present two case studies using Emer- ald. First, we use Emerald’s full-system mode to highlight the importance of system-wide interactions by studying and analyzing memory organization and schedul- ing in SoCs. Second, we use Emerald’s standalone mode to evaluate a dynamic mechanism for balancing the shading work assigned to GPU cores. Our dynamic mechanism speeds up frame rendering by 7.3–19% compared to static load-balancing. The second work highlights the time-variant traffic asymmetry in heteroge- neous SoCs. We analyze the impact of this asymmetry on network performance and propose interleaved source injection (ISI), an interconnect topology and asso- ciated flow control mechanism to manage time-varying asymmetric network traf- fic. We evaluate ISI using stochastic traffic patterns and a set of traces that emulate mobile use cases with traffic from various IP blocks. We show that ISI increases saturation throughput by 80–184% for 12% increase in NoC area. In the last piece of work, we study the compression properties of framebuffer surfaces and highlight the characteristics of surfaces generated by different appli- iii cations. We use our analysis to propose Dynamic Color Palettes (DCP), a hardware scheme that dynamically constructs color palettes and employs them to efficiently compress framebuffer surfaces. We evaluated DCP against a set of 124 workloads and found that DCP improves compression rates by 91% for UI and 20% for 2D applications compared to previous proposals. We also propose a hybrid scheme (HDCP) that combines DCP with a generic compression scheme. HDCP outper- forms previous proposals by 161%, 124% and 83% for UI, 2D, and 3D applica- tions, respectively. iv Lay Summary Mobile devices, like phone and tablets, have become interwoven into the fabric of our daily lives. These devices are powered by integrated mobile chips, or system- on-chips (SoCs). Modern SoCs consist of a dozen or more specialized modules, where each module specializes in a specific task, like processing audio or video, rendering graphics, computing AI decisions, encrypting data, etc. As SoCs have become larger and more complicated, challenges have followed. These challenges include figuring out how to design hardware and software for such systems to deliver the best possible user experience. This dissertation con- tributes a set of tools that allow us to improve our understanding of how SoCs work and how their modules interact. The work in this dissertation also studies and proposes techniques to improve how SoC modules communicate, and techniques to reduce the energy consumption of SoCs’ memory system. v Preface The work presented in this dissertation was carried by the author Ayub A. Gubran under the supervision of Prof. Tor M. Aamodt at the University of British Columbia, Point Grey campus. The research material presented in this dissertation, namely, Chapter 2, Chap- ter 3, and Chapter 4, is based on work that has either been published or currently under submission as detailed below. A version of Chapter 2 has been presented at the 2019 International Interna- tional Symposium on Computer Architecture (ISCA’19) and was included in the proceedings as: • Ayub A. Gubran and Tor M. Aamodt. 2019. Emerald: graphics modeling for SoC systems. In Proceedings of the 46th International Symposium on Computer Architecture (ISCA ’19). ACM, New York, NY, USA, 169-182. DOI: https://doi.org/10.1145/3307650.3322221. Ayub A. Gubran was the lead investigator, responsible for all major areas of con- cept formation, implementation, data collection and analysis, as well as manuscript composition. Prof. Tor M. Aamodt provided early concept formation, techni- cal guidance and feedback, and contributed to editing the final version of the manuscript. A version of Chapter 3 is currently under submission as “Ayub A. Gubran and Tor M. Aamodt. The Case for Interleaved Source Injection in Networks with Asymmetric Loads”. Ayub A. Gubran was the lead investigator, responsible for all major areas of concept formation, implementation, data collection and analysis, as well as manuscript composition. Francois Demoullin implemented the scripts vi that were used to carry the experiments in Section 3.6.3.1. He also carried and col- lected the data for a number of experiments in the early stages of the work under the supervision of A. Gubran. Prof. Tor M. Aamodt provided technical guidance and feedback throughout the process and aided in writing the manuscript. A version of Chapter 4 was presented as a peer reviewed poster at the 7th Con- ference on High-Performance Graphics (HPG ’15) as “Framebuffer Compression Using Dynamic Color Palettes”. Also a pre-print version of this work is published as: • Ayub A. Gubran, Felix Huang, and Tor M. Aamodt. “Surface Compression Using Dynamic Color Palettes.” arXiv preprint arXiv:1903.06658 (2019). Ayub A. Gubran was the lead investigator, responsible for all major areas of con- cept formation, implementation, data collection and analysis. Felix Huang worked under the supervision of Ayub Gubran; he collected the workloads listed in Ta- ble 4.2, evaluated replacement policies (Section 4.7.3.2), and worked on collecting the results of the hybrid compression scheme (Figure 4.18). Prof. Tor M. Aamodt provided technical guidance and feedback throughout the process. vii Table of Contents Abstract . iii Lay Summary . v Preface . vi Table of Contents . viii List of Tables . xiv List of Figures . xvi List of Abbreviations . xx Acknowledgments . xxiii Dedication . xxiv 1 Introduction . 1 1.1 Motivation . 2 1.2 Research Challenges . 5 1.3 Research Objectives and Contributions . 6 1.4 Dissertation Organization . 7 2 Emerald: Graphics Modeling for SoC Systems . 9 2.1 Introduction . 9 2.2 Background . 12 viii 2.2.1 Computer Architecture Simulators . 12 2.2.2 Graphics APIs . 14 2.2.2.1 The OpenGL Pipeline . 14 2.2.2.2 Recent Trends in Graphics APIs . 17 2.2.2.3 Example: Life of a Triangle . 17 2.2.3 Hardware Graphics Pipeline Optimizations . 19 2.2.3.1 Early Elimination of Invisible Primitives and Frag- ments . 19 2.2.3.1.a Early Depth Testing . 19 2.2.3.1.b Hierarchical Depth Testing . 20 2.2.3.1.c Deferred Shading . 22 2.2.3.1.d Primitive Culling and Clipping . 23 2.2.3.1.e Face Culling . 23 2.2.3.1.f View Frustum Culling . 23 2.2.3.1.g Clipping . 25 2.2.3.2 Hierarchical Position Rasterization . 26 2.2.3.3 Summary . 27 2.2.4 Graphics Hardware Architectures . 27 2.2.4.1 Tile-Based Rendering Architectures . 28 2.2.4.1.a TBR Performance Bottlenecks . 30 2.2.4.1.b Contemporary TBR Architectures . 31 2.2.4.2 Immediate Tiled Rendering Architectures . 31 2.2.5 GPU Compute Architecture . 33 2.3 Emerald SoC Architecture . 34 2.4 Emerald Graphics Architecture . 37 2.4.1 Emerald Graphics Pipeline . 38 2.4.1.1 Pipeline Overview . 38 2.4.1.2 Description of Pipeline Stages . 39 2.4.2 Emerald GPU Architecture . 42 2.4.3 Vertex Shading . 44 2.4.4 Primitive Processing . 46 2.4.5 Hierarchical-Z (Hi-Z) Operations . 49 2.4.6 The TC Stage . 52 ix 2.4.6.1 Tile Coalescing Example . 55 2.4.6.2 Out-of-order Primitive Rendering . 56 2.4.7 Model Validation . 56 2.4.8 Model Limitations and Future Work . 58 2.5 Emerald Software Design . 60 2.5.1 Emerald Standalone Mode . 60 2.5.2 Emerald Full-system Mode . 61 2.6 Case Study I: Memory Organization and Scheduling on Mobile SoCs . 61 2.6.1 Implementation . 62 2.6.1.1 DASH Scheduler . 62 2.6.1.1.a Clustering Bandwidth . 62 2.6.1.2 HMC Controller . 63 2.6.2 Evaluation . 65 2.6.2.1 Regular-load Scenario . 65 2.6.2.2 High-load Scenario . 67 2.6.2.3 Summary and Discussion . 69 2.7 Case Study II: Dynamic Fragment Shading Load-Balancing (DFSL) . 71 2.7.1 Experimental Setup . 71 2.7.2 Load-Balance vs.