Optimizing SIMD Execution in HW/SW Co-Designed Processors

Optimizing SIMD Execution in HW/SW Co-designed Processors Rakesh Kumar Department of Computer Architecture Universitat Politècnica de Catalunya Advisors: Alejandro Martínez Intel Barcelona Research Center Antonio González Intel Barcelona Research Center Universitat Politècnica de Catalunya A thesis submitted in fulfillment of the requirements for the degree of Doctor of Philosophy / Doctor per la UPC ABSTRACT SIMD accelerators are ubiquitous in microprocessors from different computing domains. Their high compute power and hardware simplicity improve overall performance in an energy efficient manner. Moreover, their replicated functional units and simple control mechanism make them amenable to scaling to higher vector lengths. However, code generation for these accelerators has been a challenge from the days of their inception. Compilers generate vector code conservatively to ensure correctness. As a result they lose significant vectorization opportunities and fail to extract maximum benefits out of SIMD accelerators. This thesis proposes to vectorize the program binary at runtime in a speculative manner, in addition to the compile time static vectorization. There are different environments that support runtime profiling and optimization support required for dynamic vectorization, one of most prominent ones being: 1) Dynamic Binary Translators and Optimizers (DBTO) and 2) Hardware/Software (HW/SW) Co-designed Processors. HW/SW co-designed environment provides several advantages over DBTOs like transparent incorporations of new hardware features, binary compatibility, etc. Therefore, we use HW/SW co-designed environment to assess the potential of speculative dynamic vectorization. Furthermore, we analyze vector code generation for wider vector units and find out that even though SIMD accelerators are amenable to scaling from hardware point of view, vector code generation at higher vector length is even more challenging. The two major factors impeding vectorization for wider SIMD units are: 1) Reduced dynamic instruction stream coverage for vectorization and 2) Large number of permutation instructions. To solve the first problem we propose Variable Length Vectorization that iteratively vectorizes for multiple vector lengths to improve dynamic instruction stream coverage. Secondly, to reduce the number of permutation instructions we propose Selective Writing that selectively writes to different parts of a vector register and avoids permutations. Finally, we tackle the problem of leakage energy in SIMD accelerators. Since SIMD accelerators consume significant amount of real estate on the chip, they become the principle source of leakage if not utilized judiciously. Power gating is one of the most widely used techniques to reduce leakage energy of functional units. However, power gating has its own energy and performance overhead associated with it. We propose to selectively devectorize the vector code when higher SIMD lanes are used intermittently. This selective devectorization keeps the higher SIMD lanes idle and power gated for maximum duration. Therefore, resulting in overall leakage energy reduction. i ii Table of Contents Abstract i List of Figures vii List of Tables xi List of Algorithms xiii 1. Introduction ................................................................................................................. 1 1.1 SIMD Execution Model .................................................................................... 2 1.2 Challenges in SIMD Execution ........................................................................ 3 1.2.1 Static Vectorization Limitations .............................................................. 3 1.2.2 Wider Vector Units .................................................................................. 4 1.2.3 Leakage in SIMD accelerators ................................................................. 5 1.3 Contributions..................................................................................................... 5 1.3.1 Speculative Dynamic Vectorization ........................................................ 5 1.3.2 Vectorizing for Wider Vector Units ........................................................ 6 1.3.3 Dynamic Selective Devectorization ......................................................... 6 1.4 HW/SW Co-designed Processors ..................................................................... 7 1.5 Why HW/SW Co-designed Environment ......................................................... 7 1.6 Thesis Organization .......................................................................................... 8 2. Background ............................................................................................................... 11 2.1 Microarchitectural Innovations to Exploit Parallelism ................................... 12 2.1.1 Extracting Instruction Level Parallelism ............................................... 12 2.1.2 Extracting Thread Level Parallelism ..................................................... 13 2.1.3 Extracting Data Level Parallelism ......................................................... 15 2.2 SIMD ISA Extensions..................................................................................... 16 2.2.1 Intel´s SIMD Extensions ........................................................................ 17 2.2.2 PowerPC Altivec .................................................................................... 20 iii 2.2.3 ARM Neon ............................................................................................. 20 2.3 Code Generation for SIMD Accelerators ....................................................... 21 2.3.1 Traditional Compiler Vectorization ....................................................... 21 2.3.2 Superword Level Parallelism ................................................................. 22 2.3.3 SLP in Presence of Control Flow........................................................... 23 2.3.4 Speculative Dynamic Vectorization ...................................................... 24 2.3.5 Liquid SIMD .......................................................................................... 24 2.3.6 Vapor SIMD........................................................................................... 25 2.4 Code Optimizations ........................................................................................ 25 2.4.1 Static Compiler Optimizations ............................................................... 26 2.4.2 Dynamic Binary Optimizations ............................................................. 26 2.4 Hardware/Software Co-designed Processors .................................................. 28 2.4.1 Memory System in HW/SW Co-designed Processors ........................... 29 2.4.2 TOL translator/optimizer ....................................................................... 30 2.4.3 Hardware Support in HW/SW Co-designed Processors ........................ 32 2.4.4 Salient features of HW/SW Co-designed Processors ............................ 33 3. Experimental Framework ........................................................................................ 35 3.1 Main Components ........................................................................................... 36 3.2 Execution Flow ............................................................................................... 37 3.3 Translation Optimization Layer ...................................................................... 39 3.3.1 Interpretation .......................................................................................... 39 3.3.2 Basic Block Translation ......................................................................... 40 3.3.3 Superblocks and Optimizations ............................................................. 42 3.4 Speculation and Recovery............................................................................... 44 3.5 The Timing Model .......................................................................................... 46 3.6 TOL Configuration ......................................................................................... 48 3.6.1 Optimal Promotion Threshold ............................................................... 49 3.6.2 Optimized Code Distribution ................................................................. 51 3.6.3 Emulation Cost....................................................................................... 52 3.6.4 Dynamic Instruction and Overhead Distribution ................................... 52 3.6.5 Floating Point and Integer Instruction Distribution ............................... 54 3.7 Host ISA Extension......................................................................................... 55 3.8 McPAT ............................................................................................................ 56 iv 3.8.1 Power Modeling ..................................................................................... 57 4. Speculative Dynamic Vectorization ......................................................................... 59 4.1 Introduction ..................................................................................................... 59 4.2 Motivation ....................................................................................................... 61 4.3 Dynamic Vectorization Algorithm ................................................................. 64 4.3.1 The Vectorizer ......................................................................................

Optimizing SIMD Execution in HW/SW Co-Designed Processors

Data-Flow Prescheduling for Large Instruction Windows in Out-Of-Order Processors

Jetson TX2 • NVIDIA Jetson Xavier • GPU Programming • Algorithm Mapping: • Convolutions Parallel Algorithm Execution

REPORT Compaq Chooses SMT for Alpha Simultaneous Multithreading

Computer Architecture Out-Of-Order Execution

CSC.T433 Advanced Computer Architecture, Department of Computer Science, TOKYO TECH 1 Datapath of Ooo Execution Processor

Prozessorarchitektur Am Beispiel Des Amdathlon

STRAIGHT: Realizing a Lightweight Large Instruction Window by Using Eventually Consistent Distributed Registers

A Bibliography of Publications in IEEE Micro

Multithreading

UNIVERSITY of CALIFORNIA, SAN DIEGO Holistic Design for Multi-Core Architectures a Dissertation Submitted in Partial Satisfactio

Sony's Emotionally Charged Chip

Commemorative Booklet for the Thirty-Fifth Asilomar Microcomputer Workshop April 15-17, 2009