GPU Architecture Part 2: SIMT Control Flow Management

Total Page:16

File Type:pdf, Size:1020Kb

GPU Architecture Part 2: SIMT Control Flow Management GPU architecture part 2: SIMT control flow management Caroline Collange Inria Rennes – Bretagne Atlantique [email protected] https://team.inria.fr/pacap/members/collange/ Master 2 SIF ADA - 2019 Outline Running SPMD software on SIMD hardware Context: software and hardware The control flow divergence problem Stack-based control flow tracking Stacks Counters Path-based control flow tracking The idea: use PCs Implementation: path list Applications Software approaches Use cases and principle Scalarization Research directions 2 Analogy: Out-of-order microarchitecture scale: test esi, esi je .L4 void scale(float a, float * X, int n) sub esi, 1 xor eax, eax Software { lea rdx, [4+rsi*4] for(int i = 0; i != n; ++i) .L3: movss xmm1, DWORD PTR [rdi+rax] X[i] = a * X[i]; mulss xmm1, xmm0 } movss DWORD PTR [rdi+rax], xmm1 add rax, 4 cmp rax, rdx jne .L3 .L4: rep Architecture: sequential programming model ret Hardware 3 Analogy: Out-of-order microarchitecture scale: test esi, esi je .L4 void scale(float a, float * X, int n) sub esi, 1 xor eax, eax Software { lea rdx, [4+rsi*4] for(int i = 0; i != n; ++i) .L3: movss xmm1, DWORD PTR [rdi+rax] X[i] = a * X[i]; mulss xmm1, xmm0 } movss DWORD PTR [rdi+rax], xmm1 add rax, 4 cmp rax, rdx jne .L3 .L4: rep Architecture: sequential programming model ret Hardware datapaths: dataflow execution model Hardware 4 Analogy: Out-of-order microarchitecture scale: test esi, esi je .L4 void scale(float a, float * X, int n) sub esi, 1 xor eax, eax Software { lea rdx, [4+rsi*4] for(int i = 0; i != n; ++i) .L3: movss xmm1, DWORD PTR [rdi+rax] X[i] = a * X[i]; mulss xmm1, xmm0 } movss DWORD PTR [rdi+rax], xmm1 add rax, 4 cmp rax, rdx jne .L3 .L4: rep Architecture: sequential programming model ret Dark magic! Hardware datapaths: dataflow execution model Hardware 5 Analogy: Out-of-order microarchitecture scale: test esi, esi je .L4 void scale(float a, float * X, int n) sub esi, 1 xor eax, eax Software { lea rdx, [4+rsi*4] for(int i = 0; i != n; ++i) .L3: movss xmm1, DWORD PTR [rdi+rax] X[i] = a * X[i]; mulss xmm1, xmm0 } movss DWORD PTR [rdi+rax], xmm1 add rax, 4 cmp rax, rdx jne .L3 .L4: rep Architecture: sequential programming model ret Dark magic! Out-of-order superscalar microarchitecture Hardware datapaths: dataflow execution model Hardware 6 GPU microarchitecture: where it fits __global__ void scale(float a, float * X) { Software unsigned int tid; tid = blockIdx.x * blockDim.x + threadIdx.x; X[tid] = a * X[tid]; } Architecture: multi-thread programming model Dark magic! SIMT microarchitecture Hardware datapaths: SIMD execution units Hardware 7 Programmer's view Programming model Threads SPMD: Single program, multiple data One kernel code, many threads Unspecified execution order between explicit synchronization barriers Languages Barrier Graphics shaders : HLSL, Cg, GLSL GPGPU : C for CUDA, OpenCL For n threads: X[tid] ← a * X[tid] Kernel 8 Types of control flow Structured control flow: single-entry, single exit properly nested Conditionals: if-then-else Single-entry, single-exit loops: while, do-while, for… Function call-return… Unstructured control flow break, continue && || short-circuit evaluation Exceptions Coroutines goto, comefrom Code that is hard to indent! 9 Warps T0 warps of fixed warps size into are grouped Threads Warp 0 Warp T1 T2 T3 T4 Warp 1 Warp T5 T6 T7 10 An early SIMT architecture. Musée Gallo- Romain de St-Romain-en-Gal, Vienne Control flow: uniform or divergent Outcome per thread: Control is uniform Warp when all threads in the warp x = 0; follow the same path // Uniform condition T0 T1 T2 T3 if(a[x] > 17) { 1 1 1 1 x = 1; } Control is divergent // Divergent condition when different threads follow if(tid < 2) { 1 1 0 0 different paths x = 2; } 11 Computer architect view SIMD execution inside a warp One SIMD lane per thread All SIMD lanes see the same instruction Control-flow differentiation using execution mask All instructions controlled with 1 bit per lane 1→perform instruction 0→do nothing 12 Running independent threads in SIMD How to keep threads synchronized? x = 0; Challenge: divergent control flow // Uniform condition if(tid > 17) { Rules of the game x = 1; One thread per SIMD lane } Same instruction on all lanes // Divergent conditions Lanes can be individually disabled if(tid < 2) { with execution mask if(tid == 0) { Which instruction? x = 2; How to compute execution mask? } else { Thread 0 Thread 1 Thread 2 Thread 3 x = 3; } 1 instruction Lane 0 Lane 1 Lane 2 Lane 3 } 16 Example: if Execution mask inside if statement = if condition Warp x = 0; // Uniform condition T0 T1 T2 T3 if condition: if(a[x] > 17) { a[x] > 17? 1 1 1 1 x = 1; Execution mask: 1 1 1 1 } // Divergent condition if condition: if(tid < 2) { tid < 2? 1 1 0 0 x = 2; Execution mask: 1 1 0 0 } 17 State of the art in 1804 The Jacquard loom is a GPU Framebuffer Shaders Multiple warp threads with per-thread conditional execution Execution mask given by punched cards Supports 600 to 800 parallel threads 18 Conditionals that no thread executes Do not waste energy fetching instructions not executed Skip instructions when execution mask is all-zeroes Warp x = 0; // Uniform condition T0 T1 T2 T3 if condition: if(a[x] > 17) { a[x] > 17? 1 1 1 1 x = 1; Execution mask: 1 1 1 1 } else { Execution mask: 0 0 0 0 x = 0; Jump to end-if } // Divergent condition if condition: if(tid < 2) { tid < 2? 1 1 0 0 x = 2; Execution mask: 1 1 0 0 } Uniform branches are just usual scalar branches 19 What about loops? Keep looping until all threads exit Mask out threads that have exited the loop Warp Execution trace: i = 0; T0 T1 T2 T3 while(i < tid) { i = 0; i =? 0 0 0 0 i++; i < tid? 0 1 1 1 } print(i); i++; i =? 0 1 1 1 i < tid? 0 0 1 1 i++; i =? 0 1 2 2 T i m i < tid? 0 0 0 1 e i++; i =? 0 1 2 3 i < tid? 0 0 0 0 No active thread left → restore mask and exit loop print(i); i =? 0 1 2 3 20 What about nested control flow? x = 0; // Uniform condition if(tid > 17) { x = 1; } // Divergent conditions if(tid < 2) { if(tid == 0) { x = 2; } else { x = 3; } } We need a generic solution! 21 Outline Running SPMD software on SIMD hardware Context: software and hardware The control flow divergence problem Stack-based control flow tracking Stacks Counters Path-based control flow tracking The idea: use PCs Implementation: path list Applications Software approaches Use cases and principle Scalarization Research directions 22 Quizz What do Star Wars and GPUs have in common? 23 Answer: Pixar! In the early 1980's, the Computer Division of Lucasfilm was designing custom hardware for computer graphics Acquired by Steve Jobs in 1986 and became Pixar Their core product: the Pixar Image Computer This early GPU handles nested divergent control flow! 24 Pixar Image Computer: architecture overview Instruction Control unit push/ pop ALU 0 ALU 1 ALU 2 ALU 3 Mask stack Execution mask 25 The mask stack of the Pixar Image Computer Code Mask Stack x = 0; 1 activity bit / thread // Uniform condition if(tid > 17) { 1111 s k i tid=0 tid=2 p x = 1; } 1111 // Divergent conditions tid=1 tid=3 if(tid < 2) { push if(tid == 0) { 1111 1100 push x = 2; 1111 1100 1000 } pop 1111 1100 else { push x = 3; 1111 1100 0100 } pop 1111 1100 pop } 1111 A. Levinthal and T. Porter. Chap - a SIMD graphics processor. SIGGRAPH’84, 1984. 26 Observation: stack content is a histogram 1000 0100 1100 1100 1100 1100 1100 1111 1111 1111 1111 1111 1111 On structured control flow: columns of 1s A thread active at level n is active at all levels i<n Conversely: no “zombie” thread gets revived at level i>n if inactive at n 27 Observation: stack content is a histogram 1000 0100 1100 1100 1100 1100 1100 1111 1111 1111 1111 1111 1111 On structured control flow: columns of 1s A thread active at level n is active at all levels i<n Conversely: no “zombie” thread gets revived at level i>n if inactive at n The height of each column of 1s is enough Alternative implementation: maintain an activity counter for each thread R. Keryell and N. Paris. Activity counter : New optimization for the dynamic scheduling of 28 SIMD control flow. ICPP ’93, 1993. With activity counters Code Counters x = 0; 1 (in)activity counter / thread // Uniform condition if(tid > 17) { s k i tid=0 tid=2 p x = 1; } 0 0 0 0 // Divergent conditions tid=1 tid=3 if(tid < 2) { 0 0 1 1 inc if(tid == 0) { inc x = 2; 0 1 2 2 } dec 0 0 1 1 else { inc x = 3; 1 0 2 2 } dec 0 0 1 1 } dec 0 0 0 0 29 30 Credits: Ronan Keryell In an SIMD prototype from École des Mines de Paris 1993 Paris in In an SIMD de prototype fromdes Mines École In Intel integrated graphics since 2004 since In graphics Intel integrated Activity counters in use Activity Outline Running SPMD software on SIMD hardware Context: software and hardware The control flow divergence problem Stack-based control flow tracking Stacks, counters A compiler perspective Path-based control flow tracking The idea: use PCs Implementation: path list Applications Software approaches Use cases and principle Scalarization Research directions 31 Back to ancient history Microsoft DirectX 7.x 8.0 8.1 9.0 a 9.0b 9.0c 10.0 10.1 11 Unified shaders NVIDIA NV10 NV20 NV30 NV40 G70 G80-G90 GT200 GF100 FP 16 Programmable FP 32 Dynamic SIMT CUDA shaders control flow ATI/AMD FP 24 CTM FP 64 CAL R100 R200 R300 R400 R500 R600 R700 Evergreen GPGPU traction 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 32 Early days of programmable shaders It is 21st century! Graphics cards now look and sound like hair dryers The infamous GeForce FX 5800 Ultra Graphics shaders are programmed in assembly-like language Direct3D shader assembly, OpenGL ARB Vertex/Fragment Program..
Recommended publications
  • Reviving the Development of Openchrome
    Reviving the Development of OpenChrome Kevin Brace OpenChrome Project Maintainer / Developer XDC2017 September 21st, 2017 Outline ● About Me ● My Personal Story Behind OpenChrome ● Background on VIA Chrome Hardware ● The History of OpenChrome Project ● Past Releases ● Observations about Standby Resume ● Developmental Philosophy ● Developmental Challenges ● Strategies for Further Development ● Future Plans 09/21/2017 XDC2017 2 About Me ● EE (Electrical Engineering) background (B.S.E.E.) who specialized in digital design / computer architecture in college (pretty much the only undergraduate student “still” doing this stuff where I attended college) ● Graduated recently ● First time conference presenter ● Very experienced with Xilinx FPGA (Spartan-II through 7 Series FPGA) ● Fluent in Verilog / VHDL design and verification ● Interest / design experience with external communication interfaces (PCI / PCIe) and external memory interfaces (SDRAM / DDR3 SDRAM) ● Developed a simple DMA engine for PCI I/F validation w/Windows WDM (Windows Driver Model) kernel device driver ● Almost all the knowledge I have is self taught (university engineering classes were not very useful) 09/21/2017 XDC2017 3 Motivations Behind My Work ● General difficulty in obtaining meaningful employment in the digital hardware design field (too many students in the field, difficulty obtaining internship, etc.) ● Collects and repairs abandoned computer hardware (It’s like rescuing puppies!) ● Owns 100+ desktop computers and 20+ laptop computers (mostly abandoned old stuff I
    [Show full text]
  • Inside Intel® Core™ Microarchitecture Setting New Standards for Energy-Efficient Performance
    White Paper Inside Intel® Core™ Microarchitecture Setting New Standards for Energy-Efficient Performance Ofri Wechsler Intel Fellow, Mobility Group Director, Mobility Microprocessor Architecture Intel Corporation White Paper Inside Intel®Core™ Microarchitecture Introduction Introduction 2 The Intel® Core™ microarchitecture is a new foundation for Intel®Core™ Microarchitecture Design Goals 3 Intel® architecture-based desktop, mobile, and mainstream server multi-core processors. This state-of-the-art multi-core optimized Delivering Energy-Efficient Performance 4 and power-efficient microarchitecture is designed to deliver Intel®Core™ Microarchitecture Innovations 5 increased performance and performance-per-watt—thus increasing Intel® Wide Dynamic Execution 6 overall energy efficiency. This new microarchitecture extends the energy efficient philosophy first delivered in Intel's mobile Intel® Intelligent Power Capability 8 microarchitecture found in the Intel® Pentium® M processor, and Intel® Advanced Smart Cache 8 greatly enhances it with many new and leading edge microar- Intel® Smart Memory Access 9 chitectural innovations as well as existing Intel NetBurst® microarchitecture features. What’s more, it incorporates many Intel® Advanced Digital Media Boost 10 new and significant innovations designed to optimize the Intel®Core™ Microarchitecture and Software 11 power, performance, and scalability of multi-core processors. Summary 12 The Intel Core microarchitecture shows Intel’s continued Learn More 12 innovation by delivering both greater energy efficiency Author Biographies 12 and compute capability required for the new workloads and usage models now making their way across computing. With its higher performance and low power, the new Intel Core microarchitecture will be the basis for many new solutions and form factors. In the home, these include higher performing, ultra-quiet, sleek and low-power computer designs, and new advances in more sophisticated, user-friendly entertainment systems.
    [Show full text]
  • POWER-AWARE MICROARCHITECTURE: Design and Modeling Challenges for Next-Generation Microprocessors
    POWER-AWARE MICROARCHITECTURE: Design and Modeling Challenges for Next-Generation Microprocessors THE ABILITY TO ESTIMATE POWER CONSUMPTION DURING EARLY-STAGE DEFINITION AND TRADE-OFF STUDIES IS A KEY NEW METHODOLOGY ENHANCEMENT. OPPORTUNITIES FOR SAVING POWER CAN BE EXPOSED VIA MICROARCHITECTURE-LEVEL MODELING, PARTICULARLY THROUGH CLOCK- GATING AND DYNAMIC ADAPTATION. Power dissipation limits have Thus far, most of the work done in the area David M. Brooks emerged as a major constraint in the design of high-level power estimation has been focused of microprocessors. At the low end of the per- at the register-transfer-level (RTL) description Pradip Bose formance spectrum, namely in the world of in the processor design flow. Only recently have handheld and portable devices or systems, we seen a surge of interest in estimating power Stanley E. Schuster power has always dominated over perfor- at the microarchitecture definition stage, and mance (execution time) as the primary design specific work on power-efficient microarchi- Hans Jacobson issue. Battery life and system cost constraints tecture design has been reported.2-8 drive the design team to consider power over Here, we describe the approach of using Prabhakar N. Kudva performance in such a scenario. energy-enabled performance simulators in Increasingly, however, power is also a key early design. We examine some of the emerg- Alper Buyuktosunoglu design issue in the workstation and server mar- ing paradigms in processor design and com- kets (see Gowan et al.)1 In this high-end arena ment on their inherent power-performance John-David Wellman the increasing microarchitectural complexities, characteristics. clock frequencies, and die sizes push the chip- Victor Zyuban level—and hence the system-level—power Power-performance efficiency consumption to such levels that traditionally See the “Power-performance fundamentals” Manish Gupta air-cooled multiprocessor server boxes may box.
    [Show full text]
  • Intel® G31 Express Chipset Product Brief
    Product Brief Intel® G31 Express Chipset Intel® G31 Express Chipset Flexibility and scalability for essential computing The Intel® G31 Express Chipset supports Intel’s upcoming 45nm processors and enables Windows Vista* premium experience for value conscious consumers. The Intel G31 Express Chipset Desktop PC platforms, combined with either the Intel® Core™2 Duo or the Intel® Core™2 Quad processor, deliver new technologies and innovating capabilities for all consumers. With a 1333MHz system bus, DDR2 memory technology and support for Windows Vista Premium, the Intel G31 Express chipset enables scalability and performance for essential computing. Support for 45nm processor technology and Intel® Fast Memory Access (Intel® FMA) provide increased system performance for today’s computing needs. The Intel G31 Express Chipset enables a Intel® I/O Controller Hub (Intel® ICH7/R) balanced platform for everyday computing needs. The Intel G31 Express Chipset elevates storage performance with Serial Intel® Viiv™ processor technology ATA (SATA) and enhancements to Intel® Matrix Storage Technology2. Intel® Viiv™ processor technology1 is a set of PC technologies designed This chipset has four integrated SATA ports for transfer rates up to 3 for the enjoyment of digital entertainment in the home. The Intel G31 Gb/s (300 MB/s) to SATA hard drives or optical devices. Support for RAID Express Chipset supports Intel Viiv processor technology with either the 0, 1, 5 and 10 allows for different RAID capabilities that address specific Intel® ICH7R or ICH7DH SKUs. needs and usages. For example, critical data can be stored on one array designed for high reliability, while performance-intensive applications like Faster System Performance games can reside on a separate array designed for maximum The Intel® G31 Express Chipset Graphics Memory Controller Hub (GMCH) performance.
    [Show full text]
  • Hardware Architecture
    Hardware Architecture Components Computing Infrastructure Components Servers Clients LAN & WLAN Internet Connectivity Computation Software Storage Backup Integration is the Key ! Security Data Network Management Computer Today’s Computer Computer Model: Von Neumann Architecture Computer Model Input: keyboard, mouse, scanner, punch cards Processing: CPU executes the computer program Output: monitor, printer, fax machine Storage: hard drive, optical media, diskettes, magnetic tape Von Neumann architecture - Wiki Article (15 min YouTube Video) Components Computer Components Components Computer Components CPU Memory Hard Disk Mother Board CD/DVD Drives Adaptors Power Supply Display Keyboard Mouse Network Interface I/O ports CPU CPU CPU – Central Processing Unit (Microprocessor) consists of three parts: Control Unit • Execute programs/instructions: the machine language • Move data from one memory location to another • Communicate between other parts of a PC Arithmetic Logic Unit • Arithmetic operations: add, subtract, multiply, divide • Logic operations: and, or, xor • Floating point operations: real number manipulation Registers CPU Processor Architecture See How the CPU Works In One Lesson (20 min YouTube Video) CPU CPU CPU speed is influenced by several factors: Chip Manufacturing Technology: nm (2002: 130 nm, 2004: 90nm, 2006: 65 nm, 2008: 45nm, 2010:32nm, Latest is 22nm) Clock speed: Gigahertz (Typical : 2 – 3 GHz, Maximum 5.5 GHz) Front Side Bus: MHz (Typical: 1333MHz , 1666MHz) Word size : 32-bit or 64-bit word sizes Cache: Level 1 (64 KB per core), Level 2 (256 KB per core) caches on die. Now Level 3 (2 MB to 8 MB shared) cache also on die Instruction set size: X86 (CISC), RISC Microarchitecture: CPU Internal Architecture (Ivy Bridge, Haswell) Single Core/Multi Core Multi Threading Hyper Threading vs.
    [Show full text]
  • Multiprocessing Contents
    Multiprocessing Contents 1 Multiprocessing 1 1.1 Pre-history .............................................. 1 1.2 Key topics ............................................... 1 1.2.1 Processor symmetry ...................................... 1 1.2.2 Instruction and data streams ................................. 1 1.2.3 Processor coupling ...................................... 2 1.2.4 Multiprocessor Communication Architecture ......................... 2 1.3 Flynn’s taxonomy ........................................... 2 1.3.1 SISD multiprocessing ..................................... 2 1.3.2 SIMD multiprocessing .................................... 2 1.3.3 MISD multiprocessing .................................... 3 1.3.4 MIMD multiprocessing .................................... 3 1.4 See also ................................................ 3 1.5 References ............................................... 3 2 Computer multitasking 5 2.1 Multiprogramming .......................................... 5 2.2 Cooperative multitasking ....................................... 6 2.3 Preemptive multitasking ....................................... 6 2.4 Real time ............................................... 7 2.5 Multithreading ............................................ 7 2.6 Memory protection .......................................... 7 2.7 Memory swapping .......................................... 7 2.8 Programming ............................................. 7 2.9 See also ................................................ 8 2.10 References .............................................
    [Show full text]
  • Microcontroller Serial Interfaces
    Microcontroller Serial Interfaces Dr. Francesco Conti [email protected] Microcontroller System Architecture Each MCU (micro-controller unit) is characterized by: • Microprocessor • 8,16,32 bit architecture • Usually “simple” in-order microarchitecture, no FPU Example: STM32F101 MCU Microcontroller System Architecture Each MCU (micro-controller unit) is characterized by: • Microprocessor • 8,16,32 bit architecture • Usually “simple” in-order microarchitecture, no FPU • Memory • RAM (from 512B to 256kB) • FLASH (from 512B to 1MB) Example: STM32F101 MCU Microcontroller System Architecture Each MCU (micro-controller unit) is characterized by: • Microprocessor • 8,16,32 bit architecture • Usually “simple” in-order microarchitecture, no FPU • Memory • RAM (from 512B to 256kB) • FLASH (from 512B to 1MB) • Peripherals • DMA • Timer • Interfaces • Digital Interfaces • Analog Timer DMAs Example: STM32F101 MCU Microcontroller System Architecture Each MCU (micro-controller unit) is characterized by: • Microprocessor • 8,16,32 bit architecture • Usually “simple” in-order microarchitecture, no FPU • Memory • RAM (from 512B to 256kB) • FLASH (from 512B to 1MB) • Peripherals • DMA • Timer • Interfaces • Digital • Analog • Interconnect Example: STM32F101 MCU • AHB system bus (ARM-based MCUs) • APB peripheral bus (ARM-based MCUs) Microcontroller System Architecture Each MCU (micro-controller unit) is characterized by: • Microprocessor • 8,16,32 bit architecture • Usually “simple” in-order microarchitecture, no FPU • Memory • RAM (from 512B to 256kB) • FLASH
    [Show full text]
  • A Programming Model and Processor Architecture for Heterogeneous Multicore Computers
    A PROGRAMMING MODEL AND PROCESSOR ARCHITECTURE FOR HETEROGENEOUS MULTICORE COMPUTERS A DISSERTATION SUBMITTED TO THE DEPARTMENT OF ELECTRICAL ENGINEERING AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY Michael D. Linderman February 2009 c Copyright by Michael D. Linderman 2009 All Rights Reserved ii I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. (Professor Teresa H. Meng) Principal Adviser I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. (Professor Mark Horowitz) I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. (Professor Krishna V. Shenoy) Approved for the University Committee on Graduate Studies. iii Abstract Heterogeneous multicore computers, those systems that integrate specialized accelerators into and alongside multicore general-purpose processors (GPPs), provide the scalable performance needed by computationally demanding information processing (informatics) applications. However, these systems often feature instruction sets and functionality that significantly differ from GPPs and for which there is often little or no sophisticated compiler support. Consequently developing applica- tions for these systems is difficult and developer productivity is low. This thesis presents Merge, a general-purpose programming model for heterogeneous multicore systems. The Merge programming model enables the programmer to leverage different processor- specific or application domain-specific toolchains to create software modules specialized for differ- ent hardware configurations; and provides language mechanisms to enable the automatic mapping of processor-agnostic applications to these processor-specific modules.
    [Show full text]
  • Reverse Engineering X86 Processor Microcode
    Reverse Engineering x86 Processor Microcode Philipp Koppe, Benjamin Kollenda, Marc Fyrbiak, Christian Kison, Robert Gawlik, Christof Paar, and Thorsten Holz, Ruhr-University Bochum https://www.usenix.org/conference/usenixsecurity17/technical-sessions/presentation/koppe This paper is included in the Proceedings of the 26th USENIX Security Symposium August 16–18, 2017 • Vancouver, BC, Canada ISBN 978-1-931971-40-9 Open access to the Proceedings of the 26th USENIX Security Symposium is sponsored by USENIX Reverse Engineering x86 Processor Microcode Philipp Koppe, Benjamin Kollenda, Marc Fyrbiak, Christian Kison, Robert Gawlik, Christof Paar, and Thorsten Holz Ruhr-Universitat¨ Bochum Abstract hardware modifications [48]. Dedicated hardware units to counter bugs are imperfect [36, 49] and involve non- Microcode is an abstraction layer on top of the phys- negligible hardware costs [8]. The infamous Pentium fdiv ical components of a CPU and present in most general- bug [62] illustrated a clear economic need for field up- purpose CPUs today. In addition to facilitate complex and dates after deployment in order to turn off defective parts vast instruction sets, it also provides an update mechanism and patch erroneous behavior. Note that the implementa- that allows CPUs to be patched in-place without requiring tion of a modern processor involves millions of lines of any special hardware. While it is well-known that CPUs HDL code [55] and verification of functional correctness are regularly updated with this mechanism, very little is for such processors is still an unsolved problem [4, 29]. known about its inner workings given that microcode and the update mechanism are proprietary and have not been Since the 1970s, x86 processor manufacturers have throughly analyzed yet.
    [Show full text]
  • Intel(R) Software Guard Extensions Developer Guide
    Intel® Software Guard Extensions (Intel® SGX) Developer Guide Intel(R) Software Guard Extensions Developer Guide Legal Information No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document. Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade. This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, sched- ule, specifications and roadmaps. The products and services described may contain defects or errors known as errata which may cause deviations from published specifications. Current char- acterized errata are available on request. Intel technologies features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at Intel.com, or from the OEM or retailer. Copies of documents which have an order number and are referenced in this document may be obtained by calling 1-800-548-4725 or by visiting www.in- tel.com/design/literature.htm. Intel, the Intel logo, Xeon, and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel micro- processors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations.
    [Show full text]
  • Itanium® 2 Processor Microarchitecture Overview
    Itanium® 2 Processor Microarchitecture Overview Don Soltis, Mark Gibson Cameron McNairy Hot Chips 14, August 2002 Itanium® 2 Processor Overview 16KB16KB L1L1 I-cacheI-cache BranchBranch PredictPredict 2 bundles (128 bits each) InstrInstr 22 InstrInstr 11 InstrInstr 00 TemplateTemplate Issue up to 6 instructions to the 11 pipes FF FF M/AM/A M/AM/A M/AM/A M/AM/A I/AI/A I/AI/A BB BB BB 22 FMACsFMACs 66 ALUsALUs BranchBranch UnitUnit 6 Opnds 2 Preds 16KB16KB L1L1 D-cacheD-cache 12 Opnds 2 Results 4 Preds 6 Results 2 Loads, 2 Stores 6 Predicates 128 FP GPRs 128 Int GPRs Consumed 1 Predicate 3 Predicates 128 FP GPRs 4 Predicates 128 Int GPRs 12 Predicates Produced Block Diagram 64 Predicate Registers 4 Loads 64 Predicate Registers 2 Stores 256KB256KB L2L2 CacheCache 3MB3MB L3L3 CacheCache BusBus InterfaceInterface Hot Chips 14 Itanium® 2 Processor Overview EXEEXE DETDET WRBWRB IPGIPG ROTROT EXPEXP RENREN REGREG FP1FP1 FP2FP2 FP3FP3 FP4FP4 WRBWRB IPG: Instruction Pointer Generate, Instruction address to L1 I-cache ROT: Present 2 Instruction Bundles from L1 I-cache to dispersal hardware EXP: Disperse up to 6 instruction syllables from the 2 instruction bundles REN: Rename (or convert) virtual register IDs to physical register IDs REG: Register file read, or bypass results in flight as operands EXE: Execute integer instructions; generate results and predicates DET: Detect exceptions, traps, etc. FP1-4: Execute floating point instructions; generate results and predicates Main Execution Unit Pipeline WRB: Write back results to the register file
    [Show full text]
  • Microarchitecture-Level Soc Design 27 Young-Hwan Park, Amin Khajeh, Jun Yong Shin, Fadi Kurdahi, Ahmed Eltawil, and Nikil Dutt
    Microarchitecture-Level SoC Design 27 Young-Hwan Park, Amin Khajeh, Jun Yong Shin, Fadi Kurdahi, Ahmed Eltawil, and Nikil Dutt Abstract In this chapter we consider the issues related to integrating microarchitectural IP blocks into complex SoCs while satisfying performance, power, thermal, and reliability constraints. We first review different abstraction levels for SoC design that promote IP reuse, and which enable fast simulation for early functional validation of the SoC platform. Since SoCs must satisfy a multitude of interrelated constraints, we then present high-level power, thermal, and reliability models for predicting these constraints. These constraints are not unrelated and their interactions must be considered, modeled and evaluated. Once constraints are modeled, we must explore the design space trading off performance, power and reliability. Several case studies are presented illustrating how the design space can be explored across layers, and what modifications could be applied at design time and/or runtime to deal with reliability issues that may arise. Acronyms AHB Advanced High-performance Bus APB Advanced Peripheral Bus ASIC Application-Specific Integrated Circuit BER Bit Error Rate BLB Bit Lock Block Y.-H. Park Digital Media and Communications R&D Center, Samsung Electronics, Seoul, Korea e-mail: [email protected] A. Khajeh Broadcom Corp., San Jose, CA, USA e-mail: [email protected] J. Yong Shin • F. Kurdahi () • A. Eltawil • N. Dutt Center for Embedded and Cyber-Physical Systems, University of California Irvine, Irvine, CA, USA e-mail: [email protected]; [email protected]; [email protected]; [email protected] © Springer Science+Business Media Dordrecht 2017 867 S.
    [Show full text]