Design of Digital Circuits Lecture 20: SIMD Processors

Design of Digital Circuits Lecture 20: SIMD Processors Prof. Onur Mutlu ETH Zurich Spring 2018 11 May 2018 New Course: Bachelor’s Seminar in Comp Arch n Fall 2018 n 2 credit units n Rigorous seminar on fundamental and cutting-edge topics in computer architecture n Critical presentation, review, and discussion of seminal works in computer architecture q We will cover many ideas & issues, analyze their tradeoffs, perform critical thinking and brainstorming n Participation, presentation, report and review writing n Stay tuned for more information 2 For the Curious: New Rowhammer Attack n Another Rowhammer-based attack disclosed yesterday 3 Last Week’s Attack n Using an integrated GPU in a mobile system to remotely escalate privilege via the WebGL interface 4 More to Come … n Onur Mutlu, "The RowHammer Problem and Other Issues We May Face as Memory Becomes Denser" Invited Paper in Proceedings of the Design, Automation, and Test in Europe Conference (DATE), Lausanne, Switzerland, March 2017. [Slides (pptx) (pdf)] https://people.inf.ethz.ch/omutlu/pub/rowhammer-and-other-memory-issues_date17.pdf 5 Agenda for Today & Next Few Lectures n Single-cycle Microarchitectures n Multi-cycle and Microprogrammed Microarchitectures n Pipelining n Issues in Pipelining: Control & Data Dependence Handling, State Maintenance and Recovery, … n Out-of-Order Execution n Other Execution Paradigms 6 Readings for Today n Peleg and Weiser, “MMX Technology Extension to the Intel Architecture,” IEEE Micro 1996. n Lindholm et al., "NVIDIA Tesla: A Unified Graphics and Computing Architecture," IEEE Micro 2008. 7 Other Approaches to Concurrency (or Instruction Level Parallelism) Approaches to (Instruction-Level) Concurrency n Pipelining n Out-of-order execution n Dataflow (at the ISA level) n Superscalar Execution n VLIW n Fine-Grained Multithreading n SIMD Processing (Vector and array processors, GPUs) n Decoupled Access Execute n Systolic Arrays 9 SIMD Processing: Exploiting Regular (Data) Parallelism Flynn’s Taxonomy of Computers n Mike Flynn, “Very High-Speed Computing Systems,” Proc. of IEEE, 1966 n SISD: Single instruction operates on single data element n SIMD: Single instruction operates on multiple data elements q Array processor q Vector processor n MISD: Multiple instructions operate on single data element q Closest form: systolic array processor, streaming processor n MIMD: Multiple instructions operate on multiple data elements (multiple instruction streams) q Multiprocessor q Multithreaded processor 11 Data Parallelism n Concurrency arises from performing the same operation on different pieces of data q Single instruction multiple data (SIMD) q E.g., dot product of two vectors n Contrast with data flow q Concurrency arises from executing different operations in parallel (in a data driven manner) n Contrast with thread (“control”) parallelism q Concurrency arises from executing different threads of control in parallel n SIMD exploits operation-level parallelism on different data q Same operation concurrently applied to different pieces of data q A form of ILP where instruction happens to be the same across data 12 SIMD Processing n Single instruction operates on multiple data elements q In time or in space n Multiple processing elements n Time-space duality q Array processor: Instruction operates on multiple data elements at the same time using different spaces q Vector processor: Instruction operates on multiple data elements in consecutive time steps using the same space 13 Array vs. Vector Processors ARRAY PROCESSOR VECTOR PROCESSOR Instruction Stream Same op @ same time DiFFerent ops @ time LD VR ß A[3:0] LD0 LD1 LD2 LD3 LD0 ADD VR ß VR, 1 AD0 AD1 AD2 AD3 LD1 AD0 MUL VR ß VR, 2 ST A[3:0] ß VR MU0 MU1 MU2 MU3 LD2 AD1 MU0 ST0 ST1 ST2 ST3 LD3 AD2 MU1 ST0 DiFFerent ops @ same space AD3 MU2 ST1 MU3 ST2 Time Same op @ space ST3 Space Space 14 SIMD Array Processing vs. VLIW n VLIW: Multiple independent operations packed together by the compiler 15 SIMD Array Processing vs. VLIW n Array processor: Single operation on multiple (different) data elements 16 Vector Processors (I) n A vector is a one-dimensional array of numbers n Many scientific/commercial programs use vectors for (i = 0; i<=49; i++) C[i] = (A[i] + B[i]) / 2 n A vector processor is one whose instructions operate on vectors rather than scalar (single data) values n Basic requirements q Need to load/store vectors à vector registers (contain vectors) q Need to operate on vectors of different lengths à vector length register (VLEN) q Elements of a vector might be stored apart from each other in memory à vector stride register (VSTR) n Stride: distance in memory between two elements of a vector 17 Vector Processors (II) n A vector instruction performs an operation on each element in consecutive cycles q Vector functional units are pipelined q Each pipeline stage operates on a different data element n Vector instructions allow deeper pipelines q No intra-vector dependencies à no hardware interlocking needed within a vector q No control flow within a vector q Known stride allows easy address calculation for all vector elements n Enables prefetching of vectors into registers/cache/memory 18 Vector Processor Advantages + No dependencies within a vector q Pipelining & parallelization work really well q Can have very deep pipelines, no dependencies! + Each instruction generates a lot of work q Reduces instruction fetch bandwidth requirements + Highly regular memory access pattern + No need to explicitly code loops q Fewer branches in the instruction sequence 19 Vector Processor Disadvantages -- Works (only) if parallelism is regular (data/SIMD parallelism) ++ Vector operations -- Very inefficient if parallelism is irregular -- How about searching for a key in a linked list? Fisher, “Very Long Instruction Word architectures and the ELI-512,” ISCA 1983. 20 Vector Processor Limitations -- Memory (bandwidth) can easily become a bottleneck, especially if 1. compute/memory operation balance is not maintained 2. data is not mapped appropriately to memory banks 21 Vector Processing in More Depth Vector Registers n Each vector data register holds N M-bit values n Vector control registers: VLEN, VSTR, VMASK n Maximum VLEN can be N q Maximum number of elements stored in a vector register n Vector Mask Register (VMASK) q Indicates which elements of vector to operate on q Set by vector test instructions n e.g., VMASK[i] = (Vk[i] == 0) M-bit wide M-bit wide V0,0 V1,0 V0,1 V1,1 V0,N-1 V1,N-1 23 Vector Functional Units n Use a deep pipeline to execute element operations à fast clock cycle V V V 1 2 3 n Control of deep pipeline is simple because elements in vector are independent Six stage multiply pipeline V1 * V2 à V3 SLide credit: Krste Asanovic 24 Vector Machine Organization (CRAY-1) n CRAY-1 n Russell, “The CRAY-1 computer system,” CACM 1978. n Scalar and vector modes n 8 64-element vector registers n 64 bits per element n 16 memory banks n 8 64-bit scalar registers n 8 24-bit address registers 25 CRAY X-MP-28 @ ETH (CAB, E Floor) 26 CRAY X-MP System Organization E CRAY X-MP system organization Cray Research Inc., “The CRAY X-MP Series oF Computer Systems,” 1985 27 CRAY X-MP Design Detail CRAY X-MP designdetail Mainframe Memory size CRAY X-MP single- and (millions of Number multiprocessor systems are Model Number of CPUs 64-bit words) of banks designed to offer users outstandmg performance on large-scale, CRAY X-MPl416 compute-intensive and 110-bound CRAY X-MPl48 jobs. CRAY X-MPl216 CRAY X-MP128 CRAY X-MP mainframes consist of CRAY X-MPl24 SIX (X-MPII), eight (X-MPl2) or CRAY X-MPl18 twelve (X-MPl4) vertical columns CRAY X-MPl14 arranged in an arc. Power supplies CRAY X-MP112 and cooling are clustered around the CRAY X-MPII 1 base and extend outward. A description of the major system communications section coordinates components and their functions processing between CPUs, and follows. central memory is shared. CPU computation section Registers The basic set of programmable Within the computation section of registers is composed of: each CPU are operating registers, functional units and an instruction Eight 24-bit address (A) registers control network - hardware Sixty-four 24-b~tintermediate address elements that cooperate in executing (B) registers sequences of instructions. The Eight 64-bit scalar (S) registers instruction control network makes all Sixty-four 64-bit scalar-save decisions related to instruction issue (T) reg~sters as well as coordinating the three Eight 64-element (4096-bit) vector (V) types of processing within each registers with 64 bits per element CPU: vector, scalar and address. Each of the processing modes has The 24-bit A registers are generally its associated registers and used for addressing and counting functional unk operations. Associated with them are 64 B registers, also 24 bits wide. The block diagram of a CRAY Since the transfer between an A and X-MPl4 (opposite page) illustrates a B register takes only one clock Cray Research Inc., “The the relationship of the registers to the period, the B registers assume the functional units, instruction buffers, role of data cache, storing CRAY X-MP Series oF I10 channel control registers, informationfor fast access without interprocessor communications tying up the A registers for relatively Computer Systems,” 1985 section and memory. For long periods. multiple-processorCRAY X-MP models, the interprocessor 28 CRAY X-MP CPU Functional Units shared registers for btcrprucessw mehwh1 28 7 @-bitinsfruetion comrnun~cat~onand synchronlzatlon cause ~tto swltch from user to parcels, twlce the capac~tyof the Each cluster of shared reglsters monitor mode. Addlt~onally,each CRAY-1 ~nstruct~onbuffer.Cray The Research Inc., “The cons~stsof eight 24-b~tshared processor In a cluster can instruction buffers of eachCRAY CPU areX- MP Series oF address (SB) reglsters, e~ght64-b~t asynchronously perform scalar or baded from memory at the burst rate shared scalar (ST) reg~stersand vector operations dctated by user ~f eight words per clockComputer period.

Design of Digital Circuits Lecture 20: SIMD Processors

Data-Flow Prescheduling for Large Instruction Windows in Out-Of-Order Processors

Jetson TX2 • NVIDIA Jetson Xavier • GPU Programming • Algorithm Mapping: • Convolutions Parallel Algorithm Execution

Computer Architecture Out-Of-Order Execution

CSC.T433 Advanced Computer Architecture, Department of Computer Science, TOKYO TECH 1 Datapath of Ooo Execution Processor

STRAIGHT: Realizing a Lightweight Large Instruction Window by Using Eventually Consistent Distributed Registers

Optimizing SIMD Execution in HW/SW Co-Designed Processors

Multithreading

Transforming TLP Into DLP with the Dynamic Inter-Thread Vectorization Architecture Sajith Kalathingal

Advanced Computer Architecture

Dynamic Vectorization in the E2 Dynamic Multicore Architecture to Appear in the Proceedings of HEART 2010

Instruction Fetch and Issue on an Implementable Simultaneous

Chapter 16 - Instruction-Level Parallelism and Superscalar Processors