Computer Hardware Engineering
Total Page:16
File Type:pdf, Size:1020Kb
Computer Hardware Engineering IS1200, spring 2016 Lecture 13: SIMD, MIMD, and Parallel Programming David Broman Associate Professor, KTH Royal Institute of Technology Assistant Research Engineer, University of California, Berkeley Slides version 1.0 2 Course Structure Module 1: C and Assembly Module 4: Processor Design Programming LE1 LE2 LE3 EX1 LAB1 LE9 LE10 EX4 S2 LAB4 LE4 S1 LAB2 Module 2: I/O Systems Module 5: Memory Hierarchy LE5 LE6 EX2 LAB3 LE11 EX5 S3 Module 3: Logic Design PROJ Module 6: Parallel Processors START and Programs LE7 LE8 EX3 LD-LAB LE12 LE13 EX6 S4 Proj. Expo LE14 Part I Part II Part III David Broman SIMD, Multithreading, MIMD, Multicore, Parallelization [email protected] and GPUs and Clusters in Practice 3 Abstractions in Computer Systems Computer System Networked Systems and Systems of Systems Application Software Software Operating System Instruction Set Architecture Hardware/Software Interface Microarchitecture Logic and Building Blocks Digital Hardware Design Digital Circuits Analog Circuits Analog Design and Physics Devices and Physics Part I Part II Part III David Broman SIMD, Multithreading, MIMD, Multicore, Parallelization [email protected] and GPUs and Clusters in Practice 4 Agenda Part I Part II SIMD, Multithreading, and GPUs MIMD, Multicore, and Clusters DLP TLP Part III Parallelization in Practice DLP + TLP Part I Part II Part III David Broman SIMD, Multithreading, MIMD, Multicore, Parallelization [email protected] and GPUs and Clusters in Practice 5 Part I SIMD, Multithreading, and GPUs DLP Acknowledgement: The structure and several of the good examples are derived from the book “Computer Organization and Design” (2014) by David A. Patterson and John L. Hennessy Part I Part II Part III David Broman SIMD, Multithreading, MIMD, Multicore, Parallelization [email protected] and GPUs and Clusters in Practice 6 SISD, SIMD, and MIMD (Revisited) Data-level parallelism. Examples Data Stream are multimedia extensions (e.g., SSE, streaming SIMD Single Multiple extension), vector processors. SISD SIMD Graphical Unit Processors E.g. Intel (GPUs) are both SIMD and Single E.g. SSE MIMD Pentium 4 Instruction in x86 MISD MIMD Task-level parallelism. No examples today E.g. Intel Instruction Stream Instruction Examples are multicore and Core i7 Multiple cluster computers Part I Part II Part III David Broman SIMD, Multithreading, MIMD, Multicore, Parallelization [email protected] and GPUs and Clusters in Practice 7 Subword Parallelism and Multimedia Extensions Subword parallelism is when a wide This is the same as SIMD or data-level parallelism. data word is operated on in parallel. One instruction operates on multiple data items. Instruction 32-bit data 32-bit data 32-bit data 32-bit data MMX (MutliMedia eXtension), first SIMD by Intel Pentium processors (introduced 1997). Only on Integers. 3D Now! AMD, included single-precision floating-point (1998) Subword Parallelism SSE (Streaming SIMD Extension) introduced by Intel in Pentium III (year 1999). Included single-precision FP. AVX (Advanced Vector Extension), supported by both Intel and AMD (processors available in 2011). Added support for 256 bits and double-precision FP. NEON multimedia extension for ARMv7 and ARMv8 (32 registers 8 bytes wide or 16 registers 16 bytes wide) Part I Part II Part III David Broman SIMD, Multithreading, MIMD, Multicore, Parallelization [email protected] and GPUs and Clusters in Practice 8 Streaming SIMD Extension (SSE) and E Advanced Vector Extension (AVX) In SSE (and the later version SSE2), assembly instructions are using two-operand format. meaning: %xmm4 = %xmm4 + %xmm0 addpd %xmm0, %xmm4 Note the reversed order (Intel assembly in general) Registers (e.g. %xmm4) are 128-bits in SSE/SEE2. Added the “v” for vector to distinguish AVX from SSE and renamed registers to %ymm that are now 256-bit “pd” means Packed Double precision FP. It can operate on as many FP that fits in the register vaddpd %ymm0, %ymm1, %ymm4 vmovapd %ymm4, (%r11) AVX introduced three-operand format Moves the result to the memory address stored in Meaning: %ymm4 = %ymm0 + %ymm1 %r11 (a 64-bit register). Stores the four 64-bit FP in consecutive order in memory. Part I Part II Part III David Broman SIMD, Multithreading, MIMD, Multicore, Parallelization [email protected] and GPUs and Clusters in Practice 9 Vector Processors Older, but elegant version if SIMD. Dates back to the Vector supercomputers in the 70s (Seymour Cray) Processors Vector processors have large vector registers, e.g., 32 vector registers, each having 64 64-bit elements. Loop, MUL branch MUL Sequential code in a Element 1 etc. Element 2 loop. Stall between ADD instructions. Stall ADD Stall Element 1 Element 2 MUL MUL MUL MUL MUL Element 1 Element 2 Element 3 element 0 element 0 Efficient use of memory In a vector processor, and low instruction operations are pipelines. ADD ADD ADD bandwidth give good Stall only once per vector Element 1 Element 2 Element 3 energy characteristics. operation. Part I Part II Part III David Broman SIMD, Multithreading, MIMD, Multicore, Parallelization [email protected] and GPUs and Clusters in Practice 10 Recall the idea of a multi-issue uniprocesor Thread A Thread B Thread C Executes only one hardware thread (context switching must be done in software) Slot 1 Slot 2 Slot 3 Typically, all functional units cannot Time be fully utilized in a single-threaded program (white space is unused slot/functional unit). Part I Part II Part III David Broman SIMD, Multithreading, MIMD, Multicore, Parallelization [email protected] and GPUs and Clusters in Practice 11 Hardware Multithreading In a multithreaded processor, several hardware Thread A Thread B Thread C threads share the same functional units. The purpose of multithreading is to hide latencies and avoid stalls due to cache misses etc. Coarse-grained multithreading, switches threads only at costly Slot 1 stalls, e.g., last-level cache misses. Slot 2 Cannot overcome throughput Slot 3 losses in short stalls. Time Fine-grained multithreading Slot 1 switches between hardware Slot 2 threads every cycle. Better Slot 3 utilization. Time Part I Part II Part III David Broman SIMD, Multithreading, MIMD, Multicore, Parallelization [email protected] and GPUs and Clusters in Practice 12 Simultaneous multithreading (SMT) Simultaneous multithreading (SMT) combines Thread A Thread B Thread C multithreading with a multiple-issue, dynamically scheduled pipeline. Can fill in the wholes that multiple- issue cannot utilize with cycles from other hardware threads. Thus, better utilization. Slot 1 Slot 2 Slot 3 Time Example: Hyper-threading is Intel's name and implementation of SMT. That is why a processor can have 2 real cores, but the OS shows 4 cores (4 hardware threads). Part I Part II Part III David Broman SIMD, Multithreading, MIMD, Multicore, Parallelization [email protected] and GPUs and Clusters in Practice 13 Graphical Processing Units (GPUs) A Graphical Processing Unit (GPU) utilizes multithreading, MIMD, SIMD, and ILP. The main form of parallelism that can be used is data-level parallelism. CUDA (Compute Unified Device Architecture) is a parallel computing platform and programming model from NVIDIA. CUDA All parallelism are expressed as CUDA threads. GPU Therefore, the model is also called Single Instruction Multiple Thread (SIMT). A GPU consists of a set of multithreaded SIMD processors (called streaming multiprocessor using NVIDIA terms). For instance 16 processors. The main idea is to execute a massive number of threads and to use multithreading to hide latency. However, the latest GPUs also include a caches. Part I Part II Part III David Broman SIMD, Multithreading, MIMD, Multicore, Parallelization [email protected] and GPUs and Clusters in Practice 14 Part II MIMD, Multicore, and Clusters TLP Acknowledgement: The structure and several of the good examples are derived from the book “Computer Organization and Design” (2014) by David A. Patterson and John L. Hennessy Part I Part II Part III David Broman SIMD, Multithreading, MIMD, Multicore, Parallelization [email protected] and GPUs and Clusters in Practice 15 Shared Memory Multiprocessor (SMP) A Shared Memory Multiprocessor (SMP) has a single physical address space across all processors. An SMP is almost always the same as a multicore processor. In a uniform memory access (UMA) multiprocessor, the latency of accessing memory does not depend Processor Processor Processor on the processor. Core Core Core In a nonuniform memory access L1 Cache L1 Cache L1 Cache (NUMA) multiprocessor, memory can be divided between processor and L2 Cache result in different latencies. Processors (cores) in a SMP Memory communicate via shared memory. Part I Part II Part III David Broman SIMD, Multithreading, MIMD, Multicore, Parallelization [email protected] and GPUs and Clusters in Practice 16 Cache Coherence Different cores’ local caches could result in that different cores see different values for the same memory address. This is called the cache coherency problem. Time step 1 Time step 2 Time step 3 Processor Processor Processor Processor Processor Processor Core 1 Core 2 Core 1 Core 2 Core 1 Core 2 Cache Cache Cache Cache Cache Cache 0 0 0 1 0 Memory Memory 0 Memory 0 1 Core 1 reads Core 2 reads Core 1 memory position X. memory position X. writes to Core 2 sees The value is stored The value is stored memory. the incorrect in Core 1’s cache. in Core 2’s cache. value. Part I Part II Part III David Broman SIMD, Multithreading, MIMD, Multicore, Parallelization [email protected] and GPUs and Clusters in Practice 17 Snooping Protocol Cache coherence can be enforced using a cache coherence protocol. For instance a write invalidate protocol, such as the snooping protocol. Time step 2 Time step 2 Time step 3 Processor Processor Processor Processor Processor Processor Core 1 Core 2 Core 1 Core 2 Core 1 Core 2 Cache Cache Cache 0 Cache Cache Cache1 0 1 1 Memory Memory 0 Memory 1 1 Core 2 reads Core 2 now tries to read the memory position X.