Computer Hardware Engineering
IS1200, spring 2016 Lecture 13: SIMD, MIMD, and Parallel Programming
David Broman Associate Professor, KTH Royal Institute of Technology Assistant Research Engineer, University of California, Berkeley
Slides version 1.0
2
Course Structure
Module 1: C and Assembly Module 4: Processor Design Programming
LE1 LE2 LE3 EX1 LAB1 LE9 LE10 EX4 S2 LAB4
LE4 S1 LAB2
Module 2: I/O Systems Module 5: Memory Hierarchy
LE5 LE6 EX2 LAB3 LE11 EX5 S3
Module 3: Logic Design PROJ Module 6: Parallel Processors START and Programs LE7 LE8 EX3 LD-LAB LE12 LE13 EX6 S4
Proj. Expo LE14
Part I Part II Part III David Broman SIMD, Multithreading, MIMD, Multicore, Parallelization [email protected] and GPUs and Clusters in Practice 3
Abstractions in Computer Systems
Computer System Networked Systems and Systems of Systems
Application Software Software Operating System
Instruction Set Architecture Hardware/Software Interface
Microarchitecture
Logic and Building Blocks Digital Hardware Design
Digital Circuits
Analog Circuits Analog Design and Physics Devices and Physics
Part I Part II Part III David Broman SIMD, Multithreading, MIMD, Multicore, Parallelization [email protected] and GPUs and Clusters in Practice
4
Agenda
Part I Part II
SIMD, Multithreading, and GPUs MIMD, Multicore, and Clusters
DLP TLP
Part III
Parallelization in Practice
DLP + TLP
Part I Part II Part III David Broman SIMD, Multithreading, MIMD, Multicore, Parallelization [email protected] and GPUs and Clusters in Practice 5
Part I
SIMD, Multithreading, and GPUs
DLP
Acknowledgement: The structure and several of the good examples are derived from the book “Computer Organization and Design” (2014) by David A. Patterson and John L. Hennessy
Part I Part II Part III David Broman SIMD, Multithreading, MIMD, Multicore, Parallelization [email protected] and GPUs and Clusters in Practice
6
SISD, SIMD, and MIMD (Revisited)
Data-level parallelism. Examples Data Stream are multimedia extensions (e.g., SSE, streaming SIMD Single Multiple extension), vector processors.
SISD SIMD Graphical Unit Processors
E.g. Intel (GPUs) are both SIMD and Single E.g. SSE MIMD Pentium 4 Instruction in x86
MISD MIMD Task-level parallelism. No examples today E.g. Intel
Instruction Stream Instruction Examples are multicore and Core i7 Multiple cluster computers
Part I Part II Part III David Broman SIMD, Multithreading, MIMD, Multicore, Parallelization [email protected] and GPUs and Clusters in Practice 7 Subword Parallelism and Multimedia Extensions
Subword parallelism is when a wide This is the same as SIMD or data-level parallelism. data word is operated on in parallel. One instruction operates on multiple data items.
Instruction 32-bit data 32-bit data 32-bit data 32-bit data
MMX (MutliMedia eXtension), first SIMD by Intel Pentium processors (introduced 1997). Only on Integers. 3D Now! AMD, included single-precision floating-point (1998) Subword Parallelism SSE (Streaming SIMD Extension) introduced by Intel in Pentium III (year 1999). Included single-precision FP. AVX (Advanced Vector Extension), supported by both Intel and AMD (processors available in 2011). Added support for 256 bits and double-precision FP. NEON multimedia extension for ARMv7 and ARMv8 (32 registers 8 bytes wide or 16 registers 16 bytes wide) Part I Part II Part III David Broman SIMD, Multithreading, MIMD, Multicore, Parallelization [email protected] and GPUs and Clusters in Practice
8 Streaming SIMD Extension (SSE) and E Advanced Vector Extension (AVX)
In SSE (and the later version SSE2), assembly instructions are using two-operand format.
meaning: %xmm4 = %xmm4 + %xmm0 addpd %xmm0, %xmm4 Note the reversed order (Intel assembly in general)
Registers (e.g. %xmm4) are 128-bits in SSE/SEE2. Added the “v” for vector to distinguish AVX from SSE and renamed registers to %ymm that are now 256-bit “pd” means Packed Double precision FP. It can operate on as many FP that fits in the register
vaddpd %ymm0, %ymm1, %ymm4 vmovapd %ymm4, (%r11) AVX introduced three-operand format Moves the result to the memory address stored in Meaning: %ymm4 = %ymm0 + %ymm1 %r11 (a 64-bit register). Stores the four 64-bit FP in consecutive order in memory. Part I Part II Part III David Broman SIMD, Multithreading, MIMD, Multicore, Parallelization [email protected] and GPUs and Clusters in Practice 9
Vector Processors
Older, but elegant version if SIMD. Dates back to the Vector supercomputers in the 70s (Seymour Cray) Processors Vector processors have large vector registers, e.g., 32 vector registers, each having 64 64-bit elements.
Loop, MUL branch MUL Sequential code in a Element 1 etc. Element 2 loop. Stall between ADD instructions. Stall ADD Stall Element 1 Element 2
MUL MUL MUL MUL MUL Element 1 Element 2 Element 3 element 0 element 0 Efficient use of memory In a vector processor, and low instruction operations are pipelines. ADD ADD ADD bandwidth give good Stall only once per vector Element 1 Element 2 Element 3 energy characteristics. operation. Part I Part II Part III David Broman SIMD, Multithreading, MIMD, Multicore, Parallelization [email protected] and GPUs and Clusters in Practice
10 Recall the idea of a multi-issue uniprocesor
Thread A Thread B Thread C
Executes only one hardware thread (context switching must be done in software) Slot 1 Slot 2 Slot 3 Typically, all functional units cannot Time be fully utilized in a single-threaded program (white space is unused slot/functional unit).
Part I Part II Part III David Broman SIMD, Multithreading, MIMD, Multicore, Parallelization [email protected] and GPUs and Clusters in Practice 11
Hardware Multithreading
In a multithreaded processor, several hardware Thread A Thread B Thread C threads share the same functional units. The purpose of multithreading is to hide latencies and avoid stalls due to cache misses etc. Coarse-grained multithreading, switches threads only at costly Slot 1 stalls, e.g., last-level cache misses. Slot 2 Cannot overcome throughput Slot 3 losses in short stalls. Time
Fine-grained multithreading Slot 1 switches between hardware Slot 2 threads every cycle. Better Slot 3 utilization. Time
Part I Part II Part III David Broman SIMD, Multithreading, MIMD, Multicore, Parallelization [email protected] and GPUs and Clusters in Practice
12
Simultaneous multithreading (SMT)
Simultaneous multithreading (SMT) combines Thread A Thread B Thread C multithreading with a multiple-issue, dynamically scheduled pipeline.
Can fill in the wholes that multiple- issue cannot utilize with cycles from other hardware threads. Thus, better utilization. Slot 1 Slot 2 Slot 3 Time Example: Hyper-threading is Intel's name and implementation of SMT. That is why a processor can have 2 real cores, but the OS shows 4 cores (4 hardware threads).
Part I Part II Part III David Broman SIMD, Multithreading, MIMD, Multicore, Parallelization [email protected] and GPUs and Clusters in Practice 13
Graphical Processing Units (GPUs)
A Graphical Processing Unit (GPU) utilizes multithreading, MIMD, SIMD, and ILP. The main form of parallelism that can be used is data-level parallelism.
CUDA (Compute Unified Device Architecture) is a parallel computing platform and programming model from NVIDIA.
CUDA All parallelism are expressed as CUDA threads. GPU Therefore, the model is also called Single Instruction Multiple Thread (SIMT).
A GPU consists of a set of multithreaded SIMD processors (called streaming multiprocessor using NVIDIA terms). For instance 16 processors.
The main idea is to execute a massive number of threads and to use multithreading to hide latency. However, the latest GPUs also include a caches.
Part I Part II Part III David Broman SIMD, Multithreading, MIMD, Multicore, Parallelization [email protected] and GPUs and Clusters in Practice
14
Part II
MIMD, Multicore, and Clusters
TLP
Acknowledgement: The structure and several of the good examples are derived from the book “Computer Organization and Design” (2014) by David A. Patterson and John L. Hennessy
Part I Part II Part III David Broman SIMD, Multithreading, MIMD, Multicore, Parallelization [email protected] and GPUs and Clusters in Practice 15
Shared Memory Multiprocessor (SMP)
A Shared Memory Multiprocessor (SMP) has a single physical address space across all processors.
An SMP is almost always the same as a multicore processor.
In a uniform memory access (UMA) multiprocessor, the latency of accessing memory does not depend Processor Processor Processor on the processor. Core Core Core
In a nonuniform memory access L1 Cache L1 Cache L1 Cache (NUMA) multiprocessor, memory can be divided between processor and L2 Cache result in different latencies.
Processors (cores) in a SMP Memory communicate via shared memory.
Part I Part II Part III David Broman SIMD, Multithreading, MIMD, Multicore, Parallelization [email protected] and GPUs and Clusters in Practice
16
Cache Coherence
Different cores’ local caches could result in that different cores see different values for the same memory address. This is called the cache coherency problem.
Time step 1 Time step 2 Time step 3
Processor Processor Processor Processor Processor Processor Core 1 Core 2 Core 1 Core 2 Core 1 Core 2
Cache Cache Cache Cache Cache Cache 0 0 0 1 0 Memory Memory 0 Memory 0 1
Core 1 reads Core 2 reads Core 1 memory position X. memory position X. writes to Core 2 sees The value is stored The value is stored memory. the incorrect in Core 1’s cache. in Core 2’s cache. value.
Part I Part II Part III David Broman SIMD, Multithreading, MIMD, Multicore, Parallelization [email protected] and GPUs and Clusters in Practice 17
Snooping Protocol
Cache coherence can be enforced using a cache coherence protocol. For instance a write invalidate protocol, such as the snooping protocol.
Time step 2 Time step 2 Time step 3
Processor Processor Processor Processor Processor Processor Core 1 Core 2 Core 1 Core 2 Core 1 Core 2
Cache Cache Cache 0 Cache Cache Cache1 0 1 1 Memory Memory 0 Memory 1 1
Core 2 reads Core 2 now tries to read the memory position X. The write Core 1 invalidates variable, it gets a cache miss The value is stored and loads the new value from in Core 2’s cache. writes to the cache memory. line of other memory (heavily simplified processors. example) Part I Part II Part III David Broman SIMD, Multithreading, MIMD, Multicore, Parallelization [email protected] and GPUs and Clusters in Practice
18
False Sharing
Assume that Core 1 and Core 2 share a cache line Z.
Processor Core 1 Processor Core 2
Core 1 reads and writes to X and Cache Cache Core 2 reads and writes to Y. Cache line Z Cache line Z X = 1 X = 1 This will result in that the cache coherence protocol Y = 0 Y = 0 will invalidate the other core’s cache line, even if the cores are not interested in the other ones variable! Memory
This is called false sharing.
Part I Part II Part III David Broman SIMD, Multithreading, MIMD, Multicore, Parallelization [email protected] and GPUs and Clusters in Practice 19
Processes, Threads, and Cores
A modern operating system (OS) Concurrent threads are can execute several processes scheduled by the OS to execute in parallel on different cores. concurrently. Thread
Process Thread C C C C
C C C C Operating Thread System Process Memory
Thread Note: All threads share the process Process context, including virtual memory etc.
Each process can have N number of A process context include its own Hands-on: virtual memory space, IO files, real- concurrent threads. The thread context Activity only code, heap, shared library, includes thread ID, stack, stack pointer, Monitor process id (PID) etc. program counter etc.
Part I Part II Part III David Broman SIMD, Multithreading, MIMD, Multicore, Parallelization [email protected] and GPUs and Clusters in Practice
20 Programming with Threads and E Shared Variables
POSIX threads (pthreads) is a common way of programming concurrency and utilizing multicores for parallel computation.
#include
Semaphores
A semaphore is a global variable that can hold a nonnegative integer value. It can only be changed by the following two operations. Note that the check and return of P(s) P(s): If s > 0, then decrement s and return. and increment of P(s) If s = 0, then wait until s > 0, then decrement V(s) must be atomic, s and return. meaning that appears to be V(s) V(s): Increment s. “instantaneously”.
Semaphores were invented y Edsger Dijkstra, who was originally from the Neatherlands. P and V is supposed to come from the Dutch words Proberen (to test) and Verlogen (to increment).
Part I Part II Part III David Broman SIMD, Multithreading, MIMD, Multicore, Parallelization [email protected] and GPUs and Clusters in Practice
22
Mutex
A semaphore can be used for mutual exclusion, meaning that only one thread can access a particular resource at the same time. Such a binary semaphore is called a mutex.
A global binary semaphore is initiated to 1.
One or more threads are executing code that semaphore s = 1 needs to be protected.
One of more threads execute: P(s), also called wait(s), checks if the semaphore P(s); is nonzero. If so, lock the mutex, else wait. Code to protected... In the critical section, it is ensured that not more than V(s); one thread can execute the code at the same time.
V(s), also called post, unlocks the mutex and increments the semaphore.
Part I Part II Part III David Broman SIMD, Multithreading, MIMD, Multicore, Parallelization [email protected] and GPUs and Clusters in Practice 23 Programming with Threads and E Shared Variables with Semaphores
Hands-on: Show example Part I Part II Part III David Broman SIMD, Multithreading, MIMD, Multicore, Parallelization [email protected] and GPUs and Clusters in Practice
24 Programming with Threads and E Shared Variables with Semaphores
Correct solution…
Hands-on: Show example Part I Part II Part III David Broman SIMD, Multithreading, MIMD, Multicore, Parallelization [email protected] and GPUs and Clusters in Practice 25 E Clusters and Warehouse Scale Computers
A cluster is a set of computers that are connected over a local area network (LAN). May be viewed as one large multiprocessor. Warehouse-Scale Computers are very large cluster that can include 100 000 servers that act as one giant computer
Photo by Robert Harker (e.g., Facebook, Google, Apple). Clusters do not communicate over shared memory (as for SMP) but using message passing.
Computer The MapReduce and Hadoop framework are 1 Computer popular for batch processing. N 1. Map applies a programmer defined function on all data items. Computer 2. Reduce collects the output and collapse the data 2 Computer using another programmer defined function. N-1 Both tasks are highly parallel.
Part I Part II Part III David Broman SIMD, Multithreading, MIMD, Multicore, Parallelization [email protected] and GPUs and Clusters in Practice
26
Part III
Parallelization in Practice
DLP + TLP
Part I Part II Part III David Broman SIMD, Multithreading, MIMD, Multicore, Parallelization [email protected] and GPUs and Clusters in Practice 27
General Matrix Multiplication (GEMM)
Uses matrix size n as a Simple matrix multiplication parameter and single dimension for performance.
void dgemm(int n, double* A, double* B, double* C){ for(int i = 0; i < n; ++i) for(int j = 0; j < n; ++j){ double cij = C[i+j*n]; for(int k = 0; k < n; k++) cij += A[i+k*n] * B[k+j*n]; C[i+j*n] = cij; } }
Hands-on: Show example Part I Part II Part III David Broman SIMD, Multithreading, MIMD, Multicore, Parallelization [email protected] and GPUs and Clusters in Practice
28
Parallelizing GEMM
Unoptimized Unoptimzed C version (previous page). Using 1.7 GigaFLOPS (32x32) one core. 0.8 GigaFLOPS (960x960)
Use AVX instructions vaddpd and vmulpd 6.4 GigaFLOPS (32x32) SIMD to do 4 double precision floating-point 2.5 GigaFLOPS (960x960) operations in parallel. . AVX + unroll parts of the loop, so that the 14.6 GigaFLOPS (32x32) ILP multiple-issue, out-of-order processor have 5.1 GigaFLOPS (960x960) more instructions to schedule.
AVX + unroll + blocking (dividing the problem 13.6 GigaFLOPS (32x32) Cache into submatrices). This avoids cache misses. 12.0 GigaFLOPS (960x960)
23 GigaFLOPS (960x960, 2 cores) Mutli- AVX + unroll + blocking + multi core core (mutithreading using OpenMP) 44 GigaFLOPS (960x960, 4 cores) 174 GigaFLOPS (960x960, 16 cores) Experiment by P&H on a 2.6GHz Intel Core i7 with Turbo mode turned off. For details see P&H, 5th edtion, sections 3.8, 4.12, 5.14, and 6.12 Part I Part II Part III David Broman SIMD, Multithreading, MIMD, Multicore, Parallelization [email protected] and GPUs and Clusters in Practice 29 Future perspective: MIMD, SIMD, ILP, and Caches
“For x86 computers, we expect to see two additional cores per chip every two years and the SIMD width to double every four years.” Hennessy & Patterson, Computer Architecture – A Quantitative Approach, 5th edition, 2013 (page 263)
We must understand and utilize both MIMD and SIMD to gain maximal speedups in the future, although MIMD (multicore) has gained much more attention lately.
The previous example showed that how we program for ILP and caches, also matters significantly.
Part I Part II Part III David Broman SIMD, Multithreading, MIMD, Multicore, Parallelization [email protected] and GPUs and Clusters in Practice
30
It’s about time to… …relax
Part I Part II Part III David Broman SIMD, Multithreading, MIMD, Multicore, Parallelization [email protected] and GPUs and Clusters in Practice 31 Reading Guidelines
Module 6: Parallel Processors and Programs
Lecture 12: Parallelism, Concurrency, Speedup, and ILP • H&H Chapter 1.8, 3.6, 7.8.3-7.8.5 • P&H5 Chapters 1.7-1.8, 1.10, 4.10, 6.1-6.2 or P&H4 Chapters 1.5-1.6, 1.8, 4.10, 7.1-7.2
Lecture 13: SIMD, MIMD, and Parallel Programming • H&H Chapter 7.8.6-7.8.9 • P&H5 Chapters 2.11, 3.6, 5.10, 6.3-6.7 or P&H4 Chapters 2.11, 3,6, 5.8, 7.3-7.7 Reading Guidelines See the course webpage for more information.
Part I Part II Part III David Broman SIMD, Multithreading, MIMD, Multicore, Parallelization [email protected] and GPUs and Clusters in Practice
32
Summary
Some key take away points:
• SIMD and GPUs can efficiently parallelize problems that have data-level parallelism
• MIMD, Multicores, and Clusters can be used to parallelize problems that have task-level parallelism.
• In the future, we should try to combine and use both SIMD and MIMD!
Thanks for listening!
Part I Part II Part III David Broman SIMD, Multithreading, MIMD, Multicore, Parallelization [email protected] and GPUs and Clusters in Practice