<<

Hardware

IS1200, spring 2016 Lecture 13: SIMD, MIMD, and Parallel Programming

David Broman Associate Professor, KTH Royal Institute of Research Engineer, University of California, Berkeley

Slides version 1.0

2

Course Structure

Module 1: and Assembly Module 4: Design Programming

LE1 LE2 LE3 EX1 LAB1 LE9 LE10 EX4 S2 LAB4

LE4 S1 LAB2

Module 2: I/O Module 5:

LE5 LE6 EX2 LAB3 LE11 EX5 S3

Module 3: Logic Design PROJ Module 6: Parallel Processors START and Programs LE7 LE8 EX3 LD-LAB LE12 LE13 EX6 S4

Proj. Expo LE14

Part I Part II Part III David Broman SIMD, Multithreading, MIMD, Multicore, Parallelization [email protected] and GPUs and Clusters in Practice 3

Abstractions in Computer Systems

Computer Networked Systems and Systems of Systems

Application Software

Instruction Set Architecture Hardware/Software

Microarchitecture

Logic and Building Blocks Digital Hardware Design

Digital Circuits

Analog Circuits Analog Design and Physics Devices and Physics

Part I Part II Part III David Broman SIMD, Multithreading, MIMD, Multicore, Parallelization [email protected] and GPUs and Clusters in Practice

4

Agenda

Part I Part II

SIMD, Multithreading, and GPUs MIMD, Multicore, and Clusters

DLP TLP

Part III

Parallelization in Practice

DLP + TLP

Part I Part II Part III David Broman SIMD, Multithreading, MIMD, Multicore, Parallelization [email protected] and GPUs and Clusters in Practice 5

Part I

SIMD, Multithreading, and GPUs

DLP

Acknowledgement: The structure and several of the good examples are derived from the book “Computer Organization and Design” (2014) by David A. Patterson and John L. Hennessy

Part I Part II Part III David Broman SIMD, Multithreading, MIMD, Multicore, Parallelization [email protected] and GPUs and Clusters in Practice

6

SISD, SIMD, and MIMD (Revisited)

Data-level parallelism. Examples Data Stream are extensions (e.g., SSE, streaming SIMD Single Multiple extension), vector processors.

SISD SIMD Graphical Unit Processors

E.g. (GPUs) are both SIMD and Single E.g. SSE MIMD Pentium 4 Instruction in

MISD MIMD Task-level parallelism. No examples today E.g. Intel

Instruction Stream Instruction Examples are multicore and Core i7 Multiple cluster

Part I Part II Part III David Broman SIMD, Multithreading, MIMD, Multicore, Parallelization [email protected] and GPUs and Clusters in Practice 7 Subword Parallelism and Multimedia Extensions

Subword parallelism is when a wide This is the same as SIMD or data-level parallelism. data word is operated on in parallel. One instruction operates on multiple data items.

Instruction 32- data 32-bit data 32-bit data 32-bit data

MMX (MutliMedia eXtension), first SIMD by Intel Pentium processors (introduced 1997). Only on Integers. 3D Now! AMD, included single-precision floating-point (1998) Subword Parallelism SSE (Streaming SIMD Extension) introduced by Intel in Pentium III (year 1999). Included single-precision FP. AVX (Advanced Vector Extension), supported by both Intel and AMD (processors available in 2011). Added support for 256 and double-precision FP. NEON multimedia extension for ARMv7 and ARMv8 (32 registers 8 wide or 16 registers 16 bytes wide) Part I Part II Part III David Broman SIMD, Multithreading, MIMD, Multicore, Parallelization [email protected] and GPUs and Clusters in Practice

8 Streaming SIMD Extension (SSE) and E Advanced Vector Extension (AVX)

In SSE (and the later version SSE2), assembly instructions are using two-operand .

meaning: %xmm4 = %xmm4 + %xmm0 addpd %xmm0, %xmm4 Note the reversed order (Intel assembly in general)

Registers (e.g. %xmm4) are 128-bits in SSE/SEE2. Added the “v” for vector to distinguish AVX from SSE and renamed registers to %ymm that are now 256-bit “pd” means Packed Double precision FP. It can operate on as many FP that fits in the register

vaddpd %ymm0, %ymm1, %ymm4 vmovapd %ymm4, (%r11) AVX introduced three-operand format Moves the result to the stored in Meaning: %ymm4 = %ymm0 + %ymm1 %r11 (a 64-bit register). Stores the four 64-bit FP in consecutive order in memory. Part I Part II Part III David Broman SIMD, Multithreading, MIMD, Multicore, Parallelization [email protected] and GPUs and Clusters in Practice 9

Vector Processors

Older, but elegant version if SIMD. Dates back to the Vector in the 70s (Seymour ) Processors Vector processors have large vector registers, e.g., 32 vector registers, each having 64 64-bit elements.

Loop, MUL branch MUL Sequential code in a 1 etc. Element 2 loop. Stall between ADD instructions. Stall ADD Stall Element 1 Element 2

MUL MUL MUL MUL MUL Element 1 Element 2 Element 3 element 0 element 0 Efficient use of memory In a , and low instruction operations are pipelines. ADD ADD ADD bandwidth give good Stall only once per vector Element 1 Element 2 Element 3 energy characteristics. operation. Part I Part II Part III David Broman SIMD, Multithreading, MIMD, Multicore, Parallelization [email protected] and GPUs and Clusters in Practice

10 Recall the idea of a multi-issue uniprocesor

Thread A B Thread C

Executes only one hardware thread (context switching must be done in software) Slot 1 Slot 2 Slot 3 Typically, all functional units cannot Time be fully utilized in a single-threaded program (white space is unused slot/functional unit).

Part I Part II Part III David Broman SIMD, Multithreading, MIMD, Multicore, Parallelization [email protected] and GPUs and Clusters in Practice 11

Hardware Multithreading

In a multithreaded processor, several hardware Thread A Thread B Thread C threads share the same functional units. The purpose of multithreading is to hide latencies and avoid stalls due to misses etc. Coarse-grained multithreading, threads only at costly Slot 1 stalls, e.g., last-level cache misses. Slot 2 Cannot overcome throughput Slot 3 losses in short stalls. Time

Fine-grained multithreading Slot 1 switches between hardware Slot 2 threads every cycle. Better Slot 3 utilization. Time

Part I Part II Part III David Broman SIMD, Multithreading, MIMD, Multicore, Parallelization [email protected] and GPUs and Clusters in Practice

12

Simultaneous multithreading (SMT)

Simultaneous multithreading (SMT) combines Thread A Thread B Thread C multithreading with a multiple-issue, dynamically scheduled .

Can fill in the wholes that multiple- issue cannot utilize with cycles from other hardware threads. Thus, better utilization. Slot 1 Slot 2 Slot 3 Time Example: Hyper-threading is Intel's name and implementation of SMT. That is why a processor can have 2 real cores, but the OS shows 4 cores (4 hardware threads).

Part I Part II Part III David Broman SIMD, Multithreading, MIMD, Multicore, Parallelization [email protected] and GPUs and Clusters in Practice 13

Graphical Processing Units (GPUs)

A Graphical Processing Unit (GPU) utilizes multithreading, MIMD, SIMD, and ILP. The main form of parallelism that can be used is data-level parallelism.

CUDA (Compute Unified Device Architecture) is a platform and programming model from .

CUDA All parallelism are expressed as CUDA threads. GPU Therefore, the model is also called Single Instruction Multiple Thread (SIMT).

A GPU consists of a set of multithreaded SIMD processors (called streaming multiprocessor using NVIDIA terms). For instance 16 processors.

The main idea is to execute a massive number of threads and to use multithreading to hide latency. However, the latest GPUs also include a caches.

Part I Part II Part III David Broman SIMD, Multithreading, MIMD, Multicore, Parallelization [email protected] and GPUs and Clusters in Practice

14

Part II

MIMD, Multicore, and Clusters

TLP

Acknowledgement: The structure and several of the good examples are derived from the book “Computer Organization and Design” (2014) by David A. Patterson and John L. Hennessy

Part I Part II Part III David Broman SIMD, Multithreading, MIMD, Multicore, Parallelization [email protected] and GPUs and Clusters in Practice 15

Shared Memory Multiprocessor (SMP)

A Multiprocessor (SMP) has a single physical address space across all processors.

An SMP is almost always the same as a multicore processor.

In a (UMA) multiprocessor, the latency of accessing memory does not depend Processor Processor Processor on the processor. Core Core Core

In a nonuniform memory access L1 Cache L1 Cache L1 Cache (NUMA) multiprocessor, memory can be divided between processor and L2 Cache result in different latencies.

Processors (cores) in a SMP Memory communicate via shared memory.

Part I Part II Part III David Broman SIMD, Multithreading, MIMD, Multicore, Parallelization [email protected] and GPUs and Clusters in Practice

16

Cache Coherence

Different cores’ local caches could result in that different cores see different values for the same memory address. This is called the cache coherency problem.

Time step 1 Time step 2 Time step 3

Processor Processor Processor Processor Processor Processor Core 1 Core 2 Core 1 Core 2 Core 1 Core 2

Cache Cache Cache Cache Cache Cache 0 0 0 1 0 Memory Memory 0 Memory 0 1

Core 1 reads Core 2 reads Core 1 memory position X. memory position X. writes to Core 2 sees The value is stored The value is stored memory. the incorrect in Core 1’s cache. in Core 2’s cache. value.

Part I Part II Part III David Broman SIMD, Multithreading, MIMD, Multicore, Parallelization [email protected] and GPUs and Clusters in Practice 17

Snooping Protocol

Cache coherence can be enforced using a cache coherence protocol. For instance a write invalidate protocol, such as the snooping protocol.

Time step 2 Time step 2 Time step 3

Processor Processor Processor Processor Processor Processor Core 1 Core 2 Core 1 Core 2 Core 1 Core 2

Cache Cache Cache 0 Cache Cache Cache1 0 1 1 Memory Memory 0 Memory 1 1

Core 2 reads Core 2 now tries to read the memory position X. The write Core 1 invalidates variable, it gets a cache miss The value is stored and loads the new value from in Core 2’s cache. writes to the cache memory. of other memory (heavily simplified processors. example) Part I Part II Part III David Broman SIMD, Multithreading, MIMD, Multicore, Parallelization [email protected] and GPUs and Clusters in Practice

18

False Sharing

Assume that Core 1 and Core 2 share a cache line Z.

Processor Core 1 Processor Core 2

Core 1 reads and writes to X and Cache Cache Core 2 reads and writes to Y. Cache line Z Cache line Z X = 1 X = 1 This will result in that the cache coherence protocol Y = 0 Y = 0 will invalidate the other core’s cache line, even if the cores are not interested in the other ones variable! Memory

This is called .

Part I Part II Part III David Broman SIMD, Multithreading, MIMD, Multicore, Parallelization [email protected] and GPUs and Clusters in Practice 19

Processes, Threads, and Cores

A modern operating system (OS) Concurrent threads are can execute several processes scheduled by the OS to execute in parallel on different cores. concurrently. Thread

Process Thread C C C C

C C C C Operating Thread System Memory

Thread Note: All threads share the process Process context, including etc.

Each process can have N number of A process context include its own Hands-on: virtual memory space, IO files, real- concurrent threads. The thread context Activity only code, heap, shared , includes thread ID, stack, stack pointer, Monitor process id (PID) etc. program etc.

Part I Part II Part III David Broman SIMD, Multithreading, MIMD, Multicore, Parallelization [email protected] and GPUs and Clusters in Practice

20 Programming with Threads and E Shared Variables

POSIX threads (pthreads) is a common way of programming concurrency and utilizing multicores for parallel computation.

#include int main(){ #include pthread_t tid1, tid2; int max; volatile int counter = 0; max = 40000; pthread_create(&tid1, NULL, count, &max); void *count(void *data){ int i; max = 60000; int max = *((int*)data); pthread_create(&tid2, NULL, count, &max); for(i=0; i

Semaphores

A semaphore is a global variable that can hold a nonnegative integer value. It can only be changed by the following two operations. Note that the check and return of P(s) P(s): If s > 0, then decrement s and return. and increment of P(s) If s = 0, then wait until s > 0, then decrement V(s) must be atomic, s and return. meaning that appears to be V(s) V(s): Increment s. “instantaneously”.

Semaphores were invented y Edsger Dijkstra, was originally from the Neatherlands. P and V is supposed to come from the Dutch words Proberen (to ) and Verlogen (to increment).

Part I Part II Part III David Broman SIMD, Multithreading, MIMD, Multicore, Parallelization [email protected] and GPUs and Clusters in Practice

22

Mutex

A semaphore can be used for mutual exclusion, meaning that only one thread can access a particular resource at the same . Such a binary semaphore is called a mutex.

A global binary semaphore is initiated to 1.

One or more threads are executing code that semaphore s = 1 needs to be protected.

One of more threads execute: P(s), also called wait(s), checks if the semaphore P(s); is nonzero. If so, lock the mutex, else wait. Code to protected... In the critical section, it is ensured that not more than V(s); one thread can execute the code at the same time.

V(s), also called post, unlocks the mutex and increments the semaphore.

Part I Part II Part III David Broman SIMD, Multithreading, MIMD, Multicore, Parallelization [email protected] and GPUs and Clusters in Practice 23 Programming with Threads and E Shared Variables with Semaphores

Hands-on: Show example Part I Part II Part III David Broman SIMD, Multithreading, MIMD, Multicore, Parallelization [email protected] and GPUs and Clusters in Practice

24 Programming with Threads and E Shared Variables with Semaphores

Correct solution…

Hands-on: Show example Part I Part II Part III David Broman SIMD, Multithreading, MIMD, Multicore, Parallelization [email protected] and GPUs and Clusters in Practice 25 E Clusters and Warehouse Scale Computers

A cluster is a set of computers that are connected over a local area network (LAN). May be viewed as one large multiprocessor. Warehouse-Scale Computers are very large cluster that can include 100 000 servers that act as one giant computer

Photo by Robert Harker (e.g., , , Apple). Clusters do not communicate over shared memory (as for SMP) but using message passing.

Computer The MapReduce and Hadoop framework are 1 Computer popular for batch processing. N 1. applies a programmer defined function on all data items. Computer 2. Reduce collects the output and collapse the data 2 Computer using another programmer defined function. N-1 Both tasks are highly parallel.

Part I Part II Part III David Broman SIMD, Multithreading, MIMD, Multicore, Parallelization [email protected] and GPUs and Clusters in Practice

26

Part III

Parallelization in Practice

DLP + TLP

Part I Part II Part III David Broman SIMD, Multithreading, MIMD, Multicore, Parallelization [email protected] and GPUs and Clusters in Practice 27

General Matrix Multiplication (GEMM)

Uses matrix size n as a Simple matrix multiplication parameter and single dimension for performance.

void dgemm(int n, double* A, double* B, double* C){ for(int i = 0; i < n; ++i) for(int j = 0; j < n; ++j){ double cij = C[i+j*n]; for(int k = 0; k < n; k++) cij += A[i+k*n] * B[k+j*n]; C[i+j*n] = cij; } }

Hands-on: Show example Part I Part II Part III David Broman SIMD, Multithreading, MIMD, Multicore, Parallelization [email protected] and GPUs and Clusters in Practice

28

Parallelizing GEMM

Unoptimized Unoptimzed C version (previous page). Using 1.7 GigaFLOPS (32x32) one core. 0.8 GigaFLOPS (960x960)

Use AVX instructions vaddpd and vmulpd 6.4 GigaFLOPS (32x32) SIMD to do 4 double precision floating-point 2.5 GigaFLOPS (960x960) operations in parallel. . AVX + unroll parts of the loop, so that the 14.6 GigaFLOPS (32x32) ILP multiple-issue, out-of-order processor have 5.1 GigaFLOPS (960x960) more instructions to schedule.

AVX + unroll + blocking (dividing the problem 13.6 GigaFLOPS (32x32) Cache into submatrices). This avoids cache misses. 12.0 GigaFLOPS (960x960)

23 GigaFLOPS (960x960, 2 cores) Mutli- AVX + unroll + blocking + multi core core (mutithreading using OpenMP) 44 GigaFLOPS (960x960, 4 cores) 174 GigaFLOPS (960x960, 16 cores) Experiment by P&H on a 2.6GHz Intel Core i7 with Turbo mode turned off. For details see P&H, 5th edtion, sections 3.8, 4.12, 5.14, and 6.12 Part I Part II Part III David Broman SIMD, Multithreading, MIMD, Multicore, Parallelization [email protected] and GPUs and Clusters in Practice 29 Future perspective: MIMD, SIMD, ILP, and Caches

“For x86 computers, we expect to see two additional cores per chip every two years and the SIMD width to double every four years.” Hennessy & Patterson, – A Quantitative Approach, 5th edition, 2013 (page 263)

We must understand and utilize both MIMD and SIMD to gain maximal speedups in the future, although MIMD (multicore) has gained much more attention lately.

The previous example showed that how we program for ILP and caches, also matters significantly.

Part I Part II Part III David Broman SIMD, Multithreading, MIMD, Multicore, Parallelization [email protected] and GPUs and Clusters in Practice

30

It’s about time to… …relax

Part I Part II Part III David Broman SIMD, Multithreading, MIMD, Multicore, Parallelization [email protected] and GPUs and Clusters in Practice 31 Reading Guidelines

Module 6: Parallel Processors and Programs

Lecture 12: Parallelism, Concurrency, , and ILP • H&H Chapter 1.8, 3.6, 7.8.3-7.8.5 • P&H5 Chapters 1.7-1.8, 1.10, 4.10, 6.1-6.2 or P&H4 Chapters 1.5-1.6, 1.8, 4.10, 7.1-7.2

Lecture 13: SIMD, MIMD, and Parallel Programming • H&H Chapter 7.8.6-7.8.9 • P&H5 Chapters 2.11, 3.6, 5.10, 6.3-6.7 or P&H4 Chapters 2.11, 3,6, 5.8, 7.3-7.7 Reading Guidelines See the course webpage for more information.

Part I Part II Part III David Broman SIMD, Multithreading, MIMD, Multicore, Parallelization [email protected] and GPUs and Clusters in Practice

32

Summary

Some key take away points:

• SIMD and GPUs can efficiently parallelize problems that have data-level parallelism

• MIMD, Multicores, and Clusters can be used to parallelize problems that have task-level parallelism.

• In the future, we should try to combine and use both SIMD and MIMD!

Thanks for listening!

Part I Part II Part III David Broman SIMD, Multithreading, MIMD, Multicore, Parallelization [email protected] and GPUs and Clusters in Practice