Parallel Computing Stanford CS149, Winter 2019 Lecture 2: (Forms Of

Lecture 2: A Modern Multi-Core Processor (Forms of parallelism + understanding latency and bandwidth) Parallel Computing Stanford CS149, Winter 2019 Quick review 1. Why has single-instruction-stream performance only improved very slowly in recent years? * 2. What prevented us from obtaining maximum speedup from the parallel programs we wrote last time? * Self check 1: What do I mean by “single-instruction stream”? Self check 2: When we talked about the optimization of superscalar execution, were we talking about optimizing the performance of executing a single-instruction stream or multiple instruction streams? Stanford CS149, Winter 2019 Quick review What does it mean for a superscalar processor to “respect program order”? Program (sequence of instructions) Instruction dependency graph PC Instruction 00 01 00 a = 2 01 b = 4 02 04 05 02 tmp2 = a + b // 6 03 tmp3 = tmp2 + a // 8 04 tmp4 = b + b // 8 03 06 05 tmp5 = b * b // 16 06 tmp6 = tmp2 + tmp4 // 14 07 tmp7 = tmp5 + tmp6 // 30 08 07 08 if (tmp3 > 7) 09 print tmp3 else 09 10 10 print tmp7 Stanford CS149, Winter 2019 Today ▪ Today we will talk computer architecture ▪ Four key concepts about how modern computers work - Two concern parallel execution - Two concern challenges of accessing memory ▪ Understanding these architecture basics will help you - Understand and optimize the performance of your parallel programs - Gain intuition about what workloads might beneft from fast parallel machines Stanford CS149, Winter 2019 Part 1: parallel execution Stanford CS149, Winter 2019 Example program Compute sin(x) using Taylor expansion: sin(x) = x - x3/3! + x5/5! - x7/7! + ... for each element of an array of N foating-point numbers void sinx(int N, int terms, float* x, float* result) { for (int i=0; i<N; i++) { float value = x[i]; float numer = x[i] * x[i] * x[i]; int denom = 6; // 3! int sign = -1; for (int j=1; j<=terms; j++) { value += sign * numer / denom; numer *= x[i] * x[i]; denom *= (2*j+2) * (2*j+3); sign *= -1; } result[i] = value; } } Stanford CS149, Winter 2019 Compile program void sinx(int N, int terms, float* x, float* result) { x[i] for (int i=0; i<N; i++) { float value = x[i]; float numer = x[i] * x[i] * x[i]; ld r0, addr[r1] int denom = 6; // 3! mul r1, r0, r0 int sign = -1; mul r1, r1, r0 ... for (int j=1; j<=terms; j++) ... { ... value += sign * numer / denom; ... numer *= x[i] * x[i]; ... ... denom *= (2*j+2) * (2*j+3); st addr[r2], r0 sign *= -1; } result[i] = value; } result[i] } Stanford CS149, Winter 2019 Execute program x[i] Fetch/ Decode ld r0, addr[r1] mul r1, r0, r0 Execution Unit mul r1, r1, r0 (ALU) ... ... ... Execution ... Context ... ... st addr[r2], r0 result[i] Stanford CS149, Winter 2019 Execute program My very simple processor: executes one instruction per clock x[i] Fetch/ Decode PC ld r0, addr[r1] mul r1, r0, r0 Execution Unit mul r1, r1, r0 (ALU) ... ... ... Execution ... Context ... ... st addr[r2], r0 result[i] Stanford CS149, Winter 2019 Execute program My very simple processor: executes one instruction per clock x[i] Fetch/ Decode ld r0, addr[r1] PC mul r1, r0, r0 Execution Unit mul r1, r1, r0 (ALU) ... ... ... Execution ... Context ... ... st addr[r2], r0 result[i] Stanford CS149, Winter 2019 Execute program My very simple processor: executes one instruction per clock x[i] Fetch/ Decode ld r0, addr[r1] mul r1, r0, r0 Execution Unit PC mul r1, r1, r0 (ALU) ... ... ... Execution ... Context ... ... st addr[r2], r0 result[i] Stanford CS149, Winter 2019 Superscalar processor Recall from last class: instruction level parallelism (ILP) Decode and execute two instructions per clock (if possible) x[i] Fetch/ Fetch/ Decode Decode 1 2 ld r0, addr[r1] mul r1, r0, r0 Exec Exec mul r1, r1, r0 1 2 ... ... ... Execution ... Context ... ... st addr[r2], r0 result[i] Note: No ILP exists in this region of the program Stanford CS149, Winter 2019 Aside: Pentium 4 Image credit: http://ixbtlabs.com/articles/pentium4/index.html Stanford CS149, Winter 2019 Processor: pre multi-core era Majority of chip transistors used to perform operations that help a single instruction stream run fast Fetch/ Decode Data cache (a big one) Exec Unit (ALU) Execution Out-of-order control logic Context Fancy branch predictor Memory pre-fetcher More transistors = larger cache, smarter out-of-order logic, smarter branch predictor, etc. (Also: more transistors → smaller transistors → higher clock frequencies) Stanford CS149, Winter 2019 Processor: multi-core era Fetch/ Idea #1: Decode Use increasing transistor count to add more Exec Unit cores to the processor (ALU) Execution Rather than use transistors to increase Context sophistication of processor logic that accelerates a single instruction stream (e.g., out-of-order and speculative operations) Stanford CS149, Winter 2019 Two cores: compute two elements in parallel x[i] x[j] Fetch/ Fetch/ Decode Decode ld r0, addr[r1] ld r0, addr[r1] mul r1, r0, r0 Exec Exec mul r1, r0, r0 mul r1, r1, r0 mul r1, r1, r0 ... (ALU) (ALU) ... ... ... ... ... ... ... ... ... ... Execution Execution ... st addr[r2], r0 Context Context st addr[r2], r0 result[i] result[j] Simpler cores: each core is slower at running a single instruction stream than our original “fancy” core (e.g., 25% slower) But there are now two cores: 2 × 0.75 = 1.5 (potential for speedup!) Stanford CS149, Winter 2019 But our program expresses no parallelism void sinx(int N, int terms, float* x, float* result) This C program, compiled with gcc { for (int i=0; i<N; i++) will run as one thread on one of { the processor cores. float value = x[i]; float numer = x[i] * x[i] * x[i]; int denom = 6; // 3! If each of the simpler processor int sign = -1; cores was 25% slower than the for (int j=1; j<=terms; j++) original single complicated one, { our program now runs 25% value += sign * numer / denom; numer *= x[i] * x[i]; slower. :-( denom *= (2*j+2) * (2*j+3); sign *= -1; } result[i] = value; } } Stanford CS149, Winter 2019 Expressing parallelism using pthreads typedef struct { void sinx(int N, int terms, float* x, float* result) int N; { int terms; for (int i=0; i<N; i++) float* x; { float* result; float value = x[i]; } my_args; float numer = x[i] * x[i] * x[i]; int denom = 6; // 3! void parallel_sinx(int N, int terms, float* x, float* result) int sign = -1; { pthread_t thread_id; for (int j=1; j<=terms; j++) my_args args; { value += sign * numer / denom args.N = N/2; numer *= x[i] * x[i]; args.terms = terms; denom *= (2*j+2) * (2*j+3); args.x = x; sign *= -1; args.result = result; } pthread_create(&thread_id, NULL, my_thread_start, &args); // launch thread result[i] = value; sinx(N - args.N, terms, x + args.N, result + args.N); // do work } pthread_join(thread_id, NULL); } } void my_thread_start(void* thread_arg) { my_args* thread_args = (my_args*)thread_arg; sinx(args->N, args->terms, args->x, args->result); // do work } Stanford CS149, Winter 2019 Data-parallel expression (in Kayvon’s fctitious data-parallel language) void sinx(int N, int terms, float* x, float* result) Loop iterations declared by the { programmer to be independent // declare independent loop iterations forall (int i from 0 to N-1) { With this information, you could imagine float value = x[i]; how a compiler might automatically float numer = x[i] * x[i] * x[i]; generate parallel threaded code int denom = 6; // 3! int sign = -1; for (int j=1; j<=terms; j++) { value += sign * numer / denom; numer *= x[i] * x[i]; denom *= (2*j+2) * (2*j+3); sign *= -1; } result[i] = value; } } Stanford CS149, Winter 2019 Four cores: compute four elements in parallel Fetch/ Fetch/ Decode Decode Exec Exec (ALU) (ALU) Execution Execution Context Context Fetch/ Fetch/ Decode Decode Exec Exec (ALU) (ALU) Execution Execution Context Context Stanford CS149, Winter 2019 Sixteen cores: compute sixteen elements in parallel Sixteen cores, sixteen simultaneous instruction streams Stanford CS149, Winter 2019 Multi-core examples Core 1 Core 2 Shared L3 cache Core 3 Core 4 Intel “Skylake” Core i7 quad-core CPU NVIDIA GP104 (GTX 1080) GPU (2015) 20 replicated (“SM”) cores (2016) Stanford CS149, Winter 2019 More multi-core examples Core 1 Core 2 Intel Xeon Phi “Knights Corner“ 72-core CPU Apple A9 dual-core CPU (2016) (2015) A9 image credit: Chipworks (obtained via Anandtech) http://www.anandtech.com/show/9686/the-apple-iphone-6s-and-iphone-6s-plus-review/3 Stanford CS149, Winter 2019 Data-parallel expression (in Kayvon’s fctitious data-parallel language) void sinx(int N, int terms, float* x, float* result) Another interesting property of this code: { // declare independent loop iterations forall (int i from 0 to N-1) Parallelism is across iterations of the loop. { float value = x[i]; All the iterations of the loop do the same float numer = x[i] * x[i] * x[i]; thing: evaluate the sine of a single input int denom = 6; // 3! int sign = -1; number for (int j=1; j<=terms; j++) { value += sign * numer / denom; numer *= x[i] * x[i]; denom *= (2*j+2) * (2*j+3); sign *= -1; } result[i] = value; } } Stanford CS149, Winter 2019 Add ALUs to increase compute capability Fetch/ Decode Idea #2: Amortize cost/complexity of managing an ALU 0 ALU 1 ALU 2 ALU 3 instruction stream across many ALUs ALU 4 ALU 5 ALU 6 ALU 7 SIMD processing Single instruction, multiple data Same instruction broadcast to all ALUs Execution Context Executed in parallel on all ALUs Stanford CS149, Winter 2019 Add ALUs to increase compute capability Fetch/ Decode ld r0, addr[r1] mul r1, r0, r0 ALU 0 ALU 1 ALU 2 ALU 3 mul r1, r1, r0 ..

Parallel Computing Stanford CS149, Winter 2019 Lecture 2: (Forms Of

2.5 Classification of Parallel Computers

GPU Implementation Over IPTV Software Defined Networking

SIMD Extensions

Computer Hardware Architecture Lecture 4

Superscalar Fall 2020

The Multiscalar Architecture

An Introduction to Gpus, CUDA and Opencl

Cuda C Programming Guide

Threading SIMD and MIMD in the Multicore Context the Ultrasparc T2

Thread-Level Parallelism I

Trends in Processor Architecture

Intro Evolution of Superscalar Processor