A Modern Multi-Core Processor (Forms of Parallelism + Understanding Latency and Bandwidth)

Lecture 2: A Modern Multi-Core Processor (Forms of parallelism + understanding latency and bandwidth) Parallel Computer Architecture and Programming CMU 15-418/15-618, Fall 2016 Quick review 1. Why has single-instruction-stream performance only improved very slowly in recent years? * 2. What prevented us from obtaining maximum speedup from the parallel programs we wrote last time? * Self check 1: What do I mean by “single-instruction stream”? Self check 2: When we talked about the optimization of superscalar execution, were we talking about optimizing the performance of executing a single-instruction stream? CMU 15-418/618, Fall 2016 Today ▪ Today we will talk computer architecture ▪ Four key concepts about how modern computers work - Two concern parallel execution - Two concern challenges of accessing memory ▪ Understanding these architecture basics will help you - Understand and optimize the performance of your parallel programs - Gain intuition about what workloads might beneft from fast parallel machines CMU 15-418/618, Fall 2016 Part 1: parallel execution CMU 15-418/618, Fall 2016 Example program Compute sin(x) using Taylor expansion: sin(x) = x - x3/3! + x5/5! - x7/7! + ... for each element of an array of N foating-point numbers void sinx(int N, int terms, float* x, float* result) { for (int i=0; i<N; i++) { float value = x[i]; float numer = x[i] * x[i] * x[i]; int denom = 6; // 3! int sign = -1; for (int j=1; j<=terms; j++) { value += sign * numer / denom; numer *= x[i] * x[i]; denom *= (2*j+2) * (2*j+3); sign *= -1; } result[i] = value; } } CMU 15-418/618, Fall 2016 Compile program void sinx(int N, int terms, float* x, float* result) { x[i] for (int i=0; i<N; i++) { float value = x[i]; float numer = x[i] * x[i] * x[i]; ld r0, addr[r1] int denom = 6; // 3! mul r1, r0, r0 int sign = -1; mul r1, r1, r0 ... for (int j=1; j<=terms; j++) ... { ... value += sign * numer / denom; ... ... numer *= x[i] * x[i]; ... denom *= (2*j+2) * (2*j+3); st addr[r2], r0 sign *= -1; } result[i] = value; } result[i] } CMU 15-418/618, Fall 2016 Execute program x[i] Fetch/ Decode ld r0, addr[r1] mul r1, r0, r0 ALU mul r1, r1, r0 (Execute) ... ... ... Execution ... Context ... ... st addr[r2], r0 result[i] CMU 15-418/618, Fall 2016 Execute program My very simple processor: executes one instruction per clock x[i] Fetch/ Decode PC ld r0, addr[r1] mul r1, r0, r0 ALU mul r1, r1, r0 (Execute) ... ... ... Execution ... Context ... ... st addr[r2], r0 result[i] CMU 15-418/618, Fall 2016 Execute program My very simple processor: executes one instruction per clock x[i] Fetch/ Decode ld r0, addr[r1] PC mul r1, r0, r0 ALU mul r1, r1, r0 (Execute) ... ... ... Execution ... Context ... ... st addr[r2], r0 result[i] CMU 15-418/618, Fall 2016 Execute program My very simple processor: executes one instruction per clock x[i] Fetch/ Decode ld r0, addr[r1] mul r1, r0, r0 ALU PC mul r1, r1, r0 (Execute) ... ... ... Execution ... Context ... ... st addr[r2], r0 result[i] CMU 15-418/618, Fall 2016 Superscalar processor Recall from last class: instruction level parallelism (ILP) Decode and execute two instructions per clock (if possible) x[i] Fetch/ Fetch/ Decode Decode 1 2 ld r0, addr[r1] mul r1, r0, r0 Exec Exec mul r1, r1, r0 1 2 ... ... ... Execution ... Context ... ... st addr[r2], r0 result[i] Note: No ILP exists in this region of the program CMU 15-418/618, Fall 2016 Aside: Pentium 4 Image credit: http://ixbtlabs.com/articles/pentium4/index.html CMU 15-418/618, Fall 2016 Processor: pre multi-core era Majority of chip transistors used to perform operations that help a single instruction stream run fast Fetch/ Decode Data cache (a big one) ALU (Execute) Execution Out-of-order control logic Context Fancy branch predictor Memory pre-fetcher More transistors = larger cache, smarter out-of-order logic, smarter branch predictor, etc. (Also: more transistors → smaller transistors → higher clock frequencies) CMU 15-418/618, Fall 2016 Processor: multi-core era Fetch/ Idea #1: Decode Use increasing transistor count to add more ALU cores to the processor (Execute) Execution Rather than use transistors to increase Context sophistication of processor logic that accelerates a single instruction stream (e.g., out-of-order and speculative operations) CMU 15-418/618, Fall 2016 Two cores: compute two elements in parallel x[i] x[j] Fetch/ Fetch/ Decode Decode ld r0, addr[r1] ld r0, addr[r1] mul r1, r0, r0 ALU ALU mul r1, r0, r0 mul r1, r1, r0 mul r1, r1, r0 ... (Execute) (Execute) ... ... ... ... ... ... ... ... ... ... Execution Execution ... st addr[r2], r0 Context Context st addr[r2], r0 result[i] result[j] Simpler cores: each core is slower at running a single instruction stream than our original “fancy” core (e.g., 25% slower) But there are now two cores: 2 × 0.75 = 1.5 (potential for speedup!) CMU 15-418/618, Fall 2016 But our program expresses no parallelism void sinx(int N, int terms, float* x, float* result) This program, compiled with gcc { for (int i=0; i<N; i++) will run as one thread on one of { the processor cores. float value = x[i]; float numer = x[i] * x[i] * x[i]; int denom = 6; // 3! If each of the simpler processor int sign = -1; cores was 25% slower than the for (int j=1; j<=terms; j++) original single complicated one, { our program now runs 25% value += sign * numer / denom; numer *= x[i] * x[i]; slower. :-( denom *= (2*j+2) * (2*j+3); sign *= -1; } result[i] = value; } } CMU 15-418/618, Fall 2016 Expressing parallelism using pthreads typedef struct { void sinx(int N, int terms, float* x, float* result) int N; { int terms; for (int i=0; i<N; i++) float* x; { float* result; float value = x[i]; } my_args; float numer = x[i] * x[i] * x[i]; int denom = 6; // 3! void parallel_sinx(int N, int terms, float* x, float* result) int sign = -1; { pthread_t thread_id; for (int j=1; j<=terms; j++) my_args args; { value += sign * numer / denom args.N = N/2; numer *= x[i] * x[i]; args.terms = terms; denom *= (2*j+2) * (2*j+3); args.x = x; sign *= -1; args.result = result; } pthread_create(&thread_id, NULL, my_thread_start, &args); // launch thread result[i] = value; sinx(N - args.N, terms, x + args.N, result + args.N); // do work } pthread_join(thread_id, NULL); } } void my_thread_start(void* thread_arg) { my_args* thread_args = (my_args*)thread_arg; sinx(args->N, args->terms, args->x, args->result); // do work } CMU 15-418/618, Fall 2016 Data-parallel expression (in our fctitious data-parallel language) void sinx(int N, int terms, float* x, float* result) Loop iterations declared by the { programmer to be independent // declare independent loop iterations forall (int i from 0 to N-1) { With this information, you could imagine float value = x[i]; how a compiler might automatically float numer = x[i] * x[i] * x[i]; generate parallel threaded code int denom = 6; // 3! int sign = -1; for (int j=1; j<=terms; j++) { value += sign * numer / denom; numer *= x[i] * x[i]; denom *= (2*j+2) * (2*j+3); sign *= -1; } result[i] = value; } } CMU 15-418/618, Fall 2016 Four cores: compute four elements in parallel Fetch/ Fetch/ Decode Decode ALU ALU (Execute) (Execute) Execution Execution Context Context Fetch/ Fetch/ Decode Decode ALU ALU (Execute) (Execute) Execution Execution Context Context CMU 15-418/618, Fall 2016 Sixteen cores: compute sixteen elements in parallel Sixteen cores, sixteen simultaneous instruction streams CMU 15-418/618, Fall 2016 Multi-core examples Core 1 Core 2 Shared L3 cache Core 3 Core 4 Intel “Skylake” Core i7 quad-core CPU NVIDIA GTX 980 GPU (2015) 16 replicated processing cores (“SM”) (2014) CMU 15-418/618, Fall 2016 More multi-core examples Core 1 Core 2 Intel Xeon Phi “Knights Landing “ 76-core CPU Apple A9 dual-core CPU (2015) (2015) A9 image credit: Chipworks (obtained via Anandtech) http://www.anandtech.com/show/9686/the-apple-iphone-6s-and-iphone-6s-plus-review/3 CMU 15-418/618, Fall 2016 Data-parallel expression (in our fctitious data-parallel language) void sinx(int N, int terms, float* x, float* result) Another interesting property of this code: { // declare independent loop iterations forall (int i from 0 to N-1) Parallelism is across iterations of the loop. { float value = x[i]; All the iterations of the loop do the same float numer = x[i] * x[i] * x[i]; thing: evaluate the sine of a single input int denom = 6; // 3! int sign = -1; number for (int j=1; j<=terms; j++) { value += sign * numer / denom; numer *= x[i] * x[i]; denom *= (2*j+2) * (2*j+3); sign *= -1; } result[i] = value; } } CMU 15-418/618, Fall 2016 Add ALUs to increase compute capability Fetch/ Decode Idea #2: Amortize cost/complexity of managing an ALU 0 ALU 1 ALU 2 ALU 3 instruction stream across many ALUs ALU 4 ALU 5 ALU 6 ALU 7 SIMD processing Single instruction, multiple data Same instruction broadcast to all ALUs Execution Context Executed in parallel on all ALUs CMU 15-418/618, Fall 2016 Add ALUs to increase compute capability Fetch/ Decode ld r0, addr[r1] mul r1, r0, r0 ALU 0 ALU 1 ALU 2 ALU 3 mul r1, r1, r0 ... ALU 4 ALU 5 ALU 6 ALU 7 ... ... ... ... ... st addr[r2], r0 Recall original compiled program: Execution Context Instruction stream processes one array element at a time using scalar instructions on scalar registers (e.g., 32-bit foats) CMU 15-418/618, Fall 2016 Scalar program void sinx(int N, int terms, float* x, float* result) Original compiled program: { for (int i=0; i<N; i++) Processes one array element using scalar { instructions on scalar registers (e.g., 32-bit foats) float value = x[i]; float numer = x[i] * x[i] * x[i]; int denom = 6; // 3! int sign = -1; ld r0, addr[r1] mul r1, r0, r0 for (int j=1; j<=terms; j++) mul r1, r1, r0 { ..

A Modern Multi-Core Processor (Forms of Parallelism + Understanding Latency and Bandwidth)

Parallel Patterns for Adaptive Data Stream Processing

A Middleware for Efficient Stream Processing in CUDA

AMD Accelerated Parallel Processing Opencl Programming Guide

GPU Implementation Over IPTV Software Defined Networking

Computer Hardware Architecture Lecture 4

Superscalar Fall 2020

The Multiscalar Architecture

Fine-Grained Window-Based Stream Processing on CPU-GPU Integrated

Lightsaber: Efficient Window Aggregation on Multi-Core Processors

Parallel Stream Processing with MPI for Video Analytics and Data Visualization

Trends in Processor Architecture

Intro Evolution of Superscalar Processor