Multiple Cores + SIMD + Threads) (And Their Performance Characteristics

Lecture 2: Modern Multi-Core Architecture: (multiple cores + SIMD + threads) (and their performance characteristics) CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Announcements ▪ WWW - http://www.cs.cmu.edu/afs/cs/academic/class/15418-s12/www/ ▪ We have another TA! - Mike Mu (Yes, our TAs are Michael and Mike) (CMU 15-418, Spring 2012) Review 1. Why has single threaded performance topped out in recent years? 2. What prevented us from obtaining maximum speedup from the parallel programs we wrote last time? (CMU 15-418, Spring 2012) Today ▪ Today we will talk computer architecture ▪ Four key concepts about how modern computers work - Two concern execution - Two concern access to memory ▪ Knowing some architecture will help you - Understand and optimize the performance of your programs - Gain intuition about what workloads might bene!t from fast parallel machines (CMU 15-418, Spring 2012) Part 1: parallel execution (CMU 15-418, Spring 2012) Example program Compute sin(x) using tailor expansion: sin(x) = x - x3/3! + x5/5! + x7/7! + ... void sinx(int N, int terms, float* x, float* result) { for (int i=0; i<N; i++) { float value = x[i]; float numer = x[i] * x[i] * x[i]; int denom = 6; // 3! int sign = -1; for (int j=1; j<=terms; j++) { value += sign * numer / denom numer *= x[i] * x[i]; denom *= (j+3) * (j+4); sign *= -1; } result[i] = value; } } (CMU 15-418, Spring 2012) Compile program void sinx(int N, int terms, float* x, float* result) x[i] { for (int i=0; i<N; i++) { float value = x[i]; float numer = x[i] * x[i] * x[i]; int denom = 6; // 3! ld r0, addr[r1] int sign = -1; mul r1, r0, r0 mul r1, r1, r0 for (int j=1; j<=terms; j++) ... { ... value += sign * numer / denom ... numer *= x[i] * x[i]; ... ... denom *= (j+3) * (j+4); st addr[r2], r0 sign *= -1; } result[i] = value; } result[i] } (CMU 15-418, Spring 2012) Execute program x[i] Fetch/ Decode ld r0, addr[r1] mul r1, r0, r0 ALU mul r1, r1, r0 (Execute) ... ... ... Execution ... Context ... ... st addr[r2], r0 result[i] (CMU 15-418, Spring 2012) Execute program Very simple processor: execute one instruction per clock x[i] Fetch/ Decode PC ld r0, addr[r1] mul r1, r0, r0 ALU mul r1, r1, r0 (Execute) ... ... ... Execution ... Context ... ... st addr[r2], r0 result[i] (CMU 15-418, Spring 2012) Execute program Very simple processor: execute one instruction per clock x[i] Fetch/ Decode ld r0, addr[r1] PC mul r1, r0, r0 ALU mul r1, r1, r0 (Execute) ... ... ... Execution ... Context ... ... st addr[r2], r0 result[i] (CMU 15-418, Spring 2012) Execute program Very simple processor: execute one instruction per clock x[i] Fetch/ Decode ld r0, addr[r1] mul r1, r0, r0 ALU PC mul r1, r1, r0 (Execute) ... ... ... Execution ... Context ... ... st addr[r2], r0 result[i] (CMU 15-418, Spring 2012) Superscalar processor Recall from last class: instruction level parallelism (ILP) Decode and execute two instructions per clock (if possible) x[i] Fetch/ Fetch/ Decode Decode 1 2 ld r0, addr[r1] mul r1, r0, r0 Exec Exec mul r1, r1, r0 1 2 ... ... ... Execution ... Context ... ... st addr[r2], r0 result[i] Note: No ILP exists in this region of the program (CMU 15-418, Spring 2012) Aside: Pentium 4 (CMU 15-418, Spring 2012) CPU: pre multi-core era Majority of chip transistors perform operations that help a single instruction stream run fast. Fetch/ Decode Data cache (a big one) ALU (Execute) Execution Out-of-order control logic Context Fancy branch predictor Memory pre-fetcher More transistors = more cache, smarter out-of-order logic, smarter branch predictor, etc. (Also: more transistors --> smaller transistors --> higher clock frequencies) (CMU 15-418, Spring 2012) CPU: multi-core era Fetch/ Idea #1: Decode Decrease sophistication of out-of-order and ALU speculative operations (Execute) Use increasing transistor count to provide Execution more processing cores Context (CMU 15-418, Spring 2012) Two cores: compute two elements in parallel x[i] x[j] Fetch/ Fetch/ Decode Decode ld r0, addr[r1] ld r0, addr[r1] mul r1, r0, r0 ALU ALU mul r1, r0, r0 mul r1, r1, r0 mul r1, r1, r0 ... (Execute) (Execute) ... ... ... ... ... ... ... ... ... ... Execution Execution ... st addr[r2], r0 Context Context st addr[r2], r0 result[i] result[j] Simpler cores: each core is slower than original “fat” core (e.g., 25% slower) But there are now two: 2 * 0.75 = 1.5 (speedup!) (CMU 15-418, Spring 2012) But our program expresses no parallelism void sinx(int N, int terms, float* x, float* result) This program, compiled with gcc { for (int i=0; i<N; i++) will run as one thread on one of { the cores. float value = x[i]; float numer = x[i] * x[i] * x[i]; int denom = 6; // 3! int sign = -1; for (int j=1; j<=terms; j++) { value += sign * numer / denom numer *= x[i] * x[i]; denom *= (j+3) * (j+4); sign *= -1; } result[i] = value; } } (CMU 15-418, Spring 2012) Expressing parallelism using pthreads typedef struct { void sinx(int N, int terms, float* x, float* result) int N; { int terms; for (int i=0; i<N; i++) float* x; { float* result; float value = x[i]; } my_args; float numer = x[i] * x[i] * x[i]; int denom = 6; // 3! void parallel_sinx(int N, int terms, float* x, float* result) int sign = -1; { pthread_t thread_id; for (int j=1; j<=terms; j++) my_args args; { value += sign * numer / denom args.N = N/2; numer *= x[i] * x[i]; args.terms = terms; denom *= (j+3) * (j+4); args.x = x; sign *= -1; args.result = result; } pthread_create(&thread_id, NULL, my_thread_start, &args); // launch thread result[i] = value; sinx(N - args.N, terms, x + args.N, result + args.N); // do work } pthread_join(thread_id, NULL); } } void my_thread_start(void* thread_arg) { my_args thread_args = (my_args)thread_arg; sinx(args.N, args.terms, args.x, args.result); // do work } (CMU 15-418, Spring 2012) Data-parallel expression (in Kayvon’s !ctitious data-parallel language) void sinx(int N, int terms, float* x, float* result) Parallelism visible to compiler { // declare independent loop iterations forall (int i from 0 to N-1) Facilitates automatic generation of { threaded code float value = x[i]; float numer = x[i] * x[i] * x[i]; int denom = 6; // 3! int sign = -1; for (int j=1; j<=terms; j++) { value += sign * numer / denom numer *= x[i] * x[i]; denom *= (j+3) * (j+4); sign *= -1; } result[i] = value; } } (CMU 15-418, Spring 2012) Four cores: compute four elements in parallel Fetch/ Fetch/ Decode Decode ALU ALU (Execute) (Execute) Execution Execution Context Context Fetch/ Fetch/ Decode Decode ALU ALU (Execute) (Execute) Execution Execution Context Context (CMU 15-418, Spring 2012) Sixteen cores: compute sixteen elements in parallel Sixteen cores, sixteen simultaneous instruction streams (CMU 15-418, Spring 2012) Multi-core examples Intel Core i7 quad-core CPU (2010) NVIDIA Tesla GPU (2009) 16 cores (CMU 15-418, Spring 2012) Add ALUs to increase compute capability Fetch/ Decode Idea #2: Amortize cost/complexity of managing an ALU 1 ALU 2 ALU 3 ALU 4 instruction stream across many ALUs ALU 5 ALU 6 ALU 7 ALU 8 SIMD processing Ctx Ctx Ctx Ctx Single instruction, multiple data Ctx Ctx Ctx Ctx Same instruction broadcast to all ALUs Shared Ctx Data Executed in parallel on all ALUs (CMU 15-418, Spring 2012) Add ALUs to increase compute capability Fetch/ Decode ld r0, addr[r1] mul r1, r0, r0 ALU 1 ALU 2 ALU 3 ALU 4 mul r1, r1, r0 ... ALU 5 ALU 6 ALU 7 ALU 8 ... ... ... Ctx Ctx Ctx Ctx ... ... st addr[r2], r0 Ctx Ctx Ctx Ctx Original compiled program: Shared Ctx Data Processes one fragment using scalar instructions on scalar registers (e.g., 32 bit "oats) (CMU 15-418, Spring 2012) Scalar program void sinx(int N, int terms, float* x, float* result) Original compiled program: { for (int i=0; i<N; i+=4) Processes one array element using scalar { instructions on scalar registers (e.g., 32 bit !oats) float value = x[i]; float numer = x[i] * x[i] * x[i]; int denom = 6; // 3! int sign = -1; ld r0, addr[r1] mul r1, r0, r0 for (int j=1; j<=terms; j++) mul r1, r1, r0 { ... value += sign * numer / denom ... numer *= x[i] * x[i]; ... denom *= (j+3) * (j+4); ... sign *= -1; ... } ... st addr[r2], r0 result[i] = value; } } (CMU 15-418, Spring 2012) Vector program (using AVX intrinsics) #include <immintrin.h> void sinx(int N, int terms, float* x, float* sinx) Intrinsics available to C programmers { for (int i=0; i<N; i+=8) { __m256 origx = _mm256_load_ps(&x[i]); __m256 value = origx; __m256 numer = _mm256_mul_ps(origx, mm_mul_ps(origx, origx)); __m256 denom = _mm256_broadcast_ss(6); // 3! int sign = -1; for (int j=1; j<=terms; j++) { // value += sign * numer / numer __m256 tmp = _mm256_div_ps(_mm_mul_ps(_mm256_set1ps(sign), numer), denom); value = _mm256_add_ps(value, tmp); numer = _mm256_mul_ps(numer, _mm256_mul_ps(origx, origx)); denom = _mm256_mul_ps(denom, _mm256_broadcast_ss((j+3) * (j+4))); sign *= -1; } _mm256_store_ps(&result[i], value); } } (CMU 15-418, Spring 2012) Vector program (using AVX intrinsics) #include <immintrin.h> void sinx(int N, int terms, float* x, float* sinx) { vloadps xmm0, addr[r1] for (int i=0; i<N; i+=8) vmulps xmm1, xmm0, xmm0 { v __m256 origx = _mm256_load_ps(&x[i]); mulps xmm1, xmm1, xmm0 __m256 value = origx; ... __m256 numer = _mm256_mul_ps(origx, mm_mul_ps(origx, origx)); ... __m256 denom = _mm256_broadcast_ss(6); // 3! ... int sign = -1; ... ... for (int j=1; j<=terms; j++) ... { vstoreps addr[xmm2], xmm0 // value += sign * numer / numer __m256 tmp = _mm256_div_ps(_mm_mul_ps(_mm256_broadcast_ss(sign), numer), denom); value = _mm256_add_ps(value, tmp); Compiled program: Processes eight array elements numer = _mm256_mul_ps(numer,

Multiple Cores + SIMD + Threads) (And Their Performance Characteristics

Comparative Performance Evaluation of Cache-Coherent NUMA and COMA Architectures Abstract 1 Introduction

Parallel Patterns for Adaptive Data Stream Processing

Detailed Cache Coherence Characterization for Openmp Benchmarks

A Middleware for Efficient Stream Processing in CUDA

2.5 Classification of Parallel Computers

Chapter 1: Computer Abstractions and Technology 1.6 – 1.7: Performance and Power

AMD Accelerated Parallel Processing Opencl Programming Guide

Analysis and Optimization of I/O Cache Coherency Strategies for Soc-FPGA Device

GPU Implementation Over IPTV Software Defined Networking

SIMD Extensions

Introduction to Multi-Threading and Vectorization Matti Kortelainen Larsoft Workshop 2019 25 June 2019 Outline

Parallel Programming