10/25/17

CS 61C: 61C Survey Great Ideas in It would be nice to have a review lecture Lecture 18: every once in a while, actually showing us Parallel Processing – SIMD how things fit in the bigger picture Krste Asanović & Randy Katz http://inst.eecs.berkeley.edu/~cs61c

CS 61c Lecture 18: Parallel Processing - SIMD 2

Agenda 61C Topics so far … • What we learned: • 61C – the big picture 1. Binary numbers • Parallel processing 2. C 3. Pointers • Single instruction, multiple data 4. Assembly language 5. Datapath architecture • SIMD matrix multiplication 6. Pipelining • Amdahl’s law 7. Caches 8. Performance evaluation • Loop unrolling 9. Floating point • Memory access strategy - blocking • What does this buy us? − Promise: execution speed • And in Conclusion, … − Let’s check!

CS 61c Lecture 18: Parallel Processing - SIMD 3 CS 61c Lecture 18: Parallel Processing - SIMD 4

Reference Problem Application Example: Deep Learning • Matrix multiplication • Image classification (cats …) −Basic operation in many engineering, data, and imaging processing tasks • Pick “best” vacation photos −Image filtering, noise reduction, … −Many closely related operations • Machine translation § E.g. stereo vision (project 4) • Clean up accent • dgemm −double-precision floating-point matrix • Fingerprint verification multiplication • Automatic game playing

CS 61c Lecture 18: Parallel Processing - SIMD 5 CS 61c Lecture 18: Parallel Processing - SIMD 6

1 10/25/17

Matrices Matrix Multiplication � B • Square matrix of dimension NxN � 0 N-1 C = A*B 0 Cij = Σk (Aik * Bkj) � � � A C

N-1 �

CS 61c Lecture 18: Parallel Processing - SIMD 7 CS 61c 8

Reference: Python C • Matrix multiplication in Python • c = a * b • a, b, c are N x N matrices

N Python [Mflops] • 1 MFLOP = 1 Million floating- 32 5.4 point operations per second 160 5.5 (fadd, fmul) 480 5.4 3 960 5.3 • dgemm(N …) takes 2*N flops

CS 61c Lecture 18: Parallel Processing - SIMD 9 CS 61c Lecture 18: Parallel Processing - SIMD 10

Timing Program Execution C versus Python

N C [GFLOPS] Python [GFLOPS] 32 1.30 0.0054 160 1.30 0.0055 240x 480 1.32 0.0054 ! 960 0.91 0.0053

Which class gives you this kind of power? We could stop here … but why? Let’s do better!

CS 61c Lecture 18: Parallel Processing - SIMD 11 CS 61c Lecture 18: Parallel Processing - SIMD 12

2 10/25/17

Agenda Why Parallel Processing? • 61C – the big picture • CPU Clock Rates are no longer increasing • Parallel processing − Technical & economic challenges § Advanced cooling technology too expensive or impractical for most • Single instruction, multiple data applications § Energy costs are prohibitive • SIMD matrix multiplication • Amdahl’s law • Parallel processing is only path to higher speed • Loop unrolling − Compare airlines: • Memory access strategy - blocking § Maximum speed limited by speed of sound and economics § Use more and larger airplanes to increase throughput • And in Conclusion, … § And smaller seats …

CS 61c Lecture 18: Parallel Processing - SIMD 13 CS 61c Lecture 18: Parallel Processing - SIMD 14

New-School Machine Structures Using Parallelism for Performance (It’s a bit more complicated!) Software Hardware • Two basic ways: • Parallel Requests Assigned to computer Warehouse Smart − e.g., Search “Katz” Scale Phone Multiprogramming Computer § run multiple independent programs in parallel • Parallel Threads Harness Assigned to core Parallelism & § “Easy” e.g., Lookup, Ads Achieve High Computer • Parallel Instructions Performance Core Core − >1 instruction @ one time … § run one program faster e.g., 5 pipelined instructions Memory (Cache) Input/Output § “Hard” • Parallel Data Core >1 data item @ one time Today’s Instruction Unit(s) Functional • We’ll focus on parallel computing in the next e.g., Add of 4 pairs of words Unit(s) • Hardware descriptions Lecture A0+B0 A1+B1 A2+B2 A3+B3 few lectures All gates @ one time Cache Memory • Programming Languages Logic Gates CS 61c Lecture 18: Parallel Processing - SIMD 15 16

Single-Instruction/Single-Data Stream Single-Instruction/Multiple-Data Stream (SISD) (SIMD or “sim-dee”) • Sequential computer that • SIMD computer exploits exploits no parallelism in multiple data streams against either the instruction or data a single instruction stream to streams. Examples of SISD operations that may be architecture are traditional naturally parallelized, e.g., uniprocessor machines Intel SIMD instruction ҆E.g. our trusted RISC-V extensions or NVIDIA Graphics Processing Unit Processing Unit (GPU)

This is what we did up to now in 61C Today’s topic. CS 61c Lecture 18: Parallel Processing - SIMD 17 CS 61c Lecture 18: Parallel Processing - SIMD 18

3 10/25/17

Multiple-Instruction/Multiple-Data Streams Multiple-Instruction/Single-Data Stream (MIMD or “mim-dee”) (MISD) • Multiple autonomous Instruction Pool processors simultaneously executing different • Multiple-Instruction, PU instructions on different data. Single-Data stream • MIMD architectures include PU computer that exploits multicore and Warehouse-Scale multiple instruction PU Computers streams against a single Data Pool data stream. PU • Historical significance Topic of Lecture 19 and beyond. This has few applications. Not covered in 61C. CS 61c Lecture 18: Parallel Processing - SIMD 19 CS 61c Lecture 18: Parallel Processing - SIMD 20

Flynn* Taxonomy, 1966 Agenda • 61C – the big picture • Parallel processing • Single instruction, multiple data • SIMD and MIMD are currently the most common • SIMD matrix multiplication parallelism in architectures – usually both in same • system! Amdahl’s law • Most common parallel processing programming style: • Loop unrolling Single Program Multiple Data (“SPMD”) • Memory access strategy - blocking − Single program that runs on all processors of a MIMD − Cross-processor execution coordination using • And in Conclusion, … synchronization primitives CS 61c Lecture 18: Parallel Processing - SIMD 21 CS 61c Lecture 18: Parallel Processing - SIMD 22

SIMD – “Single Instruction Multiple Data” SIMD Applications & Implementations • Applications − Scientific computing § Matlab, NumPy − Graphics and video processing § Photoshop, … − Big Data § Deep learning − Gaming − … • Implementations − x86 − ARM − RISC-V vector extensions

CS 61c Lecture 18: Parallel Processing - SIMD 23 CS 61c Lecture 18: Parallel Processing - SIMD 24

4 10/25/17

First SIMD Extensions: MIT Lincoln Labs TX-2, 1957 x86 SIMD Evolution • New instructions • New, wider, more registers • More parallelism

http://svmoore.pbworks.com/w/file/fetch/70583970/VectorOps.pdf

CS 61c 25 CS 61c Lecture 18: Parallel Processing - SIMD 26

Laptop CPU Specs SIMD Registers

$ sysctl -a | grep cpu hw.physicalcpu: 2 hw.logicalcpu: 4 machdep.cpu.brand_string: Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 PCLMULQDQ DTES64 MON DSCPL VMX EST TM2 SSSE3 FMA CX16 TPR PDCM SSE4.1 SSE4.2 x2APIC MOVBE POPCNT AES PCID XSAVE OSXSAVE SEGLIM64 TSCTMR AVX1.0 RDRAND F16C machdep.cpu.leaf7_features: SMEP ERMS RDWRFSGS TSC_THREAD_OFFSET BMI1 AVX2 BMI2 INVPCID SMAP RDSEED ADX IPT FPU_CSDS CS 61c Lecture 18: Parallel Processing - SIMD 27 CS 61c Lecture 18: Parallel Processing - SIMD 28

SIMD Data Types SIMD Vector Mode

(Now also AVX-512 available)

CS 61c Lecture 18: Parallel Processing - SIMD 29 CS 61c Lecture 18: Parallel Processing - SIMD 30

5 10/25/17

Agenda Problem • 61C – the big picture • Today’s compilers (largely) do not generate SIMD • Parallel processing code • Single instruction, multiple data • Back to assembly … • SIMD matrix multiplication • x86 (not using RISC-V as no vector hardware yet) • Amdahl’s law −Over 1000 instructions to learn … • Loop unrolling −Green Book • Memory access strategy - blocking • Can we use the compiler to generate all non-SIMD • And in Conclusion, … instructions?

CS 61c Lecture 18: Parallel Processing - SIMD 31 CS 61c Lecture 18: Parallel Processing - SIMD 32

x86 Intrinsics AVX Data Types Intrinsics AVX Code Nomenclature Intrinsics: Direct access to registers & assembly from C Register

CS 61c Lecture 18: Parallel Processing - SIMD 33 CS 61c Lecture 18: Parallel Processing - SIMD 34

x86 SIMD “Intrinsics” Raw Double-Precision Throughput

Characteristic Value assembly instruction CPU i7-5557U Clock rate (sustained) 3.1 GHz Instructions per clock (mul_pd) 2 4 parallel multiplies Parallel multiplies per instruction 4

Peak double FLOPS 24.8 GFLOPS

2 instructions per clock cycle (CPI = 0.5)

https://www.karlrupp.net/2013/06/cpu-gpu-and-mic-hardware-characteristics-over-time/ Actual performance is lower because of overhead https://software.intel.com/sites/landingpage/IntrinsicsGuide/ CS 61c Lecture 18: Parallel Processing - SIMD 35 CS 61c Lecture 18: Parallel Processing - SIMD 36

6 10/25/17

Vectorized Matrix Multiplication “Vectorized” dgemm for i …; i+=4 � for j ... Inner Loop: �

� i += 4

CS 61c 37 CS 61c Lecture 18: Parallel Processing - SIMD 38

Performance Agenda • 61C – the big picture Gflops N • Parallel processing scalar avx • Single instruction, multiple data 32 1.30 4.56 • 160 1.30 5.47 SIMD matrix multiplication 480 1.32 5.27 • Amdahl’s law 960 0.91 3.64 • Loop unrolling • Memory access strategy - blocking • 4x faster • But still << theoretical 25 GFLOPS! • And in Conclusion, …

CS 61c Lecture 18: Parallel Processing - SIMD 39 CS 61c Lecture 18: Parallel Processing - SIMD 40

A trip to LA Amdahl’s Law • Get enhancement E for your new PC Commercial airline: − E.g. floating-point rocket booster

Get to SFO & check-in SFO à LAX Get to destination • E 3 hours 1 hour 3 hours − Speeds up some task (e.g. arithmetic) by factor SE − F is fraction of program that uses this ”task” Total time: 7 hours Execution Time: Supersonic aircraft: no speedup speedup section 1-F F T0 (no E) Get to SFO & check-in SFO à LAX Get to destination T (with E) 1-F F / SE 3 hours 6 min 3 hours E Total time: 6.1 hours Speedup: Speedup: Speedup = T / T = 1 / ((1-F) + F/S ) Flying time Sflight = 60 / 6 = 10x O E E Trip time Strip = 7 / 6.1 = 1.15x CS 61c Lecture 18: Parallel Processing - SIMD 41 CS 61c Lecture 18: Parallel Processing - SIMD 42

7 10/25/17

Big Idea: Amdahl’s Law Maximum “Achievable” Speed-Up

Speedup = 1 / ((1-F) + F/SE) Speedup = TO / TE = 1 / ((1-F) + F/SE) (SE->∞) = 1/(1-F) Question: What is a reasonable # of parallel processors to speed up Part sped up Part not sped up an algorithm with F = 95%? (i.e. 19/20th can be sped up)

Example: The execution time of half of a program can a) Maximum speedup: Smax = 1/(1-0.95) = 20

be accelerated by a factor of 2. but needs SE=∞! What is the program speed-up overall? b) Reasonable “engineering” compromise:

Equal time in sequential and parallel code: (1-F) = F/SE TO / TE = 1 / ((1-F) + F/SE) = 1/((1- 0.5)+0.5/2) = 1.33 << 2 SE = F/(1-F) = 0.95/0.05 = 20

CS 61c Lecture 18: Parallel Processing - SIMD 43 CS 61c S = 1/(0.05+0.05) = 10 44

If the portion of Strong and Weak Scaling the program that • To get good speedup on a parallel processor while keeping the can be parallelized problem size fixed is harder than getting good speedup by is small, then the 500 processors increasing the size of the problem. speedup is limited for 19x − Strong scaling: when speedup can be achieved on a parallel processor without increasing the size of the problem In this region, the − Weak scaling: when speedup is achieved on a parallel processor by sequential portion limits increasing the size of the problem proportionally to the increase in the number of processors 20 processors the performance for 10x • Load balancing is another important factor: every processor doing same amount of work − Just one unit with twice the load of others cuts speedup almost in half

CS 61c 45 CS 61c Lecture 18: Parallel Processing - SIMD 46

Peer Instruction Administrivia Suppose a program spends 80% of its time in a square-root • Midterm 2 is Tuesday during class time! routine. How much must you speedup square root to make − Review Session: Saturday 10-12pm @Hearst Annex A1 the program run 5 times faster? − One double-sided crib sheet; we’ll provide the RISCV card

Speedup = 1 / ((1-F) + F/SE) − Bring your ID! We’ll check it RED: 5 − Room assignments posted on Piazza GREEN: 20 • Project 2 pre-lab due Friday ORANGE: 500 None of above • HW4 due 11/3, but advisable to finish by the midterm • Project 2-2 due 11/6, but also good to finish before MT2

CS 61c Lecture 18: Parallel Processing - SIMD 47 CS 61c Lecture 19: Leval Parallel Processing 48

8 10/25/17

Agenda Amdahl’s Law applied to dgemm • 61C – the big picture • Measured dgemm performance • − Peak 5.5 GFLOPS Parallel processing − Large matrices 3.6 GFLOPS • Single instruction, multiple data − Processor 24.8 GFLOPS • SIMD matrix multiplication • Amdahl’s law • Why are we not getting (close to) 25 GFLOPS? − Something else (not floating-point ALU) is limiting performance! • Loop unrolling − But what? Possible culprits: • Memory access strategy - blocking § Cache § Hazards • And in Conclusion, … § Let’s look at both!

CS 61c Lecture 18: Parallel Processing - SIMD 49 CS 61c Lecture 18: Parallel Processing - SIMD 50

Pipeline Hazards – dgemm Loop Unrolling

4 registers

Compiler does the unrolling

How do you verify that the generated code is actually unrolled? CS 61c Lecture 18: Parallel Processing - SIMD 51 CS 61c Lecture 18: Parallel Processing - SIMD 52

Performance Agenda • 61C – the big picture Gflops • Parallel processing N scalar avx unroll • Single instruction, multiple data 32 1.30 4.56 12.95 • SIMD matrix multiplication 160 1.30 5.47 19.70 • Amdahl’s law 480 1.32 5.27 14.50 • Loop unrolling 960 0.91 3.64 6.91 • Memory access strategy - blocking • And in Conclusion, …

CS 61c Lecture 18: Parallel Processing - SIMD 53 CS 61c Lecture 18: Parallel Processing - SIMD 54

9 10/25/17

FPU versus Memory Access But memory is accessed repeatedly • How many floating-point operations does matrix multiply Inner loop: take? − F = 2 x N3 (N3 multiplies, N3 adds) • How many memory load/stores? − M = 3 x N2 (for A, B, C) • Many more floating-point operations than memory accesses − q = F/M = 2/3 * N − Good, since arithmetic is faster than memory access − Let’s check the code … • q = F/M = 1! (2 loads and 2 floating-point operations)

CS 61c Lecture 18: Parallel Processing - SIMD 55 CS 61c Lecture 18: Parallel Processing - SIMD 56

Typical Sub-Matrix Multiplication or: Beating Amdahl’s Law On-Chip Components Control

Cache Third- Secondary Instr Second Level Main Memory RegFile -Level Cache Memory (Disk Cache Datapath Data Cache (SRAM) (DRAM) Or Flash) (SRAM)

Speed (cycles): ½’s 1’s 10’s 100’s-1000 1,000,000’s Size (bytes): 100’s 10K’s M’s G’s T’s Cost/bit: highest lowest • Where are the operands (A, B, C) stored? • What happens as N increases? • Idea: arrange that most accesses are to fast cache!

CS 61c Lecture 18: Parallel Processing - SIMD 57 CS 61c 58

Blocking Memory Access Blocking

• Idea: − Rearrange code to use values loaded in cache many times − Only “few” accesses to slow main memory (DRAM) per floating point operation − à throughput limited by FP hardware and cache, not slow DRAM − P&H, RISC-V edition p. 465

CS 61c Lecture 18: Parallel Processing - SIMD 59 CS 61c Lecture 18: Parallel Processing - SIMD 60

10 10/25/17

Performance Agenda • 61C – the big picture Gflops N • Parallel processing scalar avx unroll blocking 32 1.30 4.56 12.95 13.80 • Single instruction, multiple data 160 1.30 5.47 19.70 21.79 • SIMD matrix multiplication 480 1.32 5.27 14.50 20.17 • Amdahl’s law 960 0.91 3.64 6.91 15.82 • Loop unrolling • Memory access strategy - blocking • And in Conclusion, …

CS 61c Lecture 18: Parallel Processing - SIMD 61 CS 61c Lecture 18: Parallel Processing - SIMD 62

And in Conclusion, … • Approaches to Parallelism − SISD, SIMD, MIMD (next lecture) • SIMD − One instruction operates on multiple operands simultaneously • Example: matrix multiplication − Floating point heavy à exploit Moore’s law to make fast • Amdahl’s Law: − Serial sections limit speedup − Cache § Blocking − Hazards § Loop unrolling

CS 61c Lecture 18: Parallel Processing - SIMD 63

11