L18 SIMD (6Up).Pdf

10/25/17 CS 61C: 61C Survey Great Ideas in Computer Architecture It would be nice to have a review lecture Lecture 18: every once in a while, actually showinG us Parallel Processing – SIMD how things fit in the biGGer picture Krste Asanović & Randy Katz http://inst.eecs.berkeley.edu/~cs61c CS 61c Lecture 18: Parallel ProcessinG - SIMD 2 Agenda 61C Topics so far … • What we learned: • 61C – the big picture 1. Binary numbers • Parallel processinG 2. C 3. Pointers • SinGle instruction, multiple data 4. Assembly lanGuaGe 5. Datapath architecture • SIMD matrix multiplication 6. PipelininG • Amdahl’s law 7. Caches 8. Performance evaluation • Loop unrollinG 9. FloatinG point • Memory access strateGy - blocking • What does this buy us? − Promise: execution speed • And in Conclusion, … − Let’s check! CS 61c Lecture 18: Parallel ProcessinG - SIMD 3 CS 61c Lecture 18: Parallel ProcessinG - SIMD 4 Reference ProBlem Application Example: Deep Learning • Matrix multiplication • ImaGe classification (cats …) −Basic operation in many engineering, data, and imaGinG processing tasks • Pick “best” vacation photos −ImaGe filterinG, noise reduction, … −Many closely related operations • Machine translation § E.G. stereo vision (project 4) • Clean up accent • dgemm −double-precision floating-point matrix • FinGerprint verification multiplication • Automatic Game playinG CS 61c Lecture 18: Parallel ProcessinG - SIMD 5 CS 61c Lecture 18: Parallel ProcessinG - SIMD 6 1 10/25/17 Matrices Matrix Multiplication � B • Square matrix of dimension NxN � 0 N-1 C = A*B 0 Cij = Σk (Aik * Bkj) � � � A C N-1 � CS 61c Lecture 18: Parallel ProcessinG - SIMD 7 CS 61c 8 Reference: Python C • Matrix multiplication in Python • c = a * b • a, b, c are N x N matrices N Python [Mflops] • 1 MFLOP = 1 Million floatinG- 32 5.4 point operations per second 160 5.5 (fadd, fmul) 480 5.4 3 960 5.3 • dgemm(N …) takes 2*N flops CS 61c Lecture 18: Parallel ProcessinG - SIMD 9 CS 61c Lecture 18: Parallel ProcessinG - SIMD 10 Timing Program Execution C versus Python N C [GFLOPS] Python [GFLOPS] 32 1.30 0.0054 160 1.30 0.0055 240x 480 1.32 0.0054 ! 960 0.91 0.0053 Which class gives you this kind of power? We could stop here … but why? Let’s do better! CS 61c Lecture 18: Parallel ProcessinG - SIMD 11 CS 61c Lecture 18: Parallel ProcessinG - SIMD 12 2 10/25/17 Agenda Why Parallel Processing? • 61C – the biG picture • CPU Clock Rates are no longer increasing • Parallel processing − Technical & economic challenGes § Advanced coolinG technology too expensive or impractical for most • SinGle instruction, multiple data applications § EnerGy costs are prohibitive • SIMD matrix multiplication • Amdahl’s law • Parallel processinG is only path to hiGher speed • Loop unrollinG − Compare airlines: • Memory access strateGy - blocking § Maximum speed limited by speed of sound and economics § Use more and larGer airplanes to increase throuGhput • And in Conclusion, … § And smaller seats … CS 61c Lecture 18: Parallel ProcessinG - SIMD 13 CS 61c Lecture 18: Parallel ProcessinG - SIMD 14 New-School Machine Structures Using Parallelism for Performance (It’s a bit more complicated!) Software Hardware • Two basic ways: • Parallel Requests AssiGned to computer Warehouse Smart − e.G., Search “Katz” Scale Phone MultiproGramminG Computer § run multiple independent proGrams in parallel • Parallel Threads Harness AssiGned to core Parallelism & § “Easy” e.G., Lookup, Ads Achieve High Computer • Parallel Instructions Performance Core Core −Parallel computinG >1 instruction @ one time … § run one proGram faster e.G., 5 pipelined instructions Memory (Cache) Input/Output § “Hard” • Parallel Data Core >1 data item @ one time Today’s Instruction Unit(s) Functional • We’ll focus on parallel computinG in the next e.G., Add of 4 pairs of words Unit(s) • Hardware descriptions Lecture A0+B0 A1+B1 A2+B2 A3+B3 few lectures All Gates @ one time Cache Memory • ProGramminG LanGuaGes LoGic Gates CS 61c Lecture 18: Parallel ProcessinG - SIMD 15 16 Single-Instruction/Single-Data Stream Single-Instruction/Multiple-Data Stream (SISD) (SIMD or “sim-dee”) • Sequential computer that • SIMD computer exploits exploits no parallelism in multiple data streams aGainst either the instruction or data a single instruction stream to streams. Examples of SISD operations that may be architecture are traditional naturally parallelized, e.G., uniprocessor machines Intel SIMD instruction ҆E.G. our trusted RISC-V pipeline extensions or NVIDIA Graphics ProcessinG Unit Processing Unit (GPU) This is what we did up to now in 61C Today’s topic. CS 61c Lecture 18: Parallel ProcessinG - SIMD 17 CS 61c Lecture 18: Parallel ProcessinG - SIMD 18 3 10/25/17 Multiple-Instruction/Multiple-Data Streams Multiple-Instruction/Single-Data Stream (MIMD or “mim-dee”) (MISD) • Multiple autonomous Instruction Pool processors simultaneously executinG different • Multiple-Instruction, PU instructions on different data. Single-Data stream • MIMD architectures include PU computer that exploits multicore and Warehouse-Scale multiple instruction PU Computers streams aGainst a sinGle Data Pool data stream. PU • Historical siGnificance Topic of Lecture 19 and beyond. This has few applications. Not covered in 61C. CS 61c Lecture 18: Parallel ProcessinG - SIMD 19 CS 61c Lecture 18: Parallel ProcessinG - SIMD 20 Flynn* Taxonomy, 1966 Agenda • 61C – the biG picture • Parallel processinG • Single instruction, multiple data • SIMD and MIMD are currently the most common • SIMD matrix multiplication parallelism in architectures – usually both in same • system! Amdahl’s law • Most common parallel processinG proGramminG style: • Loop unrollinG SinGle ProGram Multiple Data (“SPMD”) • Memory access strateGy - blocking − SinGle proGram that runs on all processors of a MIMD − Cross-processor execution coordination using • And in Conclusion, … synchronization primitives CS 61c Lecture 18: Parallel ProcessinG - SIMD 21 CS 61c Lecture 18: Parallel ProcessinG - SIMD 22 SIMD – “Single Instruction Multiple Data” SIMD Applications & Implementations • Applications − Scientific computinG § Matlab, NumPy − Graphics and video processinG § Photoshop, … − BiG Data § Deep learninG − GaminG − … • Implementations − x86 − ARM − RISC-V vector extensions CS 61c Lecture 18: Parallel ProcessinG - SIMD 23 CS 61c Lecture 18: Parallel ProcessinG - SIMD 24 4 10/25/17 First SIMD Extensions: MIT Lincoln LaBs TX-2, 1957 x86 SIMD Evolution • New instructions • New, wider, more registers • More parallelism http://svmoore.pbworks.com/w/file/fetch/70583970/VectorOps.pdf CS 61c 25 CS 61c Lecture 18: Parallel ProcessinG - SIMD 26 Laptop CPU Specs SIMD Registers $ sysctl -a | grep cpu hw.physicalcpu: 2 hw.logicalcpu: 4 machdep.cpu.brand_string: Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 PCLMULQDQ DTES64 MON DSCPL VMX EST TM2 SSSE3 FMA CX16 TPR PDCM SSE4.1 SSE4.2 x2APIC MOVBE POPCNT AES PCID XSAVE OSXSAVE SEGLIM64 TSCTMR AVX1.0 RDRAND F16C machdep.cpu.leaf7_features: SMEP ERMS RDWRFSGS TSC_THREAD_OFFSET BMI1 AVX2 BMI2 INVPCID SMAP RDSEED ADX IPT FPU_CSDS CS 61c Lecture 18: Parallel ProcessinG - SIMD 27 CS 61c Lecture 18: Parallel ProcessinG - SIMD 28 SIMD Data Types SIMD Vector Mode (Now also AVX-512 available) CS 61c Lecture 18: Parallel ProcessinG - SIMD 29 CS 61c Lecture 18: Parallel ProcessinG - SIMD 30 5 10/25/17 Agenda ProBlem • 61C – the biG picture • Today’s compilers (larGely) do not Generate SIMD • Parallel processinG code • SinGle instruction, multiple data • Back to assembly … • SIMD matrix multiplication • x86 (not usinG RISC-V as no vector hardware yet) • Amdahl’s law −Over 1000 instructions to learn … • Loop unrollinG −Green Book • Memory access strateGy - blocking • Can we use the compiler to Generate all non-SIMD • And in Conclusion, … instructions? CS 61c Lecture 18: Parallel ProcessinG - SIMD 31 CS 61c Lecture 18: Parallel ProcessinG - SIMD 32 x86 Intrinsics AVX Data Types Intrinsics AVX Code Nomenclature Intrinsics: Direct access to reGisters & assembly from C ReGister CS 61c Lecture 18: Parallel ProcessinG - SIMD 33 CS 61c Lecture 18: Parallel ProcessinG - SIMD 34 x86 SIMD “Intrinsics” Raw DouBle-Precision Throughput Characteristic Value assembly instruction CPU i7-5557U Clock rate (sustained) 3.1 GHz Instructions per clock (mul_pd) 2 4 parallel multiplies Parallel multiplies per instruction 4 Peak double FLOPS 24.8 GFLOPS 2 instructions per clock cycle (CPI = 0.5) https://www.karlrupp.net/2013/06/cpu-gpu-and-mic-hardware-characteristics-over-time/ Actual performance is lower because of overhead https://software.intel.com/sites/landinGpaGe/IntrinsicsGuide/ CS 61c Lecture 18: Parallel ProcessinG - SIMD 35 CS 61c Lecture 18: Parallel ProcessinG - SIMD 36 6 10/25/17 Vectorized Matrix Multiplication “Vectorized” dgemm for i …; i+=4 � for j ... Inner Loop: � � � i += 4 CS 61c 37 CS 61c Lecture 18: Parallel ProcessinG - SIMD 38 Performance Agenda • 61C – the biG picture Gflops N • Parallel processinG scalar avx • SinGle instruction, multiple data 32 1.30 4.56 • 160 1.30 5.47 SIMD matrix multiplication 480 1.32 5.27 • Amdahl’s law 960 0.91 3.64 • Loop unrollinG • Memory access strateGy - blocking • 4x faster • But still << theoretical 25 GFLOPS! • And in Conclusion, … CS 61c Lecture 18: Parallel ProcessinG - SIMD 39 CS 61c Lecture 18: Parallel ProcessinG - SIMD 40 A trip to LA Amdahl’s Law • Get enhancement E for your new PC Commercial airline: − E.G. floating-point rocket

Load more