Parallel Computing Introduction Alessio Turchi
Total Page:16
File Type:pdf, Size:1020Kb
Parallel Computing Introduction Alessio Turchi Slides thanks to: Tim Mattson (Intel) Sverre Jarp (CERN) Vincenzo Innocente (CERN) 1 Outline • High Performance computing: A hardware system view • Parallel Computing: Basic Concepts • The Fundamental patterns of parallel Computing 5 The birth of Supercomputing • The CRAY-1A: – 12.5-nanosecond clock, – 64 vector registers, – 1 million 64-bit words of high- speed memory. – Speed: – 160 MFLOPS vector peak speed – 110 MFLOPS Linpack 1000 (best effort) • Cray software … by 1978 – Cray Operating System (COS), – the first automatically vectorizing Fortran compiler (CFT), On July 11, 1977, the CRAY-1A, serial . – Cray Assembler Language (CAL) number 3, was delivered to NCAR. The were introduced. system cost was $8.86 million ($7.9 million plus $1 million for the disks). http://www.cisl.ucar.edu/computers/gallery/cray/cray1.jsp Peak GFLOPS The Era of the Vector Supercomputer Vector the of Era The Supercomputers original The 10 20 30 40 50 60 0 • • • • Required modest changes to software (vectorization) software to changes modest Required configuration memory shared an in processors Multiple software and hardware specialized highly built, Custom data of vectors on operated that mainframes Large Cray 2 (4), 1985 Vector Vector Cray YMP (8), 1989 Cray C916 (16), 1991 Cray T932 (32), 1996 Supercomputer Center The Cray C916/512 the at Pittsburgh The attack of the killer micros • The Caltech Cosmic Cube developed by Charles Seitz and Geoffrey Fox in1981 • 64 Intel 8086/8087 processors • 128kB of memory per processor • 6-dimensional hypercube network The cosmic cube, Charles Seitz Launched the Communications of the ACM, Vol 28, number 1 January 1985, p. 22 “attack of the killer micros” http://calteches.library.caltech.edu/3419/1/Cubism.pdf Eugene Brooks, SC’90 • • • • helped MPPs dominate supercomputing dominate MPPs helped scaling weak and performance CPU Improving Required massive changes to software (parallelization) (parallelization) software to changes massive Required scalability support to hardware custom of Lots networks interconnection scalable latency, low speed, High microprocessors shelf the off commercial of numbers large with computers Parallel Peak GFLOPS 100 120 140 160 180 200 20 40 60 80 0 Vector Vector Vector iPSC\860(128) 1990. MPP MPP TMC CM5-(1024) 1992 Paragon XPS 1993 NM National labs inAlbuquerque ParagonXPS-140 at Sandia SIMD computers … the other MPP supercomputer Thinking machines CM-2: The Classic Symmetric SIMD supercomputer (mid- 80’ s): Description: Up to 64K bit- serial processing elements. Strength: Supports deterministic programming models … single thread of control for ease of understanding. Weakness: Poor floating point performance. Programming model was not general “ … we want to build a computer that enough. TMC struggled ’ will be proud of us” , Danny Hillis throughout the 90 s and filed for bankruptcy in 1994. Third party names are the property of their owners. 10 The MPP future looked bright … but then clusters took over • A cluster is a collection of connected, independent computers that work in unison to solve a problem. • Nothing is custom … motivated users could build a cluster on their own . First clusters appeared in the late 80’s (Stacks of “SPARC pizza boxes”) . The Intel Pentium Pro in 1995 coupled with Linux made them competitive. NASA Goddard’s Beowulf cluster demonstrated publically that high visibility science could be done on clusters. Clusters made it easier to bring the benefits due to Moores’s law into working supercomputers Top 500 list: System Architecture * Source: http://s.top500.org/static/lists/2013/06/TOP500_201306_Poster.pdf *Constellation: A cluster for which the number of processors on a node is greater than the number of nodes in the cluster. I’ve never seen anyone use this term outside of the top500 list. Execution model: Distributed memory MIMD • Cluster or MPP … the future is clear. Distributed memory scales and is more energy efficient (cache coherence moves lots of electrons around and that consumes lots of power). • Each node has its own processors, memory and caches but cannot directly access another node’s memory. • Each “node” has a Network Interface component (NIC) for all communication and synchronization. • Fundamentally more scalable than shared memory machines … especially cache coherent shared memory. P0 NIC P1 NIC Pn NIC memory memory . memory interconnect Computer Architecture and Performance Tuning Cache/Memory Hierarchy Processor Core (Registers) § From CPU to main memory on a recent Haswell L1I L1D (R:64B + W:32B)/1c processor (32 KB) (32 KB) 4c latency § With multicore, L2 R: 64B/1c memory (256 KB) 11c latency bandwidth is shared between 32B/1c for all cores cores in the > 21c latency same Shared L3 processor (~20 MB) (socket) ~24 B/c for all cores > 200c latency Local/remote memory (large, but typically non-uniform) c = cycle Computer Architecture and Performance Tuning Cache lines (1) § When a data element or an instruction is requested by the processor, a cache line is ALWAYS moved (as the minimum quantity), usually to Level-1 requested § A cache line is a contiguous section of memory, typically 64B in size (8 * double) and 64B aligned § A 32KB Level-1 cache can hold 512 lines § When cache lines have to be moved come from memory § Latency is long (>200 cycles) § It is even longer if the memory is remote § Memory controller stays busy (~8 cycles) Computer Architecture and Performance Tuning Cache lines (2) § Good utilisation is vital § When only one element (4B or 8B) element is used inside the cache line: § A lot of bandwidth is wasted! requested § Multidimensional C arrays should be accessed with the last index changing fastest: for (i = 0; i < rows; ++i) for (j = 0; j < columns; ++j) mymatrix [i] [j] += increment; § Pointer chasing (in linked lists) can easily lead to “cache thrashing” (too much memory traffic) Computer Architecture and Performance Tuning Cache lines (3) § Prefetching: § Fetch a cache line before it is requested § Hiding latency § Normally done by the hardware § Especially if processor executes Out-of-order § Also done by software instructions § Especially when In-order (IA-64, Xeon Phi, etc.) § Locality is vital: § Spatial locality – Use all elements in the line § Temporal locality – Complete the execution whilst the elements are certain to be in the cache Programming the memory hierarchy is an art in itself. Computer Architecture and Performance Tuning Latency Measurements (example) § Memory Latency on Sandy Bridge-EP 2690 (dual socket) § 90 ns (local) versus 150 ns (remote) Socket 0 Socket 1 C0T0 C0 C1 C0C0T1 C1 C2 C3 C2 C3 C4 C5 C4 C5 Shared Shared cache cache Mem-ctl Mem-ctl Memory Interconnect I/O bus Computer Architecture and Performance Tuning First topic: Vector registers § Until recently, Steaming SIMD Extensions (SSE): § 16 “XMM” registers with 128 bits each (in 64-bit mode) § New (as of 2011): Advanced Vector eXtensions (AVX): § 16 “YMM” registers with 256 bits each 32 Bytes 32 Byte elements 16 Words E15 E14 E13E12 E11 E10 E9 E8 E7 E6 E5 E4 E3 E2 E1 E0 8 Dwords/Single E7 E6 E5 E4 E3 E2 E1 E0 4 Qwords/Double E3 E2 E1 E0 Bit 255 Bit 0 128 bits (SSE) NOW: 512 bits (AVX512) 256 bits (AVX 1/AVX 2) Computer Architecture and Performance Tuning Computer Architecture and Performance Tuning Four floating-point data flavours § Single precision § Scalar single (SS) - - - - - - - E0 § Packed single (PS) E7 E6 E5 E4 E3 E2 E1 E0 § Double precision - - - E0 § Scalar Double (SD) E3 E2 E1 E0 § Packed Double (PD) § Note: § Scalar mode (with AVX) means using only: § 1/8 of the width (single precision) § 1/4 of the width (double precision) § Even longer vectors are coming! have been announced ! § Definitely 512 bits (already used in the Xeon Phi processors) Computer Architecture and Performance Tuning Single Instruction Multiple Data 22 October 2012 Vincenzo Innocente 38 Computer Architecture and Performance Tuning Intel’s Nehalem micro- architecture can execute four instructions in parallel (across six ports) in each cycle. Computer Architecture and Performance Tuning Latest superscalar architecture Port 0 Port 1 Port 2 Port 3 Port 4 Port 5 Port 6 Port 7 Integer Integer Load Load Store Integer Integer Store Alu Alu Data Data Data Alu Alu Address Integer Integer Store Store Integer Integer Shift LEA Address Address LEA Shift Vec Int Vector Vec Int Branch ALU Logical ALU Unit Vector Vector Shift PSAD Shuffle Vector String Vector Logical Compare Logical Vec FMA Vec FMA Vec FMul Vec FMul Vec FAdd x87 FP x87 FP Multiply Add DIV SQRT § Intel’s Haswell micro-architecture can Integer execute four instructions in parallel MUL (across eight ports) in each cycle. Branch Unit Computer Architecture and Performance Tuning Matrix multiply example § For a given algorithm, we can understand exactly which functional execution units are needed § For instance, in the innermost loop of matrix multiplication for ( int i = 0; i < N; ++i ) { for ( int j = 0; j < N; ++j ) { for ( int k = 0; k < N; ++k ) { c[ i * N + j ] += a[ i * N + k ] * b[ k * N + j ]; } } } Until Haswell (2012): Store Add Load Mult Load As of Haswell (2013): Store Load FMA Load Computer Architecture and Performance Tuning Recent architectures Source AnandTech Cost of operations (in cpu cycles) op instruction sse s sse d avx s avx d +,- ADD,SUB 3 3 3 3 == COMISS 2,3 2,3 2,3 2,3 < > CMP.. f=d CVT.. 3 3 4 4 d=f |,&,^ AND,OR 1 1 1 1 * MUL 5 5 5 5 /,sqrt DIV, SQRT 10-14 10-22 21-29 21-45 1.f/ , RCP, RSQRT 5 7 1.f/sqrt = MOV 1,3,… 1,3,… 1,4,…. 1,4,… 350 from main memory Outline • High Performance computing: A hardware system view • Parallel Computing: Basic Concepts • The Fundamental patterns of parallel Computing Concurrency vs.