Parallel Computing Introduction Alessio Turchi

Slides thanks to: Tim Mattson (Intel) Sverre Jarp (CERN) Vincenzo Innocente (CERN)

1 Outline

• High Performance computing: A hardware system view

: Basic Concepts • The Fundamental patterns of parallel Computing

5 The birth of Supercomputing

• The CRAY-1A: – 12.5-nanosecond clock, – 64 vector registers, – 1 million 64-bit words of high- speed memory. – Speed: – 160 MFLOPS vector peak speed – 110 MFLOPS Linpack 1000 (best effort) • Cray software … by 1978 – Cray Operating System (COS), – the first automatically vectorizing Fortran compiler (CFT), On July 11, 1977, the CRAY-1A, serial . – Cray Assembler Language (CAL) number 3, was delivered to NCAR. The were introduced. system cost was $8.86 million ($7.9 million plus $1 million for the disks). http://www.cisl.ucar.edu/computers/gallery/cray/cray1.jsp The original The Era of the Vector • Large mainframes that operated on vectors of data • Custom built, highly specialized hardware and software • Multiple processors in an configuration • Required modest changes to software (vectorization)

60 1 9 9 6 9 8 50 9 1 9 9

, S 5 1 1 )

8 , , P 6 ) 40 ) 9 1 O 8 2 ( 1 (

L 3

, ( 6 ) F

P

30 1 4 2 G ( 9 M

3 k Y C 2 9

a T y y 20 y e a a a y P r r r a r C C 10 C C

0 The Cray C916/512 at the Pittsburgh VectorVector Supercomputer Center The attack of the killer micros

• The developed by Charles Seitz and Geoffrey Fox in1981 • 64 Intel 8086/8087 processors • 128kB of memory per processor • 6-dimensional hypercube network

The cosmic cube, Charles Seitz Launched the Communications of the ACM, Vol 28, number 1 January 1985, p. 22 “attack of the killer micros” http://calteches.library.caltech.edu/3419/1/Cubism.pdf Eugene Brooks, SC’90 Improving CPU performance and weak scaling helped MPPs dominate supercomputing • Parallel with large numbers of commercial off the shelf microprocessors • High speed, low latency, scalable interconnection networks • Lots of custom hardware to support • Required massive changes to software (parallelization)

200

180 2 . 9 0

160 9 9 3 1 9

9 S )

140 1 9

4 P ) 1 2

8 O 120 0 S 2 L 1 1 P ( F 100 ( - X G 0

5 80 6 n k M 8 o a \

60 C g

e C a P C S 40 r a P M i P 20 T Paragon XPS-140 at Sandia 0 Vector MPP National labs in Albuquerque Vector MPP NM SIMD computers … the other MPP supercomputer Thinking machines CM-2: The Classic Symmetric SIMD supercomputer (mid- 80’ s): Description: Up to 64K bit- serial processing elements. Strength: Supports deterministic programming models … single of control for ease of understanding. Weakness: Poor floating point performance. Programming model was not general “ … we want to build a that enough. TMC struggled ’ will be proud of us” , Danny Hillis throughout the 90 s and filed for bankruptcy in 1994. Third party names are the property of their owners. 10 The MPP future looked bright … but then clusters took over • A cluster is a collection of connected, independent computers that work in unison to solve a problem. • Nothing is custom … motivated users could build a cluster on their own . First clusters appeared in the late 80’s (Stacks of “SPARC pizza boxes”) . The Intel Pentium Pro in 1995 coupled with Linux made them competitive. . NASA Goddard’s demonstrated publically that high visibility science could be done on clusters. . Clusters made it easier to bring the benefits due to Moores’s law into working supercomputers Top 500 list: System Architecture

*

Source: http://s.top500.org/static/lists/2013/06/TOP500_201306_Poster.pdf *Constellation: A cluster for which the number of processors on a node is greater than the number of nodes in the cluster. I’ve never seen anyone use this term outside of the top500 list. Execution model: MIMD

• Cluster or MPP … the future is clear. Distributed memory scales and is more energy efficient ( moves lots of electrons around and that consumes lots of power). • Each node has its own processors, memory and caches but cannot directly access another node’s memory. • Each “node” has a Network Interface component (NIC) for all communication and synchronization. • Fundamentally more scalable than shared memory machines … especially cache coherent shared memory.

P0 NIC P1 NIC Pn NIC

memory memory . . . memory

interconnect Computer Architecture and Performance Tuning Cache/Memory Hierarchy

Processor Core (Registers) § From CPU to main memory on a recent Haswell L1I L1D (R:64B + W:32B)/1c processor (32 KB) (32 KB) 4c latency § With multicore, L2 R: 64B/1c memory (256 KB) 11c latency bandwidth is shared between 32B/1c for all cores cores in the > 21c latency same Shared L3 processor (~20 MB) (socket) ~24 B/c for all cores > 200c latency Local/remote memory (large, but typically non-uniform) c = cycle Computer Architecture and Performance Tuning Cache lines (1)

§ When a data element or an instruction is requested by the processor, a cache line is ALWAYS moved (as the minimum quantity), usually to Level-1

requested

§ A cache line is a contiguous section of memory, typically 64B in size (8 * double) and 64B aligned § A 32KB Level-1 cache can hold 512 lines

§ When cache lines have to be moved come from memory § Latency is long (>200 cycles) § It is even longer if the memory is remote § Memory controller stays busy (~8 cycles) Computer Architecture and Performance Tuning Cache lines (2)

§ Good utilisation is vital § When only one element (4B or 8B) element is used inside the cache line: § A lot of bandwidth is wasted!

requested

§ Multidimensional C arrays should be accessed with the last index changing fastest:

for (i = 0; i < rows; ++i) for (j = 0; j < columns; ++j) mymatrix [i] [j] += increment;

§ Pointer chasing (in linked lists) can easily lead to “cache thrashing” (too much memory traffic) Computer Architecture and Performance Tuning Cache lines (3) § Prefetching: § Fetch a cache line before it is requested § Hiding latency § Normally done by the hardware § Especially if processor executes Out-of-order § Also done by software instructions § Especially when In-order (IA-64, Xeon Phi, etc.)

§ Locality is vital: § Spatial locality – Use all elements in the line § Temporal locality – Complete the execution whilst the elements are certain to be in the cache

Programming the memory hierarchy is an art in itself. Computer Architecture and Performance Tuning Latency Measurements (example) § Memory Latency on Sandy Bridge-EP 2690 (dual socket) § 90 ns (local) versus 150 ns (remote)

Socket 0 Socket 1

C0T0 C0 C1 C0C0T1 C1 C2 C3 C2 C3 C4 C5 C4 C5 Shared Shared cache cache Mem-ctl Mem-ctl

Memory Interconnect

I/O bus Computer Architecture and Performance Tuning First topic: Vector registers § Until recently, Steaming SIMD Extensions (SSE): § 16 “XMM” registers with 128 bits each (in 64-bit mode)

§ New (as of 2011): Advanced Vector eXtensions (AVX): § 16 “YMM” registers with 256 bits each

32 Bytes 32 Byte elements

16 Words E15 E14 E13E12 E11 E10 E9 E8 E7 E6 E5 E4 E3 E2 E1 E0

8 Dwords/Single E7 E6 E5 E4 E3 E2 E1 E0

4 Qwords/Double E3 E2 E1 E0

Bit 255 Bit 0 128 bits (SSE)

NOW: 512 bits (AVX512) 256 bits (AVX 1/AVX 2) Computer Architecture and Performance Tuning Computer Architecture and Performance Tuning Four floating-point data flavours

§ Single precision § Scalar single (SS) ------E0 § Packed single (PS) E7 E6 E5 E4 E3 E2 E1 E0

§ Double precision - - - E0 § Scalar Double (SD) E3 E2 E1 E0 § Packed Double (PD)

§ Note: § Scalar mode (with AVX) means using only: § 1/8 of the width (single precision) § 1/4 of the width (double precision) § Even longer vectors are coming! have been announced ! § Definitely 512 bits (already used in the Xeon Phi processors) Computer Architecture and Performance Tuning Single Instruction Multiple Data

22 October 2012 Vincenzo Innocente 38 Computer Architecture and Performance Tuning

Intel’s Nehalem micro- architecture can execute four instructions in parallel (across six ports) in each cycle. Computer Architecture and Performance Tuning Latest superscalar architecture Port 0 Port 1 Port 2 Port 3 Port 4 Port 5 Port 6 Port 7

Integer Integer Load Load Store Integer Integer Store Alu Alu Data Data Data Alu Alu Address

Integer Integer Store Store Integer Integer Shift LEA Address Address LEA Shift

Vec Int Vector Vec Int Branch ALU Logical ALU Unit

Vector Vector Shift PSAD Shuffle

Vector String Vector Logical Compare Logical

Vec FMA Vec FMA Vec FMul Vec FMul Vec FAdd

x87 FP x87 FP Multiply Add

DIV SQRT § Intel’s Haswell micro-architecture can Integer execute four instructions in parallel MUL (across eight ports) in each cycle. Branch Unit Computer Architecture and Performance Tuning Matrix multiply example

§ For a given algorithm, we can understand exactly which functional execution units are needed § For instance, in the innermost loop of matrix multiplication

for ( int i = 0; i < N; ++i ) { for ( int j = 0; j < N; ++j ) { for ( int k = 0; k < N; ++k ) { c[ i * N + j ] += a[ i * N + k ] * b[ k * N + j ]; } } }

Until Haswell (2012): Store Add Load Mult Load As of Haswell (2013): Store Load FMA Load Computer Architecture and Performance Tuning Recent architectures

Source AnandTech Cost of operations (in cpu cycles) op instruction sse s sse d avx s avx d +,- ADD,SUB 3 3 3 3 == COMISS 2,3 2,3 2,3 2,3 < > CMP.. f=d CVT.. 3 3 4 4 d=f

|,&,^ AND,OR 1 1 1 1 * MUL 5 5 5 5 /,sqrt DIV, SQRT 10-14 10-22 21-29 21-45 1.f/ , RCP, RSQRT 5 7 1.f/sqrt = MOV 1,3,… 1,3,… 1,4,…. 1,4,… 350 from main memory Outline

• High Performance computing: A hardware system view

• Parallel Computing: Basic Concepts • The Fundamental patterns of parallel Computing Concurrency vs. Parallelism

.Two important definitions: . Concurrency: A condition of a system in which multiple tasks are logically active at one time. . Parallelism: A condition of a system in which multiple tasks are actually active at one time.

Concurrent, non-parallel Execution

Concurrent, parallel Execution Concurrency in Action: a web server

. A Web Server is a Concurrent Application (the problem is fundamentally defined in terms of concurrent tasks): . An arbitrary, large number of clients make requests which reference per-client persistent state . Consider an Image Server, which relieves load on primary web servers by storing, processing, and serving only images

Web Server Image Server Images The Internet

Client Client Client Images ClientClient 34 Concurrency in action: Mandelbrot Set

.The Mandelbrot set: An iterative in the complex plane 2 zn1 zn c z0 = 0, c is constant .Color each point in the complex plain of C values based on convergence or

C

i divergence of the m

a

g

i

n

iterative map. a

r

y . The computation for each point is independent of all the other points … a so-called embarrassingly parallel problem . CReal Decomposition in parallel programs

. Every parallel program is based on concurrency … i.e. tasks defined by What’s a task an application that can run at the decomposition for this same time. problem? . EVERY parallel program requires a task decomposition and a data decomposition: . Task decomposition: break the application down into a set of tasks that can execute concurrently.. . Data decomposition: How must the data be broken down into chunks and associated with threads/processes to make the parallel program run efficiently. Decomposition in parallel programs

. Every parallel program is based on concurrency … i.e. tasks defined by Task: the computation required an application that can run at the for each pixel … the body of the same time. loop for a pair (i,j). . EVERY parallel program requires a task decomposition and a data decomposition: . Task decomposition: break the application down into a set of tasks that can execute concurrently.. . Data decomposition: How must the data be broken down into chunks and associated with threads/processes to make the parallel program run efficiently.

51 Decomposition in parallel programs

. Every parallel program is based on concurrency … i.e. tasks defined by an application that can run at the same time. . EVERY parallel program requires a task decomposition and a data decomposition: . Task decomposition: break the application down into a set of tasks that can execute concurrently.. . Data decomposition: How must the data be broken down into chunks and associated with threads/processes to make the Suggest a data decomposition for parallel program run efficiently. this problem … assume a quad core shared memory PC. 52 Decomposition in parallel programs

. Every parallel program is based on concurrency … i.e. tasks defined by an application that can run at the same time. . EVERY parallel program requires a task decomposition and a data decomposition: . Task decomposition: break the application down into a set of tasks that can execute concurrently.. . Data decomposition: How must the data be broken down into chunks and associated with threads/processes to make the Map the pixels into row blocks and parallel program run efficiently. deal them out to the cores. This will give each core a memory efficient block to work on. 54 Decomposition in parallel programs

. Every parallel program is based on But given this data decomposition, it is concurrency … i.e. tasks defined by effective to think of a task as the update an application that can run at the to a pixel? Should we update our task same time. definition given the data decomposition? . EVERY parallel program requires a task decomposition and a data decomposition: . Task decomposition: break the application down into a set of tasks that can execute concurrently.. . Data decomposition: How must the data be broken down into chunks and associated with threads/processes to make the Map the pixels into row blocks and parallel program run efficiently. deal them out to the cores. This will give each core a memory efficient block to work on. 55 Decomposition in parallel programs

Yes. You go back and forth between . Every parallel program is based on task and data decomposition until you concurrency … i.e. tasks defined by have a pair that work well together. In an application that can run at the this case, let’s define a task as the same time. update to a row-block . EVERY parallel program requires a task decomposition and a data decomposition: . Task decomposition: break the application down into a set of tasks that can execute concurrently.. . Data decomposition: How must the data be broken down into chunks and associated with threads/processes to make the Map the pixels into row blocks and parallel program run efficiently. deal them out to the cores. This will give each core a memory efficient block to work on. 56 Outline

• High Performance computing: A hardware system view • The processors in HPC systems • Parallel Computing: Basic Concepts • The Fundamental patterns of parallel Computing

58 Pattern

• Use when: – Your problem is defined in terms of independent collections of data elements operated on by a similar (if not identical) sequence of instructions; i.e. the concurrency is in the data. – Hint: when the data decomposition dominates your design, this is probably the pattern to use! • Solution – Define collections of data elements that can be updated in parallel. – Define computation as a sequence of collective operations applied together to each data element.

Tasks

Data 1 Data 2 Data 3 …… Data n Pattern

. Use when: . The problem naturally decomposes into a distinct collection of tasks • Hint: when the task decomposition dominates you design, this is probably the pattern to use.

• Solution – Define the set of tasks and a way to detect when the computation is done. – Manage (or “remove”) dependencies so the correct answer is produced regardless of the details of how the tasks execute. – Schedule the tasks for execution in a way that keeps the work balanced between the processing elements of the parallel computer and Fundamental Design Patterns:

• Data Parallelism: – Kernel Parallelism – Geometric Decomposition – Loop parallel • Task Parallelism – Task queue – Divide and Conquer – Loop parallel • Implementation Patterns (used to support the above) – SPMD (Any MIMD machine, but typically distributed memory) – Fork Join (Multithreading, shared address space MIMD) – Kernel Parallelism (GPGPU)

62 Processors with lots of cores/vector-units/SIMT Summary connected into clusters are here to stay. You have no choice … embrace parallel computing!

Protect your software investment … refuse to use any programming model that locks you to a vendors platform. Open Standards are the ONLY rational approach in the long run. Parallel programming can be intimidating to learn, but there are only 6 fundamental design patterns used in most programs