Lecture on Scientific

Dr. Kersten Schmidt

Lecture 18

Technische Universit¨at Berlin Institut f¨ur Mathematik Wintersemester 2014/2015 Syllabus

I Linear Regression, Fast Fourier transform I Modelling by partial differential equations (PDEs)

I Maxwell, Helmholtz, Poisson, Linear elasticity, Navier-Stokes equation I boundary value problem, eigenvalue problem I boundary conditions (Dirichlet, Neumann, Robin) I handling of infinite domains (wave-guide, homogeneous exterior: DtN, PML) I boundary integral equations

I Computer aided-design (CAD)

I Mesh generators I Space discretisation of PDEs

I Finite difference method I Finite element method I Discontinuous Galerkin finite element method

I Solvers

I Linear Solvers (direct, iterative), preconditioner I Nonlinear Solvers (Newton-Raphson iteration) I Eigenvalue Solvers

I Parallelisation

I (SIMD, MIMD: shared/distributed memory) I Programming in parallel: OpenMP, MPI

, VL Scientific Computing WS 2014/2015, Dr. K. Schmidt, 01/20/2015 2 Computer hardware

Central Processing unit (CPU) – the – consisting of

I the (ALU), which performs arithmetic and logic operations,

I hardware register, that supply operands to the to the ALU and store results,

I the that fetches instructions from main memory,

I an hierarchy of CPU caches (level L1, L2, L3) for storing temporarily data which will needed for the next instructions, and

I possibly an integrated graphics processor. A processor may consists of several repetitions of the subunits, the cores, to obtain parallelisation.

, VL Scientific Computing WS 2014/2015, Dr. K. Schmidt, 01/20/2015 3 Computer hardware

Clock signal (dt. Taktsignal)

I for synchronisation of the operations in CPU and fetching from and writing to memory

I is number of cycles per time, e.g. 3.6 GHz for i7-4790

I Algebraic and logic operations need their number of cycles

I Receiving data from memory need several cycles (latency) or SIMD (single instruction multiple data) extension

I realize same operation on many similar data at the same time, e.g. matrix operations.

I Streaming SIMD extension in modern PC CPUs (from Intel Pentium III, AMD Athlon-XP) with additional SIMD register.

I Increase of number of .

, VL Scientific Computing WS 2014/2015, Dr. K. Schmidt, 01/20/2015 4 Computer hardware

Example: use of SIMD

, VL Scientific Computing WS 2014/2015, Dr. K. Schmidt, 01/20/2015 5 Computer hardware

A variant of SIMD: a pipeline

I Complicated operations take often more than one cyle to complete, e.g. multiplication of two integer take 4 clock ticks

Example: Element by element product ci = ai bi of two vectors (Hadamard’s product) of integers of length n take 4n clock ticks Petersen and Arbenz, “Introduction to Parallel Computing”, Oxford University Press, 2004

Pipeline: Split operation in several stages that each take one cycle, then a pipleline can (after a startup phase) produce a result in each clock cycle

Let the numbers ai , bi , ci be split in four fragments (bytes, little-endian)

ai = [ai,3, ai,2, ai,1, ai,0]

bi = [bi,3, bi,2, bi,1, bi,0]

ci = [ci,3, ci,2, ci,1, ci,0] then

ci,j = ai,j bi,j + carry from ai,j−1 bi,j−1

, VL Scientific Computing WS 2014/2015, Dr. K. Schmidt, 01/20/2015 6 Computer hardware

A variant of SIMD: a pipeline

Example: Element by element product ci = ai bi of two vectors (Hadamard’s product) of integers of length n take 4n clock ticks Petersen and Arbenz, “Introduction to Parallel Computing”, Oxford University Press, 2004

Speed-up 4n S = ∼ n if n ≥ 4 4 + n , VL Scientific Computing WS 2014/2015, Dr. K. Schmidt, 01/20/2015 7 Computer hardware

Computer memory B primary (RAM, dt. Arbeitsspeicher) I storing the program (instructions) to run and data to work on (concept by von Neumann, Princeton)

I lose data if device is powered down B secondary I do not lose data if device is powered down

I examples: flash memory (e.g. solid state drives, SSD), magnetic discs (hard and floppy disk), optical disc (e.g. CD-ROM) B cache (as part of the CPU) I storing temporarily data which will needed for the next instructions

, VL Scientific Computing WS 2014/2015, Dr. K. Schmidt, 01/20/2015 8 Computer hardware

Prefetch data in cache which will be need for further instructions

No prefetch: Processor stalls periodically while waiting to retrieve data from main memory into cache or into processor’s register

Prefetching data before processor completes previous task eliminates stall times.

Stall time remain while Prefetching data for fast computations, but it is shorter.

, VL Scientific Computing WS 2014/2015, Dr. K. Schmidt, 01/20/2015 9 Computer hardware

Prefetch data in cache which will be need for further instructions

Example: Application of a function to each component ai of a vector → bi Petersen and Arbenz, “Introduction to Parallel Computing”, Oxford University Press, 2004

Simple loop for(i = 0; i < n; ++i) b[i] = f(a[i]); with prefetching (hiding next load of a[i] under loop overhead) t = a[0]; /* prefetch a[0] */ for(i = 0; i < n-1; ) { b[i] = f(t); t = a[++i]; /* prefetch a[i+1] */ } b[n-1] = f(t);

, VL Scientific Computing WS 2014/2015, Dr. K. Schmidt, 01/20/2015 10 Computer hardware

Moore’s law: observation of an exponential grows of speed of integrated circuits / processors I number of transistors per sqare centimeter double every 18 to 24 month I but access time to memory has not improved accordingly, memory performance double in 6 years only (→ hierarchy of cache)

Transistor Count and Moore’s Law - 2011“ von Wgsimon - Eigenes Werk (Wikimedia Commons) ” , VL Scientific Computing WS 2014/2015, Dr. K. Schmidt, 01/20/2015 11 Computer hardware

Overview of Parallel Computing Hardware Parallel computing : distribute the instructions on many processors to decrease the overall computation time Parallel systems can be classified according to the number of instruction streams and data streams M. Flynn, Proc. IEEE 54 (1996), 1901–1909.

, VL Scientific Computing WS 2014/2015, Dr. K. Schmidt, 01/20/2015 12 Computer hardware

SISD: Singe instruction stream – Single data stream B the classical von Neumann machine

SIMD: Single instruction stream – Multiple data streams

I During each the central control unit broadcasts an instruction to the processors and each of them either executes the instruction or is idle.

I At any given time a processor is active and executes exactly the same instruction as all other processors in a completely synchronous way, or is idle.

, VL Scientific Computing WS 2014/2015, Dr. K. Schmidt, 01/20/2015 13 Computer hardware

MIMD: multiple instruction stream – Multiple data streams

I each processor can execute its own instruction stream on its own data independently from other processors

I each processor is a full-fledged (dt. vollwertig) CPU with control unit and ALU

I MIMD are asynchronous, each processor can execute its own program Generic distributed memory computer, e.g., cluster

Generic shared memory computer, e.g., compute server

, VL Scientific Computing WS 2014/2015, Dr. K. Schmidt, 01/20/2015 14 Computer hardware

Clusters with partly distributed memory and shared memory

Clusters/Compute server at the Institut for Mathematics, TU Berlin http://www.math.tu-berlin.de/iuk/computeserver

I AMD Clusters, Intel Clusters, GPU Clusters, IBM-Cell processor cluster

I Batch system (dt. System der Stapelverarbeitung) to submit (parallel) jobs

, VL Scientific Computing WS 2014/2015, Dr. K. Schmidt, 01/20/2015 15 Parallelisation

Processes

I A process is an instance of a program that is executing more or less autonomously on a physical processor.

I A program is parallel if, at any time during its execution, it comprises more than one process.

Shared-memory and distributed-memory programs differ in how processes communicate with each other:

I In shared-memory programs communication is through variables that are shared by all the involved processes.

I In distributed-memory programs processes communicate by sending and receiving messages.

, VL Scientific Computing WS 2014/2015, Dr. K. Schmidt, 01/20/2015 16 Parallelisation

Sequential algorithms

I to estimate the run-time of a sequential algorithm we cound the number of floating point operations (flops) – sometimes it may be better to count the number of memory accesses, but that does not change the picture in comparison with parallel algorithms

Tseq = (number of flops) × tflop

Example: dot product of two n-vectors

n X ~x · ~y = xi yi ⇒ Tseq = (2n − 1)tflop i=1 Parallel algorithm – what does execution time mean on a parallel processor?

, VL Scientific Computing WS 2014/2015, Dr. K. Schmidt, 01/20/2015 17 Parallelisation

Parallel algorithm – what does execution time mean on a parallel processor?

I on a distributed arrangement of processors there is no common or synchronized clock

I one may choose the maximal execution time of the program on the processors involved in the computation

I if we measure time in parallel we do this on one processor (P0) – the master processor

I processor P0 usually reads input data and outputs the computed results

I for theoretical considerations we simply assume that all processors (say p) start in the same moment

I the execution time T (p) then is the period of time from this moment until when the last of the p processors finishes its computation

I T (1) is the execution time of the best sequential algorithm

Paper by David Bayley in Supercomputer in 1991, who show how to manipulate timings in order to claim successful parallelisation of an algorithm.

Speedup – measures the gain in (wall-clock) time that is obtained by parallel execution of a program

, VL Scientific Computing WS 2014/2015, Dr. K. Schmidt, 01/20/2015 18