Lecture on Scientific Computing
Dr. Kersten Schmidt
Lecture 18
Technische Universit¨at Berlin Institut f¨ur Mathematik Wintersemester 2014/2015 Syllabus
I Linear Regression, Fast Fourier transform I Modelling by partial differential equations (PDEs)
I Maxwell, Helmholtz, Poisson, Linear elasticity, Navier-Stokes equation I boundary value problem, eigenvalue problem I boundary conditions (Dirichlet, Neumann, Robin) I handling of infinite domains (wave-guide, homogeneous exterior: DtN, PML) I boundary integral equations
I Computer aided-design (CAD)
I Mesh generators I Space discretisation of PDEs
I Finite difference method I Finite element method I Discontinuous Galerkin finite element method
I Solvers
I Linear Solvers (direct, iterative), preconditioner I Nonlinear Solvers (Newton-Raphson iteration) I Eigenvalue Solvers
I Parallelisation
I Computer hardware (SIMD, MIMD: shared/distributed memory) I Programming in parallel: OpenMP, MPI
, VL Scientific Computing WS 2014/2015, Dr. K. Schmidt, 01/20/2015 2 Computer hardware
Central Processing unit (CPU) – the processor – consisting of
I the arithmetic logic unit (ALU), which performs arithmetic and logic operations,
I hardware register, that supply operands to the to the ALU and store results,
I the control unit that fetches instructions from main memory,
I an hierarchy of CPU caches (level L1, L2, L3) for storing temporarily data which will needed for the next instructions, and
I possibly an integrated graphics processor. A processor may consists of several repetitions of the subunits, the cores, to obtain parallelisation.
, VL Scientific Computing WS 2014/2015, Dr. K. Schmidt, 01/20/2015 3 Computer hardware
Clock signal (dt. Taktsignal)
I for synchronisation of the operations in CPU and fetching from and writing to memory
I Clock rate is number of cycles per time, e.g. 3.6 GHz for i7-4790
I Algebraic and logic operations need their number of cycles
I Receiving data from memory need several cycles (latency) Vector processor or SIMD (single instruction multiple data) extension
I realize same operation on many similar data at the same time, e.g. matrix operations.
I Streaming SIMD extension in modern PC CPUs (from Intel Pentium III, AMD Athlon-XP) with additional SIMD register.
I Increase of number of instructions per cycle.
, VL Scientific Computing WS 2014/2015, Dr. K. Schmidt, 01/20/2015 4 Computer hardware
Example: use of SIMD
, VL Scientific Computing WS 2014/2015, Dr. K. Schmidt, 01/20/2015 5 Computer hardware
A variant of SIMD: a pipeline
I Complicated operations take often more than one cyle to complete, e.g. multiplication of two integer take 4 clock ticks
Example: Element by element product ci = ai bi of two vectors (Hadamard’s product) of integers of length n take 4n clock ticks Petersen and Arbenz, “Introduction to Parallel Computing”, Oxford University Press, 2004
Pipeline: Split operation in several stages that each take one cycle, then a pipleline can (after a startup phase) produce a result in each clock cycle
Let the numbers ai , bi , ci be split in four fragments (bytes, little-endian)
ai = [ai,3, ai,2, ai,1, ai,0]
bi = [bi,3, bi,2, bi,1, bi,0]
ci = [ci,3, ci,2, ci,1, ci,0] then
ci,j = ai,j bi,j + carry from ai,j−1 bi,j−1
, VL Scientific Computing WS 2014/2015, Dr. K. Schmidt, 01/20/2015 6 Computer hardware
A variant of SIMD: a pipeline
Example: Element by element product ci = ai bi of two vectors (Hadamard’s product) of integers of length n take 4n clock ticks Petersen and Arbenz, “Introduction to Parallel Computing”, Oxford University Press, 2004
Speed-up 4n S = ∼ n if n ≥ 4 4 + n , VL Scientific Computing WS 2014/2015, Dr. K. Schmidt, 01/20/2015 7 Computer hardware
Computer memory B primary (RAM, dt. Arbeitsspeicher) I storing the program (instructions) to run and data to work on (concept by von Neumann, Princeton)
I lose data if device is powered down B secondary I do not lose data if device is powered down
I examples: flash memory (e.g. solid state drives, SSD), magnetic discs (hard and floppy disk), optical disc (e.g. CD-ROM) B cache (as part of the CPU) I storing temporarily data which will needed for the next instructions
, VL Scientific Computing WS 2014/2015, Dr. K. Schmidt, 01/20/2015 8 Computer hardware
Prefetch data in cache which will be need for further instructions
No prefetch: Processor stalls periodically while waiting to retrieve data from main memory into cache or into processor’s register
Prefetching data before processor completes previous task eliminates stall times.
Stall time remain while Prefetching data for fast computations, but it is shorter.
, VL Scientific Computing WS 2014/2015, Dr. K. Schmidt, 01/20/2015 9 Computer hardware
Prefetch data in cache which will be need for further instructions
Example: Application of a function to each component ai of a vector → bi Petersen and Arbenz, “Introduction to Parallel Computing”, Oxford University Press, 2004
Simple loop for(i = 0; i < n; ++i) b[i] = f(a[i]); with prefetching (hiding next load of a[i] under loop overhead) t = a[0]; /* prefetch a[0] */ for(i = 0; i < n-1; ) { b[i] = f(t); t = a[++i]; /* prefetch a[i+1] */ } b[n-1] = f(t);
, VL Scientific Computing WS 2014/2015, Dr. K. Schmidt, 01/20/2015 10 Computer hardware
Moore’s law: observation of an exponential grows of speed of integrated circuits / processors I number of transistors per sqare centimeter double every 18 to 24 month I but access time to memory has not improved accordingly, memory performance double in 6 years only (→ hierarchy of cache)
Transistor Count and Moore’s Law - 2011“ von Wgsimon - Eigenes Werk (Wikimedia Commons) ” , VL Scientific Computing WS 2014/2015, Dr. K. Schmidt, 01/20/2015 11 Computer hardware
Overview of Parallel Computing Hardware Parallel computing : distribute the instructions on many processors to decrease the overall computation time Parallel systems can be classified according to the number of instruction streams and data streams M. Flynn, Proc. IEEE 54 (1996), 1901–1909.
, VL Scientific Computing WS 2014/2015, Dr. K. Schmidt, 01/20/2015 12 Computer hardware
SISD: Singe instruction stream – Single data stream B the classical von Neumann machine
SIMD: Single instruction stream – Multiple data streams
I During each instruction cycle the central control unit broadcasts an instruction to the processors and each of them either executes the instruction or is idle.
I At any given time a processor is active and executes exactly the same instruction as all other processors in a completely synchronous way, or is idle.
, VL Scientific Computing WS 2014/2015, Dr. K. Schmidt, 01/20/2015 13 Computer hardware
MIMD: multiple instruction stream – Multiple data streams
I each processor can execute its own instruction stream on its own data independently from other processors
I each processor is a full-fledged (dt. vollwertig) CPU with control unit and ALU
I MIMD are asynchronous, each processor can execute its own program Generic distributed memory computer, e.g., cluster
Generic shared memory computer, e.g., compute server
, VL Scientific Computing WS 2014/2015, Dr. K. Schmidt, 01/20/2015 14 Computer hardware
Clusters with partly distributed memory and shared memory
Clusters/Compute server at the Institut for Mathematics, TU Berlin http://www.math.tu-berlin.de/iuk/computeserver
I AMD Clusters, Intel Clusters, GPU Clusters, IBM-Cell processor cluster
I Batch system (dt. System der Stapelverarbeitung) to submit (parallel) jobs
, VL Scientific Computing WS 2014/2015, Dr. K. Schmidt, 01/20/2015 15 Parallelisation
Processes
I A process is an instance of a program that is executing more or less autonomously on a physical processor.
I A program is parallel if, at any time during its execution, it comprises more than one process.
Shared-memory and distributed-memory programs differ in how processes communicate with each other:
I In shared-memory programs communication is through variables that are shared by all the involved processes.
I In distributed-memory programs processes communicate by sending and receiving messages.
, VL Scientific Computing WS 2014/2015, Dr. K. Schmidt, 01/20/2015 16 Parallelisation
Sequential algorithms
I to estimate the run-time of a sequential algorithm we cound the number of floating point operations (flops) – sometimes it may be better to count the number of memory accesses, but that does not change the picture in comparison with parallel algorithms
Tseq = (number of flops) × tflop
Example: dot product of two n-vectors
n X ~x · ~y = xi yi ⇒ Tseq = (2n − 1)tflop i=1 Parallel algorithm – what does execution time mean on a parallel processor?
, VL Scientific Computing WS 2014/2015, Dr. K. Schmidt, 01/20/2015 17 Parallelisation
Parallel algorithm – what does execution time mean on a parallel processor?
I on a distributed arrangement of processors there is no common or synchronized clock
I one may choose the maximal execution time of the program on the processors involved in the computation
I if we measure time in parallel we do this on one processor (P0) – the master processor
I processor P0 usually reads input data and outputs the computed results
I for theoretical considerations we simply assume that all processors (say p) start in the same moment
I the execution time T (p) then is the period of time from this moment until when the last of the p processors finishes its computation
I T (1) is the execution time of the best sequential algorithm
Paper by David Bayley in Supercomputer in 1991, who show how to manipulate timings in order to claim successful parallelisation of an algorithm.
Speedup – measures the gain in (wall-clock) time that is obtained by parallel execution of a program
, VL Scientific Computing WS 2014/2015, Dr. K. Schmidt, 01/20/2015 18