Parallel Computing: a Brief Discussion

Parallel Computing: a brief discussion Marzia Rivi Astrophysics Group Department of Physics & Astronomy Research Programming Social – Nov 10, 2015 What is parallel computing? Traditionally, software has been written for serial computation: – To be run on a single computer having a single core. – A problem is broken into a discrete series of instructions. – Instructions are executed one after another. – Only one instruction may execute at any moment in time. Parallel computing is the simultaneous use of multiple compute resources to solve a computational problem: – A problem is broken into discrete parts that can be solved concurrently. – Instructions from each part executed simultaneously on different cores. Why parallel computing? • Save time and/or money: – in theory, more resources we use, shorter the time to finish, with potential cost savings. • Solve larger problems: – when the problems are so large and complex, it is impossible to solve them on a single computer, e.g. "Grand Challenge" problems requiring PetaFLOPS and PetaBytes of computing resources (en.wikipedia.org/wiki/Grand_Challenge). – Many scientific problems can be tackled only by increasing processor performances. – Highly complex or memory greedy problems can be solved only with greater computing capabilities. • Limits to serial computing: physical and practical reasons Who needs parallel computing? • has a large number of High Throughput independent jobs (e.g. Computing (HTC) processing video files, (many computers): genome sequencing, • dynamic parametric studies) environment • multiple Researcher 1 • uses serial applications independent small- • developed serial code and to-medium jobs validated it on small • large amounts of High Performance problems processing over Computing (HPC) • to publish, needs some long time (single parallel “big problem” results • loosely connected computer): Researcher 2 resources (e.g. grid) • static environment • needs to run large parallel • single large scale simulations fast (e.g. problems molecular dynamics, • tightly coupled computational fluid parallelism dynamics, cosmology) Researcher 3 How to parallelise an application? Automatic parallelisation tools: • compiler support for vectorisation of operations (SSE and AVX) and threads parallelisation (OpenMP) • specific tools exists but limited practical use • all successful applications require intervention and steering Parallel code development requires: • programming languages (with support for parallel libraries, APIs) • parallel programming standards (such as MPI and OpenMP) • compilers • performance libraries/tools (both serial and parallel) But…, more that anything, it requires understanding: • the algorithms (program, application, solver, etc.): • the factors that influence parallel performance How to parallelise an application? • First, make it work! – analyse the key features of your parallel algorithms: • parallelism: the type of parallel algorithm that can use parallel agents • granularity: the amount of computation carried out by parallel agents • dependencies: algorithmic restrictions on how the parallel work can be scheduled – re-program the application to run in parallel and validate it • Then, make it work well! – Pay attention to the key aspects of an optimal parallel execution: • data locality (computation vs. communication) • scalability (linear scaling is the holy grail: execution time is inversely proportional with the number of processors) – Use profilers and performance tools to identify problems. Task parallelism Thread (or task) parallelism is based on parting the operations of the Task Parallelismalgorithm . Thread (or task)If an parallelism algorithm is isimplemented based on withexecuting series of independent operations these concurrently differentcan be spread parts throughout of the algorithm. the processors thus realizing program parallelisation. Features cpu begin • Different independent sets of 1 instructions applied to single task 1 set (or multiple sets) of data. cpu 2 • May lead to work imbalance task 2 and may not scale well and task 3 cpu performance limited by the 3 slowest process. task 4 cpu end 4 Data Parallelism Data parallelism Data parallelism means spreading data to be computed through Datathe parallelismprocessors. means spreading data to be computed through the processors. Features • The sameThe processorssets of instructions execute merely applied the to same different operations (parts, butof the) on diverse data. data sets. • ProcessorsThis workoften onlymeans on distribution data assigned of array to themelements and acrosscommunicate the computing units. when necessary. begin • Easy to program, scale well. • Inherent in program loops. no i<4? yes array[4] data task cpu cpu cpu cpu end 1 3 2 4 Granularity • Coarse – parallelise large amounts of the total workload – in general, the coarser the better – minimal inter-processor communication – can lead to imbalance • Fine – parallelise small amounts of the total workload (e.g. inner loops) – can lead to unacceptable parallel overheads (e.g. communication) Dependencies Dictate the order DataDataof operations dependencedependence, imposes limits on parallelism and requires parallel synchronisation. In the following example the i index loop can be parallelized: InIn this the example following the example i index the loop i index can loop be parallelized: can be parallelized: DO I = 1, N fortran For (i=1; i<n; i++){ c/c++ DO I = 1,DO N J = 1, N fortran For for(i=1; (j=1 i<n; ;j<n; i++){ j++){ c/c++ DO A(J,I)J = 1, =N A(J-1,I) + B(J,I) for a[j][i](j=1 ;j<n; = a[j-1][i] j++){ + b[j][i]; END A(J,I)DO = A(J-1,I) + B(J,I) }a[j][i] = a[j-1][i] + b[j][i]; END DO END DO } } END DO } In this loop parallelization is dependent on the K value: InIn this this loop loop parallelization parallelization is is dependent dependent on on the the K value k value:: DO I = M, N fortran For (i=m; i<n; i++){ c/c++ DO I = M, N A(I) = A(I-K) + B(I)/C(I)fortran For a[i](i=m; = i<n;a[i-k] i++){ + b[i]/c[i]; c/c++ END DO A(I) = A(I-K) + B(I)/C(I) }a[i] = a[i-k] + b[i]/c[i]; END DO } If Kk >> M-N N-M oror k K< N-M< M-N parallelizationparallelization is is straightforward. straightforward. If K > N-M or K < M-N parallelization is straightforward. ConceptsConcepts andand TerminologyTerminology • • SharedShared Memory Memory = =a acomputer computer architecture architecture where where all all processors processors have have directdirect access access to to common common physical physical memory. memory. Also, Also, it it describes describes a amodel model wherewhere parallel parallel tasks tasks can can directly directly address address and and access access the the same same logical logical memorymemory locations. locations. • • DistributedDistributed Memory Memory = =network network based based memory memory access access for for physical physical memorymemory that that is is not not common. common. As As a aprogramming programming model, model, tasks tasks can can only only logically "see" local machine memory and must use communications to logically "see" local machine memory and must use communications to access memoryParallel on other machinescomputing where othermodels tasks are executing. access memoryShared memory on other machines where otherDistributed tasks are memory executing. • Each processor has direct access • Local processor memory is invisible to to common Shared physical Memory memory Distributed Memory all other processors, network based (e.g. multi-processors, Shared Memory cluster Distributed Memory memory access (e.g. computer nodes). clusters). • Agent of parallelism: the thread • Agent of parallelism: the process (program = collection of threads). (program = collection of processes). • Threads exchange information • Exchanging information between implicitly by reading/writing processes requires communications. shared variables. • Programming standard: MPI. • Programming standard: OpenMP. OpenMP http://www.openmp.org API instructing the compiler what can be done in parallel (high-level programming). Application User • Consisting of: Directive Environment – compiler directives Compiler Variables – runtime library functions Runtime Library – environment variables Threads in Operating System • Supported by most compilers for Fortran and C/C++. • Usable as serial code (threading ignored by serial compilation). • By design, suited for data parallelism. OpenMP Threads are generated automatically at runtime and scheduled by the OS. • Thread creation / destruction overhead. Execution model • Minimise the number of times parallel regions are entered/exited. master Master thread only Fortran syntax C syntax !$OMP PARALLEL #pragma omp parallel { All threads Parallel region !$OMP END PARALLEL } Master thread only OpenMP - example Objective: vectorise a loop, to map the sin operation to vector x in parallel. Idea: instruct the compiler on what to parallelise (the loop) and how (private and shared data) and let it do the hard work. In C: #pragma omp parallel for shared(x, y, J) private(j) for (j=0; j<J; j++) { y[j] = sin(x[j]); } In Fortran: $omp parallel do shared(x, y, J) private(j) do j = 1, J y(j) = sin(x(j)) end do $omp end parallel do Number of threads is set by environment variable OMP_NUM_THREADS or programmed for using the RTL function omp_set_num_threads(). MPI - Message Passing Interface http://www.mpi-forum.org/ MPI is a specification for a Distributed-Memory API designed by a committee for Fortran, C and C++ languages. • Two versions: – MPI 1.0, quickly and universally adopted

Load more