Concepts in parallel programming

“The Angry Penguin“, used under creative commons licence from Swantje Hess and Jannis Pohlmann.

17/12/2018 Warwick RSE

• Solving multiple problems at once

• Goes back before computers

• Rooms full of people working on problems

• Cryptanalysis

• Calculating tide tables

• We’re interested in getting computers to do it

• How? Used under CC BY-SA 4.0. Attributed to UK Government under Crown copyright Parallelism that we’re not talking about

• bit level parallelism

• Processors work on chunks of data at once rather than bit by bit

• Instruction level parallelism

• Processors can operate on more than one variable at a time

• NOT multicore

• A large chunk of optimisation of code is trying to improve instruction level parallelism Parallelism that we are talking about

• Task level parallelism

• Split up a “task” into separate bits that computer can work on separately Embarrassing parallelism

• Embarrassing parallelism

• Tasks that are unrelated to each other can easily just be handed off to different processors

• Exploring parameters

• “Task farming” Embarrassing parallelism

• Just run multiple copies of your program(s)

• Don’t run more than you have physical processors

• Simultaneous Multithreading (Hyper-threading) doesn't generally work well with research computing loads

• Can use scheduler systems to queue jobs to run when there is a free processor Tightly coupled parallelism

• Tightly coupled parallelism

• Your problem is split up into separate chunks, but each chunk needs information from other chunks

• Can be some other chunks

• Can be all other chunks

• You have to make the data available so that every chunk that needs it has access to it Problems in parallel computing

• To allow multiple processors to work on a chunk, you have to do two things

• Make sure that the data is somewhere so that it can be used by a processor (communication)

• Make sure that data transfer is synchronised so that you know you can use it when you need to

• Different models solve these problems in different ways

CPU CPU

CPU CPU

Memory

CPU CPU

CPU CPU Shared Memory (SMP)

• Several processors all have direct access to the same memory

• Each processor has a work chunk, but the memory that it uses to hold all of the information is in the shared memory

• Communication is automatic

• Synchronisation is still a problem Shared Memory (SMP)

• Surprisingly nasty problem

• Imagine the code i = i + 1

CPU i=0i=1

i=0i=1 Shared Memory (SMP)

• Now imagine doing it on two processors, each running i = i + 1

• The final result, should be 2

CPU i=0i=1 CPU

i=0i=1 i=0i=1 Atomic operations • Ancient Greek via Latin via French

• Indivisible

• An atomic operation cannot be interrupted

• Ultimately shared memory only works because there exist atomic operations

• Usually they are hidden away under a library

• Whole field of non-blocking algorithms use them though

• Previous example has atomic Read-modify-write Shared Memory (SMP)

• Solution is to have each processor only work on things that are safe independently

• When something like that happens you enter a critical section where things happen one after the other

• That’s the simplest case

• It can get a lot harder

• But, for most problems there are only a few bits like this

• Most of the time is spent with only one processor working on each bit of memory, so it’s quite easy OpenMP

• Most common in research codes

• Directives that tell the compiler how to parallelise loops in code

• Automates some synchronisation

• Still have critical sections

• Can’t do everything

• Newer releases are much more powerful Threads

• Explicitly tell code to run a given function in another

• Operating system will try to schedule threads on free processors so that all processors have the same load

• Common in non-academic codes, less so in academia

• Synchronisation is done through Mutual Exclusion (Mutex) system (a specific form of critical section)

• Explicit function for a thread to get the Mutex and another to release it

• Only one thread can hold the Mutex at a time, others wait until the first has released it Distributed Memory

RAM CPU CPU RAM

RAM CPU CPU RAM Fabric

RAM CPU CPU RAM

RAM CPU CPU RAM Distributed Memory

• Processors all have their own memory

• Data can only flow between processors through a fabric

• Typically send and receive data explicitly

• Manual communication

• Synchronisation is tied directly to communication

• When receive operation completes data is synchronised Distributed Memory

• Have to manually work out the transfer of data between processors

• Send explicit messages between processors containing data

• This can be difficult in general but there are strategies

• Fabric is in general quite slow compared with memory access

• Minimise the amount of data transferred and the number of transfers requested MPI • The Message Passing Interface (MPI) is the most popular distributed memory library

• Just a library, no compiler involvement

• Includes routines for

• Sending

• Receiving

• Collective operations (summing over processors etc.)

• Parallel file I/O

• Others Why use MPI? • Performs very nearly as well as shared memory on a single computer

• Harder to program than OpenMP

• For cases where OpenMP is simple

• Comparable difficulty to writing threaded code

• Library itself works on the largest in the world

• Your algorithm might not, but it’ll still go as far as it can with MPI MPI on shared memory hardware

• MPI works fine on shared memory hardware

• At the user level treats it as if it was distributed memory

• Some shared memory features, covered in advanced materials

• For algorithms that work well on distributed memory performance comparable to OpenMP or pthreads

• Some algorithms map better to shared memory

• Can use Hybrid MPI/OpenMP/pthreads code if you want best of both worlds Alternatives? • OpenSHMEM

• Unified Parallel C

• Chapel

• X10

• None of them have obvious advantages over MPI at the moment and many of them are poorly or patchily supported