Concepts in parallel programming
“The Angry Penguin“, used under creative commons licence from Swantje Hess and Jannis Pohlmann.
17/12/2018 Warwick RSE Parallel Computing
• Solving multiple problems at once
• Goes back before computers
• Rooms full of people working on problems
• Cryptanalysis
• Calculating tide tables
• We’re interested in getting computers to do it
• How? Used under CC BY-SA 4.0. Attributed to UK Government under Crown copyright Parallelism that we’re not talking about
• bit level parallelism
• Processors work on chunks of data at once rather than bit by bit
• Instruction level parallelism
• Processors can operate on more than one variable at a time
• NOT multicore
• A large chunk of optimisation of code is trying to improve instruction level parallelism Parallelism that we are talking about
• Task level parallelism
• Split up a “task” into separate bits that computer can work on separately Embarrassing parallelism
• Embarrassing parallelism
• Tasks that are unrelated to each other can easily just be handed off to different processors
• Exploring parameters
• “Task farming” Embarrassing parallelism
• Just run multiple copies of your program(s)
• Don’t run more than you have physical processors
• Simultaneous Multithreading (Hyper-threading) doesn't generally work well with research computing loads
• Can use scheduler systems to queue jobs to run when there is a free processor Tightly coupled parallelism
• Tightly coupled parallelism
• Your problem is split up into separate chunks, but each chunk needs information from other chunks
• Can be some other chunks
• Can be all other chunks
• You have to make the data available so that every chunk that needs it has access to it Problems in parallel computing
• To allow multiple processors to work on a chunk, you have to do two things
• Make sure that the data is somewhere so that it can be used by a processor (communication)
• Make sure that data transfer is synchronised so that you know you can use it when you need to
• Different models solve these problems in different ways Shared Memory
CPU CPU
CPU CPU
Memory
CPU CPU
CPU CPU Shared Memory (SMP)
• Several processors all have direct access to the same memory
• Each processor has a work chunk, but the memory that it uses to hold all of the information is in the shared memory
• Communication is automatic
• Synchronisation is still a problem Shared Memory (SMP)
• Surprisingly nasty problem
• Imagine the code i = i + 1
CPU i=0i=1
i=0i=1 Shared Memory (SMP)
• Now imagine doing it on two processors, each running i = i + 1
• The final result, should be 2
CPU i=0i=1 CPU
i=0i=1 i=0i=1 Atomic operations • Ancient Greek via Latin via French
• Indivisible
• An atomic operation cannot be interrupted
• Ultimately shared memory only works because there exist atomic operations
• Usually they are hidden away under a library
• Whole field of non-blocking algorithms use them though
• Previous example has atomic Read-modify-write Shared Memory (SMP)
• Solution is to have each processor only work on things that are safe independently
• When something like that happens you enter a critical section where things happen one after the other
• That’s the simplest case
• It can get a lot harder
• But, for most problems there are only a few bits like this
• Most of the time is spent with only one processor working on each bit of memory, so it’s quite easy OpenMP
• Most common in research codes
• Directives that tell the compiler how to parallelise loops in code
• Automates some synchronisation
• Still have critical sections
• Can’t do everything
• Newer releases are much more powerful Threads
• Explicitly tell code to run a given function in another thread
• Operating system will try to schedule threads on free processors so that all processors have the same load
• Common in non-academic codes, less so in academia
• Synchronisation is done through Mutual Exclusion (Mutex) system (a specific form of critical section)
• Explicit function for a thread to get the Mutex and another to release it
• Only one thread can hold the Mutex at a time, others wait until the first has released it Distributed Memory
RAM CPU CPU RAM
RAM CPU CPU RAM Fabric
RAM CPU CPU RAM
RAM CPU CPU RAM Distributed Memory
• Processors all have their own memory
• Data can only flow between processors through a fabric
• Typically send and receive data explicitly
• Manual communication
• Synchronisation is tied directly to communication
• When receive operation completes data is synchronised Distributed Memory
• Have to manually work out the transfer of data between processors
• Send explicit messages between processors containing data
• This can be difficult in general but there are strategies
• Fabric is in general quite slow compared with memory access
• Minimise the amount of data transferred and the number of transfers requested MPI • The Message Passing Interface (MPI) is the most popular distributed memory library
• Just a library, no compiler involvement
• Includes routines for
• Sending
• Receiving
• Collective operations (summing over processors etc.)
• Parallel file I/O
• Others Why use MPI? • Performs very nearly as well as shared memory on a single computer
• Harder to program than OpenMP
• For cases where OpenMP is simple
• Comparable difficulty to writing threaded code
• Library itself works on the largest supercomputers in the world
• Your algorithm might not, but it’ll still go as far as it can with MPI MPI on shared memory hardware
• MPI works fine on shared memory hardware
• At the user level treats it as if it was distributed memory
• Some shared memory features, covered in advanced materials
• For algorithms that work well on distributed memory performance comparable to OpenMP or pthreads
• Some algorithms map better to shared memory
• Can use Hybrid MPI/OpenMP/pthreads code if you want best of both worlds Alternatives? • OpenSHMEM
• Unified Parallel C
• Chapel
• X10
• None of them have obvious advantages over MPI at the moment and many of them are poorly or patchily supported