01 Concepts in Parallel Programming.Key

Concepts in parallel programming “The Angry Penguin“, used under creative commons licence from Swantje Hess and Jannis Pohlmann. 17/12/2018 Warwick RSE Parallel Computing • Solving multiple problems at once • Goes back before computers • Rooms full of people working on problems • Cryptanalysis • Calculating tide tables • We’re interested in getting computers to do it • How? Used under CC BY-SA 4.0. Attributed to UK Government under Crown copyright Parallelism that we’re not talking about • bit level parallelism • Processors work on chunks of data at once rather than bit by bit • Instruction level parallelism • Processors can operate on more than one variable at a time • NOT multicore • A large chunk of optimisation of code is trying to improve instruction level parallelism Parallelism that we are talking about • Task level parallelism • Split up a “task” into separate bits that computer can work on separately Embarrassing parallelism • Embarrassing parallelism • Tasks that are unrelated to each other can easily just be handed off to different processors • Exploring parameters • “Task farming” Embarrassing parallelism • Just run multiple copies of your program(s) • Don’t run more than you have physical processors • Simultaneous Multithreading (Hyper-threading) doesn't generally work well with research computing loads • Can use scheduler systems to queue jobs to run when there is a free processor Tightly coupled parallelism • Tightly coupled parallelism • Your problem is split up into separate chunks, but each chunk needs information from other chunks • Can be some other chunks • Can be all other chunks • You have to make the data available so that every chunk that needs it has access to it Problems in parallel computing • To allow multiple processors to work on a chunk, you have to do two things • Make sure that the data is somewhere so that it can be used by a processor (communication) • Make sure that data transfer is synchronised so that you know you can use it when you need to • Different models solve these problems in different ways Shared Memory CPU CPU CPU CPU Memory CPU CPU CPU CPU Shared Memory (SMP) • Several processors all have direct access to the same memory • Each processor has a work chunk, but the memory that it uses to hold all of the information is in the shared memory • Communication is automatic • Synchronisation is still a problem Shared Memory (SMP) • Surprisingly nasty problem • Imagine the code i = i + 1 CPU i=0i=1 i=0i=1 Shared Memory (SMP) • Now imagine doing it on two processors, each running i = i + 1 • The final result, should be 2 CPU i=0i=1 CPU i=0i=1 i=0i=1 Atomic operations • Ancient Greek via Latin via French • Indivisible • An atomic operation cannot be interrupted • Ultimately shared memory only works because there exist atomic operations • Usually they are hidden away under a library • Whole field of non-blocking algorithms use them though • Previous example has atomic Read-modify-write Shared Memory (SMP) • Solution is to have each processor only work on things that are safe independently • When something like that happens you enter a critical section where things happen one after the other • That’s the simplest case • It can get a lot harder • But, for most problems there are only a few bits like this • Most of the time is spent with only one processor working on each bit of memory, so it’s quite easy OpenMP • Most common in research codes • Directives that tell the compiler how to parallelise loops in code • Automates some synchronisation • Still have critical sections • Can’t do everything • Newer releases are much more powerful Threads • Explicitly tell code to run a given function in another thread • Operating system will try to schedule threads on free processors so that all processors have the same load • Common in non-academic codes, less so in academia • Synchronisation is done through Mutual Exclusion (Mutex) system (a specific form of critical section) • Explicit function for a thread to get the Mutex and another to release it • Only one thread can hold the Mutex at a time, others wait until the first has released it Distributed Memory RAM CPU CPU RAM RAM CPU CPU RAM Fabric RAM CPU CPU RAM RAM CPU CPU RAM Distributed Memory • Processors all have their own memory • Data can only flow between processors through a fabric • Typically send and receive data explicitly • Manual communication • Synchronisation is tied directly to communication • When receive operation completes data is synchronised Distributed Memory • Have to manually work out the transfer of data between processors • Send explicit messages between processors containing data • This can be difficult in general but there are strategies • Fabric is in general quite slow compared with memory access • Minimise the amount of data transferred and the number of transfers requested MPI • The Message Passing Interface (MPI) is the most popular distributed memory library • Just a library, no compiler involvement • Includes routines for • Sending • Receiving • Collective operations (summing over processors etc.) • Parallel file I/O • Others Why use MPI? • Performs very nearly as well as shared memory on a single computer • Harder to program than OpenMP • For cases where OpenMP is simple • Comparable difficulty to writing threaded code • Library itself works on the largest supercomputers in the world • Your algorithm might not, but it’ll still go as far as it can with MPI MPI on shared memory hardware • MPI works fine on shared memory hardware • At the user level treats it as if it was distributed memory • Some shared memory features, covered in advanced materials • For algorithms that work well on distributed memory performance comparable to OpenMP or pthreads • Some algorithms map better to shared memory • Can use Hybrid MPI/OpenMP/pthreads code if you want best of both worlds Alternatives? • OpenSHMEM • Coarray Fortran • Unified Parallel C • Chapel • X10 • None of them have obvious advantages over MPI at the moment and many of them are poorly or patchily supported.

01 Concepts in Parallel Programming.Key

Scalable and Distributed Deep Learning (DL): Co-Design MPI Runtimes and DL Frameworks

Parallel Programming

Parallel Computer Architecture

Parallel Programming Using Openmp Feb 2014

Enabling Efficient Use of UPC and Openshmem PGAS Models on GPU Clusters

An Overview of the PARADIGM Compiler for Distributed-Memory

Concepts from High-Performance Computing Lecture a - Overview of HPC Paradigms

In Reference to RPC: It's Time to Add Distributed Memory

Parallel Breadth-First Search on Distributed Memory Systems

Introduction to Parallel Computing

Multi-Core Architectures

Scalable Problems and Memory-Bounded Speedup