Task Level Parallelism

Task Level Parallelism The topic of this chapter is thread-level parallelism. While, thread-level parallelism falls within the textbook’s classification of ILP and data parallelism. It also falls into a broader topic of parallel and distributed computing. In the next set of slides, I will attempt to place you in the context of this broader computation space. Of course a proper treatment of parallel computing or distributed computing is worthy of an entire semester (or two) course of study. I can only give you a brief exposure to this topic. The text highlighted in green in these slides contain external hyperlinks. 1 / 13 Task Level Parallelism I Task Level Parallelism: organizing a program or computing solution into a set of processes/tasks/threads for simultaneous execution. I Conventionally one might think of task level parallelism (and the MIMD processing paradigm) as being used for a single program or operation, however, request level parallelism( e.g., serving http requests for a website) is also generally addressed/studied by hardware solutions in this same space. I Request level processing and related problems with independent transactions (e.g., bitcoin mining, web page requests) fall into a class of problems called embarrassingly parallel. Embarrassingly because they have virtually no synchronization requirements and thus show linear (measured by number of compute nodes) speedup. 2 / 13 Parallel, Distributed, & Concurrent Computing I Parallel Computing: a form of computation in which many calculations are carried out simultaneously; operating on the principle that large problems can often be divided into discrete parts that can be solved concurrently (“in parallel”). There are several different forms of parallel computing: bit-level, instruction level, data, and task parallelism. I Distributed Computing/Distributed Systems: a system in which components located on networked computers communicate and coordinate their actions by passing messages. The components interact with each other in order to achieve a common goal. I Concurrent Computing a form of computing in which programs are designed as collections of interacting computational processes that may be executed in parallel. Concurrent programs (processes or threads) can be executed: (i) on a single processor by time-slicing, or (ii) in parallel by assigning each computational process to one of a set of processors. The main challenges in designing concurrent programs are ensuring the correct sequencing of the interactions or communications between different computational executions, and coordinating access to resources that are shared among executions. 3 / 13 Parallel, Distributed, & Concurrent Computing The terms concurrent computing, parallel computing, and distributed computing have a lot of overlap, and no clear distinction exists between them. The same system may be characterized both as “parallel” and “distributed”; the processors in a typical distributed system run concurrently in parallel. Parallel computing may be seen as a particular tightly coupled form of distributed computing, and distributed computing may be seen as a loosely coupled form of parallel computing. Nevertheless, it is possible to roughly classify concurrent systems as “parallel” or “distributed” using the following criteria: I In parallel computing, all processors may have access to a shared memory to exchange information between processors. I In distributed computing, each processor has its own private memory (distributed memory). Information is exchanged by passing messages between the processors. 4 / 13 Pgming Ex: Reader/Writer w/ Semaphores semaphore mutex = 1; semaphore fillCount = 0; semaphore emptyCount = BUFFER SIZE; procedure producer() f procedure consumer() f while (true) f while (true) f item = produceItem(); wait(fillCount); wait(emptyCount); wait(mutex); wait(mutex); item = removeItemFromBuffer(); putItemIntoBuffer(item); signal(mutex); signal(mutex); signal(emptyCount); signal(fillCount); consumeItem(item); g g g g wait(s): repeat fif s >= 0 (s = s - 1; break;)g signal(s): s = s + 1; 5 / 13 MIMD IS1 IS1 IS1 DS1 CU PU MM 1 1 1 IS2 IS2 DS2 CU PU MM IS2 2 2 2 ISn ISn ISn DSn CU PU MM n n m 6 / 13 Decomposing MIMD: Multiprocessors I Multiprocessors: computers consisting of tightly coupled processors that typically present a shared memory space. The principle topic of this chapter. I Symmetric (shared-memory) Multiprocessors (SMP): small scale multiprocessors with a shared memory space providing mostly uniform memory access (UMA). Example: single processor multicore x86 machines. I Distributed Shared Memory (DSM) multiprocessors: generally larger solutions with a distributed memory solution that provides nonuniform memory access (NUMA). Much trickier to program effectively as non-local memory references (that present to the programmer as just another memory location in their address space) can be surprisingly costly to access. Actually all parallel programming is much harder to exploit for speedup than it appears. The difficulty of balancing the computation and managing synchronization costs between the parallel tasks is quite difficult. 7 / 13 Decomposing MIMD: Multicomputers I Multicomputers: computers consisting of loosely coupled processors that typically present a distributed memory space. Not to be confused with a distributed memory multiprocessor. I Beowulf Clusters: a collection of (general purpose) compute nodes that are networked together for parallel computing. Often presented as a rack of blade computers, but also existing as a collection of independent computers networked together for the purpose of parallel computing. I Networking support generally provided by standard networking hardware such as Ethernet or Infiniband. 8 / 13 Decomposing MIMD: Massively Parallel Processing (MPP) Large scale (generally above 1K nodes) parallel computing. All forms, SIMD, MIMD, GPGPUs, Clusters, Tightly Coupled Multicomputers (e.g., IBM Blue Gene family). Very interesting problem space. Fault tolerance and fault recovery become far more important. 9 / 13 Decomposing MIMD: Warehouse-Scale Clusters of 10s of thousands (and beyond) of independent compute nodes. Generally providing a compute platform for supporting large scale request level parallelism. The topic of the next chapter. 10 / 13 Parallelism is Hard/Amdhal’s Law Amdhal’s Law: Idealized parallelism 20 95% parallel 90% parallel 18 75% parallel 50% parallel 16 14 12 10 Speedup 8 6 4 2 0 1 4 16 64 256 1024 4096 16384 65536 Number of Processors The speedup just isn’t there by conventional Speedup = 1 Fractionenhanced approaches. Gaining speedup is very difficult. In fact, (1−Fractionenhanced )+ Speedupenhanced this graph shows an idealized speedup without any Restating consideration for synchronization costs. = 1 Speedup (1−% affected+% left after optimization) 11 / 13 Amdhal’s Law for Parallelism with Overhead, Naive Amdhal’s Law: Introducing Overhead 100 original Adding a fixed 0.5% overhead 90 2x the runtime of the parallel portion 80 70 60 50 40 30 Speedup assuming 99% parallel 20 10 0 1 4 16 64 256 1024 4096 16384 65536 Number of Processors Neither the fixed or the variable overhead So we look at both fixed and variable overheads. The first costs are accurate, but they can give us line adds a fixed overhead to the parallel portion equal to some curves to consider. In reality, the 0.5% of the original computation. The second line shows costs of synchronization will be very a variable overhead equal to doubling the runtime costs of difficult to establish. I cannot really give the parallel components (not an unreasonable possibility). you much direction here, sorry. = 1 Speedup %parallel (1−%parallel)+( +:05) processors = 1 Speedup %parallel (1−%parallel)+( ∗2) processors 12 / 13 Gustafson’s Law Gustafson’s Law: Scaled speedup 140 95% parallel 90% parallel 75% parallel 120 50% parallel 100 80 Speedup 60 40 20 0 20 40 60 80 100 120 Number of Processors From wikipedia: Gustafson called his metric scaled speedup, because in the above S(P) = P − α · (P − 1) expression S(P) is the ratio of the total, single-process execution time to the per-process parallel execution time; the former scales with P, while the latter is assumed fixed or nearly S: Speedup so. This is in contrast to Amdahl’s Law, which takes the single-process execution time to P: number of processors be the fixed quantity, and compares it to a shrinking per-process parallel execution time. α: the non-parallelizable Thus, Amdahl’s law is based on the assumption of a fixed problem size: it assumes the fraction of any parallel overall workload of a program does not change with respect to machine size (i.e., the process number of processors). Both laws assume the parallelizable part is evenly distributed over P processors. 13 / 13.

Task Level Parallelism

Data-Level Parallelism

High Performance Computing Through Parallel and Distributed Processing

Cimple: Instruction and Memory Level Parallelism a DSL for Uncovering ILP and MLP

Introduction to Multi-Threading and Vectorization Matti Kortelainen Larsoft Workshop 2019 25 June 2019 Outline

Scheduling Task Parallelism on Multi-Socket Multicore Systems

Introduction to High Performance Computing

Demystifying Parallel and Distributed Deep Learning: an In-Depth Concurrency Analysis

Parallel Programming: Performance

Exploiting Parallelism in Many-Core Architectures: Architectures G Bortolotti, M Caberletti, G Crimi Et Al

An Introduction to Gpus, CUDA and Opencl

Task Parallelism Bit-Level Parallelism

1 Parallelism