Task Level Parallelism

Task Level Parallelism The topic of this chapter is thread-level parallelism. While, thread-level parallelism falls within the textbook’s classification of ILP and data parallelism. It also falls into a broader topic of parallel and distributed computing. In the next set of slides, I will attempt to place you in the context of this broader computation space that is called task level parallelism. Of course a proper treatment of parallel computing or distributed computing is worthy of an entire semester (or two) course of study. I can only give you a brief exposure to this topic. The text highlighted in green in these slides contain external hyperlinks. 1 / 14 Classification of Parallelism Software Sequential Concurrent Serial Some problem written as a se- Some problem written as a quential program (the MATLAB concurrent program (the O/S example from the textbook). example from the textbook). Execution on a serial platform. Execution on a serial platform. Parallel Some problem written as a se- Some problem written as a quential program (the MATLAB concurrent program (the O/S Hardware example from the textbook). example from the textbook). Execution on a parallel plat- Execution on a parallel platform. form. 2 / 14 Flynn’s Classification of Parallelism CU: control unit SM: shared memory DS1 PU MM PU: processor unit IS: instruction stream 1 1 MM: memory unit DS: data stream DS2 PU MM IS 2 2 CU IS SM IS DS CU PU MM DSn PUn MMm (a) SISD computer IS (b) SIMD computer IS1 IS1 IS1 IS1 IS1 DS1 CU PU DS CU PU MM 1 1 1 1 1 SM IS2 IS2 IS2 IS2 DS2 CU PU CU PU MM IS2 2 2 2 2 2 MM MM MM 1 2 m SM ISn ISn ISn ISn ISn DSn IS CUnPU n DS 2 CUnPU n MMm IS1 ISn (c) MISD computer (d) MIMD computer 3 / 14 Task Level Parallelism I Task Level Parallelism: organizing a program or computing solution into a set of processes/tasks/threads for simultaneous execution. Thread level parallelism is a form of task level parallelism. Task level parallelism generally breaks down into one of two forms: I Running various steps of the algorithm as different (communicating) tasks/threads I SPMD style: Single Program Multiple Data I Conventionally one might think of task level parallelism (and the MIMD processing paradigm) as being used for a single program or operation, however, request level parallelism( e.g., serving http requests for a website) is also generally addressed/studied by hardware solutions in this same space. I Request level processing and related problems with independent transactions (e.g., bitcoin mining, web page requests) fall into a class of problems called embarrassingly parallel. Embarrassingly because they have virtually no synchronization requirements and thus show linear (measured by number of compute nodes) speedup. 4 / 14 Parallel, Distributed, & Concurrent Computing I Parallel Computing: a form of computation in which many calculations are carried out simultaneously; operating on the principle that large problems can often be divided into discrete parts that can be solved concurrently (“in parallel”). There are several different forms of parallel computing: bit-level, instruction level, data, and task parallelism. I Distributed Computing/Distributed Systems: a system in which components located on networked computers communicate and coordinate their actions by passing messages. The components interact with each other in order to achieve a common goal. I Concurrent Computing a form of computing in which programs are designed as collections of interacting computational processes that may be executed in parallel. Concurrent programs (processes or threads) can be executed: (i) on a single processor by time-slicing, or (ii) in parallel by assigning each computational process to one of a set of processors. The main challenges in designing concurrent programs are ensuring the correct sequencing of the interactions or communications between different computational executions, and coordinating access to resources that are shared among executions. 5 / 14 Parallel, Distributed, & Concurrent Computing The terms concurrent computing, parallel computing, and distributed computing have a lot of overlap, and no clear distinction exists between them. The same system may be characterized both as “parallel” and “distributed”; the processors in a typical distributed system run concurrently in parallel. Parallel computing may be seen as a particular tightly coupled form of distributed computing, and distributed computing may be seen as a loosely coupled form of parallel computing. Nevertheless, it is possible to roughly classify concurrent systems as “parallel” or “distributed” using the following criteria: I In parallel computing, all processors may have access to a shared memory to exchange information between processors. I In distributed computing, each processor has its own private memory (distributed memory). Information is exchanged by passing messages between the processors. 6 / 14 Decomposing MIMD: Multiprocessors I Multiprocessors: computers consisting of tightly coupled processors that typically present a shared memory space. The principle topic of this chapter. I Symmetric (shared-memory) Multiprocessors (SMP): small scale multiprocessors with a shared memory space providing mostly uniform memory access (UMA). Example: single processor multicore x86 machines. I Distributed Shared Memory (DSM) multiprocessors: generally larger solutions with a distributed memory solution that provides nonuniform memory access (NUMA). Much trickier to program effectively as non-local memory references (that present to the programmer as just another memory location in their address space) can be surprisingly costly to access. Actually all parallel programming is much harder to exploit for speedup than it appears. The difficulty of balancing the computation and managing synchronization costs between the parallel tasks is quite difficult. 7 / 14 Decomposing MIMD: Multicomputers I Multicomputers: computers consisting of loosely coupled processors that typically present a distributed memory space. Not to be confused with a distributed memory multiprocessor. I Beowulf Clusters: a collection of (general purpose) compute nodes that are networked together for parallel computing. Often presented as a rack of blade computers, but also existing as a collection of independent computers networked together for the purpose of parallel computing. I Networking support generally provided by standard networking hardware such as Ethernet or Infiniband. 8 / 14 Decomposing MIMD: Massively Parallel Processing (MPP) Large scale (generally above 1K nodes) parallel computing. All forms, SIMD, MIMD, GPGPUs, Clusters, Tightly Coupled Multicomputers (e.g., IBM Blue Gene family). Very interesting problem space. Fault tolerance and fault recovery become far more important than in other spaces. As the size and scale of processing hardware increases, its not clear that MPP as a designation is/will continue to be relevant/significant. 9 / 14 Decomposing MIMD: Warehouse-Scale Clusters of 10s of thousands (and beyond) of independent compute nodes. Generally providing a compute platform for supporting large scale request level parallelism. The topic of the next chapter. 10 / 14 Speedup and Scaling Strong Scaling: increasing parallelism in hardware achieves increased speedup. Weak Scaling: increased parallelism is achieved only by increasing the problem size with the hardware parallelism size increases. 11 / 14 Parallelism is Hard/Amdhal’s Law Amdhal’s Law: Idealized parallelism 20 95% parallel 90% parallel 18 75% parallel 50% parallel 16 14 12 10 Speedup 8 6 4 2 0 1 4 16 64 256 1024 4096 16384 65536 Number of Processors The speedup just isn’t there by conventional Speedup = 1 Fractionenhanced approaches. Gaining speedup is very difficult. In fact, (1−Fractionenhanced )+ Speedupenhanced this graph shows an idealized speedup without any Restating consideration for synchronization costs. = 1 Speedup (1−% affected+% left after optimization) 12 / 14 Amdhal’s Law for Parallelism with Overhead, Naive Amdhal’s Law: Introducing Overhead 100 original Adding a fixed 0.5% overhead 90 2x the runtime of the parallel portion 80 70 60 50 40 30 Speedup assuming 99% parallel 20 10 0 1 4 16 64 256 1024 4096 16384 65536 Number of Processors Neither the fixed or the variable overhead So we look at both fixed and variable overheads. The first costs are accurate, but they can give us line adds a fixed overhead to the parallel portion equal to some curves to consider. In reality, the 0.5% of the original computation. The second line shows costs of synchronization will be very a variable overhead equal to doubling the runtime costs of difficult to establish. I cannot really give the parallel components (not an unreasonable possibility). you much direction here, sorry. = 1 Speedup %parallel (1−%parallel)+( +:05) processors = 1 Speedup %parallel (1−%parallel)+( ∗2) processors 13 / 14 Gustafson’s Law Gustafson’s Law: Scaled speedup 140 95% parallel 90% parallel 75% parallel 120 50% parallel 100 80 Speedup 60 40 20 0 20 40 60 80 100 120 Number of Processors From wikipedia: Gustafson called his metric scaled speedup, because in the above S(P) = P − α · (P − 1) expression S(P) is the ratio of the total, single-process execution time to the per-process parallel execution time; the former scales with P, while the latter is assumed fixed or nearly S: Speedup so. This is in contrast to Amdahl’s Law, which takes the single-process execution time to P: number of processors be the fixed quantity, and compares it to a shrinking per-process parallel execution time. α: the non-parallelizable Thus, Amdahl’s law is based on the assumption of a fixed problem size: it assumes the fraction of any parallel overall workload of a program does not change with respect to machine size (i.e., the process number of processors). Both laws assume the parallelizable part is evenly distributed over P processors. 14 / 14.

Task Level Parallelism

2.5 Classification of Parallel Computers

Cimple: Instruction and Memory Level Parallelism a DSL for Uncovering ILP and MLP

Introduction to Multi-Threading and Vectorization Matti Kortelainen Larsoft Workshop 2019 25 June 2019 Outline

Computer Hardware Architecture Lecture 4

Scheduling Task Parallelism on Multi-Socket Multicore Systems

Task Parallelism Bit-Level Parallelism

Threading SIMD and MIMD in the Multicore Context the Ultrasparc T2

Computer Architecture: Parallel Processing Basics

Parallel Processing! 1! CSE 30321 – Lecture 23 – Introduction to Parallel Processing! 2! Suggested Readings! •! Readings! –! H&P: Chapter 7! •! (Over Next 2 Weeks)!

A CPU/GPU Task-Parallel Runtime with Explicit Epoch Synchronization

An Overview of Parallel Ccomputing

Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures