Understanding Parallelism and the Limitations of Parallel Computing Understanding Parallelism: Sequential Work

Understanding Parallelism and the Limitations of Parallel Computing Understanding Parallelism: Sequential work After 16 time steps: 4 cars (c) RRZE 2013 Scalability Laws 2 Understanding Parallelism: Parallel work After 4 time steps: 4 cars “perfect speedup” (c) RRZE 2013 Scalability Laws 3 Understanding parallelism: Shared resources, imbalance shared resource Unused resources due to resource bottleneck and imbalance! Waiting for shared Waiting for resource synchronization (c) RRZE 2013 Scalability Laws 4 Limitations of Parallel Computing: Amdahl's Law Ideal world: All work is perfectly parallelizable serial serial serial serial Closer to reality: Purely serial parts serialseriell serial limit maximum speedup Reality is even worse: Communication and synchronization impede scalability even further (c) RRZE 2013 Scalability Laws 5 Limitations of Parallel Computing: Calculating Speedup in a Simple Model (“strong scaling”) T(1) = s+p = serial compute time (=1) parallelizable part: p = 1-s purely serial part s Parallel execution time: T(N) = s+p/N General formula for speedup: k T(1) 1 Amdahl's Law (1967) S p “strong scaling” 1s T(N) s N (c) RRZE 2013 Scalability Laws 6 Limitations of Parallel Computing: Amdahl's Law (“strong scaling”) . Reality: No task is perfectly parallelizable . Shared resources have to be used serially . Task interdependencies must be accounted for . Communication overhead (but that can be modeled separately) . Benefit of parallelization may be strongly limited . "Side effect": limited scalability leads to inefficient use of resources . Metric: Parallel Efficiency (what percentage of the workers/processors is efficiently used): S (N) (N) p p N . Amdahl case: 1 p s(N 1) 1 (c) RRZE 2013 Scalability Laws 7 Limitations of Parallel Computing: Adding a simple communication model for strong scaling T(1) = s+p = serial compute time parallelizable part: p = 1-s purely serial part s Model assumption: non- parallel: T(N) = s+p/N+Nk overlapping communication fraction k for messages communication per worker General formula for speedup: T(1) 1 S k p 1s T(N) s N Nk (c) RRZE 2013 Scalability Laws 8 Limitations of Parallel Computing: Amdahl's Law (“strong scaling”) . Large N limits 0 1 . at k=0, Amdahl's Law lim S p (N) predicts N s independent of N ! . at k≠0, our simple model k N 1 1 of communication S p (N) overhead yields a Nk beaviour of . Problems in real world programming . Load imbalance . Shared resources have to be used serially (e.g. IO) . Task interdependencies must be accounted for . Communication overhead (c) RRZE 2013 Scalability Laws 9 Limitations of Parallel Computing: Amdahl´s Law (“strong scaling”) + comm. model 10 9 8 7 6 s=0.01 5 s=0.1 S(N) s=0.1, k=0.05 4 3 2 1 0 1 2 3 4 5 6 7 8 9 10 # CPUs (c) RRZE 2013 Scalability Laws 10 Limitations of Parallel Computing: Amdahl´s Law (“strong scaling”) Parallel 100 efficiency: 90 <10% 80 ~50% 70 60 s=0.01 50 s=0.1 S(N) s=0.1, k=0.05 40 30 20 10 0 1 10 100 500 1000 # CPUs (c) RRZE 2013 Scalability Laws 11 Limitations of Parallel Computing: How to mitigate overheads . Communication is not necessarily purely serial . Non-blocking crossbar networks can transfer many messages concurrently – factor Nk in denominator becomes k (technical measure) . Sometimes, communication can be overlapped with useful work (implementation, algorithm): . Communication overhead may show a more fortunate behavior than Nk . "superlinear speedups“: data size per CPU decreases with increasing CPU count may fit into cache at large CPU counts (c) RRZE 2013 Scalability Laws 12 Limits of Scalability: Serial & Parallel fraction Serial fraction s may depend on . Program / algorithm . Non-parallelizable part, e.g. recursive data setup . Non-parallelizable IO, e.g. reading input data . Communication structure . Load balancing (assumed so far: perfect balanced) . … . Computer hardware . Processor: Cache effects & memory bandwidth effects . Parallel Library; Network capabilities; Parallel IO . … Determine s "experimentally": Measure speedup and fit data to Amdahl’s law – but that could be more complicated than it seems… (c) RRZE 2013 Scalability Laws 13 Scalability data on modern multi-core systems An example Scaling across nodes P P P P C C C C CP P CP P C C C C CP P CP P C P C P C P C P C C ChipsetC C C C 12 sockets ChipsetC C on node Chipset MemoryChipset Memory 12 cores on Memory socket Memory (c) RRZE 2013 Scalability Laws 14 Scalability data on modern multi-core systems The scaling baseline . Scalability presentations should be 2,0 intranode grouped according to the 1,5 memory- largest unit that the bound 1,0 scaling is based on code! Speedup Good 0,5 (the “scaling baseline”) scaling 0,0 across 1 2 4 sockets # CPUs . Amdahl model with communication: Fit 1 S(N) s 1s kN N internode to inter-node scalability numbers (N = # nodes, >1) (c) RRZE 2013 Scalability Laws 15 Application to “accelerated computing” . SIMD, GPUs, Cell SPEs, FPGAs, just any optimization… . Assume overall (serial, un-accelerated) runtime to be Ts=s+p=1 . Assume p can be accelerated and run α times faster. We neglect any additional cost (communication…) . To get a speedup of rα, how small must s be? Solve for s: 1 r 1 1 r s 1 s 1 s . At α=10 and r =0.9 (for an overall speedup of 9), we get s≈0.012, i.e. you must accelerate 98.8% of serial runtime! . Limited memory on accelerators may limit the achievable speedup (c) RRZE 2013 Scalability Laws 16 .

Understanding Parallelism and the Limitations of Parallel Computing Understanding Parallelism: Sequential Work

Distributed Algorithms with Theoretic Scalability Analysis of Radial and Looped Load ﬂows for Power Distribution Systems

Scalability and Performance Management of Internet Applications in the Cloud

Scalability and Optimization Strategies for GPU Enhanced Neural Networks (Genn)

Parallel System Performance: Evaluation & Scalability

Multiprocessing and Scalability

Pascal Viewer: a Tool for the Visualization of Parallel Scalability Trends

Computer Architecture: Parallel Processing Basics

Parallel Programming with Openmp

Challenges for the Message Passing Interface in the Petaflops Era

Scalable SIMD-Efficient Graph Processing on Gpus

Computing at Massive Scale: Scalability and Dependability Challenges

CUDA Basics Murphy Stein New York University