Parallel Shared-Memory Programming on Multicore Nodes

Parallel shared-memory programming on multicore nodes Georg Hager and Jan Treibig HPC Services, Erlangen Regional Computing Center (RRZE) Specialist Workshop on Parallel Computing University of Ghent April 19-20, 2012 Schedule Day 1 Time 9:00 Introduction to parallel programming on multicore systems 10:00 OpenMP Basic topics I 10:30 Coffee break 10:45 OpenMP Basic topics II 11:45 Hands-on session 1 12:30 Lunch break 13:30 OpenMP Advanced topics 15:00 Coffee break 15:15 Hands-on session 2 16:15 Tooling 17.00 End of day 1 Specialist Workshop 2012 Multicore - Introduction 2 Schedule Day 2 Time 9:00 Performance-oriented programming I 10:30 Coffee break 10:45 Hands-on session 3 11:45 Performance-oriented programming II 12:30 Lunch break 13:30 Performance-oriented programming III 14:15 Hands-on session 4 15:15 Coffee break 15:30 Architecture-specific programming 17.00 End of workshop Moodle e-learning platform: http://moodle.rrze.uni-erlangen.de/moodle/course/view.php?id=236&lang=en Specialist Workshop 2012 Multicore - Introduction 3 Workshop outline . Part 1: Introduction . Part 3: OpenMP Advanced topics . Programming models . Synchronization . Aspects of parallelism . Advanced worksharing clauses . Strong scaling (“Amdahl’s law”) . Tasking . Weak scaling (“Gustafson’s law”) . Part 4: MulticoreTools . Program design aspects . LIKWID tool suite . Multicore node architecture . Alternatives . Part 5: Multicore performance . Part 2: OpenMP Basic Topics . Performance properties of . OpenMP programing model multicore systems . OpenMP syntax . Simple performance modeling . parallel directive . Code optimization . Data scoping . ccNUMA issues . Worksharing directives . Part 6: Architecture-Specific . OpenMP API . Cache hierarchies in depth . Compiling and running . Understanding x86 assembly Specialist Workshop 2012 Multicore - Introduction 4 Introduction Parallel Computing . Parallelism will substantially increase through the use of multi/many-core chips in the future! . Parallel computing is entering everyday life . Dual-core/quad-core systems everywhere (Workstation, Laptop, etc…) . Basic design concepts for parallel computers: . Shared memory multiprocessor systems: Multiple processors run in parallel but use the same (a single) address space (“shared memory”). Examples: Multicore workstations, multicore multisocket cluster nodes. Distributed memory systems: Multiple processors/compute nodes are connected via a network. Each processor has its own address space/ memory. Examples: Clusters with x86-based servers and high-speed interconnect networks (or just GBit Ethernet). Specialist Workshop 2012 Multicore - Introduction 5 Message Passing vs. Shared Memory: Programming Models Distributed Memory Shared Memory Same program on each Single Program on single machine processor/machine (SPMD) or . UNIX Process splits off threads, Multiple programs with consistent mapped to CPUs for work communication structure (MPMD) distribution . Data . Program written in a sequential language . may be process-global or thread- local . all variables process-local . exchange of data not needed, or via . no implicit knowledge of data on suitable synchronization mechanisms other processors . Data exchange between processes: . Programming models . send/receive messages via . explicit threading (hard) appropriate library . directive-based threading via . most tedious, but also the most OpenMP (easier) flexible way of parallelization . Library or parallel language . Parallel library: . automatic parallelization (very easy, . Message Passing Interface, MPI but mostly not efficient) Specialist Workshop 2012 Multicore - Introduction 6 Parallel programming models Standards-based Parallelism . MPI standard . OpenMP standard . MPI forum released version . architecture review board 2.2 in September 2009 released version 3.1 in July . unified document („MPI1+2“) 2011 . Base languages . feature update (tasking etc.) . Fortran (77, 95) . Base languages . C . Fortran (77, 95) . C++ binding obsolescent . C, C++ use C bindings (Java is not a base language) . Resources: . Resources: . http://www.mpi-forum.org . http://www.openmp.org . http://www.compunity.org Portability: • of semantics (well-defined, „safe“ interfaces) • of performance (a difficult target) Specialist Workshop 2012 Multicore - Introduction 7 Alternatives for shared-memory programming Almost exclusively C(++) . POSIX Threads . Standardized threading interface (C) . “Bare metal” parallelization, very basic interface . Intel Threading Building Blocks (TBB) . C++ library . Task-based, work stealing . Intel Cilk Plus . … Specialist Workshop 2012 Multicore - Introduction 8 Analysis Finding exploitable concurrency Problem analysis: 1. Decompose into subproblems . Data decomposition . perhaps even hierarchy of subproblems . Task decomposition . that can simultaneously execute (non-orthogonal concepts) . that still efficiently execute 2. Identify dependencies between subproblems . imply need for data exchange . identification of synchronization points . determine explicit communication between processors or ordering of tasks working on shared data Specialist Workshop 2012 Multicore - Introduction 9 Example Climate simulation . Prototypical for separate, but coupled systems . Coarse grained parallelism . functional decomposition Ice Atmosphere Physics simulate various subsystems and/or observables Sea Land . domain decomposition into equally-sized blocks Math, each block contains observables on a grid Algorithm numerical methods for solution available (?) . Medium grained parallelism . further subdivide work Software if multiple CPUs available in a compute node . Fine grained parallelism Silicon . instructions and data on a CPU Specialist Workshop 2012 Multicore - Introduction 10 Basic parallelism (1): A diligent worker after 16 time steps: 4 cars Specialist Workshop 2012 Multicore - Introduction 11 Basic parallelism (2): Use four workers after 4 time steps: 4 cars Specialist Workshop 2012 Multicore - Introduction 12 Understanding parallelism: Limits of scalability – shared resources + imbalance Unused resources due to resource shared resource bottleneck and imbalance! Waiting for Waiting for shared resource synchronization Specialist Workshop 2012 Multicore - Introduction 13 Limitations of Parallel Computing: Amdahl's Law Ideal world: All work is perfectly parallelizable serial serial serial serial Closer to reality: Purely serial parts seriell serial limit maximum speedup Reality is even worse: Communication and synchronization impede scalability even further Specialist Workshop 2012 Multicore - Introduction 14 Limitations of Parallel Computing: Calculating Speedup in a Simple Model (“strong scaling”) T(1) = s+p = serial compute time (=1) parallelizable part: p = 1-s purely serial part s Parallel execution time: T(N) = s+p/N General formula for speedup: k T (1) 1 Amdahl's Law S p for "strong scaling" 1s T (N) s N Specialist Workshop 2012 Multicore - Introduction 15 Limitations of Parallel Computing: Amdahl's Law (“strong scaling”) . Reality: No task is perfectly parallelizable . Shared resources have to be used serially . Task interdependencies must be accounted for . Communication overhead (but that can be modeled separately) . Benefit of parallelization may be strongly limited . "Side effect": limited scalability leads to inefficient use of resources . Metric: Parallel Efficiency (what percentage of the workers/processors is efficiently used): S (N) (N) p p N 1 . Amdahl case: p s(N 1) 1 Specialist Workshop 2012 Multicore - Introduction 16 Limits of Scalability: Strong Scaling with a Simple Communication Model T(1) = s+p = serial compute time parallelizable part: p = 1-s purely serial part s parallel: T(N) = s+p/N+Nk fraction k for “communication” between each two workers (only present in parallel case) General formula for speedup k T (1) 1 (worst case): S p 1s T (N) s N Nk (k=0 Amdahl's Law) Specialist Workshop 2012 Multicore - Introduction 17 Limits of Scalability: Amdahl’s Law . Large N limits . at k=0, Amdahl's Law 0 1 predicts lim S p (N) N s (independent of N) . at k≠0, our simplified model of communication overhead k N 1 1 yields a beaviour of S (N) p Nk . In reality the situation gets even worse . Load imbalance . Pipeline startups . … Specialist Workshop 2012 Multicore - Introduction 18 Understanding Parallelism: Amdahl’s Law and the impact of communication 10 9 8 7 6 s=0.01 S(N) 5 s=0.1 s=0.1, k=0.05 4 3 2 1 0 12345678910 # CPUs Specialist Workshop 2012 Multicore - Introduction 19 Limitations of Parallel Computing: Increasing Parallel Efficiency . Increasing problem size often helps to enlarge the parallel fraction p . Often p scales with problem size while s stays constant . Fraction of s relative to overall runtime decreases s p Scalability in terms of parallel speedup and parallel s p/N efficiency improves when scaling the problem size! s p s p/N Specialist Workshop 2012 Multicore - Introduction 20 Limitations of Parallel Computing: Increasing Parallel Efficiency („weak scaling“) . When scaling a problem to more workers, the amount of work will often be scaled as well . Let s and p be the serial and parallel fractions so that s+p=1 . Perfect situation: runtime stays constant while N increases . “Performance Speedup”= work/time for problem size N with N workers work/time for problem size 1 with 1 worker s pN P (N) s pN N (1 N)s s (1 s)N s s p Gustafson's Law ("weak scaling") . Linear in N – but closely observe the meaning of the word "work"! Specialist Workshop 2012 Multicore

Load more