Parallel Shared-Memory Programming on Multicore Nodes

Parallel shared-memory programming on multicore nodes Georg Hager and Jan Treibig HPC Services, Erlangen Regional Computing Center (RRZE) Specialist Workshop on Parallel Computing University of Ghent April 19-20, 2012 Schedule Day 1 Time 9:00 Introduction to parallel programming on multicore systems 10:00 OpenMP Basic topics I 10:30 Coffee break 10:45 OpenMP Basic topics II 11:45 Hands-on session 1 12:30 Lunch break 13:30 OpenMP Advanced topics 15:00 Coffee break 15:15 Hands-on session 2 16:15 Tooling 17.00 End of day 1 Specialist Workshop 2012 Multicore - Introduction 2 Schedule Day 2 Time 9:00 Performance-oriented programming I 10:30 Coffee break 10:45 Hands-on session 3 11:45 Performance-oriented programming II 12:30 Lunch break 13:30 Performance-oriented programming III 14:15 Hands-on session 4 15:15 Coffee break 15:30 Architecture-specific programming 17.00 End of workshop Moodle e-learning platform: http://moodle.rrze.uni-erlangen.de/moodle/course/view.php?id=236&lang=en Specialist Workshop 2012 Multicore - Introduction 3 Workshop outline . Part 1: Introduction . Part 3: OpenMP Advanced topics . Programming models . Synchronization . Aspects of parallelism . Advanced worksharing clauses . Strong scaling (“Amdahl’s law”) . Tasking . Weak scaling (“Gustafson’s law”) . Part 4: MulticoreTools . Program design aspects . LIKWID tool suite . Multicore node architecture . Alternatives . Part 5: Multicore performance . Part 2: OpenMP Basic Topics . Performance properties of . OpenMP programing model multicore systems . OpenMP syntax . Simple performance modeling . parallel directive . Code optimization . Data scoping . ccNUMA issues . Worksharing directives . Part 6: Architecture-Specific . OpenMP API . Cache hierarchies in depth . Compiling and running . Understanding x86 assembly Specialist Workshop 2012 Multicore - Introduction 4 Introduction Parallel Computing . Parallelism will substantially increase through the use of multi/many-core chips in the future! . Parallel computing is entering everyday life . Dual-core/quad-core systems everywhere (Workstation, Laptop, etc…) . Basic design concepts for parallel computers: . Shared memory multiprocessor systems: Multiple processors run in parallel but use the same (a single) address space (“shared memory”). Examples: Multicore workstations, multicore multisocket cluster nodes. Distributed memory systems: Multiple processors/compute nodes are connected via a network. Each processor has its own address space/ memory. Examples: Clusters with x86-based servers and high-speed interconnect networks (or just GBit Ethernet). Specialist Workshop 2012 Multicore - Introduction 5 Message Passing vs. Shared Memory: Programming Models Distributed Memory Shared Memory Same program on each Single Program on single machine processor/machine (SPMD) or . UNIX Process splits off threads, Multiple programs with consistent mapped to CPUs for work communication structure (MPMD) distribution . Data . Program written in a sequential language . may be process-global or thread- local . all variables process-local . exchange of data not needed, or via . no implicit knowledge of data on suitable synchronization mechanisms other processors . Data exchange between processes: . Programming models . send/receive messages via . explicit threading (hard) appropriate library . directive-based threading via . most tedious, but also the most OpenMP (easier) flexible way of parallelization . Library or parallel language . Parallel library: . automatic parallelization (very easy, . Message Passing Interface, MPI but mostly not efficient) Specialist Workshop 2012 Multicore - Introduction 6 Parallel programming models Standards-based Parallelism . MPI standard . OpenMP standard . MPI forum released version . architecture review board 2.2 in September 2009 released version 3.1 in July . unified document („MPI1+2“) 2011 . Base languages . feature update (tasking etc.) . Fortran (77, 95) . Base languages . C . Fortran (77, 95) . C++ binding obsolescent . C, C++ use C bindings (Java is not a base language) . Resources: . Resources: . http://www.mpi-forum.org . http://www.openmp.org . http://www.compunity.org Portability: • of semantics (well-defined, „safe“ interfaces) • of performance (a difficult target) Specialist Workshop 2012 Multicore - Introduction 7 Alternatives for shared-memory programming Almost exclusively C(++) . POSIX Threads . Standardized threading interface (C) . “Bare metal” parallelization, very basic interface . Intel Threading Building Blocks (TBB) . C++ library . Task-based, work stealing . Intel Cilk Plus . … Specialist Workshop 2012 Multicore - Introduction 8 Analysis Finding exploitable concurrency Problem analysis: 1. Decompose into subproblems . Data decomposition . perhaps even hierarchy of subproblems . Task decomposition . that can simultaneously execute (non-orthogonal concepts) . that still efficiently execute 2. Identify dependencies between subproblems . imply need for data exchange . identification of synchronization points . determine explicit communication between processors or ordering of tasks working on shared data Specialist Workshop 2012 Multicore - Introduction 9 Example Climate simulation . Prototypical for separate, but coupled systems . Coarse grained parallelism . functional decomposition Ice Atmosphere Physics simulate various subsystems and/or observables Sea Land . domain decomposition into equally-sized blocks Math, each block contains observables on a grid Algorithm numerical methods for solution available (?) . Medium grained parallelism . further subdivide work Software if multiple CPUs available in a compute node . Fine grained parallelism Silicon . instructions and data on a CPU Specialist Workshop 2012 Multicore - Introduction 10 Basic parallelism (1): A diligent worker after 16 time steps: 4 cars Specialist Workshop 2012 Multicore - Introduction 11 Basic parallelism (2): Use four workers after 4 time steps: 4 cars Specialist Workshop 2012 Multicore - Introduction 12 Understanding parallelism: Limits of scalability – shared resources + imbalance Unused resources due to resource shared resource bottleneck and imbalance! Waiting for Waiting for shared resource synchronization Specialist Workshop 2012 Multicore - Introduction 13 Limitations of Parallel Computing: Amdahl's Law Ideal world: All work is perfectly parallelizable serial serial serial serial Closer to reality: Purely serial parts seriell serial limit maximum speedup Reality is even worse: Communication and synchronization impede scalability even further Specialist Workshop 2012 Multicore - Introduction 14 Limitations of Parallel Computing: Calculating Speedup in a Simple Model (“strong scaling”) T(1) = s+p = serial compute time (=1) parallelizable part: p = 1-s purely serial part s Parallel execution time: T(N) = s+p/N General formula for speedup: k T (1) 1 Amdahl's Law S p for "strong scaling" 1s T (N) s N Specialist Workshop 2012 Multicore - Introduction 15 Limitations of Parallel Computing: Amdahl's Law (“strong scaling”) . Reality: No task is perfectly parallelizable . Shared resources have to be used serially . Task interdependencies must be accounted for . Communication overhead (but that can be modeled separately) . Benefit of parallelization may be strongly limited . "Side effect": limited scalability leads to inefficient use of resources . Metric: Parallel Efficiency (what percentage of the workers/processors is efficiently used): S (N) (N) p p N 1 . Amdahl case: p s(N 1) 1 Specialist Workshop 2012 Multicore - Introduction 16 Limits of Scalability: Strong Scaling with a Simple Communication Model T(1) = s+p = serial compute time parallelizable part: p = 1-s purely serial part s parallel: T(N) = s+p/N+Nk fraction k for “communication” between each two workers (only present in parallel case) General formula for speedup k T (1) 1 (worst case): S p 1s T (N) s N Nk (k=0 Amdahl's Law) Specialist Workshop 2012 Multicore - Introduction 17 Limits of Scalability: Amdahl’s Law . Large N limits . at k=0, Amdahl's Law 0 1 predicts lim S p (N) N s (independent of N) . at k≠0, our simplified model of communication overhead k N 1 1 yields a beaviour of S (N) p Nk . In reality the situation gets even worse . Load imbalance . Pipeline startups . … Specialist Workshop 2012 Multicore - Introduction 18 Understanding Parallelism: Amdahl’s Law and the impact of communication 10 9 8 7 6 s=0.01 S(N) 5 s=0.1 s=0.1, k=0.05 4 3 2 1 0 12345678910 # CPUs Specialist Workshop 2012 Multicore - Introduction 19 Limitations of Parallel Computing: Increasing Parallel Efficiency . Increasing problem size often helps to enlarge the parallel fraction p . Often p scales with problem size while s stays constant . Fraction of s relative to overall runtime decreases s p Scalability in terms of parallel speedup and parallel s p/N efficiency improves when scaling the problem size! s p s p/N Specialist Workshop 2012 Multicore - Introduction 20 Limitations of Parallel Computing: Increasing Parallel Efficiency („weak scaling“) . When scaling a problem to more workers, the amount of work will often be scaled as well . Let s and p be the serial and parallel fractions so that s+p=1 . Perfect situation: runtime stays constant while N increases . “Performance Speedup”= work/time for problem size N with N workers work/time for problem size 1 with 1 worker s pN P (N) s pN N (1 N)s s (1 s)N s s p Gustafson's Law ("weak scaling") . Linear in N – but closely observe the meaning of the word "work"! Specialist Workshop 2012 Multicore

Parallel Shared-Memory Programming on Multicore Nodes

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support