<<

Parallel shared-memory programming on multicore nodes

Georg Hager and Jan Treibig

HPC Services, Erlangen Regional Computing Center (RRZE)

Specialist Workshop on University of Ghent April 19-20, 2012 Schedule Day 1

Time 9:00 Introduction to parallel programming on multicore systems 10:00 OpenMP Basic topics I 10:30 Coffee break 10:45 OpenMP Basic topics II 11:45 Hands-on session 1 12:30 Lunch break 13:30 OpenMP Advanced topics 15:00 Coffee break 15:15 Hands-on session 2 16:15 Tooling 17.00 End of day 1

Specialist Workshop 2012 Multicore - Introduction 2 Schedule Day 2

Time 9:00 Performance-oriented programming I 10:30 Coffee break 10:45 Hands-on session 3 11:45 Performance-oriented programming II 12:30 Lunch break 13:30 Performance-oriented programming III 14:15 Hands-on session 4 15:15 Coffee break 15:30 Architecture-specific programming 17.00 End of workshop

Moodle e-learning platform:

http://moodle.rrze.uni-erlangen.de/moodle/course/view.?id=236&lang=en

Specialist Workshop 2012 Multicore - Introduction 3 Workshop outline

. Part 1: Introduction . Part 3: OpenMP Advanced topics . Programming models . Synchronization . Aspects of parallelism . Advanced worksharing clauses . Strong scaling (“Amdahl’s law”) . Tasking . Weak scaling (“Gustafson’s law”) . Part 4: MulticoreTools . Program design aspects . LIKWID tool suite . Multicore node architecture . Alternatives . Part 5: Multicore performance . Part 2: OpenMP Basic Topics . Performance properties of . OpenMP programing model multicore systems . OpenMP syntax . Simple performance modeling . parallel directive . Code optimization . Data scoping . ccNUMA issues . Worksharing directives . Part 6: Architecture-Specific . OpenMP API . hierarchies in depth . Compiling and running . Understanding x86 assembly

Specialist Workshop 2012 Multicore - Introduction 4 Introduction Parallel Computing

. Parallelism will substantially increase through the use of multi/many-core chips in the future!

. Parallel computing is entering everyday life . Dual-core/quad-core systems everywhere (Workstation, Laptop, etc…)

. Basic design concepts for parallel computers: . Shared memory multiprocessor systems: Multiple processors run in parallel but use the same (a single) address space (“shared memory”). Examples: Multicore workstations, multicore multisocket cluster nodes. . systems: Multiple processors/compute nodes are connected via a network. Each processor has its own address space/ memory. Examples: Clusters with x86-based servers and high-speed interconnect networks (or just GBit Ethernet).

Specialist Workshop 2012 Multicore - Introduction 5 vs. Shared Memory: Programming Models Distributed Memory Shared Memory

Same program on each Single Program on single machine processor/machine (SPMD) or . UNIX splits off threads, Multiple programs with consistent mapped to CPUs for work communication structure (MPMD) distribution

. Data . Program written in a sequential language . may be process-global or - local . all variables process-local . exchange of data not needed, or via . no implicit knowledge of data on suitable synchronization mechanisms other processors . Data exchange between processes: . Programming models . send/receive messages via . explicit threading (hard) appropriate . directive-based threading via . most tedious, but also the most OpenMP (easier) flexible way of parallelization . Library or parallel language . Parallel library: . automatic parallelization (very easy, . Message Passing Interface, MPI but mostly not efficient)

Specialist Workshop 2012 Multicore - Introduction 6 Parallel programming models Standards-based Parallelism . MPI standard . OpenMP standard . MPI forum released version . architecture review board 2.2 in September 2009 released version 3.1 in July . unified document („MPI1+2“) 2011 . Base languages . feature update (tasking etc.) . Fortran (77, 95) . Base languages . C . Fortran (77, 95) . C++ binding obsolescent . C, C++  use C bindings (Java is not a base language) . Resources: . Resources: . http://www.mpi-forum.org . http://www.openmp.org . http://www.compunity.org Portability: • of semantics (well-defined, „safe“ interfaces) • of performance (a difficult target)

Specialist Workshop 2012 Multicore - Introduction 7 Alternatives for shared-memory programming

Almost exclusively C(++)

. POSIX Threads . Standardized threading interface (C) . “Bare metal” parallelization, very basic interface

. Intel Threading Building Blocks (TBB) . C++ library . Task-based, work stealing

. Intel Plus . …

Specialist Workshop 2012 Multicore - Introduction 8 Analysis Finding exploitable concurrency Problem analysis:

1. Decompose into subproblems . Data decomposition . perhaps even hierarchy of subproblems . Task decomposition

. that can simultaneously execute (non-orthogonal concepts) . that still efficiently execute

2. Identify dependencies between subproblems . imply need for data exchange . identification of synchronization points . determine explicit communication between processors or ordering of tasks working on shared data

Specialist Workshop 2012 Multicore - Introduction 9 Example Climate simulation

. Prototypical for separate, but coupled systems . Coarse grained parallelism

. functional decomposition Ice Atmosphere Physics simulate various subsystems and/or observables Sea Land . domain decomposition into equally-sized blocks Math, each block contains observables on a grid Algorithm numerical methods for solution available (?) . Medium grained parallelism . further subdivide work Software if multiple CPUs available in a compute node . Fine grained parallelism

Silicon . instructions and data on a CPU

Specialist Workshop 2012 Multicore - Introduction 10 Basic parallelism (1): A diligent worker

after 16 time steps: 4 cars

Specialist Workshop 2012 Multicore - Introduction 11 Basic parallelism (2): Use four workers

after 4 time steps: 4 cars

Specialist Workshop 2012 Multicore - Introduction 12 Understanding parallelism: Limits of – shared resources + imbalance

Unused resources due to resource shared resource bottleneck and imbalance!

Waiting for Waiting for shared resource synchronization

Specialist Workshop 2012 Multicore - Introduction 13 Limitations of Parallel Computing: Amdahl's Law

Ideal world: All work is perfectly parallelizable

serial serial

serial serial Closer to reality: Purely serial parts

seriell serial limit maximum

Reality is even worse: Communication and synchronization impede scalability even further

Specialist Workshop 2012 Multicore - Introduction 14 Limitations of Parallel Computing: Calculating Speedup in a Simple Model (“strong scaling”)

T(1) = s+p = serial compute time (=1)

parallelizable part: p = 1-s purely serial part s

Parallel execution time: T(N) = s+p/N

General formula for speedup: k T (1) 1 Amdahl's Law S p   for "strong scaling" 1s T (N) s  N

Specialist Workshop 2012 Multicore - Introduction 15 Limitations of Parallel Computing: Amdahl's Law (“strong scaling”)

. Reality: No task is perfectly parallelizable . Shared resources have to be used serially . Task interdependencies must be accounted for . Communication overhead (but that can be modeled separately)

. Benefit of parallelization may be strongly limited . "Side effect": limited scalability leads to inefficient use of resources . Metric: Parallel Efficiency (what percentage of the workers/processors is efficiently used): S (N)  (N)  p p N 1 . Amdahl case:   p s(N 1) 1

Specialist Workshop 2012 Multicore - Introduction 16 Limits of Scalability: Strong Scaling with a Simple Communication Model

T(1) = s+p = serial compute time

parallelizable part: p = 1-s purely serial part s

parallel: T(N) = s+p/N+Nk fraction k for “communication” between each two workers (only present in parallel case)

General formula for speedup k T (1) 1 (worst case): S   p 1s T (N) s  N  Nk (k=0  Amdahl's Law)

Specialist Workshop 2012 Multicore - Introduction 17 Limits of Scalability: Amdahl’s Law

. Large N limits . at k=0, Amdahl's Law 0 1 predicts lim S p (N)  N  s

(independent of N)

. at k≠0, our simplified model of communication overhead k N 1 1 yields a beaviour of S (N)  p Nk

. In reality the situation gets even worse . Load imbalance . startups . …

Specialist Workshop 2012 Multicore - Introduction 18 Understanding Parallelism: Amdahl’s Law and the impact of communication

10

9

8

7

6 s=0.01

S(N) 5 s=0.1 s=0.1, k=0.05 4

3

2

1

0 12345678910 # CPUs

Specialist Workshop 2012 Multicore - Introduction 19 Limitations of Parallel Computing: Increasing Parallel Efficiency

. Increasing problem size often helps to enlarge the parallel fraction p . Often p scales with problem size while s stays constant . Fraction of s relative to overall runtime decreases

s p Scalability in terms of parallel speedup and parallel s p/N efficiency improves when scaling the problem size!

s p

s p/N

Specialist Workshop 2012 Multicore - Introduction 20 Limitations of Parallel Computing: Increasing Parallel Efficiency („weak scaling“) . When scaling a problem to more workers, the amount of work will often be scaled as well . Let s and p be the serial and parallel fractions so that s+p=1 . Perfect situation: runtime stays constant while N increases . “Performance Speedup”=

work/time for problem size N with N workers work/time for problem size 1 with 1 worker

s  pN P (N)   s  pN  N  (1 N)s  s  (1 s)N s s  p Gustafson's Law ("weak scaling") . Linear in N – but closely observe the meaning of the word "work"!

Specialist Workshop 2012 Multicore - Introduction 21 Limits of Scalability: Serial & Parallel fraction

Serial fraction s may depend on . Program / algorithm . Non-parallelizable part, e.g. recursive data setup . Non-parallelizable IO, e.g. reading input data . Communication structure . Load balancing (assumed so far: perfect balanced) . … . . Processor: Cache effects & memory bandwidth effects . Parallel Library; Network capabilities; Parallel IO . …

Determine s "experimentally": Measure speedup and fit data to Amdahl’s law – but that could be more complicated than it seems…

Specialist Workshop 2012 Multicore - Introduction 22 Scalability data on modern multi-core systems An example

Scaling across nodes

P P P P C C C C CP P CP P C C C C CP P CP P C P C P C P C P C C ChipsetC C C C 12 sockets ChipsetC C on node Chipset MemoryChipset Memory 12 cores on Memory socket Memory

Specialist Workshop 2012 Multicore - Introduction 23 Scalability data on modern multi-core systems The scaling baseline . Scalability presentations should be grouped according to the 2,0 intranode

largest unit that the 1,5 memory scaling is based on -bound 1,0 (the “scaling baseline”) code! Speedup 0,5 Good scaling 0,0 across 124sockets # CPUs

. Amdahl model with communication: Fit 1 S ( N )  1 s s  N  kN internode to inter-node scalability numbers (N = # nodes, >1)

Specialist Workshop 2012 Multicore - Introduction 24 Load imbalance Extreme cases . Sometimes scalability cannot be described by a simple “refined” Amdahl model . Load imbalance may impede scalability much more strongly than communication or serial execution . Two extreme cases:

A few “laggers” hurt a lot A few “speeders” do not matter Possible remedies for load imbalance: Dynamic load balancing Scalable? Overhead? Manual work distribution General? Inflexible? Choosing a different algorithm Suitability?

Specialist Workshop 2012 Multicore - Introduction 25 Multicore processor and system architecture

Basics The x86 multicore evolution so far Intel Single-Dual-/Quad-/Hexa-/-Cores (one-socket view) 2005: “Fake” dual-core 2006: True dual-core P P P P P P P P P

C C C C C 45nm C C 65nm C C C C C C C C

Chipset Chipset Chipset Chipset Other Other socket socket Woodcrest Woodcrest Harpertown

Memory Memory Core2 Duo” Memory Memory “ “Core2 Quad”

2008: 2010: 2012: Wider SIMD units Simultaneous 6-core chip AVX: 256 Bit Multi Threading (SMT)

T0 T0 T0 T0 T0 T0 T0 T0 T0 T0 T0 T0 T0 T0 T0 T0 T0 T0 P P P P P P P P P P T P P P P P P P P T1 T1 T1 T1 T1 1 T1 T1 T1 T1 T1 T1 T1 T1 T1 T1 T1 T1 C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C

MI MI MI Other Other socket socket

Memory Memory Memory Nehalem EP Westmere EP Sandy Bridge EP “Core i7” “Core i7” “Core i7” 45nm 32nm 32nm

Specialist Workshop 2012 Multicore - Introduction 27 There is no longer a single driving force for chip performance!

Floating Point (FP) Performance:

P = ncore * F * S * 

ncore number of cores: 8 F FP instructions per cycle: 2 (1 MULT and 1 ADD) Intel Xeon “Sandy Bridge EP” socket S FP ops / instruction: 4 (dp) / 8 (sp) 4,6,8 core variants available (256 Bit SIMD registers – “AVX”)

 Clock speed : 2.5 GHz

TOP500 rank 1 (1996) P = 160 GF/s (dp) / 320 GF/s (sp)

But: P=5 GF/s (dp) for serial, non-SIMD code

Specialist Workshop 2012 Multicore - Introduction 28 From UMA to ccNUMA Basic architecture of commodity compute cluster nodes Yesterday (2006): Dual-socket Intel “Core2” node:

Uniform (UMA) Flat memory ; symmetric MPs But: system “anisotropy”

Today: Dual-socket Intel (Westmere) node: Cache-coherent Non-Uniform Memory Architecture (ccNUMA) HT / QPI provide scalable bandwidth at the price of ccNUMA architectures: Where does my data finally end up?

On AMD it is even more complicated  ccNUMA within a socket!

Specialist Workshop 2012 Multicore - Introduction 29 Back to the 2-chip-per-case age 12 core AMD Magny-Cours – a 2x6-core ccNUMA socket . AMD: single-socket ccNUMA since Magny Cours

. 1 socket: 12-core Magny-Cours built from two 6-core chips  2 NUMA domains

. 2 socket server  4 NUMA domains

. 4 socket server:  8 NUMA domains

. WHY?  Shared resources are hard two scale: 2 x 2 memory channels vs. 1 x 4 memory channels per socket

Specialist Workshop 2012 Multicore - Introduction 30 Another flavor of “SMT” AMD Interlagos / Bulldozer . Up to 16 cores (8 Bulldozer modules) in a single socket

. Max. 2.6 GHz (+ Turbo Core) 2048 kB 16 kB 8 (6) MB shared dedicated shared . Pmax = (2.6 x 8 x 8) GF/s L2 cache = 166.4 GF/s L1D cache L3 cache

Each Bulldozer module: . 2 “lightweight” cores . 1 FPU: 4 MULT & 4 ADD (double precision) / cycle . Supports AVX . Supports FMA4

2 DDR3 (shared) memory channel > 15 GB/s 2 NUMA domains per socket

Specialist Workshop 2012 Multicore - Introduction 31 Cray XE6 (Hermit) “Interlagos” 16-core dual socket node

. Two 8- (integer-) core chips per socket @ 2.3 GHz (3.3 @ turbo) . Separate DDR3 memory interface per chip . ccNUMA on the socket!

. Shared FP unit per pair of integer cores (“module”) . “256-bit” FP unit . SSE4.2, AVX, FMA4

. 16 kB L1 data cache per core . 2 MB L2 cache per module . 8 MB L3 cache per chip (6 MB usable)

Specialist Workshop 2012 Multicore - Introduction 32 Parallel programming models on multicore multisocket nodes . Shared-memory (intra-node) . Good old MPI (current standard: 2.2) . OpenMP (current standard: 3.0) . POSIX threads . Intel Threading Building Blocks . Cilk++, OpenCL, StarSs,… you name it All models require awareness of . Distributed-memory (inter-node) topology and affinity . MPI (current standard: 2.2) issues for getting . PVM (gone) best performance out of the machine! . Hybrid . Pure MPI . MPI+OpenMP . MPI + any shared-memory model

Specialist Workshop 2012 Multicore - Introduction 33 Parallel programming models: Pure MPI . Machine structure is invisible to user: .  Very simple programming model .  MPI “knows what to do”!? . Performance issues . Intranode vs. internode MPI . Node/system topology

Specialist Workshop 2012 Multicore - Introduction 34 Parallel programming models: Pure threading on the node . Machine structure is invisible to user .  Very simple programming model . Threading SW (OpenMP, pthreads, TBB,…) should know about the details . Performance issues . Synchronization overhead . Memory access . Node topology

Specialist Workshop 2012 Multicore - Introduction 35 Threading Models

POSIX threads Threading Building Blocks OpenMP Why threads for parallel programs?

. “Thread” == “Lightweight process” . “Independent instruction stream” . In simulation we usually run one thread per (virtual or physical) core, but more is possible

. New processes are expensive to generate (via fork()) . Threads share all the data of a process, so they are cheap

. Inter-process communication is slow and cumbersome . Shared memory between threads provides an easy way to communicate and synchronize

. A threading model puts threads to use by making them accessible to the programmer . Either explicitly or “wrapped” in some parallel paradigm

Specialist Workshop 2012 Multicore - Introduction 37 POSIX threads example: Matrix-vector multiply with 100 threads

static double a[100][100], b[100], c[100];

int main(int argc, char* argv[]) { pthread_t tids[100]; ... for (int i = 0; i < 100; i++) pthread_create(tids + i, NULL, mult, (void *)(c + i)); for (int i = 0; i < 100; i++) pthread_join(tids[i], NULL); ... } There are no static void *mult(void *cp) { int i = (double *)cp - c; shared resources double sum = 0; here! (not really) for (int j = 0; j < 100; j++) sum += a[i][j] * b[j]; c[i] = sum; return NULL; } Adapted from material by J. Kleinöder

Specialist Workshop 2012 Multicore - Introduction 38 POSIX threads pros and cons

. Pros . Most basic threading interface . Straightforward, manageable API . Dynamic generation and destruction of threads . Reasonable synchronization primitives . Full execution control

. Cons . Most basic threading interface . Higher functions (reductions, synchronization, work distributions, task queueing) must be done “by hand” . Only available with C API . Only available on (near-) POSIX compliant OSs . Compiler has no clue about threads

Specialist Workshop 2012 Multicore - Introduction 39 Intel Threading Building Blocks (TBB)

. Introduced by Intel in 2006

. C++ threading library . Uses POSIX threads “under the hood” . Programmer works with tasks rather than threads . “Task stealing” model . Parallel C++ containers

. Commercial and open source variants exist

Specialist Workshop 2012 Multicore - Introduction 40 A simple parallel loop in TBB: Apply Foo() to every element of an array

#include "tbb/tbb.h" using namespace tbb;

class ApplyFoo { float *const my_a; public: void operator()( const blocked_range& r ) const { float *a = my_a; for( size_t i=r.begin(); i!=r.end(); ++i ) Foo(a[i]); } ApplyFoo( float a[] ) : my_a(a) {} };

void ParallelApplyFoo( float a[], size_t n ) { parallel_for(blocked_range(0,n), ApplyFoo(a)); }

Adapted from the Intel TBB tutorial

Specialist Workshop 2012 Multicore - Introduction 41 TBB pros and cons

. Pros . High-level programming model . Task concept is often more natural for real-world problems than thread concept . Built-in parallel (thread-safe) containers . Built-in work distribution (configurable, but not too finely) . Available for Linux, Windows, MacOS

. Cons . C++ only . Mapping of threads to resources (cores) not part of the model . “Number of threads” concept only vaguely implemented . Dynamic work sharing and task stealing introduce variability, difficult to optimize under ccNUMA constraints . Compiler has no clue about threads

Specialist Workshop 2012 Multicore - Introduction 42 Parallel Programming with OpenMP

. “Easy” and portable parallel programming of shared memory computers: OpenMP

. Standardized set of compiler directives & library functions: http://www.openmp.org/ . FORTRAN, C and C++ interfaces . Supported by most/all commercial compilers, GNU starting with 4.2 . Few free tools are available

. OpenMP program can be written to compile and execute on a single-processor machine just by ignoring the directives

Specialist Workshop 2012 Multicore - Introduction 43 Shared Memory Model used by OpenMP

. Central concept of OpenMP programming: Threads

 Threads access globally shared memory T T  Data can be shared or private  shared data available private private to all threads (in principle) Shared  private data only to Memory thread that owns it

private

private

T T

Specialist Workshop 2012 Multicore - Introduction 44 OpenMP Program Execution Fork and Join

. Program start: only master thread runs . Parallel region: team of worker threads is generated (“fork”) . synchronize when leaving parallel region (“join”) . Only master executes sequential part . worker threads usually sleep

. Task and data distribution via directives

. Usually optimal: Thread # 0 1 2 3 4 5 one thread per core

Specialist Workshop 2012 Multicore - Introduction 45 Example: Numerical integration in OpenMP

Approximate by a discrete sum ! function to integrate 1 1 n double f(double x) { f (t) dt  f (x )  n  i return 4.0/(1.0+x*x); 0 i1 } where i  0.5 w=1.0/n; xi  (i 1,..., n) n sum=0.0; We want for(i=1; i<=n; ++i) { 1 4dx x = w*(i-0.5);    1  x 2 sum += f(x); 0 }  solve this in OpenMP pi=w*sum;

... (printout omitted)

Specialist Workshop 2012 Multicore - Introduction 46 Example: Numerical integration in OpenMP

... pi=0.0; w=1.0/n;

#pragma omp parallel private(x,sum) concurrent execution by “team { of threads” sum=0.0; #pragma omp for worksharing among threads for(i=1; i<=n; ++i) { x = w*(i-0.5); sum += f(x); } sequential execution

#pragma omp critical pi=pi+w*sum;

}

Specialist Workshop 2012 Multicore - Introduction 47 OpenMP pros and cons

. Pros . High-level programming model . Available for Fortran, C, C++ . Ideal for , some support for . Built-in work distribution . Directive concept is part of the language . Good support for incremental parallelization

. Cons . Mapping of threads to resources (cores) not part of the model . OpenMP parallelization may interfere with compiler optimization . Parallel data structures are not part of the model . Only limited synchronization facilities . Model revolves around “parallel region” concept

Specialist Workshop 2012 Multicore - Introduction 48