Engineering Tool 4 (D-MAVT, AS19)

Day 3

Malte Schwerhoff

http://lec.inf.ethz.ch/mavt/etIV/2019/ Yesterday’s ECTS Project

Function void select_random_students(...): Master solution Student solution // N is number of students // N is number of students idx1 = random_uint(0, N); idx1 = random_uint(0, N); do { idx2 = random_uint(1, N); idx2 = random_uint(0, N); idx2 = (idx1 + idx2) % N; } while (idx2 == idx1);

2 Yesterday’s ECTS Project

Fine-grained vs. coarse-grained locking Fine-grained Coarse-grained class Student { … std::mutex mutex; …} std::mutex global_mx = …;

class Lecturer { class Lecturer { … … void transfer(Student* s1, Student* s2, …) { void transfer(Student* s1, Student* s2, …) { // let s1 < s2 global_mx.lock(); s1->mutex.(); s2->mutex.lock(); // critical section: transfer credits // critical section: transfer credits global_mx.unlock(); s1->mutex.unlock(); s2->mutex.unlock(); } } … … } } class Admin { class Admin { … … void work() { void work() { global_mx.lock(); for (auto student : students) … student->mutex.lock(); } … } } } 3 Yesterday’s ECTS Project

Mutual exclusion always (and intentionally) reduces the degree of parallelism in a program.

Guidelines too keep in mind: • Be careful not to unnecessarily reduce the degree of parallelism • Make the critical section as small as possible

But remember: always benchmark your program to check effective performance gain.

4 Today’s Agenda

The purpose of today’s lecture is to briefly go over a few other concepts and techniques from the area of parallel programming: • Locks – the better mutexes • Beyond deadlocks: livelocks, starvation, fairness • Inter-thread communication: condition variables, barriers • Beyond threads: tasks, futures, promises • Declarative parallelism: the OpenMP paradigm • Lock-free concurrency • Compilers and hardware, and the C++ memory model

5 Locks

6 Beyond Deadlocks: Dangers of Mutexes void foo(...) { some_mutex.lock();

if (...) { ... What could go wrong return; with this program? }

some_mutex.unlock(); } void foo(...) { some_mutex.lock(); What could go wrong some_object.some_function(); with this program? some_mutex.unlock(); } 7 Beyond Deadlocks: Dangers of Mutexes void foo(...) { • some_function might not terminate … some_mutex.lock(); • … or raise an exception (a “controlled” some_object.some_function(); error)

some_mutex.unlock(); some_mutex won’t be unlocked } and the whole system could deadlock ↪ • Non-termination cannot be prevented systematically (in practice) Apply the following guidelines (if possible): • Don’t call unknown functions when holding a mutex ↪ • In particular, do not call virtual member functions !

8 Beyond Deadlocks: Mutexes and RAII

void foo(...) { void foo(...) { some_mutex.lock(); some_mutex.lock(); if (...) return; fun() // might throw exception some_mutex.unlock(); some_mutex.unlock(); } }

These situations can be handled systematically, by using the RAII idiom (recall ET2). Core idea: • Recall that stack-allocated objects (those not instantiated with new) are deallocated when they go out of scope, .g. at the end of a function

int& foo(...) { int x = ...; return x; x goes out of scope; returns } reference to dead object 9 Beyond Deadlocks: Mutexes and RAII

void foo(...) { void foo(...) { some_mutex.lock(); some_mutex.lock(); if (...) return; fun() // might throw exception some_mutex.unlock(); some_mutex.unlock(); } }

These situations can be handled systematically, by using the RAII idiom (recall ET2). Core idea: • Recall that stack-allocated objects (those not instantiated with new) are deallocated when they go out of scope, e.g. at the end of a function • Destructor is automatically called when objects are deallocated Wrap mutex in stack-allocated guard object • Guard’s constructor locks the mutex

↪ • Guard’s deconstructor unlocks it 10 Beyond Deadlocks: Guarding Mutexes

void foo(...) { void foo(...) { some_mutex.lock(); some_mutex.lock(); if (...) return; fun() // might throw exception some_mutex.unlock(); some_mutex.unlock(); } }

void foo(...) { void foo(...) { std::lock_guard std::lock_guard guard(some_mutex); guard(some_mutex); if (...) return; fun() // might throw exception } }

Guard automatically locks mutex – and more importantly, also unlocks it

11 Beyond Deadlocks: Guarding Mutexes

• Guideline: use locks, i.e. guarded mutexes, whenever possible • Not done in this course for simplicity • Nevertheless very important: if you remember one thing from today, then this

• Different locks exist (see cppreference.com for details): • std::lock_guard: basic lock for single mutex • std::scoped_lock: multiple mutexes, prevents deadlocks (if used exclusively) • std::unique_lock: single mutex, more control (e.g. when mutex is locked) • std::shared_lock: for reader-writer situations • Many threads only read the shared data in their critical section • Few threads write the shared data • Reading in parallel is fine, but writers need exclusive access • Example: shared phone book; much more often read than updated

12 Beyond Deadlocks

13 Beyond Deadlocks: Livelocks

// simplified C++ code • Solving the Dining Philosophers problem? while (true) { • Try to grab left fork left_fork.try_lock(1ms) if (left_fork.is_locked()) { • If successful, try to grab right fork right_fork.try_lock(1ms) • If both successful eat if (right_fork.is_locked()) break; // exit loop to eat • Otherwise, put down left fork (if grabbed), left_fork.unlock(); wait for a bit and try→ again } this_thread::sleep_for(1ms) } any problems with this algorithm? // let’s eat ...

• Possible⤷ danger of a livelock: • All philosophers grab left fork, don’t get right fork, drop forks, wait, repeat ... • Overall system doesn’t halt (as in a deadlock), but no real progress is made either • Earlier-shown resource ordering approach also prevents livelocks

14 Beyond Deadlocks: Starvation and Fairness

// simplified C++ code • Solving the Dining Philosophers problem? while (true) { • Try to grab left fork left_fork.try_lock(1ms) if (left_fork.is_locked()) { • If successful, try to grab right fork right_fork.try_lock(1ms) • If both successful eat if (right_fork.is_locked()) break; // exit loop to eat • Otherwise, put down left fork (if grabbed), left_fork.unlock(); wait for a bit and try→ again } this_thread::sleep_for(1ms) } any problems with this algorithm? // let’s eat ...

• Starvation⤷ is a related problem: • It can happen that some philosophers never get their forks (shared resources), thus they starve • Earlier-shown resource ordering approach does not prevent starvation • A fair scheduler (or fairness-enforcing locking approaches) can help • Mathematically defining what fairness means is an interesting exercise 15 Inter-Thread Communication

16 Inter-Thread Communication

• In order to synchronise their work, threads often need to exchange information, i.e. communicate

• Already seen: • Atomic data types • Mutexes

• But there are, of course, many more communication ideas & techniques • Most (all?) can be implemented using atomics and mutexes • Different techniques have (dis)advantages in different situations • As usual: understand problem, then choose best-fitting tools

17 Inter-Thread Communication: Condition Variables

• C++: library

• Think of a condition variable as a broadcast

• Sender-receiver/producer-consumer are typical use case • One thread performs some work • Then informs another thread (or many others) that the work is done

18 Inter-Thread Communication: Condition Variables

• C++: library ; think broadcast

• Atomic boolean vs. condition variable

// T1 does some work // T1 does some work T1: atomic_bool = true; cond_var.notify_all()

while (!atomic_bool.read()); cond_var.wait(); T2: // T2 can use T1’s results // T2 can use T1’s results

+ simple, no internal overhead + waiting threads sleep, no busy waiting + can also notify (wake up) only one thread – busy waiting: loop consumes CPU time – more internal overhead, in particular thread sleep/wake – further complications (lost wake-up, spurious wake-up) 19 Inter-Thread Communication: Barriers

• Situation: • Threads solve a problem in phases • Phase i+1 can only be started once all (or some) threads completed phase i

• Can already be handled, using com- time binations of what we’ve seen so far (atomics, mutexes, …) … • … but barriers (C++ 20, Java, …) exist specifically for this use case

Gears icon by icons8.com 20 Inter-Thread Communication: Barriers barrier end_of_phase(N + 1); Pseudo-code shows an example: • N worker threads, one supervisor thread void worker(...) { // do work from phase 1 end_of_phase.arrive_and_wait();

// do work from phase 2 ... } void supervisor(...) { int p = 1; while (...) { end_of_phase.arrive_and_wait(); cout << “End of phase “ << p++; } }

21 Inter-Thread Communication: Barriers barrier end_of_phase(N + 1); Pseudo-code shows an example: N worker threads, one supervisor thread void worker(...) { • st // do work from phase 1 • Workers must all finish 1 phase before end_of_phase.arrive_and_wait(); starting 2nd, etc.

// do work from phase 2 ... } void supervisor(...) { int p = 1; while (...) { end_of_phase.arrive_and_wait(); cout << “End of phase “ << p++; } }

22 Inter-Thread Communication: Barriers barrier end_of_phase(N + 1); Pseudo-code shows an example: N worker threads, one supervisor thread void worker(...) { • st // do work from phase 1 • Workers must all finish 1 phase before end_of_phase.arrive_and_wait(); starting 2nd, etc. • Supervisor simply prints finished phase // do work from phase 2 ... } void supervisor(...) { int p = 1; while (...) { end_of_phase.arrive_and_wait(); cout << “End of phase “ << p++; } }

23 Inter-Thread Communication: Barriers barrier end_of_phase(N + 1); Pseudo-code shows an example: N worker threads, one supervisor thread void worker(...) { • st // do work from phase 1 • Workers must all finish 1 phase before end_of_phase.arrive_and_wait(); starting 2nd, etc. • Supervisor simply prints finished phase // do work from phase 2 • A thread that arrive_and_waits ... } decrements barrier counter, then sleeps • Once barrier counter is down to zero, all void supervisor(...) { waiting threads are woken up and the int p = 1; barrier counter is reset to N+1 while (...) { end_of_phase.arrive_and_wait(); cout << “End of phase “ << p++; } }

24 Beyond Threads

25 Beyond Threads: Tasks

• Threads • Are ultimately a concept of operating systems (and CPUs) • Have (therefore) been “lifted” to programming languages • Offer flexibility, but are not necessarily an ideal solution for often-occurring tasks

• Short digression: • Task: Print each element in a list xs = List(1,2,3,4) • Some solutions more directly correspond to the task description // Solution 1: index-based iteration for (i <- 0 until xs.length) print(xs(i)) than others • See the Scala code to the left for // Solution 2: element-based iteration an example for (x <- xs) print(x) // Solution 3: lambda-function-based iteration xs foreach print // Closely matches task desc.

26 Beyond Threads: Tasks

• Typical task in parallel context: (from main thread’s perspective) 1. Go off and compute foo(x) for me, while I do something else 2. I need the result at latest at this point

• Solution with explicit threads works, but: int x = ...; • Purpose of lambda function not obvious int y; (from task description) std::thread worker([&]{ y = foo(x); }); • Shared variable y risk • Result of foo(x) is needed on first use of y; // ... main thread does its thing ... worker.join() serves→ as “artificial marker” worker.join(); // here because ... • Imagine complicated (if-elses) std::cout << y; // ... this line uses y where y is first used at different program points where to best put worker.join()?

Observation:→ complications mainly due to communicating result 27 ⤷ Beyond Threads: Tasks

• Typical task in parallel context: (from main thread’s perspective) 1. Go off and compute foo(x) for me, while I do something else 2. I need the result at latest at this point • More elegant solution using tasks (std::async) from library

task-based solution: thread-based solution: int x = ...; int x = ...; int y; auto y = std::async(foo, x) std::thread worker([&]{ y = foo(x); }); // ... // ... std::cout << y.get(); worker.join(); // here because ... std::cout << y; // ... this line uses y

28 Beyond Threads: Tasks and Futures

More elegant solution using tasks (std::async) from library

int x = ...; • std::async decides whether or not foo(x) should run in a separate thread (depending on system’s hardware, system auto y = std::async(foo, x) status, complexity of foo, …) // ... • y is a std::future: • A handle to eventually available data std::cout << y.get(); • y.get() either immediately returns the task’s result, or waits until the result is available

Tasks are a more convenient abstraction over the underlying threads

29 Beyond Threads: Tasks, Futures, Promises int x = ...; • Futures (std::future): auto y = std::async(foo, x) • Are the receiving side of a communication or pipe between parent (main function/thread) and child // ... (async’ed task) std::cout << y.get(); • Parent can wait for a future to deliver a value (y.get()) • Futures are internally protected against data races • Promises (std::promise): • Are the sending side of the channel • Not explicitly used in above example (since not necessary) • Also internally protected against races

• Promises and futures provide many functions, offer great flexibility

File icon by Freepik for flaticon.com; pipes by freepik for freepik.com 30 Beyond Threads: Tasks, Futures, Promises

int x = ...;

auto y = std::async(foo, x)

// ...

std::cout << y.get();

• Take-away message: prefer tasks over threads (if possible)!

• Tasks, futures and promises (or similar concepts) have been added to lots of languages in recent years: C++, Java, Python, Javascript, Scala, …

• No silver bullet: fundamental concurrency problems (race conditions, deadlocks, …) can still occur

31 Declarative Parallelism

32 Declarative Parallelism: Declarative vs. Imperative

Assume we have a sequence of numbers, then has a maximum

Mathematics 𝑆𝑆 Programming𝑆𝑆 Let int x = -∞; be maximal, then it holds that for (int e : S) 𝑥𝑥 ∈ 𝑆𝑆 . if (e < x) x = e;

Mathematical∀𝑦𝑦 ∈ 𝑆𝑆 ⋅ 𝑦𝑦 ≤ formulas/texts𝑥𝑥 are Programs are imperative: the declarative: is declared to be computer is instructed to find the largest number in , there’s (i.e. compute) the largest number no need to explicitly𝑥𝑥 find it 𝑆𝑆

33 Declarative Parallelism: Declarative vs. Imperative

The declarative style of mathematics is very powerful

Mathematics Programming Let int x = -∞; be maximal, then it holds that for (int e : S) 𝑥𝑥 ∈ 𝑆𝑆 . if (e < x) x = e;

∀𝑦𝑦 ∈ 𝑆𝑆 ⋅ 𝑦𝑦 ≤Mathematics𝑥𝑥 Programming

1 for (k = 0; k < infinity; ++k) Let == ∞ (= ) 2 x += ... ∞ 1 𝜋𝜋 2 𝑥𝑥𝑥𝑥 ∑�𝑘𝑘=0 𝑘𝑘2 6 ??? 𝑘𝑘=0 𝑘𝑘

34 Declarative Parallelism: Declarative vs. Imperative

• Recall the imperative parallel-sum implementation: 1. Compute chunk size (#data/#threads) 2. Fork sum(array, chunk_begin, chunk_end) worker thread per chunk 3. Join threads and add up partial sums • Using a declarative approach, several implementation steps can be left implicit #pragma omp parallel for reduction(+:sum) for (int i = 0; i < vec.length(); i++) sum += vec[i]; • The OpenMP #pragma directive (a compiler • Declarative programming extension) declares that • Aims to abstract over machine-level 1. the loop can be parallelised (parallel) in some way details 2. the partial sums are reduced to a single value by • Helps focusing on the essential steps summation (reduction(+:sum)) • May not result in best possible perfor- mance (because of its abstractions) 35 Lock-free Concurrency

36 Lock-free Concurrency

• Threads waiting for mutexes sleep, but // std::mutex resource_mx thread sleep/wake-up itself costs time resource_mx.lock(); • Lots of contention (many competing threads) → sleep/wake-up can be inefficient • Also: mutex-holding and sleeping threads slow down whole system • Lock-free concurrency // std::atomic_bool resource_atm • uses atomic operations and “try-repeat- while(!resource_atm.read()); loops”, i.e. busy waiting • is used by expert programmers in high- performance code, such as the Linux kernel

37 Lock-free Concurrency

Example: Linked-list-based stack void push(stack& s, llnode* n) { Task: Push element onto stack n->nxt = s.top; s.top = n; } 3 7

5

38 Lock-free Concurrency

Example: Linked-list-based stack void push(stack& s, llnode* n) { Task: Push element onto stack n->nxt = s.top; s.top = n; } 3 7

nxt 5

39 Lock-free Concurrency

Example: Linked-list-based stack void push(stack& s, llnode* n) { Task: Push element onto stack n->nxt = s.top; s.top = n; } 3 7

nxt 5

40 Lock-free Concurrency

Example: Linked-list-based stack void push(stack& s, llnode* n) { Task: Push element onto stack n->nxt = s.top; s.top = n; Problem: Race conditions }

5 void push(stack& s, llnode* n) { n->nxt = s.top; s.top = n; 3 7 }

0

41 Lock-free Concurrency

Example: Linked-list-based stack void push(stack& s, llnode* n) { Task: Push element onto stack n->nxt = s.top; s.top = n; Problem: Race conditions }

5 void push(stack& s, llnode* n) { n->nxt = s.top; s.top = n; 3 7 }

0

42 Lock-free Concurrency

Example: Linked-list-based stack void push(stack& s, llnode* n) { Task: Push element onto stack n->nxt = /* node 3 */; s.top = n; Problem: Race conditions }

5 void push(stack& s, llnode* n) { n->nxt = s.top; s.top = n; 3 7 }

0

43 Lock-free Concurrency

Example: Linked-list-based stack void push(stack& s, llnode* n) { Task: Push element onto stack n->nxt = /* node 3 */; s.top = n; Problem: Race conditions }

5 void push(stack& s, llnode* n) { n->nxt = s.top; s.top = n; 3 7 }

0

44 Lock-free Concurrency

Example: Linked-list-based stack void push(stack& s, llnode* n) { Task: Push element onto stack n->nxt = /* node 3 */; s.top = n; Problem: Race conditions }

5 void push(stack& s, llnode* n) { n->nxt = s.top; s.top = n; 3 7 }

0

Other erroneous interleavings possible; many more when adding further operations, e.g. removing elements

45 Lock-free Concurrency

3 7

Straight-forward solution: prevent data races by guarding critical section with a mutex

void push(stack& s, llnode* n) { s.mutex->lock(); n->nxt = s.top; s.top = n; s.mutex->unlock(); }

46 Lock-free Concurrency

3 7

Potentially better-performing solution: use an atomic compare-and-swap (CAS) operation void push(stack& s, llnode* n) { • Hardware atomically executes CAS bool b; • CAS(p, v1, v2) sets p to v2 if p do { has value v1 llnode* curr_top = s.top; • CAS returns true if swap made, n->nxt = curr_top; b = CAS(s.top, curr_top, n); otherwise false } while (!b); Other atomic operations exists. } Lock-free code is much harder to reason about, i.e. to argue that it is correct 47 Compilers, Hardware, C++ Memory Model

48 Compiler Optimisations

• Program can be changed without affecting x = 1; observable behaviour: y = 2; • x = 1 and y = 1 can be swapped std::cout << “Hi”; • std::cout can be moved anywhere z = x + 1; • y = 2 could even be removed y = x; • ... • Reordering statements is one of many optimisations compilers apply • Compiler optimisations are very important for performance

Question: When is it OK to reorder reads and writes in concurrent programs? ↪

49 CPUs and Memory Hierarchies

Main Memory (32GB)

CPU

• CPU reads/writes values from/to main memory, to compute with them …

Memory size and speed are approximated but realistic numbers 50 CPUs and Memory Hierarchies

Main Memory (32GB)

L2 cache (32MB)

L1 cache (32KB)

CPU

• CPU reads/writes values from/to main memory, to compute with them … • … with a hierarchy of memory caches in between • Faster memory is more expensive, hence smaller: L1 is 5x faster than L2, which is 30x faster than main memory, which is 350x faster than disk

Memory size and speed are approximated but realistic numbers 51 CPUs and Memory Hierarchies

Main Memory (32GB)

L3 cache CPU

L2 L2 L2 L2

L1 L1 L1 L1

Core Core Core Core

Multi-core CPUs have caches per core more complicated hierarchies

→ 52 CPUs and Memory Hierarchies

• Caches and main memory all need to be synchronized (eventually): • (too) often is inefficient • but out-of-sync memory may cause inconsistencies • In particular: memory writes by one core should be made visible to other cores • The fun doesn’t stop here: CPUs themselves also reorder operations to improve performance

Question: Which guarantees do developers of parallel ↪ programs actually get? 53 C++ Memory Model

• The C++ memory model acts as the formal contract between programmers and compiler developers sequential consistency • It defines the semantics (effects) a program’s memory reads and writes have acquire-release • It defines three levels of semantics: lower semantics levels weaken programmers’ guarantees but allow more optimisations relaxed • Sequential consistency is the default semantics mode, and all that we have used in this course

Visualisation from Rainer Grimm’s book “Concurrency with Modern C++” 54 Your Exercise

55 Your Exercise

• Finish the other two, get your OKs from us and submit on Code Expert • Evaluate this Engineering Tool (especially since it is brand-new) • Come back for Engineering Tool 5 (not be me, though)

Thank you for your participation!

56