Concurrency and Parallelism  PL support for concurrency/parallelism a huge topic  Increasingly important (not traditionally in PL courses)  Lots of active research as well as decades-old work CS-XXX: Graduate Programming Languages  We’ll just do explicit threads plus:  (locks and transactions) Lecture 20 — Shared-Memory Parallelism and  Futures Concurrency  Synchronous message passing (Concurrent ML)

 We’ll skip  calculi (foundational message-passing)  Asynchronous methods, join calculus, ... Dan Grossman  Data-parallel languages (e.g., NESL or ZPL) 2012  ...

 Mostly in ML syntax (inference rules where convenient)  Even though current OCaml implementation has threads but not parallelism

Dan Grossman CS-XXX 2012, Lecture 20 2

Concurrency vs. Parallelism Threads (Terminology not universal, but distinction paramount): High-level: “Communicating sequential processes”

Concurrency is about correctly and efficiently managing access to Low-level: “Multiple stacks plus communication” shared resources  Examples: operating system, shared hashtable, version control From OCaml’s .mli:  Key challenge is responsiveness to external events that may type t (*thread handle; remember we’re in module Thread*) arrive asynchronously and/or simultaneously val create : (’a->’b) -> ’a -> t (* run new thread *)  Often provide responsiveness via threads val self : unit -> t (* what thread is executing this? *)  Often focus on The code for a thread is in a closure (with hidden fields) and Parallelism is about using extra computational resources to do Thread.create actually spawns the thread more useful work per unit time  Examples: scientific computing, most graphics, a lot of servers Most languages make the same distinction, e.g., Java:  Key challenge is Amdahl’s Law (no sequential bottlenecks)  Create a Thread object (data in fields) with a run method  Often provide parallelism via threads on different processors  Call its start method to actually spawn the thread and/or to mask I/O latency Dan Grossman CS-XXX 2012, Lecture 20 3 Dan Grossman CS-XXX 2012, Lecture 20 4

Why use threads? One possible formalism (omitting thread-ids)  Program state is one heap and multiple expressions One OR more of:  Any ei might “take the next step” and potentially spawn a 1. Performance (multiprocessor or mask I/O latency) thread 2. Isolation (separate errors or responsiveness)  A value in the “thread-pool” is removable 3. Natural code structure (1 stack awkward)  Nondeterministic with interleaving granularity determined by rules It’s not just performance Some example rules for H; e → H; e; o (where o ::= · | e):

On the other hand, it seems fundamentally harder (for H; e1 → H ; e1; o programmers, language implementors, language designers, H;!l → H; H(l); · H; e1 e2 → H ; e1 e2; o semanticists) to have multiple threads of execution

H; spawn(v1,v2) → H;0;(v1 v2)

Dan Grossman CS-XXX 2012, Lecture 20 5 Dan Grossman CS-XXX 2012, Lecture 20 6 Formalism continued Equivalence just changed The H; e → H; e; o judgment is just a helper-judgment for Expressions equivalent in a single-threaded world are not H; T → H; T where T ::= · | e; T necessarily equivalent in a multithreaded context!

H; e → H ; e ; · Example in OCaml: H; e1; ...; e; ...; en → H ; e1; ...; e ; ...; en let x, y = ref 0, ref 0 in H; e → H; e; e create (fun () -> if (!y)=1 then x:=(!x)+1) (); create (fun () -> if (!x)=1 then y:=(!y)+1) () (* 1 *) H ; e1; ...; e; ...; en → H ; e1; ...; e ; ...; en; e Can we replace line (1) with: H; e ; ...; e ; v; e ; ...; e → H; e ; ...; e ; e ; ...; e 1 i−1 i+1 n 1 i−1 i+1 n create (fun () -> y:=(!y)+1; if (!x)<>1 then y:=(!y)-1) () H; · Program termination: For more compiler gotchas, see “Threads cannot be implemented as a library” by Hans-J. Boehm in PLDI2005  Example: C bit-fields or other adjacent fields

Dan Grossman CS-XXX 2012, Lecture 20 7 Dan Grossman CS-XXX 2012, Lecture 20 8

Communication Join If threads do nothing other threads need to “see,” we are done “Fork-join” parallelism a simple approach good for “farm out  Best to do as little communication as possible subcomputations then merge results”  E.g., do not mutate shared data unnecessarily, or hide (* suspend caller until/unless arg terminates *) mutation behind easier-to-use interfaces valjoin:t->unit Common pattern: One way to communicate: Shared memory val fork_join : (’a -> ’b array) -> (* divider *)  One thread writes to a ref, another reads it (’b -> ’c) -> (* conqueror *)  Sounds nasty with pre-emptive scheduling (’c array -> ’d) -> (* merger *)  Hence synchronization mechanisms ’a -> (* data *)  Taught in O/S for historical reasons! ’d  Fundamentally about restricting interleavings Apply the second argument to each element of the ’b array in parallel, then use third argument after they are done

See lec20code.ml for implementation and related patterns (untested) Dan Grossman CS-XXX 2012, Lecture 20 9 Dan Grossman CS-XXX 2012, Lecture 20 10

Futures Locks (a.k.a. mutexes) A different model for without explicit shared memory or message sends (* mutex.mli *)  Easy to implement on top of either, but most models are type t (* a mutex *) easily inter-implementable val create : unit -> t val : t -> unit (* may block *)  See ML file for implementation over shared memory val unlock:t->unit type ’a promise; OCaml locks do not have two common features: val future : (unit -> ’a) -> ’a promise (*do in parallel*) val force : ’a promise -> ’a (*may block*)  Reentrancy (changes semantics of lock and unlock)

Essentially fork/join with a value returned?  Banning nonholder release (changes semantics of unlock)  Returning a value more functional  Less structured than “cobegin s1; s2; ... sn” form of fork/join Also want condition variables (condition.mli), not discussed here

Dan Grossman CS-XXX 2012, Lecture 20 11 Dan Grossman CS-XXX 2012, Lecture 20 12 Using locks The Evolution Problem Among infinite correct idioms using locks (and more incorrect Write a new function that needs to update o1 and o2 together. ones), the most common:  What locks should you acquire? In what order?   Determine what data must be “kept in sync” There may be no answer that avoids races and without breaking old code. (Need a stricter partial order.)  Always acquire a lock before accessing that data and release it  Race conditions: See definitions later in lecture afterwards  : Cycle of threads blocked forever  Have a partial order on all locks and if a thread holds m1 it can acquire m2 only if m1 this.value.length) this.expand(...); sb.getChars(0,len,this.value,this.count); //synchronized  Over-synchronizing the hallmark of concurrency inefficiency ... } Undocumented in 1.4; in 1.5 caller synchronizes on sb if necessary Dan Grossman CS-XXX 2012, Lecture 20 13 Dan Grossman CS-XXX 2012, Lecture 20 14

Atomic Blocks (Software Transactions) Transactions make things easier Java-like: atomic{s} Problems like append and xfer become trivial

OCaml-like: atomic : (unit -> ’a) -> ’a So does mixing coarse-grained and fine-grained operations (e.g., hashtable lookup and hashtable resize) Execute the body/thunk as though no interleaving from other threads Transactions are great, but not a panacea:  Allow parallelism unless there are actual run-time memory  Application-level races can remain conflicts (detect and abort/retry)  Application-level deadlock can remain  Convenience of coarse-grained locking with parallelism of  Implementations generally try-and-abort, which is hard for fine-grained locking (or better) “launch missiles” (e.g., I/O)  But language implementation has to do more to detect  Many software implementations provide a weaker and conflicts (much like garbage collection is convenient but has under-specified semantics if there are data races with costs) non-transactions Most research on implementation (preserve parallelism unless there  Memory- questions remain and may be are conflicts), but this is not an implementation course worse than with locks...

Dan Grossman CS-XXX 2012, Lecture 20 15 Dan Grossman CS-XXX 2012, Lecture 20 16

Data races, informally Bad interleaving example class Stack { … // state used by isEmpty, push, pop [More formal definition to follow] synchronized boolean isEmpty() { … } synchronized void push(E val) { … } synchronized E pop() { … } “race condition” means two different things E peek() { // this is wrong E ans = pop(); • Data race: Two threads read/write, write/read, or write/write the push(ans); same location without intervening synchronization return ans; } – So two conflicting accesses could happen “at the same time” } – Better name not used: simultaneous access error Thread 1 (peek) Thread 2 E ans = pop(); • Bad interleaving: Application error due to thread scheduling push(x) boolean b = isEmpty() – Different order would not produce error push(ans); – A data-race free program can have bad interleavings Time return ans;

Dan Grossman, CS-XXX 2012, Lecture 20 17 Dan Grossman, CS-XXX 2012, Lecture 20 18 Consistent locking Data races, more formally T1 T2

If all mutable, thread-shared memory is consistently guarded by wr(x) Let threads T1, …, Tn perform actions: some lock, then data races are impossible rd(y) – Read shared location x – Write shared location x – [Successfully] Acquire lock m wr(y) – Release lock m rel(m) – Thread-local actions (local acq(m) But: variables, control flow, arithmetic) • Will ignore these rd(z) – Bad interleavings can remain: programmer must make critical sections large enough wr(x) Order in one thread is program order – Consistent locking is sufficient but not necessary – Legal orders given by language’s rd(x) • A tool detecting consistent-locking violations might report single-threaded semantics + reads “problems” even if no data races are possible

Dan Grossman, CS-XXX 2012, Lecture 20 19 Dan Grossman, CS-XXX 2012, Lecture 20 20

Data races, more formally Data races, more formally T1 T2 T1 T2 wr(x) wr(x) Execution [trace] is a partial order over • Two actions conflict if they read/write, actions a1 < a2 rd(y) write/read, or write/write the same location rd(y) – Program order: If Ti performs a1 – Different locations not a conflict before a2, then a1 < a2 wr(y) – Read/read not a conflict wr(y) – Sync order: If a2=(Ti acquires m) rel(m) rel(m) occurs after a1=(Tj releases m), then a1 < a2 acq(m) acq(m) – Transitivity: If a1 < a2 and a2 < a3, rd(z) rd(z) then a1 < a3 wr(x) wr(x) Called the happens-before relation rd(x) rd(x)

Dan Grossman, CS-XXX 2012, Lecture 20 21 Dan Grossman, CS-XXX 2012, Lecture 20 22

Data races, more formally Beyond locks T1 T2 wr(x) • Finally, a data race is two conflicting Notion of data race extends to synchronization other than locks rd(y) actions a1 and a2 unordered by the – Just define happens-before appropriately happens-before relation – a1

Dan Grossman, CS-XXX 2012, Lecture 20 23 Dan Grossman, CS-XXX 2012, Lecture 20 24 Why care about data races? An example

Recall not all race conditions are data races… Can the assertion fail? So why focus on data races? // shared memory a = 0; b = 0; • One answer: Find some bugs without application-specific knowledge // Thread 1 // Thread 2 x = a + b; b = 1; • More interesting: Semantics for modern languages very relaxed for y = a; a = 1; programs with data races z = a + b; – Else optimizing compilers and hardware too difficult in practice assert(z>=y); – Increases importance of writing data-race-free programs

Dan Grossman, CS-XXX 2012, Lecture 20 25 Dan Grossman, CS-XXX 2012, Lecture 20 26

An example Common-subexpression elimination

Can the assertion fail? Compilers simplify/optimize code in many ways, e.g.: // shared memory // shared memory a = 0; b = 0; a = 0; b = 0; // Thread 1 // Thread 1 // Thread 2 // Thread 2 x = a + b; x = a + b; b = 1; b = 1; y = a; y = a; a = 1; a = 1; z = a + b; z = a + b; x; assert(z>=y); assert(z>=y);

Now assertion can fail – Argue assertion cannot fail: – As though third read of a precedes second read of a a never decreases and b is never negative, so z>=y – Many compiler optimizations have the effect of – But argument makes implicit assumptions you cannot make reordering/removing/adding memory operations like this in Java, C#, C++, etc. (!) (exceptions: constant-folding, function inlining, …) Dan Grossman, CS-XXX 2012, Lecture 20 27 Dan Grossman, CS-XXX 2012, Lecture 20 28

A decision… Memory-consistency model

•A memory-consistency model (or memory model) for a shared- // shared memory memory language specifies which write a read can see a = 0; b = 0; – Essential part of language definition – Widely under-appreciated until last several years // Thread 1 // Thread 2 x = a + b; b = 1; y = a; a = 1; • Natural, strong model is sequential consistency (SC) [Lamport] z = a + b; x; – Intuitive “interleaving semantics” with a global memory assert(z>=y);

“the results of any execution is the same as if the operations of Language semantics must resolve this tension: all the processors were executed in some sequential order, – If assertion can fail, the program is wrong and the operations of each individual processor appear in this – If assertion cannot fail, the compiler is wrong sequence in the order specified by its program”

Dan Grossman, CS-XXX 2012, Lecture 20 29 Dan Grossman, CS-XXX 2012, Lecture 20 30 Considered too strong The “grand compromise”

• Under SC, compiler is wrong in our example • Basic idea: – Must disable any optimization that has effect of reordering – Guarantee SC only for “correctly synchronized” programs [Adve] memory operations [on mutable, thread-shared memory] – Rely on programmer to synchronize correctly – Correctly synchronized == data-race free (DRF)! • So modern languages do not guarantee SC – Another reason: Disabling optimization insufficient because • More precisely: the hardware also reorders memory operations unless you If every SC execution of a program P has no data races, use very expensive (10x-100x) instructions then every execution of P is equivalent to an SC execution – Notice we use SC to decide if P has data races • But still need some language semantics to reason about programs… • Known as “DRF implies SC”

Dan Grossman, CS-XXX 2012, Lecture 20 31 Dan Grossman, CS-XXX 2012, Lecture 20 32

Roles under the compromise Back to the example

• Programmer: write a DRF program Code has a data race, so program is wrong and compiler is justified • Language implementor: provide SC assuming program is DRF // shared memory But what if there is a data race: a = 0; b = 0; – C++: anything can happen // Thread 1 // Thread 2 • “catch-fire semantics” x = a + b; b = 1; • Just like array-bounds errors, uninitialized data, etc. y = a; a = 1; – Java/C#: very complicated story z = a + b; x; assert(z>=y); • Preserve safety/security despite reorderings • “DRF implies SC” a theorem about the very-complicated definition

Dan Grossman, CS-XXX 2012, Lecture 20 33 Dan Grossman, CS-XXX 2012, Lecture 20 34

Back to the example Back to the example

This version is DRF, so the “optimization” is illegal This version is DRF, but the optimization is legal because it does – Compiler would be wrong: assertion must not fail not affect observable behavior: the assertion will not fail // shared memory // shared memory a = 0; b = 0; a = 0; b = 0; m a lock m a lock

// Thread 1 // Thread 2 // Thread 1 // Thread 2 sync(m){x = a + b;} sync(m){b = 1;} sync(m){ sync(m){ sync(m){y = a;} sync(m){a = 1;} x = a + b; b = 1; sync(m){z = a + b;} y = a; a = 1; assert(z>=y); z = a + b; x; } } assert(z>=y);

Dan Grossman, CS-XXX 2012, Lecture 20 35 Dan Grossman, CS-XXX 2012, Lecture 20 36 Back to the example So what is allowed?

This version is also DRF and the optimization is illegal How can language implementors know if an optimization obeys “DRF – Volatile fields (cf. C++ atomics) exist precisely for writing implies SC”? Must be aware of threads! [Boehm] “clever” code like this (e.g., lock-free data structures) // shared memory Basically 2.5 rules suffice, without needing inter-thread analysis: volatile int a, b; a = 0; 0. Optimization must be legal for single-threaded programs b = 0; 1. Do not move shared-memory accesses across lock acquires/releases // Thread 1 // Thread 2 – Careful: A callee might do synchronization x = a + b; b = 1; – Can relax this slightly [Effinger-Dean et al 2012] y = a; a = 1; z = a + b; 2. Never add a memory operation not in the program [Boehm] assert(z>=y); – Seems like it would be strange to do this, but there are some non- strange examples (cf. homework problem)

Dan Grossman, CS-XXX 2012, Lecture 20 37 Dan Grossman, CS-XXX 2012, Lecture 20 38

Thread-local or immutable memory

• All these issues go away for memory the compiler knows is thread-local or immutable – Could be via static analysis, programmer annotations, or using a mostly functional language

• If the default were thread-local or immutable, maybe we could live with less/no optimization on thread-shared-and-mutable – But harder to write reusable libraries

• Most memory is thread-local, just too hard for the compiler to be sure

Dan Grossman, CS-XXX 2012, Lecture 20 39