Shared Memory Concurrency in the Real World Working with Relaxed Memory Consistency

Shared Memory Concurrency in the Real World Working with Relaxed Memory Consistency Susmit Sarkar University of Cambridge University of St Andrews Jan 2013 The Golden Age: 1945-1959 Thread Memory Memory is an array of values Susmit Sarkar (St Andrews, Cambridge) Shared Memory Concurrency Jan 2013 2 / 44 Until: 1962(?) Burroughs D825 (first multiprocessing computer) Outstanding features include truly modular hardware with parallel processing throughout. FUTURE PLANS The complement of compiling languages is to be expanded. Susmit Sarkar (St Andrews, Cambridge) Shared Memory Concurrency Jan 2013 3 / 44 Shared Memory Concurrency Multiple threads with a single shared memory Question: How do we reason about it? Answer [1979]: Sequential Consistency . the result of any execution is the same as if the operations of all the processors were executed in some sequential order, respecting the order specified by the pro- gram. [Lamport, 1979] Susmit Sarkar (St Andrews, Cambridge) Shared Memory Concurrency Jan 2013 4 / 44 False on modern (since 1972) multiprocessors, or with optimizing compilers Sequential Consistency Thread 0 Thread 1 Thread 2 Thread 3 (Shared) Memory Traditional assumption (concurrent algorithms, semantics, verification): Sequential Consistency (SC) Implies: can use interleaving semantics Susmit Sarkar (St Andrews, Cambridge) Shared Memory Concurrency Jan 2013 5 / 44 Sequential Consistency Thread 0 Thread 1 Thread 2 Thread 3 (Shared) Memory Traditional assumption (concurrent algorithms, semantics, verification): Sequential Consistency (SC) Implies: can use interleaving semantics False on modern (since 1972) multiprocessors, or with optimizing compilers Susmit Sarkar (St Andrews, Cambridge) Shared Memory Concurrency Jan 2013 5 / 44 Our world is not SC Not since IBM System 370/158MP (1972) Susmit Sarkar (St Andrews, Cambridge) Shared Memory Concurrency Jan 2013 6 / 44 Our world is not SC Not since IBM System 370/158MP (1972) . Nor in x86, ARM, POWER, SPARC, or Itanium . Nor in C, C++, Java Susmit Sarkar (St Andrews, Cambridge) Shared Memory Concurrency Jan 2013 7 / 44 Forbidden on SC (no interleaving allows that result) Observed on x86 (630/10000 on dual core Intel Core2) Observed on Tegra3 (8.2M/174M) First Example of Things Going Wrong At heart of mutual exclusion algorithm (Dekker’s, Peterson’s) there is usually code like: Initially: x = 0; y = 0; Thread 0 Thread 1 x = 1; y = 1; if (0 == y) if (0 == x) {CRITICAL1 }; {CRITICAL2 }; Susmit Sarkar (St Andrews, Cambridge) Shared Memory Concurrency Jan 2013 8 / 44 Observed on x86 (630/10000 on dual core Intel Core2) Observed on Tegra3 (8.2M/174M) First Example of Things Going Wrong Distilling that example Initially: x = 0; y = 0; Thread 0 Thread 1 x = 1; y = 1; r0 = y; r1 = x; Finally: r0 = 0 ∧ r1 = 0?? Forbidden on SC (no interleaving allows that result) Susmit Sarkar (St Andrews, Cambridge) Shared Memory Concurrency Jan 2013 8 / 44 First Example of Things Going Wrong Initially: x = 0; y = 0; Thread 0 Thread 1 x = 1; y = 1; r0 = y; r1 = x; Finally: r0 = 0 ∧ r1 = 0?? Forbidden on SC (no interleaving allows that result) Observed on x86 (630/10000 on dual core Intel Core2) Observed on Tegra3 (8.2M/174M) Susmit Sarkar (St Andrews, Cambridge) Shared Memory Concurrency Jan 2013 8 / 44 Aside: Litmus Tests Essential features of that example (SB, aka Dekker’s) Litmus Test: Execution with Writes and Reads Thread 0 Thread 1 a: W[x]=1 c: W[y]=1 po po rf b: R[y]=0 rf d: R[x]=0 Test SB : Allowed Concentrate on one execution : Allowed or Forbidden? Susmit Sarkar (St Andrews, Cambridge) Shared Memory Concurrency Jan 2013 9 / 44 What’s going on here? Multiprocessors (and compilers) incorporate many performance optimisations (local store buffers, cache hierarchies, speculative execution, common subexpression elimination, hoisting code above loops, . ) These are: unobservable by single-threaded code; sometimes observable by concurrent (multi-threaded) code Susmit Sarkar (St Andrews, Cambridge) Shared Memory Concurrency Jan 2013 10 / 44 What’s going on here? Multiprocessors (and compilers) incorporate many performance optimisations (local store buffers, cache hierarchies,Upshot speculative: execution, common subexpression elimination, hoisting code above loops, . ) No longer a sequential consistent memory model These are: Instead, only a relaxed (or weakly consistent) unobservable by single-threaded code; memory model sometimes observable by concurrent (multi-threaded) code Susmit Sarkar (St Andrews, Cambridge) Shared Memory Concurrency Jan 2013 10 / 44 Relaxed Memory Consistency Models Real memory consistency models are subtle Real memory consistency models differ between architectures Real memory consistency models differ between languages Real memory consistency models are poorly understood (verification and theory research, until recently, ignored it) Real memory consistency models for real processors/languages are poorly understood (older work considered idealised models) Research opportunity for semanticists! Susmit Sarkar (St Andrews, Cambridge) Shared Memory Concurrency Jan 2013 11 / 44 Surely, there are specifications? Detailed specifications for hardware architectures: I Intel 64 and IA-32 Software Developer’s Manual I AMD64 Architecture Programmer’s Manual I POWER ISA v2.06 I ARM Architecture Reference Manual Detailed specifications for programming languages: I ISO/IEC 9899:2011 Programming Languages – C I ISO/IEC 14882:2011 Programming Languages – C++ I Java Language Specification, Java SE 7 Edition Susmit Sarkar (St Andrews, Cambridge) Shared Memory Concurrency Jan 2013 12 / 44 Surely, there are specifications? Detailed specifications for hardware architectures: I Intel 64 and IA-32 Software Developer’s Manual I AMD64 Architecture Programmer’s Manual I POWERWe’ve ISA looked v2.06 at the specifications of I ARM Architecturex86, Power, Reference ARM, Manual Java, C++, C Detailed specifications forThey programming are all flawed languages: I ISO/IEC 9899:2011 Programming Languages – C Always confusing Usually incomplete I ISO/IEC 14882:2011 Programming Languages – C++ I Java Language Specification,Sometimes, Java Just SE 7Wrong Edition Susmit Sarkar (St Andrews, Cambridge) Shared Memory Concurrency Jan 2013 12 / 44 Current Specifications: Unsuitable “all that horrible horribly incomprehensible and confusing [...] text that no-one can parse or reason with — not even the people who wrote it” Anonymous Processor Architect, 2011 Susmit Sarkar (St Andrews, Cambridge) Shared Memory Concurrency Jan 2013 13 / 44 Current Specifications: Unsuitable Why? Have to be loose specifications Cover a wide range of past and future implementations Should not reveal extraneous (or proprietary) detail In practice, in informal prose Fundamental problem: Prose specifications cannot be tested against (or reasoned about) Susmit Sarkar (St Andrews, Cambridge) Shared Memory Concurrency Jan 2013 13 / 44 Inevitably, problems (A Cautionary Tale) delay a read that happened inside the critical region mb(); (maybe it missed a cache line), and get a stale value for a = 0; Kernel Traffic #47 For mb(); any of the reads that _should_ have been serialized by b = a; 20 Dec 1999 the spinlock. spin_unlock(); return b; Note that I actually thought this was a legal } optimization, and for a while I had this in the kernel. It 1. spin_unlock() Optimization On Intel crashed. In random ways. Now, OBVIOUSLY the above always has to return 0, right? All accesses to "a" are inside the spinlock, and we Note that the fact that it does not crash now is quite 20 Nov 1999 - 7 Dec 1999 (143 posts) Archive Link: "spin_unlock always set it to zero before we read it into "b" and possibly because of either optimization(i386)" return it. So if we EVER returned anything else, the spinlock would obviously be completely broken, Topics: BSD: FreeBSD, SMP we have a lot less contention on our spinlocks these days. That might hide the problem, because the wouldn't you say? _spinlock_ will be fine (the cache coherency still People: Linus Torvalds, Jeff V. Merkey, Erich Boleyn, Manfred And yes, the above CAN return 1 with the proposed means that the spinlock itself works fine - it's just Spraul, Peter Samuelson, Ingo Molnar optimization. I doubt you can make it do so in real life, that it no longer works reliably as an exclusion but hey, add another access to another variable in the Manfred Spraul thought he'd found a way to shave spin_unlock() thing) same cache line that is accessed through another down from about 22 ticks for the "lock; btrl $0,%0" asm code, to 1 the window is probably very very small, and you spinlock (to get cache-line ping-pong and timing tick for a simple "movl $0,%0" instruction, a huge gain. Later, he have to be unlucky to hit it. Faster CPU's, different effects), and I suspect you can make it happen even with reported that Ingo Molnar noticed a 4% speed-up in a benchmark compilers, whatever. a simple example like the above. test, making the optimization very valuable. Ingo also added that the same optimization cropped up in the FreeBSD mailing list a few days I might be proven wrong, but I don't think I am. The reason it can return 1 quite legally is that your new previously. But Linus Torvalds poured cold water on the whole thing, "spin_unlock()" isnot serializing any more, so there is Note that another thing is that yes, "btcl" may be the saying: very little effective ordering between the two actions worst possible thing to use for this, and you might test It does NOT WORK! whether a simpler "xor+xchgl" might be better - it's still b = a; serializing because it is locked, but it should be the spin_unlock(); Let the FreBSD people use it, and let them get faster normal 12 cycles that Intel always seems to waste on timings. They will crash, eventually. serializing instructions rather than 22 cycles. as they access completely different data (ie no data dependencies in sight). So what you could end up doing The window may be small, but if you do this, then Elsewhere, he gave a potential (though unlikely) exploit: is equivalent to suddenly spinlocks aren't reliable any more.

Load more