Shared Memory Concurrency in the Real World Working with Relaxed Memory Consistency

Susmit Sarkar

University of Cambridge University of St Andrews

Jan 2013 The Golden Age: 1945-1959

Thread

Memory Memory is an array of values

Susmit Sarkar (St Andrews, Cambridge) Shared Memory Concurrency Jan 2013 2 / 44 Until: 1962(?)

Burroughs D825 (first multiprocessing computer)

Outstanding features include truly modular hardware with parallel processing throughout. FUTURE PLANS The complement of compiling languages is to be expanded.

Susmit Sarkar (St Andrews, Cambridge) Shared Memory Concurrency Jan 2013 3 / 44 Shared Memory Concurrency

Multiple threads with a single shared memory

Question: How do we reason about it?

Answer [1979]: Sequential Consistency

. . . the result of any execution is the same as if the operations of all the processors were executed in some sequential order, respecting the order specified by the pro- gram. [Lamport, 1979]

Susmit Sarkar (St Andrews, Cambridge) Shared Memory Concurrency Jan 2013 4 / 44 False on modern (since 1972) multiprocessors, or with optimizing compilers

Sequential Consistency

Thread 0 Thread 1 Thread 2 Thread 3

(Shared) Memory

Traditional assumption (concurrent algorithms, semantics, verification): Sequential Consistency (SC) Implies: can use interleaving semantics

Susmit Sarkar (St Andrews, Cambridge) Shared Memory Concurrency Jan 2013 5 / 44 Sequential Consistency

Thread 0 Thread 1 Thread 2 Thread 3

(Shared) Memory

Traditional assumption (concurrent algorithms, semantics, verification): Sequential Consistency (SC) Implies: can use interleaving semantics

False on modern (since 1972) multiprocessors, or with optimizing compilers

Susmit Sarkar (St Andrews, Cambridge) Shared Memory Concurrency Jan 2013 5 / 44 Our world is not SC

Not since IBM System 370/158MP (1972)

Susmit Sarkar (St Andrews, Cambridge) Shared Memory Concurrency Jan 2013 6 / 44 Our world is not SC

Not since IBM System 370/158MP (1972)

...... Nor in x86, ARM, POWER, SPARC, or Itanium

...... Nor in C, C++, Java

Susmit Sarkar (St Andrews, Cambridge) Shared Memory Concurrency Jan 2013 7 / 44 Forbidden on SC (no interleaving allows that result) Observed on x86 (630/10000 on dual core Core2) Observed on Tegra3 (8.2M/174M)

First Example of Things Going Wrong

At heart of algorithm (Dekker’s, Peterson’s) there is usually code like:

Initially: x = 0; y = 0; Thread 0 Thread 1 x = 1; y = 1; if (0 == y) if (0 == x) {CRITICAL1 }; {CRITICAL2 };

Susmit Sarkar (St Andrews, Cambridge) Shared Memory Concurrency Jan 2013 8 / 44 Observed on x86 (630/10000 on dual core Intel Core2) Observed on Tegra3 (8.2M/174M)

First Example of Things Going Wrong

Distilling that example

Initially: x = 0; y = 0; Thread 0 Thread 1 x = 1; y = 1; r0 = y; r1 = x;

Finally: r0 = 0 ∧ r1 = 0??

Forbidden on SC (no interleaving allows that result)

Susmit Sarkar (St Andrews, Cambridge) Shared Memory Concurrency Jan 2013 8 / 44 First Example of Things Going Wrong

Initially: x = 0; y = 0; Thread 0 Thread 1 x = 1; y = 1; r0 = y; r1 = x;

Finally: r0 = 0 ∧ r1 = 0??

Forbidden on SC (no interleaving allows that result) Observed on x86 (630/10000 on dual core Intel Core2) Observed on Tegra3 (8.2M/174M)

Susmit Sarkar (St Andrews, Cambridge) Shared Memory Concurrency Jan 2013 8 / 44 Aside: Litmus Tests

Essential features of that example (SB, aka Dekker’s)

Litmus Test: Execution with Writes and Reads

Thread 0 Thread 1 a: W[x]=1 c: W[y]=1 po po rf b: R[y]=0 rf d: R[x]=0

Test SB : Allowed

Concentrate on one execution : Allowed or Forbidden?

Susmit Sarkar (St Andrews, Cambridge) Shared Memory Concurrency Jan 2013 9 / 44 What’s going on here?

Multiprocessors (and compilers) incorporate many

performance optimisations

(local store buffers, cache hierarchies, speculative execution, common subexpression elimination, hoisting code above loops, . . . )

These are: unobservable by single-threaded code; sometimes observable by concurrent (multi-threaded) code

Susmit Sarkar (St Andrews, Cambridge) Shared Memory Concurrency Jan 2013 10 / 44 What’s going on here?

Multiprocessors (and compilers) incorporate many

performance optimisations

(local store buffers, cache hierarchies,Upshot speculative: execution, common subexpression elimination, hoisting code above loops, . . . )

No longer a sequential consistent memory model These are: Instead, only a relaxed (or weakly consistent) unobservable by single-threaded code; memory model sometimes observable by concurrent (multi-threaded) code

Susmit Sarkar (St Andrews, Cambridge) Shared Memory Concurrency Jan 2013 10 / 44 Relaxed Memory Consistency Models

Real memory consistency models are subtle Real memory consistency models differ between architectures Real memory consistency models differ between languages

Real memory consistency models are poorly understood (verification and theory research, until recently, ignored it) Real memory consistency models for real processors/languages are poorly understood (older work considered idealised models)

Research opportunity for semanticists!

Susmit Sarkar (St Andrews, Cambridge) Shared Memory Concurrency Jan 2013 11 / 44 Surely, there are specifications?

Detailed specifications for hardware architectures:

I Intel 64 and IA-32 Software Developer’s Manual I AMD64 Architecture Programmer’s Manual I POWER ISA v2.06 I ARM Architecture Reference Manual

Detailed specifications for programming languages:

I ISO/IEC 9899:2011 Programming Languages – C I ISO/IEC 14882:2011 Programming Languages – C++ I Java Language Specification, Java SE 7 Edition

Susmit Sarkar (St Andrews, Cambridge) Shared Memory Concurrency Jan 2013 12 / 44 Surely, there are specifications?

Detailed specifications for hardware architectures:

I Intel 64 and IA-32 Software Developer’s Manual I AMD64 Architecture Programmer’s Manual I POWERWe’ve ISA looked v2.06 at the specifications of I ARM Architecturex86, Power, Reference ARM, Manual Java, C++, C

Detailed specifications forThey programming are all flawed languages:

I ISO/IEC 9899:2011 Programming Languages – C Always confusing Usually incomplete I ISO/IEC 14882:2011 Programming Languages – C++ I Java Language Specification,Sometimes, Java Just SE 7Wrong Edition

Susmit Sarkar (St Andrews, Cambridge) Shared Memory Concurrency Jan 2013 12 / 44 Current Specifications: Unsuitable

“all that horrible horribly incomprehensible and confusing [...] text that no-one can parse or reason with — not even the people who wrote it” Anonymous Processor Architect, 2011

Susmit Sarkar (St Andrews, Cambridge) Shared Memory Concurrency Jan 2013 13 / 44 Current Specifications: Unsuitable

Why?

Have to be loose specifications Cover a wide range of past and future implementations Should not reveal extraneous (or proprietary) detail In practice, in informal prose

Fundamental problem: Prose specifications cannot be tested against (or reasoned about)

Susmit Sarkar (St Andrews, Cambridge) Shared Memory Concurrency Jan 2013 13 / 44 Inevitably, problems (A Cautionary Tale)

delay a read that happened inside the critical region mb(); (maybe it missed a cache line), and get a stale value for a = 0; Kernel Traffic #47 For mb(); any of the reads that _should_ have been serialized by b = a; 20 Dec 1999 the spinlock. spin_unlock(); return b; Note that I actually thought this was a legal } optimization, and for a while I had this in the kernel. It 1. spin_unlock() Optimization On Intel crashed. In random ways. Now, OBVIOUSLY the above always has to return 0, right? All accesses to "a" are inside the spinlock, and we Note that the fact that it does not crash now is quite 20 Nov 1999 - 7 Dec 1999 (143 posts) Archive Link: "spin_unlock always set it to zero before we read it into "b" and possibly because of either optimization()" return it. So if we EVER returned anything else, the spinlock would obviously be completely broken, Topics: BSD: FreeBSD, SMP we have a lot less contention on our spinlocks these days. That might hide the problem, because the wouldn't you say? _spinlock_ will be fine (the cache coherency still People: Linus Torvalds, Jeff V. Merkey, Erich Boleyn, Manfred And yes, the above CAN return 1 with the proposed means that the spinlock itself works fine - it's just Spraul, Peter Samuelson, Ingo Molnar optimization. I doubt you can make it do so in real life, that it no longer works reliably as an exclusion but hey, add another access to another variable in the Manfred Spraul thought he'd found a way to shave spin_unlock() thing) same cache line that is accessed through another down from about 22 ticks for the "; btrl $0,%0" asm code, to 1 the window is probably very very small, and you spinlock (to get cache-line ping-pong and timing tick for a simple "movl $0,%0" instruction, a huge gain. Later, he have to be unlucky to hit it. Faster CPU's, different effects), and I suspect you can make it happen even with reported that Ingo Molnar noticed a 4% speed-up in a benchmark compilers, whatever. a simple example like the above. test, making the optimization very valuable. Ingo also added that the same optimization cropped up in the FreeBSD mailing list a few days I might be proven wrong, but I don't think I am. The reason it can return 1 quite legally is that your new previously. But Linus Torvalds poured cold water on the whole thing, "spin_unlock()" isnot serializing any more, so there is Note that another thing is that yes, "btcl" may be the saying: very little effective ordering between the two actions worst possible thing to use for this, and you might test It does NOT WORK! whether a simpler "xor+xchgl" might be better - it's still b = a; serializing because it is locked, but it should be the spin_unlock(); Let the FreBSD people use it, and let them get faster normal 12 cycles that Intel always seems to waste on timings. They will crash, eventually. serializing instructions rather than 22 cycles. as they access completely different data (ie no data dependencies in sight). So what you could end up doing The window may be small, but if you do this, then Elsewhere, he gave a potential (though unlikely) exploit: is equivalent to suddenly spinlocks aren't reliable any more. As a completely made-up example (which will probably CPU#1 CPU#2 The issue is not writes being issued in-order (although never show the problem in real life, but is instructive as b = a; /* cache miss, we'll delay all the Intel CPU books warn you NOT to assume that an example), imaging running the following test in a this.. */ in-order write behaviour - I bet it won't be the case in loop on multiple CPU's: spin_unlock(); the long run). int test_locking(void) spin_lock(); { The issue is that you _have_ to have a serializing a = 1; instruction in order to make sure that the processor static int a; /* protected by spinlock */ /* cache miss satisfied, the "a" line int b; doesn't re-order things around the unlock. is bouncing back and forth */ For example, with a simple write, the CPU can legally spin_lock() b gets the value 1 a = 1;

a = 0; aberrant behavior was that cache inconsistencies would turnaround by Linus: occur randomly. PPro uses lock to signal that the and it returns "1", which is wrong for any working piplines are no longer invalid and the buffers should be Everybody has convinced me that yes, the Intel ordering spinlock. blown out. rules _are_ strong enough that all of this really is legal, and that's what I wanted. I've gotten sane explanations Unlikely? Yes, definitely. Something we are willing to I have seen the behavior Linus describes on a hardware for why serialization (as opposed to just the simple live with as a potential bug in any real kernel? Definitely analyzer, BUT ONLY ON SYSTEMS THAT WERE PPRO locked access) is required for the lock() side but not the not. AND ABOVE. I guess the BSD people must still be on unlock() side, and that lack of symmetry was what older hardware and that's why they don't know bothered me the most. Manfred objected that according to the Pentium Processor Family this can bite in some cases. Developers Manual, Vol3, Chapter 19.2 Memory Access Ordering, "to Oliver made a strong case that the lack of symmetry can optimize performance, the Pentium processor allows memory reads to Erich Boleyn, an Architect in an IA32 development group at Intel, also be adequately explained by just simply the lack of be reordered ahead of buffered writes in most situations. Internally, replied to Linus, pointing out a possible misconception in his symmetry wrt speculation of reads vs writes. I feel CPU reads (cache hits) can be reordered around buffered writes. proposed exploit. Regarding the code Linus posted, Erich replied: comfortable again. Memory reordering does not occur at the pins, reads (cache miss) and writes appear in-order." He concluded from this that the second CPU It will always return 0. You don't need "spin_unlock()" to Thanks, guys, we'll be that much faster due to this.. would never see the spin_unlock() before the "b=a" line. Linus agreed be serializing. Erich then argued that serialization was not required for the lock() that on a Pentium, Manfred was right. However, he quoted in turn The only thing you need is to make sure there is a store side either, but after a long and interesting discussion he apparently from the manual, "The only enhancement in the in "spin_unlock()", and that is kind of true by the fact was unable to win people over. PentiumPro processor is the added support for speculative reads and that you're changing something to be observable on store-buffer forwarding." He explained: other processors. ( A Pentium is a in-order machine, without any of the The reason for this is that stores can only possibly be In fact, as Peter Samuelson pointed out to me after KT publication interesting speculation wrt reads etc. So on a Pentium observed when all prior instructions have retired (i.e. (and many thanks to him for it): you'll never see the problem. the store is not sent outside of the processor until it is "You report that Linus was convinced to do the spinlock optimization But a Pentium is also very uninteresting from a SMP committed state, and the earlier instructions are already committed by that time), so the any loads, on Intel, but apparently someone has since changed his mind back. standpoint these days. It's just too weak with too little See from 2.3.30pre5 and above: per-CPU cache etc.. stores, etc absolutely have to have completed first, cache-miss or not. /* This is why the PPro has the MTRR's - exactly to let the * Sadly, some early PPro chips require the locked access, core do speculation (a Pentium doesn't need MTRR's, as He went on: * otherwise we could just always simply do * it won't re-order anything external to the CPU anyway, Since the instructions for the store in the spin_unlock * #define spin_unlock_string \ and in fact won't even re-order things internally). have to have been externally observed for spin_lock to * "movb $0,%0" * Jeff V. Merkey added: be aquired (presuming a correctly functioning spinlock, * Which is noticeably faster. of course), then the earlier instructions to set "b" to the */ What Linus says here is correct for PPro and above. value of "a" have to have completed first. #define spin_unlock_string \ "lock ; btrl $0,%0"" Using a mov instruction to unlock does work fine on a In general, IA32 is Processor Ordered for cacheable 486 or Pentium SMP system, but as of the PPro, this was -- Ed: [23 Dec 1999 00:00:00 -0800] no longer the case, though the window is so accesses. Speculation doesn't affect this. Also, stores infintesimally small, most kernels don't hit it (Netware are not observed speculatively on other processors. 4/5 uses this method but it's spinlocks understand this There was a long clarification discussion, resulting in a complete and the code is writtne to handle it. The most obvious

Susmit Sarkar (St Andrews, Cambridge) Shared Memory Concurrency Jan 2013 14 / 44 Inevitably, problems (A Cautionary Tale)

delay a read that happened inside the critical region mb(); (maybe it missed a cache line), and get a stale value for a = 0; Kernel Traffic #47 For mb(); any of the reads that _should_ have been serialized by b = a; 20 Dec 1999 the spinlock. spin_unlock(); return b; Note that I actually thought this was a legal } optimization, and for a while I had this in the kernel. It 1. spin_unlock() Optimization On Intel crashed. In random ways. Now, OBVIOUSLY the above always has to return 0, right? All accesses to "a" are inside the spinlock, and we Note that the fact that it does not crash now is quite 20 Nov 1999 - 7 Dec 1999 (143 posts) Archive Link: "spin_unlock always set it to zero before we read it into "b" and possibly because of either optimization(i386)" return it. So if we EVER returned anything else, the spinlock would obviously be completely broken, Topics: BSD: FreeBSD, SMP we have a lot less contention on our spinlocks these days. That might hide the problem, because the wouldn't you say? _spinlock_ will be fine (the cache coherency still People: Linus Torvalds, Jeff V. Merkey, Erich Boleyn, Manfred And yes, the above CAN return 1 with the proposed means that the spinlock itself works fine - it's just Spraul, Peter Samuelson, Ingo Molnar optimization. I doubt you can make it do so in real life, that it no longer works reliably as an exclusion but hey, add another access to another variable in the Manfred Spraul thought he'd found a way to shave spin_unlock() thing) same cache line that is accessed through another down from about 22 ticks for the "lock; btrl $0,%0" asm code, to 1 the window is probably very very small, and you spinlock (to get cache-line ping-pong and timing tick for a simple "movl $0,%0" instruction, a huge gain. Later, he have to be unlucky to hit it. Faster CPU's, different effects), and I suspect you can make it happen even with reported that Ingo Molnar noticed a 4% speed-up in a benchmark Manfredcompilers, Spraul: whatever. a simple example like the above. test, making the optimization very valuable. Ingo also added that the same optimization cropped up in the FreeBSD mailing list a few days I might be proven wrong, but I don't think I am. The reason it can return 1 quite legally is that your new previously. But Linus Torvalds poured cold water on the whole thing, "spin_unlock()" isnot serializing any more, so there is Note that another thing is that yes, "btcl" may be the saying: very little effective ordering between the two actions worst possible thing to use for this, and you might test It does NOT WORK! whether a simpler "xor+xchgl" might be better - it's still b = a; serializing because it is locked, but it should be the spin_unlock(); Let the FreBSD people use it, and let them get faster normal 12 cycles that Intel always seems to waste on timings. They will crash, eventually. serializing instructions rather than 22 cycles. as they access completely different data (ie no data dependencies in sight). So what you could end up doing The window may be small, but if you do this, then Elsewhere, he gave a potential (though unlikely) exploit: is equivalent to suddenly spinlocks aren't reliable any more. As a completely made-up example (which will probably CPU#1 CPU#2 The issue is not writes being issued in-orderWe (although cannever shave show the problem in real life,spin but is instructive as unlock()b = a; /* cache miss, we'll delay all the Intel CPU books warn you NOT to assume that an example), imaging running the following test in a this.. */ in-order write behaviour - I bet it won't be the case in loop on multiple CPU's: spin_unlock(); the long run). int test_locking(void) spin_lock(); { The issue is that you _have_ to have a serializing a = 1; instruction in order to make sure that the processor static int a; /* protected by spinlock */ /* cache miss satisfied, the "a" line doesn't re-order things around the unlock. down fromint b; about 22 ticks is bouncing back and forth */ For example, with a simple write, the CPU can legally spin_lock() b gets the value 1 for the “lock;a = 1; btrl $0, %0”

a =asm 0; code,aberrant behavior to was that 1cache inconsistencies tick would for aturnaround simple by Linus: occur randomly. PPro uses lock to signal that the and it returns "1", which is wrong for any working piplines are no longer invalid and the buffers should be Everybody has convinced me that yes, the Intel ordering spinlock. blown out. rules _are_ strong enough that all of this really is legal, and that's what I wanted. I've gotten sane explanations Unlikely? Yes, definitely. Something we are willing to I have seen the behavior Linus describes on a hardware for why serialization (as opposed to just the simple live with as a potential bug in any real kernel? Definitely analyzer, BUT ONLY ON SYSTEMS THAT WERE PPRO locked access) is required for the lock() side but not the not. “movl $0,AND ABOVE. I guess the %0 BSD people” must stillinstruction, be on unlock() side, and that lack a of symmetry was what older Pentium hardware and that's why they don't know bothered me the most. Manfred objected that according to the Pentium Processor Family this can bite in some cases. Developers Manual, Vol3, Chapter 19.2 Memory Access Ordering, "to Oliver made a strong case that the lack of symmetry can optimize performance, the Pentium processor allows memory reads to Erich Boleyn, an Architect in an IA32 development group at Intel, also be adequately explained by just simply the lack of be reordered ahead of buffered writes in most situations. Internally, replied to Linus, pointing out a possible misconception in his symmetry wrt speculation of reads vs writes. I feel CPU reads (cache hits) can be reordered around buffered writes. proposed exploit. Regarding the code Linus posted, Erich replied: comfortable again. Memory reordering does not occur at the pins, reads (cachehuge miss) and gain. writes appear in-order." He concluded from this that the second CPU It will always return 0. You don't need "spin_unlock()" to Thanks, guys, we'll be that much faster due to this.. would never see the spin_unlock() before the "b=a" line. Linus agreed be serializing. Erich then argued that serialization was not required for the lock() that on a Pentium, Manfred was right. However, he quoted in turn The only thing you need is to make sure there is a store side either, but after a long and interesting discussion he apparently from the Pentium Pro manual, "The only enhancement in the in "spin_unlock()", and that is kind of true by the fact was unable to win people over. PentiumPro processor is the added support for speculative reads and that you're changing something to be observable on store-buffer forwarding." He explained: other processors. ( A Pentium is a in-order machine, without any of the The reason for this is that stores can only possibly be In fact, as Peter Samuelson pointed out to me after KT publication interesting speculation wrt reads etc. So on a Pentium observed when all prior instructions have retired (i.e. (and many thanks to him for it): you'll never see the problem. the store is not sent outside of the processor until it is "You report that Linus was convinced to do the spinlock optimization But a Pentium is also very uninteresting from a SMP committed state, and the earlier instructions are already committed by that time), so the any loads, on Intel, but apparently someone has since changed his mind back. standpoint these days. It's just too weak with too little See from 2.3.30pre5 and above: per-CPU cache etc.. stores, etc absolutely have to have completed first, cache-miss or not. /* This is why the PPro has the MTRR's - exactly to let the * Sadly, some early PPro chips require the locked access, core do speculation (a Pentium doesn't need MTRR's, as He went on: * otherwise we could just always simply do * it won't re-order anything external to the CPU anyway, Since the instructions for the store in the spin_unlock * #define spin_unlock_string \ and in fact won't even re-order things internally). have to have been externally observed for spin_lock to * "movb $0,%0" * Jeff V. Merkey added: be aquired (presuming a correctly functioning spinlock, * Which is noticeably faster. of course), then the earlier instructions to set "b" to the */ What Linus says here is correct for PPro and above. value of "a" have to have completed first. #define spin_unlock_string \ "lock ; btrl $0,%0"" Using a mov instruction to unlock does work fine on a In general, IA32 is Processor Ordered for cacheable 486 or Pentium SMP system, but as of the PPro, this was -- Ed: [23 Dec 1999 00:00:00 -0800] no longer the case, though the window is so accesses. Speculation doesn't affect this. Also, stores infintesimally small, most kernels don't hit it (Netware are not observed speculatively on other processors. 4/5 uses this method but it's spinlocks understand this There was a long clarification discussion, resulting in a complete and the code is writtne to handle it. The most obvious

Susmit Sarkar (St Andrews, Cambridge) Shared Memory Concurrency Jan 2013 14 / 44 Inevitably, problems (A Cautionary Tale)

delay a read that happened inside the critical region mb(); (maybe it missed a cache line), and get a stale value for a = 0; Kernel Traffic #47 For mb(); any of the reads that _should_ have been serialized by b = a; 20 Dec 1999 the spinlock. spin_unlock(); return b; Note that I actually thought this was a legal } Ingo Molnar: optimization, and for a while I had this in the kernel. It 1. spin_unlock() Optimization On Intel crashed. In random ways. Now, OBVIOUSLY the above always has to return 0, right? All accesses to "a" are inside the spinlock, and we Note that the fact that it does not crash now is quite 20 Nov 1999 - 7 Dec 1999 (143 posts) Archive Link: "spin_unlock always set it to zero before we read it into "b" and possibly because of either optimization(i386)" return it. So if we EVER returned anything else, the spinlock would obviously be completely broken, Topics: BSD: FreeBSD, SMP we have a lot less contention on our spinlocks these days. That might hide the problem, because the wouldn't you say? _spinlock_ will be fine (the cache coherency still People: Linus Torvalds, Jeff V. Merkey, Erich Boleyn, Manfred And yes, the above CAN return 1 with the proposed means that the spinlock itself works fine - it's just Spraul, Peter Samuelson, Ingo Molnar optimization. I doubt you can make it do so in real life, that it no longer works reliably as an exclusion but hey, add another access to another variable in the Manfred Spraul thought he'd found a way to shave spin_unlock() thing) same cache line that is accessed through another down from about 22 ticks for the "lock; btrl $0,%0" asm code, to 1 the window is probably very very small, and you spinlock (to get cache-line ping-pong and timing tick for a simple "movl $0,%0" instruction, a huge gain. Later, he . . . 4% speedup in a benchmarkhave to be unlucky to hit it. Faster CPU's, different effects), and I suspect you can make it happen even with reported that Ingo Molnar noticed a 4% speed-up in a benchmark Manfredcompilers, Spraul: whatever. a simple example like the above. test, making the optimization very valuable. Ingo also added that the same optimization cropped up in the FreeBSD mailing list a few days I might be proven wrong, but I don't think I am. The reason it can return 1 quite legally is that your new previously. But Linus Torvalds poured cold water on the whole thing, "spin_unlock()" isnot serializing any more, so there is Note that another thing is that yes, "btcl" may be the saying: very little effective ordering between the two actions worst possible thing to use for this, and you might test test,It does NOT making WORK! the optimizationwhether a simpler "xor+xchgl" might be better - it's still b = a; serializing because it is locked, but it should be the spin_unlock(); Let the FreBSD people use it, and let them get faster normal 12 cycles that Intel always seems to waste on timings. They will crash, eventually. serializing instructions rather than 22 cycles. as they access completely different data (ie no data dependencies in sight). So what you could end up doing The window may be small, but if you do this, then Elsewhere, he gave a potential (though unlikely) exploit: is equivalent to suddenly spinlocks aren't reliable any more. very valuable. TheAs a completely same made-up example (which op- will probably CPU#1 CPU#2 The issue is not writes being issued in-orderWe (although cannever shave show the problem in real life,spin but is instructive as unlock()b = a; /* cache miss, we'll delay all the Intel CPU books warn you NOT to assume that an example), imaging running the following test in a this.. */ in-order write behaviour - I bet it won't be the case in loop on multiple CPU's: spin_unlock(); the long run). int test_locking(void) spin_lock(); { The issue is that you _have_ to have a serializing a = 1; timizationinstruction in order to make sure that croppedthe processor upstatic int a; /*in protected by spinlock the */ /* cache miss satisfied, the "a" line doesn't re-order things around the unlock. down fromint b; about 22 ticks is bouncing back and forth */ For example, with a simple write, the CPU can legally spin_lock() b gets the value 1 FreeBSD mailingfor the list. “lock;a = 1; btrl $0, %0”

a =asm 0; code,aberrant behavior to was that 1cache inconsistencies tick would for aturnaround simple by Linus: occur randomly. PPro uses lock to signal that the and it returns "1", which is wrong for any working piplines are no longer invalid and the buffers should be Everybody has convinced me that yes, the Intel ordering spinlock. blown out. rules _are_ strong enough that all of this really is legal, and that's what I wanted. I've gotten sane explanations Unlikely? Yes, definitely. Something we are willing to I have seen the behavior Linus describes on a hardware for why serialization (as opposed to just the simple live with as a potential bug in any real kernel? Definitely analyzer, BUT ONLY ON SYSTEMS THAT WERE PPRO locked access) is required for the lock() side but not the not. “movl $0,AND ABOVE. I guess the %0 BSD people” must stillinstruction, be on unlock() side, and that lack a of symmetry was what older Pentium hardware and that's why they don't know bothered me the most. Manfred objected that according to the Pentium Processor Family this can bite in some cases. Developers Manual, Vol3, Chapter 19.2 Memory Access Ordering, "to Oliver made a strong case that the lack of symmetry can optimize performance, the Pentium processor allows memory reads to Erich Boleyn, an Architect in an IA32 development group at Intel, also be adequately explained by just simply the lack of be reordered ahead of buffered writes in most situations. Internally, replied to Linus, pointing out a possible misconception in his symmetry wrt speculation of reads vs writes. I feel CPU reads (cache hits) can be reordered around buffered writes. proposed exploit. Regarding the code Linus posted, Erich replied: comfortable again. Memory reordering does not occur at the pins, reads (cachehuge miss) and gain. writes appear in-order." He concluded from this that the second CPU It will always return 0. You don't need "spin_unlock()" to Thanks, guys, we'll be that much faster due to this.. would never see the spin_unlock() before the "b=a" line. Linus agreed be serializing. Erich then argued that serialization was not required for the lock() that on a Pentium, Manfred was right. However, he quoted in turn The only thing you need is to make sure there is a store side either, but after a long and interesting discussion he apparently from the Pentium Pro manual, "The only enhancement in the in "spin_unlock()", and that is kind of true by the fact was unable to win people over. PentiumPro processor is the added support for speculative reads and that you're changing something to be observable on store-buffer forwarding." He explained: other processors. ( A Pentium is a in-order machine, without any of the The reason for this is that stores can only possibly be In fact, as Peter Samuelson pointed out to me after KT publication interesting speculation wrt reads etc. So on a Pentium observed when all prior instructions have retired (i.e. (and many thanks to him for it): you'll never see the problem. the store is not sent outside of the processor until it is "You report that Linus was convinced to do the spinlock optimization But a Pentium is also very uninteresting from a SMP committed state, and the earlier instructions are already committed by that time), so the any loads, on Intel, but apparently someone has since changed his mind back. standpoint these days. It's just too weak with too little See from 2.3.30pre5 and above: per-CPU cache etc.. stores, etc absolutely have to have completed first, cache-miss or not. /* This is why the PPro has the MTRR's - exactly to let the * Sadly, some early PPro chips require the locked access, core do speculation (a Pentium doesn't need MTRR's, as He went on: * otherwise we could just always simply do * it won't re-order anything external to the CPU anyway, Since the instructions for the store in the spin_unlock * #define spin_unlock_string \ and in fact won't even re-order things internally). have to have been externally observed for spin_lock to * "movb $0,%0" * Jeff V. Merkey added: be aquired (presuming a correctly functioning spinlock, * Which is noticeably faster. of course), then the earlier instructions to set "b" to the */ What Linus says here is correct for PPro and above. value of "a" have to have completed first. #define spin_unlock_string \ "lock ; btrl $0,%0"" Using a mov instruction to unlock does work fine on a In general, IA32 is Processor Ordered for cacheable 486 or Pentium SMP system, but as of the PPro, this was -- Ed: [23 Dec 1999 00:00:00 -0800] no longer the case, though the window is so accesses. Speculation doesn't affect this. Also, stores infintesimally small, most kernels don't hit it (Netware are not observed speculatively on other processors. 4/5 uses this method but it's spinlocks understand this There was a long clarification discussion, resulting in a complete and the code is writtne to handle it. The most obvious

Susmit Sarkar (St Andrews, Cambridge) Shared Memory Concurrency Jan 2013 14 / 44 Inevitably, problems (A Cautionary Tale)

delay a read that happened inside the critical region mb(); (maybe it missed a cache line), and get a stale value for a = 0; Kernel Traffic #47 For mb(); any of the reads that _should_ have been serialized by b = a; 20 Dec 1999 the spinlock. spin_unlock(); return b; Note that I actually thought this was a legal } Ingo Molnar: optimization, and for a while I had this in the kernel. It 1. spin_unlock() Optimization On Intel crashed. In random ways. Now, OBVIOUSLY the above always has to return 0, right? All accesses to "a" are inside the spinlock, and we Note that the fact that it does not crash now is quite 20 Nov 1999 - 7 Dec 1999 (143 posts) Archive Link: "spin_unlock always set it to zero before we read it into "b" and possibly because of either optimization(i386)" return it. So if we EVER returned anything else, the spinlock would obviously be completely broken, Topics: BSD: FreeBSD, SMP we have a lot less contention on our spinlocks these days. That might hide the problem,Linus because the Torvalds:wouldn't you say? _spinlock_ will be fine (the cache coherency still People: Linus Torvalds, Jeff V. Merkey, Erich Boleyn, Manfred And yes, the above CAN return 1 with the proposed means that the spinlock itself works fine - it's just Spraul, Peter Samuelson, Ingo Molnar optimization. I doubt you can make it do so in real life, that it no longer works reliably as an exclusion but hey, add another access to another variable in the Manfred Spraul thought he'd found a way to shave spin_unlock() thing) same cache line that is accessed through another down from about 22 ticks for the "lock; btrl $0,%0" asm code, to 1 the window is probably very very small, and you spinlock (to get cache-line ping-pong and timing tick for a simple "movl $0,%0" instruction, a huge gain. Later, he . . . 4% speedup in a benchmarkhave to be unlucky to hit it. Faster CPU's, different effects), and I suspect you can make it happen even with reported that Ingo Molnar noticed a 4% speed-up in a benchmark Manfredcompilers, Spraul: whatever. a simple example like the above. test, making the optimization very valuable. Ingo also added that the same optimization cropped up in the FreeBSD mailing list a few days I might be proven wrong, but I don't think I am. The reason it can return 1 quite legally is that your new previously. But Linus Torvalds poured cold water on the whole thing, "spin_unlock()" isnot serializing any more, so there is Note that another thing is that yes, "btcl" may be the saying: very little effective ordering between the two actions worst possible thing to use for this, and you might test test,It does NOT making WORK! the optimizationwhether a simpler "xor+xchgl" might be better - it's still b = a; serializing because it is locked, but it should be the spin_unlock(); Let the FreBSD people use it, and let them get faster normal 12 cycles that Intel always seems to waste on timings. They will crash, eventually. serializing instructions rather than 22 cycles. as they access completely different data (ie no data dependencies in sight). So what you could end up doing The window may be small, but if you do this, then Elsewhere, he gave a potential (though unlikely) exploit: is equivalent to suddenly spinlocks aren't reliable any more. very valuable. TheAs a completely same made-up example (which op- will probably It doesCPU#1 NOTCPU#2 WORK! The issue is not writes being issued in-orderWe (although cannever shave show the problem in real life,spin but is instructive as unlock()b = a; /* cache miss, we'll delay all the Intel CPU books warn you NOT to assume that an example), imaging running the following test in a this.. */ in-order write behaviour - I bet it won't be the case in loop on multiple CPU's: spin_unlock(); the long run). int test_locking(void) spin_lock(); { The issue is that you _have_ to have a serializing a = 1; timizationinstruction in order to make sure that croppedthe processor upstatic int a; /*in protected by spinlock the */ /* cache miss satisfied, the "a" line doesn't re-order things around the unlock. down fromint b; about 22 ticks is bouncing back and forth */ For example, with a simple write, the CPU can legally spin_lock() b gets the value 1 FreeBSD mailingfor the list. “lock;a = 1; Let btrl the $0, FreeBSD %0” people use it, and let them get faster timings. a =asm 0; code,aberrant behavior to was that 1cache inconsistencies tick would for aturnaround simple by Linus: occur randomly. PPro uses lock to signal that the and it returns "1", which is wrong for any working piplines are no longer invalid and the buffers should be Everybody has convinced me that yes, the Intel ordering spinlock. blown out. rules _are_ strong enough that all of this really is legal, and that's what I wanted. I've gotten sane explanations Unlikely? Yes, definitely. Something we are willing to I have seen the behavior Linus describesThey on a hardware willfor why crash, serialization (as opposed to just eventually. the simple live with as a potential bug in any real kernel? Definitely analyzer, BUT ONLY ON SYSTEMS THAT WERE PPRO locked access) is required for the lock() side but not the not. “movl $0,AND ABOVE. I guess the %0 BSD people” must stillinstruction, be on unlock() side, and that lack a of symmetry was what older Pentium hardware and that's why they don't know bothered me the most. Manfred objected that according to the Pentium Processor Family this can bite in some cases. Developers Manual, Vol3, Chapter 19.2 Memory Access Ordering, "to Oliver made a strong case that the lack of symmetry can optimize performance, the Pentium processor allows memory reads to Erich Boleyn, an Architect in an IA32 development group at Intel, also be adequately explained by just simply the lack of be reordered ahead of buffered writes in most situations. Internally, replied to Linus, pointing out a possible misconceptionAs in his a completelysymmetry wrt speculation of readsmade vs writes. I feel up exam- CPU reads (cache hits) can be reordered around buffered writes. proposed exploit. Regarding the code Linus posted, Erich replied: comfortable again. Memory reordering does not occur at the pins, reads (cachehuge miss) and gain. writes appear in-order." He concluded from this that the second CPU It will always return 0. You don't need "spin_unlock()" to Thanks, guys, we'll be that much faster due to this.. would never see the spin_unlock() before the "b=a" line. Linus agreed be serializing. Erich then argued that serialization was not required for the lock() that on a Pentium, Manfred was right. However, he quoted in turn The only thing you need is to make sure there is a store side either, but after a long and interesting discussion he apparently from the Pentium Pro manual, "The only enhancement in the in "spin_unlock()", and that is kind ofple, true by the fact . . . was unable to win people over. PentiumPro processor is the added support for speculative reads and that you're changing something to be observable on store-buffer forwarding." He explained: other processors. ( A Pentium is a in-order machine, without any of the The reason for this is that stores can only possibly be In fact, as Peter Samuelson pointed out to me after KT publication interesting speculation wrt reads etc. So on a Pentium observed when all prior instructions have retired (i.e. (and many thanks to him for it): you'll never see the problem. the store is not sent outside of the processor until it is "You report that Linus was convinced to do the spinlock optimization But a Pentium is also very uninteresting from a SMP committed state, and the earlier instructions are already committed by that time), so the any loads, on Intel, but apparently someone has since changed his mind back. standpoint these days. It's just too weak with too little See from 2.3.30pre5 and above: per-CPU cache etc.. stores, etc absolutely have to have completed first, cache-miss or not. /* This is why the PPro has the MTRR's - exactly to let the * Sadly, some early PPro chips require the locked access, core do speculation (a Pentium doesn't need MTRR's, as He went on: * otherwise we could just always simply do * it won't re-order anything external to the CPU anyway, Since the instructions for the store in the spin_unlock * #define spin_unlock_string \ and in fact won't even re-order things internally). have to have been externally observed for spin_lock to * "movb $0,%0" * Jeff V. Merkey added: be aquired (presuming a correctly functioning spinlock, * Which is noticeably faster. of course), then the earlier instructions to set "b" to the */ What Linus says here is correct for PPro and above. value of "a" have to have completed first. #define spin_unlock_string \ "lock ; btrl $0,%0"" Using a mov instruction to unlock does work fine on a In general, IA32 is Processor Ordered for cacheable 486 or Pentium SMP system, but as of the PPro, this was -- Ed: [23 Dec 1999 00:00:00 -0800] no longer the case, though the window is so accesses. Speculation doesn't affect this. Also, stores infintesimally small, most kernels don't hit it (Netware are not observed speculatively on other processors. 4/5 uses this method but it's spinlocks understand this There was a long clarification discussion, resulting in a complete and the code is writtne to handle it. The most obvious

Susmit Sarkar (St Andrews, Cambridge) Shared Memory Concurrency Jan 2013 14 / 44 Inevitably, problems (A Cautionary Tale)

delay a read that happened inside the critical region mb(); (maybe it missed a cache line), and get a stale value for a = 0; Kernel Traffic #47 For mb(); any of the reads that _should_ have been serialized by b = a; 20 Dec 1999 the spinlock. spin_unlock(); return b; Note that I actually thought this was a legal } ManfredIngo Molnar: Spraul: optimization, and for a while I had this in the kernel. It 1. spin_unlock() Optimization On Intel crashed. In random ways. Now, OBVIOUSLY the above always has to return 0, right? All accesses to "a" are inside the spinlock, and we Note that the fact that it does not crash now is quite 20 Nov 1999 - 7 Dec 1999 (143 posts) Archive Link: "spin_unlock always set it to zero before we read it into "b" and possibly because of either optimization(i386)" return it. So if we EVER returned anything else, the spinlock would obviously be completely broken, Topics: BSD: FreeBSD, SMP we have a lot less contention on our spinlocks these days. That might hide the problem,Linus because the Torvalds:wouldn't you say? _spinlock_ will be fine (the cache coherency still People: Linus Torvalds, Jeff V. Merkey, Erich Boleyn, Manfred And yes, the above CAN return 1 with the proposed means that the spinlock itself works fine - it's just Spraul, Peter Samuelson, Ingo Molnar optimization. I doubt you can make it do so in real life, that it no longer works reliably as an exclusion but hey, add another access to another variable in the Manfred Spraul thought he'd found a way to shave spin_unlock() thing) same cache line that is accessed through another down from about 22 ticks for the "lock; btrl $0,%0" asm code, to 1 the window is probably very very small, and you spinlock (to get cache-line ping-pong and timing tick for a simple "movl $0,%0" instruction, a huge gain. Later, he . . . according. . . 4% speedup to the Pentium in a Processor benchmarkhave to be unlucky to hit it. Faster Family CPU's, different Developerseffects), and I suspect you Man-can make it happen even with reported that Ingo Molnar noticed a 4% speed-up in a benchmark Manfredcompilers, Spraul: whatever. a simple example like the above. test, making the optimization very valuable. Ingo also added that the same optimization cropped up in the FreeBSD mailing list a few days I might be proven wrong, but I don't think I am. The reason it can return 1 quite legally is that your new previously. But Linus Torvalds poured cold water on the whole thing, "spin_unlock()" isnot serializing any more, so there is Note that another thing is that yes, "btcl" may be the saying: very little effective ordering between the two actions worst possible thing to use for this, and you might test ual,test, Vol3,It does NOTChapter making WORK! 19.2 the Memory optimizationwhether a simpler Access "xor+xchgl" might be better Ordering, - it's still “tob = a; optimize serializing because it is locked, but it should be the spin_unlock(); Let the FreBSD people use it, and let them get faster normal 12 cycles that Intel always seems to waste on timings. They will crash, eventually. serializing instructions rather than 22 cycles. as they access completely different data (ie no data dependencies in sight). So what you could end up doing The window may be small, but if you do this, then Elsewhere, he gave a potential (though unlikely) exploit: is equivalent to suddenly spinlocks aren't reliable any more. performance,very valuable. the Pentium The processorAs a completely same made-up example allows (which op- will probably memoryIt does readsCPU#1 NOT to beCPU#2 WORK! The issue is not writes being issued in-orderWe (although cannever shave show the problem in real life,spin but is instructive as unlock()b = a; /* cache miss, we'll delay all the Intel CPU books warn you NOT to assume that an example), imaging running the following test in a this.. */ in-order write behaviour - I bet it won't be the case in loop on multiple CPU's: spin_unlock(); the long run). int test_locking(void) spin_lock(); { The issue is that you _have_ to have a serializing a = 1; reorderedinstruction ahead in order to make sure of that the buffered processor writesstatic int ina; /* protected most by spinlock */ situations. Internally, timization cropped up in the /* cache miss satisfied, the "a" line doesn't re-order things around the unlock. down fromint b; about 22 ticks is bouncing back and forth */ For example, with a simple write, the CPU can legally spin_lock() b gets the value 1 a = 1; CPUFreeBSD reads (cache mailing hits)for can the list. be “lock; reorderedLet btrl around the buffered$0, FreeBSD %0 writes.” people use it, Memory reordering does not occur at theand pins, let reads them (cache get miss) faster timings. a =asm 0; code,aberrant behavior to was that 1cache inconsistencies tick would for aturnaround simple by Linus: occur randomly. PPro uses lock to signal that the and it returns "1", which is wrong for any working piplines are no longer invalid and the buffers should be Everybody has convinced me that yes, the Intel ordering spinlock. blown out. rules _are_ strong enough that all of this really is legal, and that's what I wanted. I've gotten sane explanations Unlikely? Yes, definitely. Something we are willing to I have seen the behavior Linus describesThey on a hardware willfor why crash, serialization (as opposed to just eventually. the simple and writeslive with asappear a potential bug in any realin-order.” kernel? Definitely analyzer, BUT ONLY ON SYSTEMS THAT WERE PPRO locked access) is required for the lock() side but not the not. “movl $0,AND ABOVE. I guess the %0 BSD people” must stillinstruction, be on unlock() side, and that lack a of symmetry was what older Pentium hardware and that's why they don't know bothered me the most. Manfred objected that according to the Pentium Processor Family this can bite in some cases. Developers Manual, Vol3, Chapter 19.2 Memory Access Ordering, "to Oliver made a strong case that the lack of symmetry can optimize performance, the Pentium processor allows memory reads to Erich Boleyn, an Architect in an IA32 development group at Intel, also be adequately explained by just simply the lack of be reordered ahead of buffered writes in most situations. Internally, replied to Linus, pointing out a possible misconceptionAs in his a completelysymmetry wrt speculation of readsmade vs writes. I feel up exam- CPU reads (cache hits) can be reordered around buffered writes. proposed exploit. Regarding the code Linus posted, Erich replied: comfortable again. Memory reordering does not occur at the pins, reads (cachehuge miss) and gain. writes appear in-order." He concluded from this that the second CPU It will always return 0. You don't need "spin_unlock()" to Thanks, guys, we'll be that much faster due to this.. would never see the spin_unlock() before the "b=a" line. Linus agreed be serializing. Erich then argued that serialization was not required for the lock() that on a Pentium, Manfred was right. However, he quoted in turn The only thing you need is to make sure there is a store side either, but after a long and interesting discussion he apparently from the Pentium Pro manual, "The only enhancement in the in "spin_unlock()", and that is kind ofple, true by the fact . . . was unable to win people over. PentiumPro processor is theYour added support for speculative example reads and cannot happen. that you're changing something to be observable on store-buffer forwarding." He explained: other processors. ( A Pentium is a in-order machine, without any of the The reason for this is that stores can only possibly be In fact, as Peter Samuelson pointed out to me after KT publication interesting speculation wrt reads etc. So on a Pentium observed when all prior instructions have retired (i.e. (and many thanks to him for it): you'll never see the problem. the store is not sent outside of the processor until it is "You report that Linus was convinced to do the spinlock optimization But a Pentium is also very uninteresting from a SMP committed state, and the earlier instructions are already committed by that time), so the any loads, on Intel, but apparently someone has since changed his mind back. standpoint these days. It's just too weak with too little See from 2.3.30pre5 and above: per-CPU cache etc.. stores, etc absolutely have to have completed first, cache-miss or not. /* This is why the PPro has the MTRR's - exactly to let the * Sadly, some early PPro chips require the locked access, core do speculation (a Pentium doesn't need MTRR's, as He went on: * otherwise we could just always simply do * it won't re-order anything external to the CPU anyway, Since the instructions for the store in the spin_unlock * #define spin_unlock_string \ and in fact won't even re-order things internally). have to have been externally observed for spin_lock to * "movb $0,%0" * Jeff V. Merkey added: be aquired (presuming a correctly functioning spinlock, * Which is noticeably faster. of course), then the earlier instructions to set "b" to the */ What Linus says here is correct for PPro and above. value of "a" have to have completed first. #define spin_unlock_string \ "lock ; btrl $0,%0"" Using a mov instruction to unlock does work fine on a In general, IA32 is Processor Ordered for cacheable 486 or Pentium SMP system, but as of the PPro, this was -- Ed: [23 Dec 1999 00:00:00 -0800] no longer the case, though the window is so accesses. Speculation doesn't affect this. Also, stores infintesimally small, most kernels don't hit it (Netware are not observed speculatively on other processors. 4/5 uses this method but it's spinlocks understand this There was a long clarification discussion, resulting in a complete and the code is writtne to handle it. The most obvious

Susmit Sarkar (St Andrews, Cambridge) Shared Memory Concurrency Jan 2013 14 / 44 Inevitably, problems (A Cautionary Tale)

delay a read that happened inside the critical region mb(); (maybe it missed a cache line), and get a stale value for a = 0; Kernel Traffic #47 For mb(); any of the reads that _should_ have been serialized by b = a; 20 Dec 1999 the spinlock. spin_unlock(); return b; Note that I actually thought this was a legal } ManfredIngo Molnar: Spraul: optimization, and for a while I had this in the kernel. It 1. spin_unlock() Optimization On Intel crashed. In random ways. Now, OBVIOUSLY the above always has to return 0, right? All accesses to "a" are inside the spinlock, and we Note that the fact that it does not crash now is quite 20 Nov 1999 - 7 Dec 1999 (143 posts) Archive Link: "spin_unlock always set it to zero before we read it into "b" and possibly because of either optimization(i386)" return it. So if we EVER returned anything else, the spinlock would obviously be completely broken, Topics: BSD: FreeBSD, SMP we have a lot less contention on our spinlocks these days. That might hide the problem,Linus because the Torvalds:wouldn't you say? _spinlock_ will be fine (the cache coherency still People: Linus Torvalds, Jeff V. Merkey, Erich Boleyn, Manfred And yes, the above CAN return 1 with the proposed means that the spinlock itself works fine - it's just Spraul, Peter Samuelson, Ingo Molnar optimization. I doubt you can make it do so in real life, that it no longer works reliably as an exclusion but hey, add another access to another variable in the Manfred Spraul thought he'd found a way to shave spin_unlock() thing) same cache line that is accessed through another down from about 22 ticks for the "lock; btrl $0,%0" asm code, to 1 the window is probably very very small, and you spinlock (to get cache-line ping-pong and timing tick for a simple "movl $0,%0" instruction, a huge gain. Later, he . . . according. . . 4% speedup to the Pentium in a Processor benchmarkhave to be unlucky to hit it. Faster Family CPU's, different Developerseffects), and I suspect you Man-can make it happen even with reported that Ingo Molnar noticed a 4% speed-up in a benchmark Manfredcompilers, Spraul: whatever. a simple example like the above. test, making the optimization very valuable. Ingo also added that the same optimization cropped up in the FreeBSD mailing list a few days I might be proven wrong, but I don't think I am. The reason it can return 1 quite legally is that your new previously. But Linus Torvalds poured cold water on the whole thing, "spin_unlock()" isnot serializing any more, so there is Note that another thing is that yes, "btcl" may be the saying: very little effective ordering between the two actions worst possible thing to use for this, and you might test ual,test, Vol3,It does NOTChapter making WORK! 19.2 the Memory optimizationwhether a simpler Access "xor+xchgl" might be better Ordering, - it's still “tob = a; optimize serializing because it is locked, but it should be the spin_unlock(); Let the FreBSD people use it, and let them get faster normal 12 cycles that Intel always seems to waste on timings. They will crash, eventually. serializing instructions rather than 22 cycles. as they access completely different data (ie no data dependencies in sight). So what you could end up doing The window may be small, but if you do this, then Elsewhere, he gave a potential (though unlikely) exploit: is equivalent to suddenly spinlocks aren't reliable any more. performance,very valuable. the Pentium The processorAs a completely same made-up example allows (which op- will probably memoryIt does readsCPU#1 NOT to beCPU#2 WORK! The issue is not writes being issued in-orderWe (although cannever shave show the problem in real life,spin but is instructive as unlock()b = a; /* cache miss, we'll delay all the Intel CPU books warn you NOT to assume that an example), imaging running the following test in a this.. */ in-order write behaviour - I bet it won't be the case in loop on multiple CPU's: spin_unlock(); the long run). int test_locking(void) spin_lock(); { The issue is that you _have_ to have a serializing a = 1; reorderedinstruction ahead in order to make sure of that the buffered processor writesstatic int ina; /* protected most by spinlock */ situations. Internally, timization cropped up in the /* cache miss satisfied, the "a" line doesn't re-order things around the unlock. down fromint b; about 22 ticks Linus Torvalds: is bouncing back and forth */ For example, with a simple write, the CPU can legally spin_lock() b gets the value 1 a = 1; CPUFreeBSD reads (cache mailing hits)for can the list. be “lock; reorderedLet btrl around the buffered$0, FreeBSD %0 writes.” people use it, Memory reordering doesfrom not occur the Pentium at theand pins, Pro let reads manual, them (cache “The get miss) fasteronly enhancement timings. a =asm 0; code,aberrant behavior to was that 1cache inconsistencies tick would for aturnaround simple by Linus: occur randomly. PPro uses lock to signal that the and it returns "1", which is wrong for any working piplines are no longer invalid and the buffers should be Everybody has convinced me that yes, the Intel ordering spinlock. blown out. rules _are_ strong enough that all of this really is legal, in the PentiumPro processorand that's what is I wanted. the I've gotten sane added explanations support for Unlikely? Yes, definitely. Something we are willing to I have seen the behavior Linus describesThey on a hardware willfor why crash, serialization (as opposed to just eventually. the simple and writeslive with asappear a potential bug in any realin-order.” kernel? Definitely analyzer, BUT ONLY ON SYSTEMS THAT WERE PPRO locked access) is required for the lock() side but not the not. “movl $0,AND ABOVE. I guess the %0 BSD people” must stillinstruction, be on unlock() side, and that lack a of symmetry was what speculativeolder Pentium hardware reads and that's why theyand don't know store-bufferbothered me the most. forwarding.” Manfred objected that according to the Pentium Processor Family this can bite in some cases. Developers Manual, Vol3, Chapter 19.2 Memory Access Ordering, "to Oliver made a strong case that the lack of symmetry can optimize performance, the Pentium processor allows memory reads to Erich Boleyn, an Architect in an IA32 development group at Intel, also be adequately explained by just simply the lack of be reordered ahead of buffered writes in most situations. Internally, replied to Linus, pointing out a possible misconceptionAs in his a completelysymmetry wrt speculation of readsmade vs writes. I feel up exam- CPU reads (cache hits) can be reordered around buffered writes. proposed exploit. Regarding the code Linus posted, Erich replied: comfortable again. Memory reordering does not occur at the pins, reads (cachehuge miss) and gain. A PentiumIt will always return 0. Youis don't needa "spin_unlock()" in-order to machine,Thanks, guys, we'll be that muchwithout faster due to this.. any of the writes appear in-order." He concluded from this that the second CPU would never see the spin_unlock() before the "b=a" line. Linus agreed be serializing. Erich then argued that serialization was not required for the lock() that on a Pentium, Manfred was right. However, he quoted in turn The only thing you need is to make sure there is a store side either, but after a long and interesting discussion he apparently from the Pentium Pro manual, "The only enhancement in the in "spin_unlock()", and that is kind ofple, true by the fact . . . was unable to win people over. PentiumPro processor is theYour added support for speculative example reads and cannot happen. interestingthat you're changing speculation something to be observable on wrt reads etc. So on a Pentium store-buffer forwarding." He explained: other processors. ( A Pentium is a in-order machine, without any of the The reason for this is that stores can only possibly be In fact, as Peter Samuelson pointed out to me after KT publication interesting speculation wrt reads etc. So on a Pentium observed when all prior instructions have retired (i.e. (and many thanks to him for it): you'll never see the problem. you’ll never see the problem. the store is not sent outside of the processor until it is "You report that Linus was convinced to do the spinlock optimization But a Pentium is also very uninteresting from a SMP committed state, and the earlier instructions are already committed by that time), so the any loads, on Intel, but apparently someone has since changed his mind back. standpoint these days. It's just too weak with too little See from 2.3.30pre5 and above: per-CPU cache etc.. stores, etc absolutely have to have completed first, cache-miss or not. /* This is why the PPro has the MTRR's - exactly to let the * Sadly, some early PPro chips require the locked access, core do speculation (a Pentium doesn't need MTRR's, as He went on: * otherwise we could just always simply do * it won't re-order anything external to the CPU anyway, Since the instructions for the store in the spin_unlock * #define spin_unlock_string \ and in fact won't even re-order things internally). have to have been externally observed for spin_lock to * "movb $0,%0" * Jeff V. Merkey added: be aquired (presuming a correctly functioning spinlock, * Which is noticeably faster. of course), then the earlier instructions to set "b" to the */ What Linus says here is correct for PPro and above. value of "a" have to have completed first. #define spin_unlock_string \ "lock ; btrl $0,%0"" Using a mov instruction to unlock does work fine on a In general, IA32 is Processor Ordered for cacheable 486 or Pentium SMP system, but as of the PPro, this was -- Ed: [23 Dec 1999 00:00:00 -0800] no longer the case, though the window is so accesses. Speculation doesn't affect this. Also, stores infintesimally small, most kernels don't hit it (Netware are not observed speculatively on other processors. 4/5 uses this method but it's spinlocks understand this There was a long clarification discussion, resulting in a complete and the code is writtne to handle it. The most obvious

Susmit Sarkar (St Andrews, Cambridge) Shared Memory Concurrency Jan 2013 14 / 44 Inevitably, problems (A Cautionary Tale)

delay a read that happened inside the critical region mb(); (maybe it missed a cache line), and get a stale value for a = 0; Kernel Traffic #47 For mb(); any of the reads that _should_ have been serialized by b = a; 20 Dec 1999 the spinlock. spin_unlock(); return b; Note that I actually thought this was a legal } ManfredIngo Molnar: Spraul: optimization, and for a while I had this in the kernel. It 1. spin_unlock() Optimization On Intel crashed. In random ways. Now, OBVIOUSLY the above always has to return 0, right? All accesses to "a" are inside the spinlock, and we Note that the fact that it does not crash now is quite 20 Nov 1999 - 7 Dec 1999 (143 posts) Archive Link: "spin_unlock always set it to zero before we read it into "b" and possibly because of either optimization(i386)" return it. So if we EVER returned anything else, the spinlock would obviously be completely broken, Topics: BSD: FreeBSD, SMP we have a lot less contention on our spinlocks these days. That might hide the problem,Linus because the Torvalds:wouldn't you say? _spinlock_ will be fine (the cache coherency still People: Linus Torvalds, Jeff V. Merkey, Erich Boleyn, Manfred And yes, the above CAN return 1 with the proposed means that the spinlock itself works fine - it's just Spraul, Peter Samuelson, Ingo Molnar optimization. I doubt you can make it do so in real life, that it no longer works reliably as an exclusion but hey, add another access to another variable in the Manfred Spraul thought he'd found a way to shave spin_unlock() thing) same cache line that is accessed through another down from about 22 ticks for the "lock; btrl $0,%0" asm code, to 1 the window is probably very very small, and you spinlock (to get cache-line ping-pong and timing tick for a simple "movl $0,%0" instruction, a huge gain. Later, he . . . according. . . 4% speedup to the Pentium in a Processor benchmarkhave to be unlucky to hit it. Faster Family CPU's, different Developerseffects), and I suspect you Man-can make it happen even with reported that Ingo Molnar noticed a 4% speed-up in a benchmark Manfredcompilers, Spraul: whatever. a simple example like the above. test, making the optimization very valuable. Ingo also added that the same optimization cropped up in the FreeBSD mailing list a few days I might be proven wrong, but I don't think I am. The reason it can return 1 quite legally is that your new previously. But Linus Torvalds poured cold water on the whole thing, "spin_unlock()" isnot serializing any more, so there is Note that another thing is that yes, "btcl" may be the saying: very little effective ordering between the two actions worst possible thing to use for this, and you might test ual,test, Vol3,It does NOTChapter making WORK! 19.2 the Memory optimizationwhether a simpler Access "xor+xchgl" might be better Ordering, - it's still “tob = a; optimize Jeff V. Markey: serializing because it is locked, but it should be the spin_unlock(); Let the FreBSD people use it, and let them get faster normal 12 cycles that Intel always seems to waste on timings. They will crash, eventually. serializing instructions rather than 22 cycles. as they access completely different data (ie no data dependencies in sight). So what you could end up doing The window may be small, but if you do this, then Elsewhere, he gave a potential (though unlikely) exploit: is equivalent to suddenly spinlocks aren't reliable any more. performance,very valuable. the Pentium The processorAs a completely same made-up example allows (which op- will probably memoryIt does readsCPU#1 NOT to beCPU#2 WORK! The issue is not writes being issued in-orderWe (although cannever shave show the problem in real life,spin but is instructive as unlock()b = a; /* cache miss, we'll delay all the Intel CPU books warn you NOT to assume that an example), imaging running the following test in a this.. */ in-order write behaviour - I bet it won't be the case in loop on multiple CPU's: spin_unlock(); the long run). int test_locking(void) spin_lock(); { The issue is that you _have_ to have a serializing a = 1; reorderedinstruction ahead in order to make sure of that the buffered processor writesstatic int ina; /* protected most by spinlock */ situations. Internally, timization cropped up in the /* cache miss satisfied, the "a" line doesn't re-order things around the unlock. down fromint b; about 22 ticks I have seen the behaviorLinus Linus Torvalds: describes on aishard- bouncing back and forth */ For example, with a simple write, the CPU can legally spin_lock() b gets the value 1 a = 1; wareCPU analyzer,FreeBSD reads (cache BUT mailing hits)for ONLY can the list. be “lock; reordered ON SYSTEMSLet btrl around the buffered$0, FreeBSD THAT %0 writes.” people use it, Memory reordering doesfrom not occur the Pentium at theand pins, Pro let reads manual, them (cache “The get miss) fasteronly enhancement timings. a =asm 0; code,aberrant behavior to was that 1cache inconsistencies tick would for aturnaround simple by Linus: WERE PPRO AND ABOVE.occur randomly. I PPro guess uses lock to signal that the the BSD peo- and it returns "1", which is wrong for any working piplines are no longer invalid and the buffers should be Everybody has convinced me that yes, the Intel ordering spinlock. blown out. rules _are_ strong enough that all of this really is legal, in the PentiumPro processorand that's what is I wanted. the I've gotten sane added explanations support for Unlikely? Yes, definitely. Something we are willing to I have seen the behavior Linus describesThey on a hardware willfor why crash, serialization (as opposed to just eventually. the simple and writeslive with asappear a potential bug in any realin-order.” kernel? Definitely analyzer, BUT ONLY ON SYSTEMS THAT WERE PPRO locked access) is required for the lock() side but not the ple must stillnot. be on“movl older $0,AND Pentium ABOVE. I guess the %0 BSD people” must stillinstruction, hardwarebe on unlock() side, and that lack a of symmetry was what speculativeolder Pentium hardware reads and that's why theyand don't know store-bufferbothered me the most. forwarding.” Manfred objected that according to the Pentium Processor Family this can bite in some cases. Developers Manual, Vol3, Chapter 19.2 Memory Access Ordering, "to Oliver made a strong case that the lack of symmetry can optimize performance, the Pentium processor allows memory reads to Erich Boleyn, an Architect in an IA32 development group at Intel, also be adequately explained by just simply the lack of be reordered ahead of buffered writes in most situations. Internally, replied to Linus, pointing out a possible misconceptionAs in his a completelysymmetry wrt speculation of readsmade vs writes. I feel up exam- CPU reads (cache hits) can be reordered around buffered writes. proposed exploit. Regarding the code Linus posted, Erich replied: comfortable again. that’s whyMemory reordering they does not occur at the pins, don’t reads (cachehuge miss) and know gain. this can bite in some A PentiumIt will always return 0. Youis don't needa "spin_unlock()" in-order to machine,Thanks, guys, we'll be that muchwithout faster due to this.. any of the writes appear in-order." He concluded from this that the second CPU would never see the spin_unlock() before the "b=a" line. Linus agreed be serializing. Erich then argued that serialization was not required for the lock() that on a Pentium, Manfred was right. However, he quoted in turn The only thing you need is to make sure there is a store side either, but after a long and interesting discussion he apparently from the Pentium Pro manual, "The only enhancement in the in "spin_unlock()", and that is kind ofple, true by the fact . . . was unable to win people over. PentiumPro processor is theYour added support for speculative example reads and cannot happen. interestingthat you're changing speculation something to be observable on wrt reads etc. So on a Pentium store-buffer forwarding." He explained: cases. other processors. ( A Pentium is a in-order machine, without any of the The reason for this is that stores can only possibly be In fact, as Peter Samuelson pointed out to me after KT publication interesting speculation wrt reads etc. So on a Pentium observed when all prior instructions have retired (i.e. (and many thanks to him for it): you'll never see the problem. you’ll never see the problem. the store is not sent outside of the processor until it is "You report that Linus was convinced to do the spinlock optimization But a Pentium is also very uninteresting from a SMP committed state, and the earlier instructions are already committed by that time), so the any loads, on Intel, but apparently someone has since changed his mind back. standpoint these days. It's just too weak with too little See from 2.3.30pre5 and above: per-CPU cache etc.. stores, etc absolutely have to have completed first, cache-miss or not. /* This is why the PPro has the MTRR's - exactly to let the * Sadly, some early PPro chips require the locked access, core do speculation (a Pentium doesn't need MTRR's, as He went on: * otherwise we could just always simply do * it won't re-order anything external to the CPU anyway, Since the instructions for the store in the spin_unlock * #define spin_unlock_string \ and in fact won't even re-order things internally). have to have been externally observed for spin_lock to * "movb $0,%0" * Jeff V. Merkey added: be aquired (presuming a correctly functioning spinlock, * Which is noticeably faster. of course), then the earlier instructions to set "b" to the */ What Linus says here is correct for PPro and above. value of "a" have to have completed first. #define spin_unlock_string \ "lock ; btrl $0,%0"" Using a mov instruction to unlock does work fine on a In general, IA32 is Processor Ordered for cacheable 486 or Pentium SMP system, but as of the PPro, this was -- Ed: [23 Dec 1999 00:00:00 -0800] no longer the case, though the window is so accesses. Speculation doesn't affect this. Also, stores infintesimally small, most kernels don't hit it (Netware are not observed speculatively on other processors. 4/5 uses this method but it's spinlocks understand this There was a long clarification discussion, resulting in a complete and the code is writtne to handle it. The most obvious

Susmit Sarkar (St Andrews, Cambridge) Shared Memory Concurrency Jan 2013 14 / 44 Inevitably, problems (A Cautionary Tale)

delay a read that happened inside the critical region mb(); (maybe it missed a cache line), and get a stale value for a = 0; Kernel Traffic #47 For mb(); any of the reads that _should_ have been serialized by b = a; 20 Dec 1999 the spinlock. spin_unlock(); return b; Note that I actually thought this was a legal } ManfredIngo Molnar: Spraul: optimization, and for a while I had this in the kernel. It 1. spin_unlock() Optimization On Intel crashed. In random ways. Now, OBVIOUSLY the above always has to return 0, right? All accesses to "a" are inside the spinlock, and we Note that the fact that it does not crash now is quite 20 Nov 1999 - 7 Dec 1999 (143 posts) Archive Link: "spin_unlock always set it to zero before we read it into "b" and possibly because of either optimization(i386)" return it. So if we EVER returned anything else, the spinlock would obviously be completely broken, Topics: BSD: FreeBSD, SMP we have a lot less contention on our spinlocks these days. That might hide the problem,Linus because the Torvalds:wouldn't you say? _spinlock_ will be fine (the cache coherency still People: Linus Torvalds, Jeff V. Merkey, Erich Boleyn, Manfred And yes, the above CAN return 1 with the proposed means that the spinlock itself works fine - it's just Spraul, Peter Samuelson, Ingo Molnar optimization. I doubt you can make it do so in real life, that it no longer works reliably as an exclusion but hey, add another access to another variable in the Manfred Spraul thought he'd found a way to shave spin_unlock() thing) same cache line that is accessed through another down from about 22 ticks for the "lock; btrl $0,%0" asm code, to 1 the window is probably very very small, and you spinlock (to get cache-line ping-pong and timing tick for a simple "movl $0,%0" instruction, a huge gain. Later, he . . . according. . . 4% speedup to the Pentium in a Processor benchmarkhave to be unlucky to hit it. Faster Family CPU's, different Developerseffects), and I suspect you Man-can make it happen even with reported that Ingo Molnar noticed a 4% speed-up in a benchmark Manfredcompilers, Spraul: whatever. a simple example like the above. test, making the optimization very valuable. Ingo also added that the same optimization cropped up in the FreeBSD mailing list a few days I might be proven wrong, but I don't think I am. The reason it can return 1 quite legally is that your new previously. But Linus Torvalds poured cold water on the whole thing, "spin_unlock()" isnot serializing any more, so there is Note that another thing is that yes, "btcl" may be the saying: very little effective ordering between the two actions worst possible thing to use for this, and you might test ual,test, Vol3,It does NOTChapter making WORK! 19.2 the Memory optimizationwhether a simpler Access "xor+xchgl" might be better Ordering, - it's still “tob = a; optimize Jeff V. Markey: serializing because it is locked, but it should be the spin_unlock(); Let the FreBSD people use it, and let them get faster normal 12 cycles that Intel always seems to waste on timings. They will crash, eventually. serializing instructions rather than 22 cycles. as they access completely different data (ie no data dependencies in sight). So what you could end up doing The window may be small, but if you do this, then Elsewhere, he gave a potential (though unlikely) exploit: is equivalent to suddenly spinlocks aren't reliable any more. performance,very valuable. the Pentium The processorAs a completely same made-up example allows (which op- will probably memoryIt does readsCPU#1 NOT to beCPU#2 WORK! The issue is not writes being issued in-orderWe (although cannever shave show the problem in real life,spin but is instructive as unlock()b = a; /* cache miss, we'll delay all the Intel CPU books warn you NOT to assume that an example), imaging running the following test in a this.. */ in-order write behaviour - I bet it won't be the case in loop on multiple CPU's: spin_unlock(); the long run). int test_locking(void) spin_lock(); { The issue is that you _have_ to have a serializing a = 1; reorderedinstruction ahead in order to make sure of that the buffered processor writesstatic int ina; /* protected most by spinlock */ situations. Internally, timization cropped up in the /* cache miss satisfied, the "a" line doesn't re-order things around the unlock. down fromint b; about 22 ticks I have seen the behaviorLinus Linus Torvalds: describes on aishard- bouncing back and forth */ For example, with a simple write, the CPU can legally spin_lock() b gets the value 1 a = 1; wareCPU analyzer,FreeBSD reads (cache BUT mailing hits)for ONLY can the list. be “lock; reordered ON SYSTEMSLet btrl around the buffered$0, FreeBSD THAT %0 writes.” people use it, Memory reordering doesfrom not occur the Pentium at theand pins, Pro let reads manual, them (cache “The get miss) fasteronly enhancement timings. a =asm 0; code,aberrant behavior to was that 1cache inconsistencies tick would for aturnaround simple by Linus: WERE PPRO AND ABOVE.occur randomly. I PPro guess uses lock to signal that the the BSD peo- and it returns "1", which is wrong for any working piplines are no longer invalid and the buffers should be Everybody has convinced me that yes, the Intel ordering spinlock. blown out. rules _are_ strong enough that all of this really is legal, in theErich PentiumPro Boleyn processor (Architect,and that's what is I wanted. the I've gotten sane addedIntel explanations ): support for Unlikely? Yes, definitely. Something we are willing to I have seen the behavior Linus describesThey on a hardware willfor why crash, serialization (as opposed to just eventually. the simple and writeslive with asappear a potential bug in any realin-order.” kernel? Definitely analyzer, BUT ONLY ON SYSTEMS THAT WERE PPRO locked access) is required for the lock() side but not the ple must stillnot. be on“movl older $0,AND Pentium ABOVE. I guess the %0 BSD people” must stillinstruction, hardwarebe on unlock() side, and that lack a of symmetry was what speculativeolder Pentium hardware reads and that's why theyand don't know store-bufferbothered me the most. forwarding.” Manfred objected that according to the Pentium Processor Family this can bite in some cases. Developers Manual, Vol3, Chapter 19.2 Memory Access Ordering, "to Oliver made a strong case that the lack of symmetry can optimize performance, the Pentium processor allows memory reads to Erich Boleyn, an Architect in an IA32 development group at Intel, also be adequately explained by just simply the lack of be reordered ahead of buffered writes in most situations. Internally, replied to Linus, pointing out a possible misconceptionAs in his a completelysymmetry wrt speculation of readsmade vs writes. I feel up exam- CPU reads (cache hits) can be reordered around buffered writes. proposed exploit. Regarding the code Linus posted, Erich replied: comfortable again. that’s whyMemory reordering they does not occur at the pins, don’t reads (cachehuge miss) and know gain. this can bite in some A PentiumIt will always return 0. Youis don't needa "spin_unlock()" in-order to machine,Thanks, guys, we'll be that muchwithout faster due to this.. any of the writes appear in-order." He concluded from this that the second CPU would never see the spin_unlock() before the "b=a" line. Linus agreed be serializing. Erich then argued that serialization was not required for the lock() that on a Pentium, Manfred was right. However, he quoted in turn It will always return 0. You don’t need The only thing you need is to make sure there is a store side either, but after a long and interesting discussion he apparently from the Pentium Pro manual, "The only enhancement in the in "spin_unlock()", and that is kind ofple, true by the fact . . . was unable to win people over. PentiumPro processor is theYour added support for speculative example reads and cannot happen. interestingthat you're changing speculation something to be observable on wrt reads etc. So on a Pentium store-buffer forwarding." He explained: cases. other processors. ( A Pentium is a in-order machine, without any of the The reason for this is that stores can only possibly be In fact, as Peter Samuelson pointed out to me after KT publication interesting speculation wrt reads etc. So on a Pentium observed when all prior instructions have retired (i.e. (and many thanks to him for it): you'll never see the problem. you’ll“spin never seeunlock() the problem.” to be serializing. the store is not sent outside of the processor until it is "You report that Linus was convinced to do the spinlock optimization But a Pentium is also very uninteresting from a SMP committed state, and the earlier instructions are already committed by that time), so the any loads, on Intel, but apparently someone has since changed his mind back. standpoint these days. It's just too weak with too little See from 2.3.30pre5 and above: per-CPU cache etc.. stores, etc absolutely have to have completed first, cache-miss or not. /* This is why the PPro has the MTRR's - exactly to let the * Sadly, some early PPro chips require the locked access, core do speculation (a Pentium doesn't need MTRR's, as He went on: * otherwise we could just always simply do * it won't re-order anything external to the CPU anyway, Since the instructions for the store in the spin_unlock * #define spin_unlock_string \ and in fact won't even re-order things internally). have to have been externally observed for spin_lock to * "movb $0,%0" * Jeff V. Merkey added: be aquired (presuming a correctly functioning spinlock, * Which is noticeably faster. of course), then the earlier instructions to set "b" to the */ What Linus says here is correct for PPro and above. value of "a" have to have completed first. #define spin_unlock_string \ "lock ; btrl $0,%0"" Using a mov instruction to unlock does work fine on a In general, IA32 is Processor Ordered for cacheable 486 or Pentium SMP system, but as of the PPro, this was -- Ed: [23 Dec 1999 00:00:00 -0800] no longer the case, though the window is so accesses. Speculation doesn't affect this. Also, stores infintesimally small, most kernels don't hit it (Netware are not observed speculatively on other processors. 4/5 uses this method but it's spinlocks understand this There was a long clarification discussion, resulting in a complete and the code is writtne to handle it. The most obvious

Susmit Sarkar (St Andrews, Cambridge) Shared Memory Concurrency Jan 2013 14 / 44 Inevitably, problems (A Cautionary Tale)

delay a read that happened inside the critical region mb(); (maybe it missed a cache line), and get a stale value for a = 0; Kernel Traffic #47 For mb(); any of the reads that _should_ have been serialized by b = a; 20 Dec 1999 the spinlock. spin_unlock(); return b; Note that I actually thought this was a legal } ManfredIngo Molnar: Spraul: optimization, and for a while I had this in the kernel. It 1. spin_unlock() Optimization On Intel crashed. In random ways. Now, OBVIOUSLY the above always has to return 0, right? All accesses to "a" are inside the spinlock, and we Note that the fact that it does not crash now is quite 20 Nov 1999 - 7 Dec 1999 (143 posts) Archive Link: "spin_unlock always set it to zero before we read it into "b" and possibly because of either Linus Torvalds:optimization(i386)" return it. So if we EVER returned anything else, the spinlock would obviously be completely broken, Topics: BSD: FreeBSD, SMP we have a lot less contention on our spinlocks these days. That might hide the problem,Linus because the Torvalds:wouldn't you say? _spinlock_ will be fine (the cache coherency still People: Linus Torvalds, Jeff V. Merkey, Erich Boleyn, Manfred And yes, the above CAN return 1 with the proposed means that the spinlock itself works fine - it's just Spraul, Peter Samuelson, Ingo Molnar optimization. I doubt you can make it do so in real life, that it no longer works reliably as an exclusion but hey, add another access to another variable in the Manfred Spraul thought he'd found a way to shave spin_unlock() thing) same cache line that is accessed through another down from about 22 ticks for the "lock; btrl $0,%0" asm code, to 1 the window is probably very very small, and you spinlock (to get cache-line ping-pong and timing tick for a simple "movl $0,%0" instruction, a huge gain. Later, he . . . according. . . 4% speedup to the Pentium in a Processor benchmarkhave to be unlucky to hit it. Faster Family CPU's, different Developerseffects), and I suspect you Man-can make it happen even with reported that Ingo Molnar noticed a 4% speed-up in a benchmark Manfredcompilers, Spraul: whatever. a simple example like the above. test, making the optimization very valuable. Ingo also added that the same optimization cropped up in the FreeBSD mailing list a few days I might be proven wrong, but I don't think I am. The reason it can return 1 quite legally is that your new previously. But Linus Torvalds poured cold water on the whole thing, "spin_unlock()" isnot serializing any more, so there is I feel comfortablesaying: again. Note that another thing is that yes, "btcl" may be the very little effective ordering between the two actions worst possible thing to use for this, and you might test ual,test, Vol3,It does NOTChapter making WORK! 19.2 the Memory optimizationwhether a simpler Access "xor+xchgl" might be better Ordering, - it's still “tob = a; optimize Jeff V. Markey: serializing because it is locked, but it should be the spin_unlock(); Let the FreBSD people use it, and let them get faster normal 12 cycles that Intel always seems to waste on timings. They will crash, eventually. serializing instructions rather than 22 cycles. as they access completely different data (ie no data dependencies in sight). So what you could end up doing The window may be small, but if you do this, then Elsewhere, he gave a potential (though unlikely) exploit: is equivalent to suddenly spinlocks aren't reliable any more. performance,very valuable. the Pentium The processorAs a completely same made-up example allows (which op- will probably memoryIt does readsCPU#1 NOT to beCPU#2 WORK! The issue is not writes being issued in-orderWe (although cannever shave show the problem in real life,spin but is instructive as unlock()b = a; /* cache miss, we'll delay all the Intel CPU books warn you NOT to assume that an example), imaging running the following test in a this.. */ in-order write behaviour - I bet it won't be the case in loop on multiple CPU's: spin_unlock(); the long run). int test_locking(void) spin_lock(); { The issue is that you _have_ to have a serializing a = 1; reorderedinstruction ahead in order to make sure of that the buffered processor writesstatic int ina; /* protected most by spinlock */ situations. Internally, Thanks,timization guys, we’ll cropped be thatup in much the /* cache miss satisfied, the "a" line doesn't re-order things around the unlock. down fromint b; about 22 ticks I have seen the behaviorLinus Linus Torvalds: describes on aishard- bouncing back and forth */ For example, with a simple write, the CPU can legally spin_lock() b gets the value 1 a = 1; warefasterCPU analyzer,FreeBSD due reads to (cache this. BUT mailing hits)for . . ONLY can the list. be “lock; reordered ON SYSTEMSLet btrl around the buffered$0, FreeBSD THAT %0 writes.” people use it, Memory reordering doesfrom not occur the Pentium at theand pins, Pro let reads manual, them (cache “The get miss) fasteronly enhancement timings. a =asm 0; code,aberrant behavior to was that 1cache inconsistencies tick would for aturnaround simple by Linus: WERE PPRO AND ABOVE.occur randomly. I PPro guess uses lock to signal that the the BSD peo- and it returns "1", which is wrong for any working piplines are no longer invalid and the buffers should be Everybody has convinced me that yes, the Intel ordering spinlock. blown out. rules _are_ strong enough that all of this really is legal, in theErich PentiumPro Boleyn processor (Architect,and that's what is I wanted. the I've gotten sane addedIntel explanations ): support for Unlikely? Yes, definitely. Something we are willing to I have seen the behavior Linus describesThey on a hardware willfor why crash, serialization (as opposed to just eventually. the simple and writeslive with asappear a potential bug in any realin-order.” kernel? Definitely analyzer, BUT ONLY ON SYSTEMS THAT WERE PPRO locked access) is required for the lock() side but not the ple must stillnot. be on“movl older $0,AND Pentium ABOVE. I guess the %0 BSD people” must stillinstruction, hardwarebe on unlock() side, and that lack a of symmetry was what speculativeolder Pentium hardware reads and that's why theyand don't know store-bufferbothered me the most. forwarding.” Manfred objected that according to the Pentium Processor Family this can bite in some cases. Developers Manual, Vol3, Chapter 19.2 Memory Access Ordering, "to Oliver made a strong case that the lack of symmetry can optimize performance, the Pentium processor allows memory reads to Erich Boleyn, an Architect in an IA32 development group at Intel, also be adequately explained by just simply the lack of be reordered ahead of buffered writes in most situations. Internally, replied to Linus, pointing out a possible misconceptionAs in his a completelysymmetry wrt speculation of readsmade vs writes. I feel up exam- CPU reads (cache hits) can be reordered around buffered writes. proposed exploit. Regarding the code Linus posted, Erich replied: comfortable again. that’s whyMemory reordering they does not occur at the pins, don’t reads (cachehuge miss) and know gain. this can bite in some A PentiumIt will always return 0. Youis don't needa "spin_unlock()" in-order to machine,Thanks, guys, we'll be that muchwithout faster due to this.. any of the writes appear in-order." He concluded from this that the second CPU would never see the spin_unlock() before the "b=a" line. Linus agreed be serializing. Erich then argued that serialization was not required for the lock() that on a Pentium, Manfred was right. However, he quoted in turn It will always return 0. You don’t need The only thing you need is to make sure there is a store side either, but after a long and interesting discussion he apparently from the Pentium Pro manual, "The only enhancement in the in "spin_unlock()", and that is kind ofple, true by the fact . . . was unable to win people over. PentiumPro processor is theYour added support for speculative example reads and cannot happen. interestingthat you're changing speculation something to be observable on wrt reads etc. So on a Pentium store-buffer forwarding." He explained: cases. other processors. ( A Pentium is a in-order machine, without any of the The reason for this is that stores can only possibly be In fact, as Peter Samuelson pointed out to me after KT publication interesting speculation wrt reads etc. So on a Pentium observed when all prior instructions have retired (i.e. (and many thanks to him for it): you'll never see the problem. you’ll“spin never seeunlock() the problem.” to be serializing. the store is not sent outside of the processor until it is "You report that Linus was convinced to do the spinlock optimization But a Pentium is also very uninteresting from a SMP committed state, and the earlier instructions are already committed by that time), so the any loads, on Intel, but apparently someone has since changed his mind back. standpoint these days. It's just too weak with too little See from 2.3.30pre5 and above: per-CPU cache etc.. stores, etc absolutely have to have completed first, cache-miss or not. /* This is why the PPro has the MTRR's - exactly to let the * Sadly, some early PPro chips require the locked access, core do speculation (a Pentium doesn't need MTRR's, as He went on: * otherwise we could just always simply do * it won't re-order anything external to the CPU anyway, Since the instructions for the store in the spin_unlock * #define spin_unlock_string \ and in fact won't even re-order things internally). have to have been externally observed for spin_lock to * "movb $0,%0" * Jeff V. Merkey added: be aquired (presuming a correctly functioning spinlock, * Which is noticeably faster. of course), then the earlier instructions to set "b" to the */ What Linus says here is correct for PPro and above. value of "a" have to have completed first. #define spin_unlock_string \ "lock ; btrl $0,%0"" Using a mov instruction to unlock does work fine on a In general, IA32 is Processor Ordered for cacheable 486 or Pentium SMP system, but as of the PPro, this was -- Ed: [23 Dec 1999 00:00:00 -0800] no longer the case, though the window is so accesses. Speculation doesn't affect this. Also, stores infintesimally small, most kernels don't hit it (Netware are not observed speculatively on other processors. 4/5 uses this method but it's spinlocks understand this There was a long clarification discussion, resulting in a complete and the code is writtne to handle it. The most obvious

Susmit Sarkar (St Andrews, Cambridge) Shared Memory Concurrency Jan 2013 14 / 44 This Tutorial

Understanding Real-world Relaxed Consistency Models

On x86 multiprocessors On ARM and IBM Power multiprocessors On ISO C/C++11

Susmit Sarkar (St Andrews, Cambridge) Shared Memory Concurrency Jan 2013 15 / 44 Microarchitecture Interlude: Store Buffering

Storing to memory is expensive Thread has to gain exclusive ownership of location

In practice, thread buffers writes . . . letting the thread go ahead if it can But, the thread should still be able to read the new store

x86 has FIFO store buffers

Susmit Sarkar (St Andrews, Cambridge) Shared Memory Concurrency Jan 2013 16 / 44 An Abstract Machine

Thread 0 Thread 1

Thread 0 Thread 1 T0’s T1’s x = 1 y = 1 Store Store r0 = y r1 = x Buffer Buffer

(Shared) Memory

Susmit Sarkar (St Andrews, Cambridge) Shared Memory Concurrency Jan 2013 17 / 44 Regaining order when needed

Suppose you wanted to program, e.g. mutual exclusion Need to regain strong ordering

MFENCE flushes local write barrier

Locked instruction, e.g. Compare-and-swap (CAS) CMPXCHG dest, src: compare EAX and dest, then atomically

I if equal, set ZF=1 and load src into dest

I otherwise, clear ZF=0 and load dest into EAX flushes local write barrier and globally locks memory

Susmit Sarkar (St Andrews, Cambridge) Shared Memory Concurrency Jan 2013 18 / 44 x86-TSO: An abstract machine

Thread ••• Thread

Store ••• Store Lock Buffer Buffer

Shared Memory

Susmit Sarkar (St Andrews, Cambridge) Shared Memory Concurrency Jan 2013 19 / 44 x86-TSO: An abstract machine

Threads in order

Global lock for exclusiveThread Thread memory access •••

Store ••• Store Lock BufferFIFO Store Buffer Buffer per thread

Shared Memory

Susmit Sarkar (St Andrews, Cambridge) Shared Memory Concurrency Jan 2013 19 / 44 x86-TSO: An abstract machine

Thread ••• Thread

Store Store ⊇ ••• Lock Buffer Buffer beh =6 Shared Memory hw

Force: Of the internal optimizations of processors, only per-thread FIFO write buffers are visible to programmers.

Still quite a loose spec: unbounded buffers, nondeterministic unbuffering, arbitrary interleaving

Susmit Sarkar (St Andrews, Cambridge) Shared Memory Concurrency Jan 2013 20 / 44 Hardware Models II: ARM and POWER

Different architectures, different models

Very different from x86

Similar to each other

Susmit Sarkar (St Andrews, Cambridge) Shared Memory Concurrency Jan 2013 21 / 44 Example: Message Passing (MP)

Initially: d = 0; f = 0; Thread 0 Thread 1 d = 1; while (f == 0) f = 1; {}; r = d; Finally: r = 0??

Forbidden on SC Thread 0 Thread 1 a: W[x]=1 c: R[y]=1 Forbidden on x86-TSO rf po po b: W[y]=1 rf d: R[x]=0 Observed on Power7 (1.7G/167G) Test MP : Allowed . . . and on ARM Tegra3 (138k/16M)

Susmit Sarkar (St Andrews, Cambridge) Shared Memory Concurrency Jan 2013 22 / 44 Explaining MP via Microarchitecture

Thread 0 Thread 1 a: W[x]=1 c: R[y]=1 rf po po b: W[y]=1 rf d: R[x]=0

Test MP : Allowed

Three possible explanations (at least): Thread core execution does stores out of order Stores propagate between threads out of order Thread core execution does loads out of order

Power and ARM do all three

Susmit Sarkar (St Andrews, Cambridge) Shared Memory Concurrency Jan 2013 23 / 44 The Model Structure

Overall structure: Thread Thread

Write request Write announce Barrier request Barrier ack

Storage Subsystem

Some aspects are thread-only, some storage-only, some both Threads and Storage Subsystem: Abstract state machines Speculative execution in Threads; Topology-independent Storage Subsystem Formally: transitions, guarded by preconditions, change state, and synchronize with each other

Susmit Sarkar (St Andrews, Cambridge) Shared Memory Concurrency Jan 2013 24 / 44 Propagation of Stores

Thread 0 Thread 1 Thread 2

a: W[x]=1 b: R[x]=1 d: R[y]=1 rf rf po po

c: W[y]=1 rf e: R[x]=0 Test WRC : Allowed

Stores propagate out of order Furthermore, they propagate to different threads in different orders

Susmit Sarkar (St Andrews, Cambridge) Shared Memory Concurrency Jan 2013 25 / 44 Propagation of Stores: Modeling

One possible microarchitectural explanation

Thread 0 Thread 2 Thread 1 Thread 3 Write Buffer Write Buffer

Shared Memory

But do not want to rely on Topology

Susmit Sarkar (St Andrews, Cambridge) Shared Memory Concurrency Jan 2013 26 / 44 Propagation of Stores: Modeling

Our model

Thread1

W R

Memory1

Thread W 5 W

W W W Memory R 5

W 2 Thread R

W

W W W 2 Memory

W W

W

W

W

W W W W

W

W

Memory 4

W

3

Memory

W

R

R

W

Thread 4

3 Thread

Susmit Sarkar (St Andrews, Cambridge) Shared Memory Concurrency Jan 2013 26 / 44 Coherence

Coherence provided by all common architectures . . . and most concurrent languages

Definition (Coherence) A linear order of stores to each concrete location (global to the system) that all threads agree on (i.e. they never observe any violation).

Microarchitecturally: Coherence protocols; Writes Propagating; Restrictions on Out-of-Order Core Execution

Note: In an abstract machine, dynamically built up

Susmit Sarkar (St Andrews, Cambridge) Shared Memory Concurrency Jan 2013 27 / 44 Coherence Figures

Force of coherence: Following Four Tests Forbidden

Thread 0 Thread 1 Thread 0 Thread 1 a: W[x]=2 b: R[x]=2 a: R[x]=2 c: W[x]=2 rf corf po po rf c: R[x]=1 b: W[x]=1

Test CoRR1 : Forbidden Test CoRW : Forbidden

Thread 0 Thread 0 Thread 1 a: W[x]=1 a: W[x]=1 c: W[x]=2 co copo po rf b: W[x]=2 b: R[x]=2 Test CoWW Test CoWR : Forbidden

Susmit Sarkar (St Andrews, Cambridge) Shared Memory Concurrency Jan 2013 28 / 44 Regaining order when needed

ARM/POWER has different ways of regaining order for different use cases: Dependencies ISYNC LWSYNC SYNC

Susmit Sarkar (St Andrews, Cambridge) Shared Memory Concurrency Jan 2013 29 / 44 Dependencies

Address Dependency value read by one load is used to calculate address of subsequent load/store

Data Dependency value read by one load is used to calculate value of subsequent store

Susmit Sarkar (St Andrews, Cambridge) Shared Memory Concurrency Jan 2013 30 / 44 Dependencies

Address Dependency value read by one load is used to calculate address of subsequent load/store

Data Dependency value read by one load is used to calculate value of subsequent store

Control Dependency value read by one load is used to calculate whether to perform subsequent store Note: Loads can still be done early!

Susmit Sarkar (St Andrews, Cambridge) Shared Memory Concurrency Jan 2013 30 / 44 Dependencies

Address Dependency value read by one load is used to calculate address of subsequent load/store

Data Dependency value read by one load is used to calculate value of subsequent store

Control Dependency value read by one load is used to calculate whether to perform subsequent store Note: Loads can still be done early!

Control-Isync Dependency value read by one load is used to calculate whether to perform subsequent load, with intervening isync

Susmit Sarkar (St Andrews, Cambridge) Shared Memory Concurrency Jan 2013 30 / 44 Dependencies

Address Dependency value read by one load is used to calculate address of subsequent load/store

Data Dependency value read by one load is used to calculate value of subsequent store

Control Dependency Sometimes naturally in algorithm value read by one load is used to calculate whether to perform subsequent store “Fake” dependencies respected as well Note: Loads can still be done early!

Control-Isync Dependency value read by one load is used to calculate whether to perform subsequent load, with intervening isync

Susmit Sarkar (St Andrews, Cambridge) Shared Memory Concurrency Jan 2013 30 / 44 Doing MP on POWER: MP+lwsync+ctrlisync

Initially: d = 0; f = 0; Thread 0 Thread 1

st d 1; loop: ld f rtmp; lwsync; cmp rtmp 0; st f 1; beq loop; isync; ld d r; Finally: r = 0??

Forbidden (and not observed) on POWER7, and ARM

lwsync prevents write reordering control-isync dependency prevents read speculation

Susmit Sarkar (St Andrews, Cambridge) Shared Memory Concurrency Jan 2013 31 / 44 Doing MP on POWER: MP+lwsync+ctrlisync

Initially: d = 0; f = 0; Thread 0 Thread 1

st d 1;Thread 0 loop: ldThread f rtmp; 1

lwsync;a: W[x]=1 c:cmp R[y]=1 rtmp 0; st f 1; rf beq loop; lwsync isync;ctrlisync b: W[y]=1 ld d r;rf d: R[x]=0 Test MP+lwsync+ctrlisync : Allowed Finally: r = 0??

Forbidden (and not observed) on POWER7, and ARM

lwsync prevents write reordering control-isync dependency prevents read speculation

Susmit Sarkar (St Andrews, Cambridge) Shared Memory Concurrency Jan 2013 31 / 44 Barriers and Cumulativity: Programming on many threads

Thread 0 Thread 1 Thread 2

a: W[x]=1 b: R[x]=1 d: R[y]=1 rf rf lwsync ctrlisync

c: W[y]=1 rf e: R[x]=0 Test WRC+lwsync+ctrlisync : Allowed

The lwsync is cumulative: it keeps the stores in order for all threads Flipping the dependency and barrier does not recover SC

Susmit Sarkar (St Andrews, Cambridge) Shared Memory Concurrency Jan 2013 32 / 44 Regaining Order: SB

Thread 0 Thread 1 a: W[x]=1 c: W[y]=1 sync sync rf b: R[y]=0 rf d: R[x]=0

Test SB+syncs : Forbidden

LWSYNC keeps writes in order: (No Good Here) SYNC additionally waits until preceding writes propagated everywhere

Susmit Sarkar (St Andrews, Cambridge) Shared Memory Concurrency Jan 2013 33 / 44 Programming Language Models

Program in higher level languages... are you safe?

1 Has to be compiled to run on hardware

2 Furthermore, compiler can optimise as well

Susmit Sarkar (St Andrews, Cambridge) Shared Memory Concurrency Jan 2013 34 / 44 Compiler doing Common Subexpression Elimination Can produce unexpected results even on SC hardware

Message Passing, Again

Initially: data = 0; flag = 0; Thread 0 Thread 1

data = 1; r0 = data; flag = 1; while (flag == 0) {}; r = data; Finally: r = 0

Susmit Sarkar (St Andrews, Cambridge) Shared Memory Concurrency Jan 2013 35 / 44 Message Passing, Again

Initially: data = 0; flag = 0; Thread 0 Thread 1

data = 1; r0 = data; flag = 1; while (flag == 0) {}; r =r 0; Finally: r = 0

Compiler doing Common Subexpression Elimination Can produce unexpected results even on SC hardware

Susmit Sarkar (St Andrews, Cambridge) Shared Memory Concurrency Jan 2013 35 / 44 Data Race Free Models

Should you even be writing programs like that? Idea: No! Programmer mistake to write Data Races

Basis of C11 Concurrency

Susmit Sarkar (St Andrews, Cambridge) Shared Memory Concurrency Jan 2013 36 / 44 C11 Concurrency: An Axiomatic Model

Complete executions are considered (threadwise operational, reading arbitrary values)

Relations defined over them (e.g. happens-before)

Predicate says whether execution is consistent

Further, no consistent execution should have races

Susmit Sarkar (St Andrews, Cambridge) Shared Memory Concurrency Jan 2013 37 / 44 Aside: Axiomatic Models

Sometimes easier to prove properties Theorem (DRF theorem) If a program is Data Race Free, then it has only Sequentially Consistent behaviour

x86-TSO has a proved equivalent axiomatic model

So does the Power model (but it is more complicated)

Susmit Sarkar (St Andrews, Cambridge) Shared Memory Concurrency Jan 2013 38 / 44 What about low-level algorithms?

C11 introduces marked atomic operations Races on these are ignored

Can be given parameters Sequentially consistent Release/Acquire Consume Relaxed

Susmit Sarkar (St Andrews, Cambridge) Shared Memory Concurrency Jan 2013 39 / 44 MP in C11: mark atomics

Mark atomic variables (accesses have memory order parameter)

Initially: d = 0; f = 0; Thread 0 Thread 1 d.store(1,rlx); while (f.load(rlx) == 0) f.store(1,rlx); {}; r = d.load(rlx); Finally: r = 0??

(Forbidden on SC) (also forbidden on TSO) Defined, and possible, in C/C++11 Allows for hardware (and compiler) optimisations

Susmit Sarkar (St Andrews, Cambridge) Shared Memory Concurrency Jan 2013 40 / 44 MP in C11 (contd.): release-acquire synchronization

Mark release stores and acquire loads

Initially: d = 0; f = 0; Thread 0 Thread 1 d.store(1,rlx); while (f.load(acq) == 0) f.store(1,rel); {}; r = d.load(rlx); Finally: r = 0??

Forbidden in C/C++11 due to release-acquire synchronization Implementation must ensure result not observed

Susmit Sarkar (St Andrews, Cambridge) Shared Memory Concurrency Jan 2013 41 / 44 Working with Formal Memory Models

Can we implement the C11 model on hardware?

We can now prove correctness of implementation schemes

Compilers require this (they earlier sometimes got it wrong)

Shows correspondence of notions Release-acquire synchronisation 7→ lwsync and ctrlisync Transitive part of happens-before 7→ cumulativity ...

Susmit Sarkar (St Andrews, Cambridge) Shared Memory Concurrency Jan 2013 42 / 44 Working with Formal Memory Models: The Future

What is the Memory Model used in the ?

Abstraction layer over hardware and compiler Relied on by kernel developers to provide “portable kernel code” “Documented” at http: //www.kernel.org/doc/Documentation/memory-barriers.txt

Susmit Sarkar (St Andrews, Cambridge) Shared Memory Concurrency Jan 2013 43 / 44 Working with Formal Memory Models: The Future

What is the Memory Model used in the Linux kernel?

How do we develop reasoning principles for relaxed memory consistency? Linearisability? Program logics?

Susmit Sarkar (St Andrews, Cambridge) Shared Memory Concurrency Jan 2013 43 / 44 Working with Formal Memory Models: The Future

What is the Memory Model used in the Linux kernel?

How do we develop reasoning principles for relaxed memory consistency?

Much work to be done!

Susmit Sarkar (St Andrews, Cambridge) Shared Memory Concurrency Jan 2013 43 / 44 More details at: http://www.cl.cam.ac.uk/˜pes20/weakmemory

x86-TSO: A Rigorous and Usable Programmer’s Model for x86 Multiprocessors [CACM’10] Understanding POWER Multiprocessors [PLDI’11] Mathematizing C++ Concurrency [POPL’11]

The ppcmem tool at: http://www.cl.cam.ac.uk/˜pes20/ppcmem

The cppmem tool at: http://svr-pes20-cppmem.cl.cam.ac.uk/cppmem/