Lock-free Shared Data Structures 10th Estonian Summer School on Computer and Systems Science Introduction and Motivation Eric Ruppert DisCoVeri Group York University Toronto, Canada

August, 2011

Eric Ruppert Lock-free Shared Data Structures Eric Ruppert Lock-free Shared Data Structures

Moore’s Law Moore’s Law for Automobiles

Gordon Moore [1965]: “If the automobile industry advanced as rapidly as the The number of components in semiconductor industry, a Rolls Royce would get a integrated circuits doubled each year million miles per gallon, and it would be cheaper to between 1958 and 1965. throw it away than to park it.” – Gordon Moore Continued doubling every 1 to 2 years since. This yielded exponential increases in performance: clock speeds increased memory capacity increased processing power increased

Eric Ruppert Lock-free Shared Data Structures Eric Ruppert Lock-free Shared Data Structures Limits of Moore’s Law Multicore Machines

Exponential improvements cannot go on forever. Multicore machines have multiple processor chips.

Components will soon be getting down to the scale of atoms. Each chip has multiple cores.

Even now, increases in computing power come from having Each core can run multiple programmes (or processes). more processors, not bigger or faster processors. Today: Machines can run 100 processes concurrently.

Big performance gains in the future will come from parallelism. Future: Many more processes.

Eric Ruppert Lock-free Shared Data Structures Eric Ruppert Lock-free Shared Data Structures

The Big Challenge for Computer Science Amdahl’s Law

Suppose we only know how to do 75% of some task in parallel. # Cores Time required Speedup 1 100 1.0 2 63 1.6 Most algorithms are designed for one . 3 50 2.0 4 44 2.3 . . . We must design efficient parallel solutions to problems. . . . 10 32 3.1 ...... 100 26 3.9 ...... ∞ 25 4.0 Better parallelization is needed to make use of extra cores.

Eric Ruppert Lock-free Shared Data Structures Eric Ruppert Lock-free Shared Data Structures One Challenge of Parallelization What These Lectures Are About

Many processes must cooperate efficiently to solve a problem. Designing implementations of shared data structures. Cooperation often requires communication between processes and sharing data. Desirable properties: load-balancing requires transferring tasks between (reasonably) easy to use processes can be accessed concurrently by several processes distributing input to processes efficient (in terms of time and space) (inputs may arrive online at different processes) scalable aggregating results of different processes’ computations (fault-tolerant) communicating partial results from one process to another

Eric Ruppert Lock-free Shared Data Structures Eric Ruppert Lock-free Shared Data Structures

Models of Distributed Systems

Two main models of communication in distributed computing.

Shared Memory Formalizing the Problem Message Passing Processes

op response

Memory

Eric Ruppert Lock-free Shared Data Structures Eric Ruppert Lock-free Shared Data Structures Shared-Memory Model Types of Shared Objects

We will focus on the shared-memory model. (It closely resembles multicore machines.) Read-write register Shared Memory Each register stores a value and provides two operations: Shared memory contains READ(): returns current value stored and does not change objects of various types. Processes the value Each process can perform op WRITE(v): changes the stored value to v and returns ack operations on the objects. response The operation can change The most basic type of shared object. the state of the object and Memory Provided in hardware. return a response.

Eric Ruppert Lock-free Shared Data Structures Eric Ruppert Lock-free Shared Data Structures

Another Type Formal Definition of an Object Type

FIFO Queue Sequential specification describes how object behaves when Stores a list of items and provides two operations: accessed by a single process at a time. ENQUEUE(x): adds x to the end of the list and returns ack Q is the set of states DEQUEUE(): removes and returns one item from the front of the list OP is the set of operations that can be applied to the object RES is the set of responses the object can return 17 24 97 12 12 δ ⊆ Q × OP × RES × Q is the transition relation If (q, op, res, q0) ∈ δ, it means that ENQUEUE(23) DEQUEUE→12 when op is applied to an object in state q, the operation can return response res and Useful for dividing up work among a collection of processes. the object can change to state q0. Not usually provided in hardware; must build it in software.

Eric Ruppert Lock-free Shared Data Structures Eric Ruppert Lock-free Shared Data Structures Example: Read/Write Register Example: Queue

Object stores a natural number and allows processes to write A FIFO queue of natural numbers. values and read current value. Q = set of all finite sequences of natural numbers Q = IN OP = {ENQUEUE(v): v ∈ IN} ∪ {DEQUEUE()} OP = {WRITE(v): v ∈ IN} ∪ {READ} RES = IN ∪ {NIL, ACK} RES = IN ∪ {ACK} δ = {(σ, ENQUEUE(v), ACK, hvi · σ): σ ∈ Q, v ∈ IN} ∪ 0 0 0 δ = {(v, WRITE(v ), ACK, v ): v, v ∈ IN} ∪ {(σ · hvi, DEQUEUE(), v, σ): σ ∈ Q, v ∈ IN} ∪ {(v, READ, v, v): v ∈ IN} {(hi, DEQUEUE(), NIL, hi)}

Eric Ruppert Lock-free Shared Data Structures Eric Ruppert Lock-free Shared Data Structures

How to Share an Object An Example: Read-Write Register

Suppose we have a read-write register initialized to value 0.

Process In reality, performing an operation is not instantaneous; P it takes some interval of time to complete. WRITE(1)

What happens when processes access an object concurrently? Q WRITE(2) How should the object behave?

R READ→?

time READ can output 1 or 2.

Eric Ruppert Lock-free Shared Data Structures Eric Ruppert Lock-free Shared Data Structures An Example: Read-Write Register An Example: Read-Write Register

Suppose we have a read-write register initialized to value 0. Suppose we have a read-write register initialized to value 0.

Process Process

P WRITE(1) P WRITE(1) WRITE(3)

Q WRITE(2) READ→? Q WRITE(2) READ→?

R READ→? R READ→?

time time Both READS can output 1 or 2. For linearizability, they should Both READS can output 1, 2 or 3. For linearizability, they should both output the same value. not output 1 and 2.

Eric Ruppert Lock-free Shared Data Structures Eric Ruppert Lock-free Shared Data Structures

An Example: Read-Write Register An Example: Read-Write Register

Suppose we have a read-write register initialized to value 0. Suppose we have a read-write register initialized to value 0.

Process Process

?  P WRITE(1) P WRITE(1) Not permitted  ?? Q WRITE(2) READ→1 Q WRITE(2) READ→1 ? R READ→2 R READ→2

time time

Eric Ruppert Lock-free Shared Data Structures Eric Ruppert Lock-free Shared Data Structures Linearizability A More Complicated Execution

An object (built in hardware or software) is linearizable if its Process operations seem to occur instantaneously. ? ? ?  P READ→0 READ→1 WRITE(2) More formally: ? ? For every execution that uses the objects, Q WRITE(1) READ→1 we can choose a ? inside each operation’s time interval so that ? ? all operations would return the same results if they were R READ→0 READ→2 performed sequentially in the order of their ?s.

The ? is called the linearization point of the operation. time

Eric Ruppert Lock-free Shared Data Structures Eric Ruppert Lock-free Shared Data Structures

Linearizability Example with a Queue Alternative Definition of Linearizability

Consider an initially empty queue. What are the possible values for A and B? Process For every execution that uses the objects, there exists a sequential order of all the operations such that P ENQUEUE(1) DEQUEUE→B 1 all operations return the same results in the sequential order, and Q DEQUEUE→A 2 if op1 ends before op2 begins, then op1 precedes op2 in the sequential order. R ENQUEUE(2) Original (and equivalent) definition of Herlihy and Wing [1990].

time A = 1, B = 2 OR A = 2, B = 1 OR A = NIL, B = 1 OR A = 1, B = NIL

Eric Ruppert Lock-free Shared Data Structures Eric Ruppert Lock-free Shared Data Structures Consistency Conditions Specifying Correctness: Summary

Linearizability is a consistency condition. There are many other weaker ones: sequential consistency How to specify correctness of a shared data structure? processor consistency Safety = sequential specification + linearizability causal consistency (Q, OPS, RES, δ) etc. Progress: operations terminate (discussed later) Weaker conditions are easier to implement but harder to use. We will focus on linearizable objects. ⇒ When using linearizable objects, can assume operations happen atomically.

Eric Ruppert Lock-free Shared Data Structures Eric Ruppert Lock-free Shared Data Structures

What are These Talks About? Challenges of Implementing Shared Objects

Goal: Design shared data structure implementations using basic objects provided by system (like registers). Asynchrony (processes run at different, variable speeds). Assume basic objects are linearizable. Concurrent updates should not corrupt data. Prove implemented objects are also linearizable. Reading data while others update it should not produce inconsistent results. Study complexity of implementations. Other unpredictable events (e.g., failures). Prove some implementations are impossible.

Eric Ruppert Lock-free Shared Data Structures Eric Ruppert Lock-free Shared Data Structures Traditional Approach: Locks Fine-Grained Locking

Each object has a lock associated with it. Each part of an object has a lock. Only one process can hold the lock at any time. 1 Obtain locks for parts you want to access 1 obtain lock 2 Access those parts of object 2 access object 3 Release locks 3 release lock

Cons: Cons: Pros: Pros: limited parallelism slow some parallelism simple no fault-tolerance avoids problems of no real parallelism reduces problems of concurrent accesses danger of , concurrent accesses no fault-tolerance livelock

Eric Ruppert Lock-free Shared Data Structures Eric Ruppert Lock-free Shared Data Structures

Alternative Approach: Lock-Free Implementations Lock-free Implementations

Avoid use of locks altogether. Programmer is responsible for coordination.

“The shorter the critical sections, the better. One can think of lock-free synchronization as a limiting case of Pros: this trend, reducing critical sections to individual machine instructions.” permits high parallelism – Maurice Herlihy fault-tolerance Cons: slow processes don’t block difficult to design fast ones unimportant processes don’t block important ones

Eric Ruppert Lock-free Shared Data Structures Eric Ruppert Lock-free Shared Data Structures Lock-Free Progress Guarantees The Goal

Non-blocking Progress Property If processes are doing operations, one eventually finishes. No deadlock Design lock-free, linearizable shared data structures. No fairness guarantee: individual operations may starve

Wait-free Progress Property Each operation eventually finishes. Stronger guarantee Hard to implement efficiently

Eric Ruppert Lock-free Shared Data Structures Eric Ruppert Lock-free Shared Data Structures

Snapshot Objects

Definition A (single-writer) snapshot object for n processes: Stores a vector of n numbers (one per process) Process i can UPDATE(i, v) to stores v in component i Snapshot Objects Any process can SCAN to return the entire vector

123 n 0000 ···

Original definition: Afek, Attiya, Dolev, Gafni, Merritt, Shavit [1993]; Anderson [1993]; Aspnes, Herlihy [1990]

Eric Ruppert Lock-free Shared Data Structures Eric Ruppert Lock-free Shared Data Structures Why Build Snapshots? Snapshot Algorithm: Attempt #1

Use an array A[1..n] of registers.

SCAN Abstraction of the problem of reading consistent view of UPDATE(i,v) for i ← 1..n several variables A[i] ← v ri ← A[i] Making backup copy of a distributed database return ACK end for return (r , r ,..., r ) Saving checkpoints for debugging distributed algorithms 1 2 n Used to solve other problems (timestamps, randomized Convention for pseudocode consensus, ...) Shared objects start with capital letters (e.g., A)

Local objects start with lower-case letters (e.g., i, ri , v).

Algorithm is simple, but wrong. 

Eric Ruppert Lock-free Shared Data Structures Eric Ruppert Lock-free Shared Data Structures

A Bad Execution Snapshot Algorithm: Attempt #2

Shared Registers Processes Problem A[1] A[2] A[3] P1 P2 P3 SCAN can read inconsistent values if concurrent UPDATES 0 0 0 SCAN: occur. UPDATE(1,7): READ A[1] → 0 A[1] ← 7 Solution 7 0 0 UPDATE(2,9): If SCAN detects a concurrent UPDATE, it tries again. A[2] ← 9 7 9 0 A better snapshot implementation: READ [ ] → SCAN A 2 9 UPDATE(i,v) READ A[1..n] repeatedly until the READ A[3] → 0 A[i] ← v ( , , ) same vector is read twice return 0 9 0 return ACK return that vector Not linearizable.  Does this work?

Eric Ruppert Lock-free Shared Data Structures Eric Ruppert Lock-free Shared Data Structures Another Bad Execution Snapshot Algorithm: Attempt #3

A[1] A[2] A[3] P1 P2 P3 0 0 0 READ A[1] → 0 ABA Problem UPDATE(1,7) 7 0 0 Reading the same value twice does not mean the register has UPDATE(2,9) not changed in between. 7 9 0 READ A[2] → 9 UPDATE(2,0) READ A[3] → 0 Solution 7 0 0 Attach a counter to each register so that scanner can really tell UPDATE(1,0) when a register has changed. 0 0 0 READ A[1] → 0 UPDATE(1,7) SCAN 7 0 0 UPDATE(i,v) READ A[1..n] repeatedly until the UPDATE(2,9) A[i] ← (v, counter ++) i same vector is read twice 7 9 0 READ A[2] → 9 return ACK READ A[3] → 0 return that vector (without counters) return (0,9,0) Does this work? Not linearizable ⇒ Algorithm is still wrong. 

Eric Ruppert Lock-free Shared Data Structures Eric Ruppert Lock-free Shared Data Structures

An example How to Prove Snapshot Algorithm is Linearizable?

A[1] A[2] A[3] P1 P2 P3 (0,0) (0,0) (0,0) READ A[1] → (0, 0) Choose a linearization point (?) for each operation. UPDATE(1,7) Show that operations would return same results if done in (7,1) (0,0) (0,0) linearization order. UPDATE(2,9) (7,1) (9,1) (0,0) READ A[2] → (9, 1) 1 UPDATE operation linearized when it does its WRITE. UPDATE(2,0) READ A[3] → (0, 0) ⇒ Invariant: A[1..n] contains true contents of snapshot (7,1) (0,2) (0,0) object (ignoring counters). UPDATE(1,0) 2 SCAN operation linearized between last two sets of READS. (0,2) (0,2) (0,0) READ A[1] → (0, 2) UPDATE(1,7) READ A[1] READ A[2] READ A[n] READ A[1] READ A[2] READ A[n] ··· ? ··· (7,3) (0,2) (0,0) → (a1, c1) → (a2, c2) → (an, cn) → (a1, c1) → (a2, c2) → (an, cn) UPDATE(2,9) (7,3) (9,3) (0,0) READ A[2] → (9, 3) READ A[3] → (0, 0) A[1] = (a1, c1) keep trying... A[2] =. (a2, c2)  . A[n] = (an, cn) Problem fixed. Is algorithm linearizable now?

Eric Ruppert Lock-free Shared Data Structures Eric Ruppert Lock-free Shared Data Structures What About Progress? Measuring Complexity of Lock-Free Algorithm

Space: n registers (of reasonable size).

Worst-case running time: Non-blocking Progress Property O(1) per UPDATE If processes are doing operations, one eventually finishes. ∞ per SCAN Clearly, UPDATES always finish.

SCANS could run forever, but only if UPDATES keep happening. Worst-case amortized running time: O(n2) per UPDATE ⇒ Algorithm is non-blocking.  O(n) per SCAN This means the total number of steps taken in an execution with u UPDATES and s SCANS be at most O(un2 + sn).

Eric Ruppert Lock-free Shared Data Structures Eric Ruppert Lock-free Shared Data Structures

Wait-Free Snapshots How to Make Non-blocking Algorithm Wait-free

The Problem Non-blocking Progress Property Fast operations (UPDATES) prevent slow operations (SCANS) If processes are doing operations, one eventually finishes. from completing.

Wait-free Progress Property Solution Each operation will eventually finish. Fast operations help slow operations finish.

Can we make the snapshot implementation wait-free? Some kind of helping mechanism is usually needed to achieve fair progress conditions.

Eric Ruppert Lock-free Shared Data Structures Eric Ruppert Lock-free Shared Data Structures A Wait-Free Snapshot Algorithm Why Is This Algorithm Wait-Free?

Main Idea If a SCAN reads A[1..n] n + 2 times then either An UPDATE must perform an embedded SCAN and write two sets of reads will be identical or the result in memory. one component will have changed twice. If a SCAN sees many UPDATES happening, it can return the Thus, each operation will terminate.  vector that was found by one of the embedded SCANS. Example (n = 4) SCAN 0000 0100 0110 1110 1111 1121 UPDATE(i,v) READ A[1..n] repeatedly ~s ← SCAN if same vector is read twice then Space: n BIG registers. Worst-case time: O(n2) shared-memory accesses per A[i] ← (v, counteri ++,~s) return that vector (values only) return ACK if some A[i] has changed twice then operation. return the vector saved in A[i] Can build BIG registers from small registers, but number of Note: uses BIG registers. registers and time will increase.

Eric Ruppert Lock-free Shared Data Structures Eric Ruppert Lock-free Shared Data Structures

Why Is This Algorithm Linearizable? What Have We Learned from Snapshots?

UPDATE linearized at the time it writes (as before). SCAN that sees same values twice is linearized as before. What about a SCAN that returns result of an embedded SCAN? Accessing a data structure while others are updating it can Linearize it at the same time as the embedded SCAN. be difficult. Why is this okay? Use repeated reads for consistency. How to avoid the ABA problem. ?? ??? ? SCAN A[i] SCANWRITE WRITE A[i] SCAN WRITE A[i] Helping is useful for making algorithms wait-free. How to talk about complexity of implementations. ? READ A[i] ······ READ A[i] READ A[i]

If a SCAN returns the result of an embedded SCAN, then the ? of the embedded SCAN is inside the SCAN.  (This is why we wait for a component to change twice.)

Eric Ruppert Lock-free Shared Data Structures Eric Ruppert Lock-free Shared Data Structures Further Work on Snapshots

Multi-writer snapshots: Any process can write to any component. Better complexity: Implementations with better time and space requirements. Adaptive snapshots: Running time depends on number of Shared Counters active processes. Lower bounds: Understand how much time is really required to perform a snapshot. Immediate snapshots: UPDATE and SCAN combined into a single operation. Partial snapshots: Implement (efficiently) SCANS of a few components of a large object.

Eric Ruppert Lock-free Shared Data Structures Eric Ruppert Lock-free Shared Data Structures

Another Example: Unit Counters Exercise: Linearizability

Consider a counter that is initially 0.

Process Definition ? P INC A unit counter stores an integer and provides two operations:  INC adds one to the stored value (and returns ACK) ? ? Q INCINC READ returns the stored value (and does not change it) ? How to implement a wait-free unit counter from registers? R READ→0 ? S READ→3

time

Eric Ruppert Lock-free Shared Data Structures Eric Ruppert Lock-free Shared Data Structures Unit Counter: Attempt #1 A Bad Execution

Try the obvious thing first. Consider two interleaved INC operations: Idea Use a single shared register C to store the value. C P1 P2 0 INC: INC: READ → 0 INC READ READ → 0 C ← C + 1 return C WRITE(1) return ACK 1 WRITE(1) Is this correct? Two INCS have happened, but a READ will now output 1. Note: C ← C + 1 really means So the simple algorithm is wrong.  temp ← READ C WRITE(temp + 1) into C

Eric Ruppert Lock-free Shared Data Structures Eric Ruppert Lock-free Shared Data Structures

Unit Counters from Snapshots Back to Square One

Idea Use a snapshot object. Try using the simplest idea we tried for snapshots. Let numi be the number of INCS process i has done. Idea Each process stores numi in its component. Use an array A[1..n]. n P Process i stores the number of increments it has done in A[i]. Use a SCAN to compute numi . i=1 READ INC sum ← 0 INC READ A[i] ← A[i] + 1 for i ← 1..n UPDATE(i, numi ++) SCAN and return sum of components return ACK sum ← sum + A[i] return ACK return sum Trivial to check that this is linearizable.  Clearly wait-free (and more efficient).  Same progress property as snapshot object.  Is it linearizable?

But SCANS are expensive. Can we do it more efficiently?

Eric Ruppert Lock-free Shared Data Structures Eric Ruppert Lock-free Shared Data Structures Linearizability Argument General Counters

Linearize INC when it WRITES. Prove there exists a linearization point for the READS. Definition Let A[i]start and A[i]end be the value of A[i] at start and end of A general counter stores an integer and provides two some READ of the counter. operations: Let A[i] be the value of A[i] when it is actually read. read INC(v) adds v to the stored value (and returns ACK) READ returns the stored value (and does not change it) A[i]start ≤ A[i]read ≤ A[i]end n n X n X Can use the same wait-free implementation from snapshots as A[i] ≤ P A[i] ≤ A[i] start read end for unit counters. i=1 i=1 i=1 counter value at start ≤ sum READ returns ≤ counter value at end Does the more efficient implementation from registers work?

So, at some time during the READ, the counter must have had the value the READ returns. 

Eric Ruppert Lock-free Shared Data Structures Eric Ruppert Lock-free Shared Data Structures

A Bad Execution Fetch&Increment Counter

Execution of simpler register-based implementation. Definition A[1] A[2] A[3] P1 P2 P3 A fetch&increment counter stores an integer and provides one 0 0 0 READ A[1] → 0 operation: INC(1) FETCH&INC adds 1 to the value and returns the old value. 1 0 0 INC(2) 1 2 0 READ A[2] → 2 Fetch&increment counters are useful for implementing READ A[3] → 0 timestamps, queues, etc. return 2 How can we implement it from registers? Not linearizable: counter has values 0, 1 and 3 but never 2. 

Eric Ruppert Lock-free Shared Data Structures Eric Ruppert Lock-free Shared Data Structures Ideas for Implementing Fetch&Increment Object A Problem!

Idea 1 Use single register C. A FETCH&INC performs C ← C + 1. A lock-free implementation of FETCH&INC from registers is Two concurrent FETCH&INCS might only increase value by 1 impossible! (and return same value).  Proved by Loui and Abu-Amara and by Herlihy in 1987. Idea 2 Use array of registers or snapshot object. How can we prove a result like this?

No obvious way to combine reading total and incrementing it into one atomic action. 

People tried some more ideas...

Eric Ruppert Lock-free Shared Data Structures Eric Ruppert Lock-free Shared Data Structures

Consensus Objects Cannot Implement FETCH&INC From Registers

Definition Proof idea: A consensus object stores a value (initially ⊥) and provides one 1 There is a lock-free implementation of consensus using operation, PROPOSE(v). FETCH&INC and registers for two processes. If value is ⊥, PROPOSE(v) changes it to v and returns v. → Easy. Else, PROPOSE(v) returns the current value without 2 There is no lock-free implementation of consensus using changing it. only registers for two processes. → Hard, but we have lots of tools for doing this. Essentially, the object returns first value proposed to it. 3 There is no lock-free implementation of FETCH&INC from registers for two (or more) processes. Very useful for process coordination tasks. → Follows easily from previous two steps. Consensus is very widely studied.

Eric Ruppert Lock-free Shared Data Structures Eric Ruppert Lock-free Shared Data Structures Lock-Free Consensus Using FETCH&INC and Registers Cannot Implement Consensus Registers for Two Processes

Idea: Proof Sketch: There must eventually be a step by some process when First process to access FETCH&INC object F wins. the final decision is actually made. Write proposed value in shared memory so loser can return winner’s value. Both processes must be able to determine what the decision was. No step that accesses a register can do this. PROPOSE(v) A process will not notice a READ by the other process. A[i] ← v % Announce my value A WRITE can be overwritten by the other process. if F.FETCH&INC = 0 then % I won Either way, second process cannot see evidence of the return v decision made by a step of the first process. else % I lost return A[3 − i] % Return winner’s value Based on classic result of Fischer, Lynch and Paterson. Exercise: Prove this works.

Eric Ruppert Lock-free Shared Data Structures Eric Ruppert Lock-free Shared Data Structures

Chaos and Confusion Hardware Implications

Computers equipped with different primitive objects may or may not be able to implement certain data structures. Example Registers can implement Registers cannot implement Tell hardware designers: snapshot objects consensus Registers are not good enough. counters fetch&inc objects Build machines with stronger primitive objects. ... queues But which primitives should we ask for? stacks ...

Very different from classical models of computing. To implement a data structure, which primitives do I need?

Eric Ruppert Lock-free Shared Data Structures Eric Ruppert Lock-free Shared Data Structures Universal Objects to the Rescue

Herlihy proved that some object types are universal: They can be used to implement every other type.

Basically, if you can solve consensus, then you can implement every type of object. So, tell hardware designers to include a universal object. Universal Constructions (They do now.)

Herlihy also classified objects: registers, snapshot objects are weakest fetch&inc objects, stacks are a little bit stronger ... compare&swap objects, LL/SC are most powerful

Eric Ruppert Lock-free Shared Data Structures Eric Ruppert Lock-free Shared Data Structures

Compare&Swap Objects Example: Non-blocking Fetch&Inc from CAS

Idea

Compare-And-Swap (CAS) Object Use a single CAS object to store the value of the counter. To increment, try CAS(v, v + 1) until you succeed. Stores a value and provides two operations: READ(): returns value stored in object without changing it INC CAS(old, new): if the value currently stored is old, change loop it to new and return old; READCAS object and store value in v otherwise return current value without changing it. if CAS(v, v + 1) = v then return v end loop CAS objects are universal. Multicore systems have hardware CAS operations. Note this algorithm is non-blocking, but not wait-free. There are much more efficient ways to do this, but at least this shows it is possible.

Eric Ruppert Lock-free Shared Data Structures Eric Ruppert Lock-free Shared Data Structures Wait-free Consensus from CAS The Universal Construction

Goal Idea Given sequential specification of any object, Use a CAS object with initial value ⊥. show there is a non-blocking (or wait-free) implementation of it The first process to access object wins and stores its value. using CAS objects (and registers).

PROPOSE(v) Idea return CAS(⊥, v) The implementation must be linearizable. So, processes agree on the linearization of all operations. To perform an operation, add it to the end of a list.

Eric Ruppert Lock-free Shared Data Structures Eric Ruppert Lock-free Shared Data Structures

A Non-blocking Universal Construction A Non-blocking Universal Construction

Example: Implementing a stack. Example: Implementing a stack. Process P: Process Q: Process P: Process Q: state = state = hihi state = state = hihi myOp PUSH(3) ⊥ head head head head

blank ⊥ blank ⊥ op next op next

Eric Ruppert Lock-free Shared Data Structures Eric Ruppert Lock-free Shared Data Structures A Non-blocking Universal Construction A Non-blocking Universal Construction

Example: Implementing a stack. Example: Implementing a stack. Process P: Process Q: Process P: Process Q: state = 3i state = hih state = 3i state = hih myOp POP myOp PUSH(7) ⊥⊥ head head headhead

blank PUSH(3) ⊥ blank PUSH(3) ⊥ op next op next

Eric Ruppert Lock-free Shared Data Structures Eric Ruppert Lock-free Shared Data Structures

A Non-blocking Universal Construction A Non-blocking Universal Construction

Example: Implementing a stack. Example: Implementing a stack. Process P: Process Q: Process P: Process Q: state = h3ih state = 3i state = h3i state = h3, 7i myOp POP myOp PUSH(7) ⊥⊥ myOp POP ⊥ headhead headhead

blank PUSH(3) ⊥ blank PUSH(3) PUSH(7) ⊥ op next op next

Eric Ruppert Lock-free Shared Data Structures Eric Ruppert Lock-free Shared Data Structures A Non-blocking Universal Construction A Non-blocking Universal Construction

Example: Implementing a stack. Example: Implementing a stack. Process P: Process Q: Process P: Process Q: state = h3, 7i state = h3, 7i state = h3i state = h3, 7i myOp POP ⊥ headhead headhead

blank PUSH(3) PUSH(7) ⊥ blank PUSH(3) PUSH(7) POP ⊥ op next op next

Eric Ruppert Lock-free Shared Data Structures Eric Ruppert Lock-free Shared Data Structures

The Details Correctness

head initially contains first node on list state initially contains initial state of the implemented object Linearization ordering: the order of operations on the list. DO-OPERATION(op) myOp ← new node containing op and null next pointer Progress: loop Some process eventually reaches end of list head ← CAS(head.next, ⊥, myOp) and adds its operation. if head = ⊥ then head = myOp ⇒ Non-blocking. (state, result) ← δ(state, head.op) % apply transition function Efficiency: terrible (but it shows any object can be built). if head = myOp then return result end loop

Eric Ruppert Lock-free Shared Data Structures Eric Ruppert Lock-free Shared Data Structures A Wait-Free Universal Construction Details

DO-OPERATION(op) Can we make the algorithm wait-free? Announce[id] ← new node containing op and null next pointer = Idea and added flag false loop Add helping. if Announce[position mod n].added = false then Fast processes help slow ones add their operations to the list. myOp ← Announce[position mod n] else Processes announce their operations in shared memory. myOp ← Announce[id] Each position on the list is reserved for one process. head ← CAS(head.next, ⊥, myOp) When you reach a position that belongs to process P: if head = ⊥ then head = myOp If P has a pending operation, help P add it. head.added ← true Otherwise, try adding your own operation. position ← position + 1 Be careful not to add same operation to list twice. (state, result) ← δ(state, head.op) % apply transition function if head = Announce[id] then return result end loop

Eric Ruppert Lock-free Shared Data Structures Eric Ruppert Lock-free Shared Data Structures

Correctness Efficiency

Still terribly inefficient (but shows CAS is universal). Linearization ordering: the order of operations on the list. In fact, time to complete an operation is unbounded (but finite).

Progress: Can make it polynomial by sharing information about head: After you announce your operation, ⇒ If a process falls behind, he can read up-to-date head value. at most one other process’s operation will be put in one of your reserved positions. Lots of later work on more efficient universal constructions. ⇒ Your operation will be added within the next O(n) nodes. (There is also the approach.) ⇒ Wait-free. However, handcrafted data structures are still needed for high efficiency shared data structures (e.g., for libraries).

Eric Ruppert Lock-free Shared Data Structures Eric Ruppert Lock-free Shared Data Structures Stack

Definition Stacks A stack stores a sequence of values and provides two operations PUSH(x) adds x to the beginning of the sequence POP removes and returns the first value from the sequence

Eric Ruppert Lock-free Shared Data Structures Eric Ruppert Lock-free Shared Data Structures

Treiber’s Stack Treiber’s Stack

Idea Idea Store the sequence as a linked list (pointers go top to bottom) Store the sequence as a linked list (pointers go top to bottom) Use CAS to update pointers. Use CAS to update pointers.

Top POP Top 7 7

3 3

9 ⊥ 9 ⊥

Eric Ruppert Lock-free Shared Data Structures Eric Ruppert Lock-free Shared Data Structures Treiber’s Stack Treiber’s Stack: Details

Idea Store the sequence as a linked list (pointers go top to bottom) Each operation tries its CAS until it succeeds. Use CAS to update pointers. PUSH(x) POP PUSH(5) Top create new node v containing loop 5 x and next pointer ⊥ t ← Top loop until a CAS succeeds if t = ⊥ then return “empty” v.next ← Top if CAS(Top, t, t.next) = t then 3 CAS(Top, v.next, v) return t.value end loop end loop 9 ⊥

Eric Ruppert Lock-free Shared Data Structures Eric Ruppert Lock-free Shared Data Structures

Correctness Efficiency

Problem Every operation must wait its turn to change Top Linearizability ⇒ no concurrency, no scalability Linearize each operation at its successful CAS. (Except: POP that returns “empty” is linearized at READ of Top.) Solution Invariant: Contents of linked list are true contents of stack. Use a back-off scheme. While waiting, try to find matching operation. Non-blocking property If a PUSH(x) finds a POP, they can cancel out: Eventually some operation successfully changes Top PUSH(x) returns ACK (or returns “empty” if only POPS are happening on empty stack). POP returns x Linearize PUSH(x) and then POP at the moment they meet.

See Hendler, Shavit, Yerushalmi [2010]

Eric Ruppert Lock-free Shared Data Structures Eric Ruppert Lock-free Shared Data Structures ABA Problem Again Solving the ABA Problem

Question Solution 1 Suppose we POP a node v off the stack. Do garbage collection more carefully. Is it safe to free the memory used by v? Do not free a node if a process holds a pointer to it. Makes garbage collection expensive. Could cause ABA problem if another process has a pointer to v.

Process P wants to POP. Solution 2 P reads Top = v and Top.next = u. Store a pointer together with a counter in the same CAS object. Process Q POPS v and u and frees them. . E.g. Java’s AtomicStampedReference. . Later, Top again points to v, and u is in the middle of the stack. Makes pointer accesses slower. P performs CAS(Top, v, u).  Technically, requires unbounded counters.

Eric Ruppert Lock-free Shared Data Structures Eric Ruppert Lock-free Shared Data Structures

Set Objects

Definition A set object stores a set of items and provides three Sets operations: FIND(x): returns true if x is in the set, or false otherwise (Using Binary Search Trees) INSERT(x): adds x to set and returns true (or false if x was already in the set) DELETE(x): removes x from the set and returns true (or false if x was not in the set)

One of the most commonly used types of objects.

Eric Ruppert Lock-free Shared Data Structures Eric Ruppert Lock-free Shared Data Structures Ways of Implementing Set Object Binary Search Trees

Example Definition Sorted linked list One square node (leaf) for each item Well-studied BST storing items in the set Slow {A, B, C, F} Skip list Round nodes (internal nodes) used to route find operations to the right leaf Provided in Java standard library B Hard to guarantee good running time Each internal node has a left child and Hash table A C a right child They exist BST property: K Harder to guarantee good running time B E Search tree Can be implemented efficiently in Java C F Hard to balance (but maybe soon...) items < K items ≥ K

Eric Ruppert Lock-free Shared Data Structures Eric Ruppert Lock-free Shared Data Structures

Non-blocking BST Insertion (non-concurrent version)

Before: A non-blocking implementation of BSTs from single-word CAS. B

Some properties: INSERT(D) Conceptually simple A C 1 Search for D Fast searches 2 Remember leaf and its parent Concurrent updates to different parts of tree do not conflict B E 3 Create new leaf, replacement leaf, Technique seems generalizable and one internal node Experiments show good performance C F 4 Swing pointer But unbalanced Ellen, Fatourou, Ruppert, van Breugel [2010].

Eric Ruppert Lock-free Shared Data Structures Eric Ruppert Lock-free Shared Data Structures Insertion (non-concurrent version) Deletion (non-concurrent version)

After: Before: B B

INSERT(D) A C A C DELETE(C) 1 Search for D 1 Search for C 2 Remember leaf and its parent B E BE 2 Remember leaf, its parent and 3 Create new leaf, replacement leaf, grandparent and one internal node 3 Swing pointer D C F 4 Swing pointer C F

C D

Eric Ruppert Lock-free Shared Data Structures Eric Ruppert Lock-free Shared Data Structures

Deletion (non-concurrent version) Challenges of Concurrent Operations (1)

After: B B

A C A C DELETE(C) 1 Search for C Concurrent DELETE(C) and INSERT(D). BE B E 2 Remember leaf, its parent and grandparent ⇒ D is not reachable! D C F C F 3 Swing pointer

CD

Eric Ruppert Lock-free Shared Data Structures Eric Ruppert Lock-free Shared Data Structures Challenges of Concurrent Operations (2) Coordination Required

B Crucial problem: A node’s child pointer is changed while the node is being removed from the tree. Solution: Updates to the same part of the tree must coordinate. A C Desirable Properties of Coordination Scheme Concurrent DELETE(B) and DELETE(C). B E No locks ⇒ C is still reachable! Maintain invariant that tree is always a BST C F Allow searches to pass unhindered Make updates as local as possible Algorithmic simplicity

Eric Ruppert Lock-free Shared Data Structures Eric Ruppert Lock-free Shared Data Structures

Flags and Marks Insertion Algorithm

An internal node can be either flagged or marked (but not both). Status is changed using CAS. B

Flag INSERT(D) A C Indicates that an update is changing a child pointer. 1 Search for D Before changing an internal node’s child pointer, flag the 2 Remember leaf and its parent node. B E 3 Create three new nodes Unflag the node after its child pointer has been changed. 4 Flag parent (if this fails, retry from D C F scratch) Mark 5 Swing pointer (using CAS) Indicates an internal node has been (or soon will be) removed 6 Unflag parent from the tree. C D Before removing an internal node, mark it. Node remains marked forever.

Eric Ruppert Lock-free Shared Data Structures Eric Ruppert Lock-free Shared Data Structures Deletion Algorithm Recall: Problem with Concurrent DELETES

B B DELETE(C) 1 Search for C A C 2 Remember leaf, its parent and A C grandparent Concurrent DELETE(B) and DELETE(C). B E  3 Flag grandparent (if this fails, retry B E from scratch) ⇒ C is still reachable! 4 Mark parent (if this fails, unflag C F grandparent and retry from scratch) C F 5 Swing pointer (using CAS) 6 Unflag grandparent

Eric Ruppert Lock-free Shared Data Structures Eric Ruppert Lock-free Shared Data Structures

Conflicting Deletions Now Work Conflicting Deletions Now Work

B B B B

Concurrent DELETE(B) and DELETE(C) A C C Concurrent DELETE(B) and DELETE(C) A C  Case I: DELETE(C)’s flag succeeds. Case II: DELETE(B)’s flag and mark B E succeed. ⇒ Even if DELETE(B)’s flag succeeds, its B E mark will fail. ⇒ DELETE(C)’s flag fails. C F ⇒ DELETE(C) will complete C F ⇒ DELETE(B) will complete DELETE(B) will retry DELETE(C) will retry

Eric Ruppert Lock-free Shared Data Structures Eric Ruppert Lock-free Shared Data Structures Locks Wait a second . . .

Problem

Implementation is supposed to be lock-free! Can think of flag or mark as a lock on the child pointers of a node. Solution Flag corresponds to temporary ownership of lock. Whenever “locking” a node, leave a key under the doormat. Mark corresponds to permanent ownership of lock. A flag or mark is actually a pointer to a small record that tells a If you try to acquire lock when it is already held, CAS will fail. process how to help the original operation.

If an operation fails to acquire a lock, it helps complete the update that holds the lock before retrying.

Thus, locks are owned by operations, not processes.

Eric Ruppert Lock-free Shared Data Structures Eric Ruppert Lock-free Shared Data Structures

Searching Linearizing Searches

Lemma Each node visited by SEARCH(K ) was on the search path for K SEARCHES just traverse edges of the BST until reaching a leaf. after the SEARCH started, and They can ignore flags and marks. before the SEARCH entered the node.

This makes them very fast. Suppose SEARCH(K ) visits nodes v1, v2,..., vm. Base: Dummy root v never changes ⇒ always on search path. But, this means SEARCHES can 1 go into marked nodes, Inductive step: Assume vi is on the search path at time Ti . The SEARCH read vi+1 as a child of vi after entering vi . travel through whole marked sections of the tree, and Case 1: vi+1 was already vi ’s child at Ti . possibly return later into old, unmarked sections of the tree. ⇒ vi+1 was on search path at Ti .  How do we linearize such searches? Case 2: vi+1 became vi ’s child after Ti . When vi+1 was added, vi was still on search path (because nodes never get new ancestors). So vi+1 was on search path then.  Eric Ruppert Lock-free Shared Data Structures Eric Ruppert Lock-free Shared Data Structures Progress Progress: The Hard Case

B B Goal: Show data structure is non-blocking. A DELETE may flag, then fail to mark, If an INSERT successfully flags, it finishes. A C C then unflag to retry. If a DELETE successfully flags and marks, it finishes. B E ⇒ The DELETE’s changes may cause SEARCHES E If updates stop happening, must finish. other CAS steps to fail. One CAS fails only if another succeeds. C F ⇒ A successful CAS guarantees progress, except for a However, lowest DELETE will make DELETE’s flag. progress. EF

Eric Ruppert Lock-free Shared Data Structures Eric Ruppert Lock-free Shared Data Structures

Warning Experimental Results

Some details have been omitted. The proof that this actually works is about 20 pages long.

2 UltraSPARC-III CPUs, each with 8 cores running at 1.2GHz. 8 threads per core. Each performs operations on random keys.

Eric Ruppert Lock-free Shared Data Structures Eric Ruppert Lock-free Shared Data Structures More Work Needed on Lock-free Trees

Balancing the tree Proving worst-case complexity bounds Conclusions Can same approach yield (efficient) wait-free BSTs? (Or at least wait-free FINDS?) Other tree data structures

Eric Ruppert Lock-free Shared Data Structures Eric Ruppert Lock-free Shared Data Structures

Lock-Free Data Structures: A Summary Techniques

Studied for 20+ years Research is important for new multicore architectures Universal constructions [1988–present] Use repeated reads for consistency. Disadvantage: inefficient Methods to avoid the ABA problem. Array-based structures [1990–2005] Helping is useful for making algorithms wait-free. snapshots, stacks, queues Pointer-swinging using CAS. List-based structures [1995–2005] “Locks” owned by operations can be combined with singly-linked lists, stacks, queues, skip lists helping. Tree-based structures [very recent] Elimination filters BSTs, some work based on B-trees, ... A few others [1995–present] union-find, ...

Eric Ruppert Lock-free Shared Data Structures Eric Ruppert Lock-free Shared Data Structures Other Directions Final Words

Other data structures Memory management issues Weaker progress guarantees (e.g., obstruction freedom) Lots of work on lock-free data structures needed for multicore machines of the future. Using randomization to get better implementations Software transactional memory

Eric Ruppert Lock-free Shared Data Structures Eric Ruppert Lock-free Shared Data Structures

Photo credits

Gordon Moore: Steve Jurvetson Gene Amdahl: Pkivolowitz at en.wikipedia

Eric Ruppert Lock-free Shared Data Structures