University of Groningen

Verification of a Lock-Free Implementation of Multiword LL/SC Object Gao, Hui; Fu, Yan; Hesselink, Wim H.

Published in: EPRINTS-BOOK-TITLE

IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's PDF) if you wish to cite from it. Please check the document version below. Document Version Publisher's PDF, also known as Version of record

Publication date: 2009

Link to publication in University of Groningen/UMCG research database

Citation for published version (APA): Gao, H., Fu, Y., & Hesselink, W. H. (2009). Verification of a Lock-Free Implementation of Multiword LL/SC Object. In EPRINTS-BOOK-TITLE University of Groningen, Johann Bernoulli Institute for Mathematics and Computer Science.

Copyright Other than for strictly personal use, it is not permitted to download or to forward/distribute the text or part of it without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license (like Creative Commons).

Take-down policy If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately and investigate your claim.

Downloaded from the University of Groningen/UMCG research database (Pure): http://www.rug.nl/research/portal. For technical reasons the number of authors shown on this cover page is limited to 10 maximum.

Download date: 28-09-2021 2009 Eighth IEEE International Conference on Dependable, Autonomic and Secure Computing

Verification of a Lock-Free Implementation of Multiword LL/SC Object

Hui Gao, Yan Fu Wim H. Hesselink School of Computer Science and Engineering Dept. of Mathematics and Computing Science University of Electronic Science and Technology of China University of Groningen Chengdu, China Groningen, The Netherlands Email: [email protected], [email protected] Email: [email protected]

Abstract—On shared memory multiprocessors, synchroniza- not in general match even naive blocking designs. It has also tion often turns out to be a performance bottleneck and the been shown [13] that the widely-available atomic conditional source of poor fault-tolerance. By avoiding locks, the significant primitives, CAS and LL/SC cannot provide starvation-free benefit of lock (or wait)-freedom for real-time systems is that the potentials for deadlock and priority inversion are avoided. implementations of many common data structures without The lock-free algorithms often require the use of special memory costs growing linearly in the number of threads. atomic processor primitives such as CAS (Compare And Swap) Wait-free algorithms are therefore rare, both in research and or LL/SC (Load Linked/Store Conditional). However, many in practice, and we are most interested in designing lock-free machine architectures support either CAS or LL/SC, but not implementations. both. In this paper, we present a lock-free implementation of the ideal semantics of LL/SC using only pointer-size CAS,and A number of researchers[3], [5], [7], [9], [15] have pro- show how to use refinement mapping to prove the correctness posed techniques for designing lock-free implementations. of the algorithm. The lock-free algorithms often require the use of special atomic processor instructions such as CAS (compare and I. INTRODUCTION swap) or LL/SC (load linked/store conditional). However, We are interested in designing efficient data structures Current mainstream architectures support either CAS or and algorithms on shared-memory multiprocessors. On such LL/SC with restricted semantics (but not both), which are machines, processes often need to coordinate with each susceptible to the ABA problem [14]. other via shared data structures. In order to prevent the The ideal semantics of the atomic primitives LL/SC are corruption of these concurrent objects, processes need a inherently immune to ABA problem. However, for practical mechanism for synchronizing their access. The traditional architectural reasons, no processor architecture supports the approach is to explicitly synchronize access to shared data ideal semantics of LL/SC. Designing efficient algorithms to by different processes to ensure correct behaviors of the bridge the gap has been the subject of many researchers’ overall system, using synchronization primitives such as interest. However, most of the research is focused on im- semaphores, monitors, guarded statements, mutex locks, etc. plementing only small LL/SC objects, whose value fits in a Due to blocking, the classical synchronization paradigms single machine [4], [8], [9], [11]. using locks can incur many problems such as long delays, In this paper, using only pointer-size CAS we present a convoying, priority inversion and deadlock. Using locks also practical lock-free implementation of the ideal semantics of involves a trade-off between coarse-grained locking which LL/SC Multiword objects (whose value does not have to fit can significantly reduce opportunities for parallelism, and in a single machine word) without causing ABA problem. fine-grained locking which requires more careful design and is more prone to bugs. A true problem of lock-free algorithms is that they are Over the past two decades the research community has hard to design correctly, even when apparently straightfor- developed a body of knowledge concerning ”Lock-Free” and ward. To ensure our implementation is not flawed, we used ”Wait-Free” algorithms and data structures. In contrast to the higher-order interactive theorem prover PVS [6] for algorithms that protect access to shared data with locks, mechanical support. All invariants as well as the simulation lock-free and wait-free algorithms are specially designed relation have been completely verified with PVS. to allow multiple threads to read and write shared data Overview. In Section 2 we present we present preliminary concurrently without corrupting it. The significant benefit material which we require throughout this paper. In Section of lock (or wait)-freedom for real-time systems is that 3, we give a lock-free implementation of the ideal semantics by avoiding locks the potentials for deadlock and priority of LL/SC Multiword objects. In Section 4, we provide inversion are avoided. an overview of the proof and a description of the role of It was shown in the 1980s that algorithms can be imple- the proof assistant PVS in it. In Section 5,we draw some mented wait-free. However, the resulting performance does conclusions.

978-0-7695-3929-4/09 $26.00 © 2009 IEEE 31 DOI 10.1109/DASC.2009.25 II. PRELIMINARY S.X := ∅; X := Y ; return true ;  The machine architecture that we have in mind is based else return false fi on modern shared-memory multiprocessors that can access B. Refinement mappings a common shared address space in a heap. There can be In practice, the specification of systems is concerned several processes running on a single processor. Variables rather with externally visible behavior than computational in shared context are visible to all processes running in feasibility. We assume that all levels of specifications under associated parallel. Variables in private context are hidden consideration have the same observable state space Σ0, and from other processes. are interpreted by their observation functions Π:Σ→ We assume a universal set V of typed variables, which Σ0. Every specification can be modeled as a four-tuple is called the vocabulary. A state s is a type-consistent (Σ, Π, Θ, N ) where (Σ, Θ, N )isthetransition system [2]. interpretation of V, mapping variables v ∈Vto values sv. A refinement mapping from a lower-level specification We denote by Σ the set of all states. If C is a command, Sc =(Σc, Πc, Θc, Nc) to a higher-level specification Sa = we denote by Cp the transition C executed by process p, and (Σa, Πa, Θa, Na), written φ : Sc Sa, is a mapping sCpt indicates that in state s process p can do a step C that φ :Σc → Σa that satisfies: establishes state t. When discussing the effect of a transition 1) φ preserves the externally visible state component: Cp from state s to state t on a variable v, we abbreviate sv Πa ◦ φ =Πc. to v and tv to v . We use the abbreviation Pres(V ) for 2) φ is a simulation, denoted φ : Sc  Sa: v∈V (v = v) to denote that all variables in the set V are preserved by the transition. ① φ takes initial states into initial states: Θc ⇒ Θa ◦φ. ② Nc is mapped by φ into a transition (possibly A. The Semantics of Synchronization Primitives stuttering) allowed by Na: Q∧N ⇒N◦ φ Q S Traditional multiprocessor architectures have included c a , where is an invariant of c. hardware support only for low level synchronization prim- Below we need to exploit the fact that the simulation only itives such as CAS and LL/SC, while high level synchro- quantifies over all reachable states of the lower-level system, nization primitives such as locks, barriers, and condition not all states. We therefore explicitly allow an invariant Q variables have to be implemented in software. in condition 2 ➁. The following theorem is stated in [1]. CAS atomically compares the contents of a location with a value and, if they match, stores a new value at Theorem 1 If there exists a refinement mapping from Sc to the location. The semantics of CAS is given by equivalent Sa, then Sc implements Sa. atomic statements below. We use angular brackets ... to indicate atomic execution of the enclosed specification Refinement mappings give us the ability to reduce an command1. implementation by reducing its components in relative isola- tion, and then gluing the reductions together with the same proc CAS(ref X : Val; in old, new : Val):Bool = structure as the implementation.  if X = old then X := new; return true else return false; fi  III. THE LOCK-FREE IMPLEMENTATION OF LL /SC Let us assume there are P (≥ 1) concurrently exe- LL and SC are a pair of instructions, closely related to the cuting sequential processes. To distinguish private persis- CAS, and together implement an atomic Read/Write cycle. tent variables of different processes, every persistent pri- Instruction LL first reads the content of a memory location, vate variable name can be extended with the suffix “.” say X, and marks it as “reserved” (not “locked”). If no +“process identifier”. In particular, pc.q is the program other processor changes the content of X in between, the location of process q, it ranges over all defined integer labels. subsequent SC operation of the same process succeeds and The specification Sa of LL/SC can then be given as modifies the value stored; otherwise it fails. The semantics shown in Fig. 1. In the specification, we model the Node of LL and SC are given by equivalent atomic statements as an array of the N shared variables in the heap under below, where me is the process identifier of the acting consideration, which can be of any type (e.g. Val). The process. indices of the Node are the addresses (or the pointers) proc LL(in X : Val):Val = to shared variables. We can thus simply regard the shared  S.X := S.X ∪{me}; return X;  variable X (under consideration) as a synonym of an index of the Node, and its value is stored in Node[x]. As before, proc SC(ref X : Val; in Y : Val):bool = the action enclosed by angular brackets ... is defined as  if me ∈ S.X then atomic statement. We now turn our attention to the lock-free implementa- 1Note that , this is allowed only in the specification of the algorithm. tion using only pointer-size CAS, which is given by the

32 Constant Constant P = number of processes; P = number of processes; N = number of shared variables; N = number of shared variables; Shared variable K = N + 2P; Node: array [1 ...N] of Val; Shared variable S: array [1 ...N] of Set; Node: array [1 ...K] of Val; Private variable indir: array [1 ...N] of 1 ...K; pc: {a1; a2}; prot: array [1 ...K] of 0 ...K; me: ProcID; Private persistent variable proc LL(in x :1...N):Val = pc:[c10 ...c34]; a1: S[x]:=S[x] ∪{me}; return Node[x];  mp:1...K; proc SC(in x :1...N; Y : Val):Bool = proc LL(in x :1...N):Val = a2: if me ∈ S[x] then loop S[x]:=∅; Node[x]:=Y ; return true c10: m := indir[x]; else return false; fi  c12: mybuf := Node[m]; Initial conditions c14: prot[m] ++; Θa: ∀p:1...P: pc.p = a1 ∨ pc.p = a2 c16: if m = indir[x] then return mybuf; Figure 1. The Specification Sa of LL/SC else c18: prot[m] −−; fi; algorithm shown in Fig. 2. This lock-free implementation is end; inspired by our previous work [12]. proc SC(in x :1...N; Y : Val):Bool = In the lock-free implementation, the shared variable c20: Node[mp]:= Y indir[x] acts as pointers to the shared node x under loop consideration(i.e., the shared variable), while node[mpp] is c22: m := indir[x]; taken as a “private” node of process p though it is declared c24: if CAS(indir[x],m,mp) then publicly: other processes can read it but cannot modify it. c26: prot[m] −−; c28: if prot[m]=1then IV. CORRECTNESS mp := m; In this section we will prove that the concrete system Sc else S implements the abstract system a. Formally, like we did in c30: prot[m] −−; [10], [14], we define repeat P choose mp from 1 ...K Σa  (Node[1 ...N], S) × (pc, me,x,Y) c32: until CAS(prot[mp], 0, 1) Σc  (Node[1 ...K], indir[1 ...N], ; prot[1 ...K]) × (pc,x,Y, mp,m,mybuf)P fi return true; Πa(Σa)  Node[1 ...N] else Πc(Σc)  node[indir[1 ...N]] c34 prot[m] −−; N  N ∨N ∨N : a a0 a1 a2 return false; N  N . c 10≤i≤34 ci fi; The transitions of the abstract system can be described: end. ∀s, t :Σa,p:1...P: Initial conditions Θc:(∀p:1...P:(pc.p = c10 ∨ pc.p = c20) s(N ) t  s = t a0 p (to allow stuttering) ∧ mybuf = N+p) s(N ) t  .p = a ∧ .p = a p a1 p pc 1 pc 2 ∧ (∀i:1...N: indir[i]=i) ∧ [x.p]=( [x.p] ∪ ) S’ S me ∧ (∀i:1...K: prot[i]=(i ≤ N+P ?1:0)) ∧ Pres(V−{pc.p, S[x.p]}) s(N ) t  .p = a ∧ .p = a a2 p pc 2 pc 1 Figure 2. The Lock-free implementation Sc of LL/SC ∧ ((me ∈ S[x.p] ∧ S[x.p]=∅∧Node[x.p]=Y ∧ Pres(V−{pc.p, Node[x.p], S[x.p]})) ∨ (me ∈/ S[x.p] ∧ Pres(V−{pc.p}))) concrete transitions c16: ∀s, t :Σc,p:1...P: The transitions of the concrete system can be described s(N ) t  .p = c in the same way. Here we only provide the description of c16 p pc 16

33 ∧ ((m.p = indir[x.p] ∧ pc .p = c20) Since our algorithm is concurrent, the step is defined as ∨ (m.p = indir[x.p] ∧ pc .p = c18)) the disjoint of all atomic actions. ∧ Pres(V−{pc.p}) % transition steps To prove that Sc implements Sa, we define the state step(p,s,t) : bool = mapping φ: Σc → Σa by showing how each component step10(p,s,t) or step12(p,s,t) or ... of Σa is generated from components in Σc: step20(p,s,t) or step22(p,s,t) or ...... ∀i:1...N: Nodea[i]=Nodec[indirc[i]] ∀i:1...N: Sa[i]={p:1...P|pcc.p∈{ / c10; c20; c22} We then started to guess and prove several invariants ∧ xc.p = i ∧ mc.p = indirc[xc.p]} as described in the next sections. This improved our un- ∀p:1...P: pca.p =(pcc.p ∈ [c10 ...c18]?a1 : a2) derstanding and our confidence in the correctness of the algorithm. Finding invariants in an algorithm one does not where the subscript indicates the concrete or abstract system really understand requires a good intuition, but is mainly a a variable belongs to, and the remaining variables in Σa are lot of work. The notion of stability for an proposed invariant identical to the variables occurring in Σc. can be proved by their corresponding T heorem or Lemma in PVS like: A. Proving the invariants with PVS When we started to investigate the algorithm, it soon % Theorem about the stability of invariant I1 became apparent that we could use PVS as a proof assistant. IV−I1: THEOREM In PVS, we defined the state space in terms of the shared forall (s,t : state, p : Process ) : and private variables, like the following: step(p,s,t) AND I1(s) AND I4(s) AND I5(s) => I1(t) N, P: posnat K: posnat = N+P*2 To ensure that all proposed invarinats be proved stable, we INV Process: TYPE = range[P] construct a global invariant by conjoining all proposed Index: TYPE = range[N] invariants, and discharge a particular proof for its stability. Val: TYPE % global invariant [# State : TYPE = INV(s:state) : bool = % shared variables I1(s) and I2(s) and I3(s) and ... − > Node : [ range(K) Val ], ... indir : [ Index − > range(K) ], prot : [ range(K) − > nat ], % Theorem about the stability of the global invariant ... IV−INV: THEOREM % private variables: forall (s,t : state, p : Process ) : pc : [ Process − > nat ], step(p,s,t) AND INV(s) => INV(t) − > mp : [ Process range(K) ], After the stabilities of all proposed invarinats have been ... checked separately, we define Init as an initial condition % local variables of procedures: that must be satisfied by all proposed invariants. m : [ Process − > range(K) ], x : [ Process − > Index ], % initial state mybuf : [ Process − > Val ], Init: { s : state | ... (forall (p: Process): #] pc(s)(p)=10 or pc(s)(p)=20) and (forall (i: Index): The code of Section 3 can be easily transformed into indir(s)(i)=i) and a transition system. For example, using s and t of type p c ... state and of type Process, line 16 is represented by the } definition: step16(p, s, t): bool = % The initial condition can be satisfied pc(s)(p) = 16 AND IV−Init: THEOREM if m(s)(p) = indir(s)(x(s)(p)) then INV(Init) t = s WITH [ (pc)(p) := 20 ] else The role of PVS was plain verification. We ourselves t = s WITH [ (pc)(p) := 18 ] invented the invariants. In the more difficult proofs of

34 preservation of some invariants, we also had to guide the ACKNOWLEDGMENT choices of case distinctions. The Project Sponsored by the Scientific Research Foun- dation for the Returned Overseas Chinese Scholars, State B. Invariants Education Ministry.

We establish some invariants for the concrete system Sc, REFERENCES that will aid us in proving the refinement. [1] M. Abadi and . Lamport. The existence of refinement map- pings. Theoretical Computer Science, 2(82), 1991. I1: p = q ∧ pc.p∈ / [c26 ...c32] ∧ pc.q∈ / [c26 ...c32] ⇒ mp.p = mp.q [2] Z. Manna and A. Pnueli. The Temporal Logic of Reactive and I2: pc.p∈ / [c26 ...c32] ⇒ indir[x] = mp.p Concurrent Systems: Specification. Springer Verlag, 1992. I3: x = y ⇒ indir[x] = indir[y] [3] M. Herlihy. A Methodology for Implementing Highly Concur- In the expression of invariants, free variables p and q range rent Data Objects. 15(5):745-770, November 1993 1 ...P x y 1 ...N over , and and range over . Invariants [4] A. Israeli and L. Rappoport. Disjoint-Access-Parallel imple- I1 and I2 indicate that, for any process p, node[mp.p] mentations of strong shared-memory primitives. In Proceed- can be treated as a “private” node of process p since only ings of the 13th Annual ACM Symposium on Principles of process p can modify that. Invariant I3 implies that all Distributed Computing, pages 151-160, August 1994. shared nodes are different. To prove the invariance of I1 [5] J. D. Valois. Implementing lock-free queues. In Proceedings to I3, we postulate of the Seventh International Conference on Parallel and Dis- ∀i:1...K: [i]=({x:1...N | [x]=i}) tributed Computing Systems, pages 64C69, Las Vegas, NV, I4: prot indir 1994. +({p | (pcp ∈/ [c26 ...c32] ∧ mpp = i) ∨ (pcp = c26 ∧ mp = i)}) [6] F. Cassez, C. Jard, B. Rozoy, M. Dermot (Eds.): Modeling and Verification of Parallel Processes. 4th Summer School, +({p | pcp ∈ [c16 ...c34] ∧ pc.p = c32 ∧ mp = i}) MOVEP 2000, Nantes, France, June 19-23, 2000. I5: pc.p ∈ [c20 ...c34] ∧ pc.p = c32 ∧ mp.q = m.p ⇒ pc.q ∈ [c26 ...c32] [7] M.P. Herlihy and V. Luchangco and M. Moir. The Repeat Offender Problem: A Mechanism for Supporting Dynamic- Invariant I4 precisely describe the counter prot[i] for each Sized, Lock-Free Data Structure. Proceedings of 16th Interna- i ∈ 1 ...K. Invariant I5 implies that process p cannot read tional Symposium on Distributed Computing, pages 339-353. the “private” node of other process q. Springer-Verlag, October 2002. Consequently, we have the main reduction theorem for [8] P. Jayanti and S. Petrovic. Efficient and practical construc- the lock-free implementation using CAS: tionsof ll/sc variables. In Proceedings of the 22nd ACM Symposium on Principles of Distributed Computing, pages 285-294, July 2003. Theorem 2 The abstract system Sa defined in Fig. 1 is implemented by the concrete system Sc defined in Fig. 2, [9] V. Luchangco and M. Moir and N. Shavit. Nonblocking k- that is, ∃φ : Sc Sa. compare-single-swap. Proceedings of the fifteenth annual ACM symposium on Parallel algorithms and architectures, pages 314-323. ACM Press, 2003 V. C ONCLUSION [10] H. Gao and W.H. Hesselink. A formal reduction for lock- We are interested in designing efficient data structures free parallel algorithms. Proceedings of the 16th Conference and algorithms on shared-memory multiprocessors. On such on Computer Aided Verification (CAV), July 2004. machines, lock-free algorithms offer significant reliability [11] S. Doherty, M. Herlihy, V. Luchangco, and M. Moir. Bringing and performance advantages over conventional lock-based practical lock-free synchronization to 64-bit applications. In implementations. The lock-free algorithms often require the Proceedings of the 23rd annual ACM symposium on Principles use of special atomic processor primitives such as CAS or of distributed computing, pages 31-39, July 2004. LL/SC. However, many machine architectures support either CAS or LL/SC with restricted semantics. [12] H. Gao, J.F. Groote and W.H. Hesselink. Lock-free Dynamic Hash Tables with Open Addressing. Distributed Computing 17 In this paper, we first present a lock-free implementation (2005) 21-42. of the ideal semantics of Multiword LL/SC object using only pointer-size CAS without causing ABA problem. Then [13] . Bencina. Survey ”Some Notes on Lock-Free and Wait-Free to ensure our algorithm is not flawed, we use refinement Algorithms” at www.audiomulch.com/ rossb/code/lockfree/. mapping to prove the correctness of the algorithm, and the [14] H. Gao and W.H. Hesselink. A general lock-free algorithm higher-order interactive theorem prover PVS for mechanical using compare-and-swap. Information and Computation 205 support. (2007) 225-241.

35 [15] H. Gao, J.F. Groote and W.H. Hesselink. Lock-free parallel and concurrent garbage collection by mark&sweep. Science of Computer Programming 64 (2007) 341-374.

36