Room Synchronizations
Guy E. Blelloch Perry Cheng Computer Science Department
void push(val y) {
it do es not matter what order relative to the increments the
int j;
values are inserted.
enterRoom(PUSHROOM);
Therefore, if we can separate the pushes from the p ops,
void push(val y) {
j = fetchAdd(&top,1);
wewould have a safe implementation of a stack. Ro om syn- int j;
A[j] = y;
j = fetchAdd(&top,1);
exitRoom();
chronizations allow ustodothisasshown in Figure 1(b).
} A[j] = y;
The ro om synchronization guarantees that no two users will
}
ever simultaneously b e in the PUSHROOM and POPROOM,sothe
val pop() {
push and p op instructions will never b e interleaved. How-
val x; val pop() {
ever, it will allowanynumb er of users to b e in either the
int k;
int k;
val x;
push or p op ro om simultaneousl y. enterRoom(POPROOM);
k = fetchAdd(&top,-1);
k = fetchAdd(&top,-1);
In the full pap er [4] weshow that using our ro om syn-
if (k <= 0) {
if (k <= 0) {
chronization algorithm (a) asynchronous accesses (pushes or
fetchAdd(&top,1);
fetchAdd(&top,1);
p ops) can arriveatany time, (b) every access is serviced in
x = EMPTY;
x = EMPTY;
2
time prop ortional to the time of a fetch-and-add , and (c)
}
}
that this implements a linearizab le stack. The exp eriments
else x = A[k-1]; elsex=A[k-1];
describ ed in Section 4 are based on a variant of this stackin return x;
exitRoom();
}
return x;
whichmultiple elements are pushed or p opp ed within each
}
ro om.
(b) (a)
Outline. The pap er is organized as follows. Section 2
formally de nes ro om synchronizati ons, and presents an al-
gorithm that provably implements them. Section 3 shows
Figure 1: The co de for a parallel stack. (a) Do es not
howroomsynchronizations can b e e ectively used to imple-
work if the push and p op are interleaved in time. (b)
ment shared queues and dynamic shared stacks. Section 4
Avoids this problem using a ro om synchronization.
presents our exp erimental results. Section 5 discusses fur-
ther issues and related work, and Section 6 concludes.
degrades linearly with the numb er of pro cessors (b ecause
tialized ), our version scales almost p er-
accesses are sequen 2. ROOM SYNCHRONIZATIONS
fectly,beyond an initial cost for going from one to two pro-
The ro om synchronization problem involves a set of m
cessors. The ro om synchronizatio ns intro duced in this pap er
ro oms, and a set of p indep endent users who wish to access
have b een incorp orated into our real-time garbage collector
the ro oms. A user wishing access to a ro om calls an En-
co de [5], where they are used extensively.
ter Ro om primitive, which returns once the user is granted
One disadvantage of ro om synchronization s is that if a
p ermission to enter the ro om. The user is then said to b e
user fails or stops while inside a ro om, the user will blo ck
inside the ro om. When done with the ro om, the user exits
other users from entering another ro om. This prop ertyis
the ro om by calling an Exit Ro om primitive. The ro oms
clearly not desirable under all conditions, but in manycon-
synchronization construct must ensure that at any p ointin
ditions, includin g the garbage collector, it is not a problem.
time, there is at most one \op en" ro om with users inside it.
We discuss this issue in Section 5.
However, anynumb er of users can b e inside the op en ro om.
eprovide details on the ro om synchro-
1.1 A Motivating Example In this section, w
nization problem, discuss necessary and desirable prop erties
To motivate our problem, we consider implementing a par-
for proto cols implementing ro om synchronizatio n, presenta
allel stack using a fetch-and-add. We assume the stackis
proto col that implements ro om synchronizatio n, and show
stored in an array A and the index top p oints to the next
that this proto col satis es all these prop erties.
free lo cation in the stack (the stack grows from 0 up). The
tr,cnt) op eration adds cnt to the contents of
fetchAdd(p 2.1 Primitives
lo cation ptr and returns the old contents of the lo cation (b e-
The basic primitives of ro om synchronization are:
fore the addition). We assume this is executed atomically.
Consider the stack co de shown in Figure 1(a). Assuming a
Create Ro oms. Given a p ositiveinteger m,createa
constant-time fetchAdd, the push and pop op erations will
ro oms ob ject R for a set of m ro oms, and return a p ointer
take constant time. The problem is that they can work
(a reference) to R. There can b e multiple ro oms ob jects at
incorrectly if a push and pop are interleaved in time. For
the same time.
example, in the following interleaving of instructions
Enter Ro om. Given a p ointer to a ro oms ob ject R and
aroomnumber i, try to enter ro om i of R. Return when
j = fetchAdd(&top,1); // from push
the user has succeeded in entering the ro om. When the
k = fetchAdd(&top,-1); // from pop
primitive returns, the user is said to b e inside the ro om. A
x = A[k-1]; // from pop
A[j] = y; // from push
ro om with a user inside is said to b e open.
Exit Ro om. Given a p ointer to a ro oms ob ject R, exit
the p op will return garbage. Without an atomic op eration
the ro om in R that the user is currently inside. Because the
that changes the counter at the same time as mo difying the
user can b e inside at most one ro om in R, there is no need
stack element, we see no simple way of xing this problem.
to sp ecify the ro om numb er. When a user requests to exit
One should note, however, that anyinterleaving of twoor
a ro om, it is no longer considered to b e inside the ro om. If
more pushes or two or more p ops is safe. Consider pushes.
2
The fetchAdd reserves a lo cation to put the value y, and the
This is under certain assumptions ab out all pro cessors
write inserts the value. Since the counter is only increasing, making progress.
123
Asynchrony is mo deled by the fact that actions from dif- there are no users remaining inside the ro om, the ro om is
ferent agent pro cesses can b e interleaved in an arbitrary said to b e closed.
manner; thus one agentmayhavemany actions b etween
DestroyRooms. Given a p ointer to a ro oms ob ject,
actions by another agent. We will consider various fairness
deallo cate the ro oms ob ject.
metrics for executions and for ro om accesses. A weak form
Other primitives. In addition to these four basic primi-
of fairness among the agent actions is the following: An ex-
tives, there is a Change Ro om primitive, whichisequivalent
ecution is weakly-fair if it is either (a) nite and no agent
to an Exit Ro om followed byanEnter Ro om, but with the
action is enabled in the nal state, or (b) in nite and each
guarantee of immediate entry into the requested ro om if it
agent has in nitely many opportunities to p erform an ac-
is the next ro om to b e op ened. There is also an Assign Exit
tion (either there is an action by the agentornoactionis
Co de primitive, discussed at the end of Section 2.3.
enabled).
Certain actions are sp ecially designated as external ac-
The Enter Ro om and Exit Ro om primitives can b e viewed
tions; these are the (only) actions in which a user communi-
as the counterparts to the \trying" (e.g., lo ck) and \exit"
cates with its agent. For ro om synchronizatio n, the external
(e.g., unlo ck) primitives for the mutual exclusion problem.
actions for a user j (and its agent) are:
As with the mutual exclusion problem, what the users do
while inside the ro om (or critical section) is part of the appli-
EnterRo omReq (i): the action of user j signalli ng to
j
cation and not part of the synchronization construct. This
its agent p a desire to enter ro om i.
j
enables the generality of the primitive, as the same construct
can b e used for a variety of applicatio ns. The drawbackis
EnterRo omGrant (i): the action of agent p signalli ng
j j
that, as with mutual exclusion, the construct relies on the
to user j that its Enter Ro om request has b een granted.
application to alternate entering and exiting requests bya
ExitRo omReq : the action of user j signalli ng to its given user.
j
The Create Ro oms and Destroy Ro oms primitives are ex-
agent p a desire to exit its currentroom.
j
ecuted once for a given ro oms ob ject, in order to allo cate
ExitRo omGrant : the action of agent p signalling to
j j
and initiali ze the ob ject prior to its use and to deallo cate
user j that its Exit Ro om request has b een granted.
the ob ject once it is no longer needed. To simplify the dis-
cussions that follow, we will fo cus on a single ro oms ob ject,
The trace (traceat j ) of an execution is the subsequence of
for which Create Ro oms has already b een executed, and De-
the execution consisting of its external actions (for a user
stroy Ro oms will b e executed once the ob ject is no longer
j ).
needed. Extending the formalizations and discussions to
The terminology ab ove fo cuses on mo deling the agents
multiple ro oms ob jects and to issues of creating and de-
that act on b ehalf of user requests, as needed to formal-
stroying ob jects is relatively straightforward.
ize b oth the ro om synchronization problem and proto cols
vide ro om synchronization s. Section 3, on the other
2.2 Formalization that pro
hand, fo cuses on mo deling user applications,suchasstacks
We can formalize the ro om synchronization problem using
and queues, that make use of ro om synchronizations . All
the I/O Automaton mo del [16]. Our terminology and formal
of the terminology stated ab ove for agents can b e similarly
mo del are an adaptation of those used in [16] for formalizing
de ned in order to mo del users. For example, each user pro-
mutual exclusion.
cess has some lo cal private memory, and there is a global
Each user j is mo deled as a state machine that commu-
shared memory accessible byall user pro cesses. From the
nicates with an agent pro cess p byinvoking ro om synchro-
j
p ersp ective of the present section, however, all users do is
nization primitives and receiving replies. The agent pro cess
make requests to enter or exit ro oms.
p , also a state machine, works on b ehalf of user j to p er-
j
Necessary prop erties. We rst state formally a condi-
form the steps of the synchronizati on proto col. Each agent
tion on users of ro om synchronization and their agents. A
pro cess has some lo cal private memory, and there is a global
trace at j of an execution for a ro oms ob ject with m ro oms
shared memory accessible by all agent pro cesses. The set of
is said to b e behaved if it is a pre x of the cyclicall y ordered
agent pro cesses, together with their memory, is called the
sequence:
protocol automaton.Anaction (an instruction step) is a
transition in a state machine. Wesay an action is enabled
EnterRo omReq (i ), EnterRo omGrant (i ), ExitRo omReq ,
j 1 j 1 j
when it is ready to execute. Actions are low-level atomic
ExitRo omGrant ,EnterRo omReq (i );: : :
j j 2
steps such as reading a shared memory lo cation or incre-
menting a lo cal counter. An execution is a sequence of al-
where i ;i ;::: 2 [1::m]. In other words, (i) the Enter Ro om
1 2
ternating states and actions, b eginning with a start state,
and Exit Ro om primitives byagiven user alternate, starting
such that each action is enabled in the state immediately
with an Enter Ro om, (ii) the user waits for a request to b e
preceding it in the sequence and up dates that state to b e the
granted prior to making another request, (iii) conversely,the
state immediately succeeding it. Thus actions are viewed as
agentwaits for a request b efore granting a request and only
3
o ccurring in some linear order.
grants what has b een requested, and (iv) the requested ro om
numb ers are valid. Wesay a user j 's requests are behaved
3
Although an execution is mo deled as a linearized sequence
if no request is the rst misb ehaved action in the trace at
of low level actions, this is a completely general mo del for
j (formally, there is no EnterRo omReq or ExitRo omReq
j j
sp ecifying parallel algorithms and studying their correctness
in the trace at j such that the pre x of the trace up to but
prop erties: The state changes resulting from actions o ccur-
not includi ng this action is b ehaved, but the pre x including
ring in parallel are equivalent to those resulting from some
the action is not b ehaved). In a b ehaved trace at j ,Enter-
linear order (or interleaving ) of these actions, given that the
Ro omReq (i) transitions user j from outside all ro oms to actions are de ned to b e suciently lowlevel. j
124
most time , and the time b etween successiveactions preparing to enter ro om i,EnterRo omGrant (i) transitions
j
of each agent preparing to enter or preparing to exit is user j from preparing to enter ro om i to inside ro om i, Exit-
at most , then any closed ro om requested byatleast Ro omReq transitions user j from inside to preparing to
j
one user is op en within time T = T ( ; ;m;p). exit, and ExitRo omGrant transitions user j from preparing
R R j
to exit to outside.
P7. Constantdelay for ro oms: This is a stronger ver-
We can now state formally what it means for a proto col
sion of P6 in which T is indep endentofp.
R
to implementroomsynchronizatio n. A proto col automa-
ton A solves the room synchronizationproblem for a given
P8. Lo ckout-freedom (i.e., no user starvation): In any
collection of users with b ehaved requests if the following
weakly-fair execution: (1) If all users inside a ro om
prop erties hold:
eventually prepare to exit the ro om, then anyuser
preparing to enter a ro om eventually gets inside the
P1. Trace b ehaved: In any execution, for any j , the trace
ro om. (2) Any user preparing to exit a ro om eventu-
at j is b ehaved. One implication is that only users re-
ally gets outside the ro om. Note that this is a stronger
questing to enter a ro om are given access to the ro om,
prop ertythanP5.
and only after its EnterRo omReq and b efore any sub-
sequent ExitRo omReq.
P9. Bounded time to enter and exit: In any execution,
any user preparing to enter a ro om is inside the ro om
P2. Mutual exclusion among ro oms: Thereisnoreach-
4
within time T = T ( ; ;m;p), and any user preparing
able state of A in which more than one ro om is op en.
1 1
to exit a ro om is outside the ro om within time T =
2 Equivalently,inany execution, b etween anyEnter-
T (;m;p), where , , m and p are as de ned in P6.
Ro omGrant (i) in the trace and the next ExitRo om-
2
j
Req (or the end of the trace if there is no such action)
j
0 0
P10. Constant time access: This is a stronger version of
there are no EnterRo omGrant(i ) actions for i 6= i.
P9 in which T and T are indep endentofp. Note that
1 2
this implies unb ounded concurrent access to ro oms P3. Weakly-concurrent access to ro oms: There are
(otherwise the time b ound would necessarily dep end reachable states of A in which more than one user is
on p).
inside a ro om.
P4. Global progress: Atany p ointinaweakly-fair exe-
P11. Demand driven: When a user is inside a ro om or
cution: (1) If there is a user j preparing to enter ro om i
outside all ro oms, there are no actions by its agent.
(i.e., its most recent external action is an EnterRo om-
Thus, an agent p erforms work only in resp onse to a
Req (i)), and there are no ro oms with users inside,
request by its user. j
then at some later p oint some user is inside some ro om
terRo omGrant action). (2) If there is
(i.e., there is a En 2.3 A room synchronization protocol
a user j preparing to exit a ro om (i.e., its most recent
Wenow present a proto col (an algorithm) for solving the
external action is an ExitRo omReq ), then at some
j
ro om synchronization problem that satis es all the prop er-
later p oint user j is outside the ro om (i.e., there is an
ties P1{P11. The proto col assumes a linearizabl e shared
ExitRo omGrant action).
j
memory [14] supp orting atomic reads, writes, fetch-and-
6
adds, and compare-and-swaps on single words of memory.
Note that prop erties P1{P4 do not guarantee any fair-
In our actual implementation of this proto col, wedonot
ness among ro oms or among users, the delays in entering or
assume a linearizabl e shared memory, and instead explic-
exiting a ro om, etc. These are considered next.
itly insert Memory Barriers into the co de (details in the full
Desirable prop erties. Prop erties P5{P11 de ne addi-
pap er [4]).
tional desirable prop erties for a proto col automaton A solv-
Consider a ro oms ob ject with m ro oms. It includes three
ing the ro om synchronizatio n problem. It is useful to ana-
arrays of size m: wait, grant,and done.Thearrays hold
lyze proto col p erformance (such as in P6, P7, P9, and P10
three counters for each ro om, all initiall y zero. It includes
b elow) assuming an upp er b ound on the (wall clo ck) time
a numRooms eld, set to m. It also includes an activeRoom
5
between successive actions byanagent.
eld, which holds the ro om numb er of the (only) ro om that
may b e op en. The sp ecial value -1 is used to indicate when
P5. No ro om starvation: In anyweakly-fair execution:
there is no active ro om, e.g., initiall y and whenever there
If all users inside a ro om eventually prepare to exit the
are no users either inside a ro om or waiting to enter a ro om.
ro om, then any closed ro om requested by at least one
7
The proto col is depicted in Figure 2. For conciseness
user is eventually op ened.
and readability,we present C co de instead of a full I/O Au-
tomaton sp eci cation. Agents enter a ro om by incrementing
P6. Bounded delay for ro oms: In any execution with m
the wait counter to get a \ticket" for the ro om (Step 2),
ro oms and p users: If each user is inside a ro om for at
and then waiting until that ticket is granted (Steps 3{10).
4
A stronger prop erty is to require that at most one ro om
6
can b e op en even if some user requests are not b ehaved. Wehave explicitly avoided atomic op erations on twoor
5
more words of memory.We use only the weaker fetch-and-
The time b ound only applies if the second action is enabled
increment form of fetch-and-add. Also note that wecould
no later than when the rst action o ccurs: an agentblocked
in principle implement ro oms using only reads and writes,
waiting for, say, a request from its user can b e arbitrarily
following ideas used, e.g., in the Bakery Algorithm [15] for
delayed. Also, note that in the absence of a p ositivelower
mutual exclusion, at the cost of far greater delays.
b ound on the time b etween successive actions, wehave not
7
restricted the relative sp eeds of the agents. Finally,note We show only the co de for enterRo om and exit-
that the time b ound applies only to the analysis of the time Ro om. The complete version of the co de is available at:
p erformance, and not to any correctness (safety) prop erties. www.cs.cmu.edu/afs/cs/project/pscico/www/rooms.html
125
1 void enterRoom(Rooms t*r,inti)f
2 int myTicket = fetchAdd(&r->wait[i],1) + 1; // get ticket for the ro om
3 while (myTicket - r->grant[i]> 0) f // wait until your ticket is granted
4 if (r->activeRoom == -1) f // while waiting, if no activeroom
5 if (compareSwap(&r->activeRoom,-1,i)==-1) f // then make i the activeroom
6 r->grant[i] = r->wait[i]; // enable all with tickets to enter ro om i
7 break;
8 g
9 g
10 g
11 g
t*r)f 12 int exitRoom(Rooms
13 int ar = r->activeRoom; // preparing to exit ro om ar
14 int myDone = fetchAdd(&r->done[ar],1)+ 1; // increment the done counter
15 if (myDone == r->grant[ar]) f //iflasttobedone
16 for (int k = 0, newAr = ar; k < r->numRooms;k++) f
17 newAr = (newAr + 1) % r->numRooms; // go round robin thru the ro oms
18 if (r->wait[newAr] - r->grant[newAr]> 0) f // if ticketed waiters
19 r->activeRoom = newAr; // make it the activeroom
20 r->grant[newAr] = r->wait[newAr]; // enable all with tickets to enter ro om newAr
21 return 1;
22 g
23 g
24 r->activeRoom = -1; // no waiters found, so no activeroom
25 return 1; // 1 indicates was last to b e done
26 g
27 return 0; // 0 indicates was not the last
28 g
Figure 2: C co de for Enter Ro om and Exit Ro om.
Agents exit a ro om by incrementing the done counter (Step and return of this pro cedure. Consider any execution and
14). Once the done counter matches the grant counter any user j with b ehaved requests. An EnterRo omGrant (i)
j
(Step 15), then all agents granted access to the ro om have (ExitRo omGrant ) action cannot b e the rst misb ehaving
j
exited the ro om. The unique agent to incrementthedone action in the trace at j , b ecause it can o ccur in the trace only
counter up to the grant counter (the \last done") do es the immediately after the matching EnterRo omReq (i)(Exit-
j
work of selecting the next active ro om (Steps 16{23). The Ro omReq , resp ectively) that initiated the pro cedure call.
j
grant counter of that active ro om is set to b e equal to its Thus the trace at j is b ehaved.
current wait counter (Step 20), therebygranting all wait-
Prop ertyP2. To provemutual exclusion, we will need
ing tickets for that ro om. If the last done agent fails to
the following de nitions and lemmas. For an execution ,
discover a ro om with waiting tickets, it resets activeRoom
let jj b e the subsequence of consisting of its actions for
to 1 (Step 24). Whenever activeRoom = 1 (Step 4), a
a user j or its agent p . A user j has a ticket foraroomi
j
ticketed agent can self-select its requested ro om as the next
after an execution for each Step2ofenterRo om(i)in jj
active ro om (Step 5), and grantallwaiting tickets for that
with no subsequentStep14ofexitRoomin jj . A user j
ro om (Step 6).
with a ticket for a ro om i is blocked after an execution if
myTicket at j is greater than grant[i]. A user j is in the
Theorem 1. The room synchronization protocol in Fig-
advanceroom region after an execution if Step 6, 16, 17,
ure 2 satis es properties P1{P11.
18, 19, 20, or 24 is enabled, or Step 15 is enabled with a
Proof. Due to page constraints, we only presentthe
successful conditional test.
pro ofs of prop erty P1 (trace b ehaved) and prop ertyP2(mu-
Lemma 1. Each user (with a behavedtrace) has at most
tual exclusion among ro oms). Other prop erties are discussed
one ticket.
brie y, with the details left to the full pap er [4]. Wealsore-
strict our attention to the case where wait, grant,and done
Proof. Supp ose there exists an execution suchthata
are unb ounded counters, and hence they are monotonically
user j has multiple tickets after . By the de nition of hav-
nondecreasing. For ease of description, we view each step of
ing a ticket, for each such ticket, there is a Step 2 in jj with
the C co de as an atomic action. This is done without loss of
no subsequent Step 14 in jj .For eachsuch Step 2, there is
generality for all steps that access at most one shared mem-
a preceding EnterRo omReq , but no subsequent ExitRo om-
j
ory lo cation. For the other steps (Steps 6, 18, 20), the pro of
Grant b ecause Step 14 must precede any ExitRo omGrant.
j
is readily extended to have separate atomic actions for each
Thus the trace at j is not b ehaved, and hence prop ertyP1
shared memory access. Finally, to simplify the notation, we
fails to hold, a contradiction.
omit explicit reference to the ro oms ob ject p ointer r , e.g.,
Lemma 2. In any execution (with behavedtraces), if a
we use enterRo om(i) instead of enterRo om(r;i).
user j is inside room i after , then j is an unblockeduser
Prop erty P1. In the proto col of Figure 2, an Enter-
with a ticket for room i.
Ro omReq (i) (ExitRo omReq ) action corresp onds to the
j j
Proof. If user j is inside ro om i, then b ecause the trace user j initiating a pro cedure call to enterRo om(i) (exit-
at j is b ehaved, the last external action at j is EnterRo om- Ro om, resp ectively). An EnterRo omGrant (i) (ExitRo om-
j
Grant (i). The last o ccurrence of Step 2 of enterRo om(i)in Grant , resp ectively) action corresp onds to the completion
j j
126
Toshowinvariant2holdsins ,we again consider each jj precedes the EnterRo omGrant (i)in , and there can
2 j
relevant case for , namely, Steps 5{6, 14{20, and 24. If b e no subsequent Step 14 b ecause there is no subsequent
is a Step 5 that succeeds in enabling Step 6 in s ,then ExitRo omReq in jj .Thus j has a ticket for ro om i. More-
2
activeRoom = 1ins (otherwise, the compareSwap would over, supp ose j were blo cked. Then myTicket at j is greater
1
not return 1). Inductively byinvariants 1 and 3, grant[i] than grant[i] after . Let = where is the ab ove
1 2
= done[i] for all ro oms i in s , and hence in s . Inductively Step 2, is the pre x of prior to ,and is the sux
1 2 1 2
byinvariant 2, there are no users in the advance ro om region of after . By Lemma 1 and examination of the co de,
in s .Thus the invariant is maintained. If is a Step 6, we see that there is no p ossible step by user j that mo d-
1
15 (with a successful conditional ), 16{20, or 24, then user j i es myTicket at j : its value is the same in all states in
is in the advance ro om region in s . Inductively byinvari- . Because grant[i] is a nondecreasing counter, user j 's
1 2
ant2,j is the only such user in s .If is a Step 6, 20, myTicket > grant[i] in all states in .Thus j would never
1 2
or 24, then there are no users in the advance ro om region exit the enterRo om while lo op, and hence there would b e no
in s , and the invariant is maintained. On the other hand, EnterRo omGrant (i)in jj ,a contradiction. Thus user j
2 j 2
if is a Step 15{19, then inductively byinvariant2and is not blo cked.
the fact that none of these steps add a user to the advance The heart of the mutual exclusion pro of is the following
ro om region, set activeRoom = 1, up date grant, or up date lemma, which presents three invariants that also provide
done, the invariant is maintained. If is a Step 14, then, as insightinto the proto col.
argued ab ove, Step 14 increments done[ar], where j is an
Lemma 3. In any execution (with behavedtraces):
unblo cked user with a ticket for a ro om ar in s . Inductively
1
byinvariants 1 and 3, grant[i] = done[i] in s and hence
1
1. For al l rooms i, wait[i] done[i] is the number of
s for all ro oms i 6= ar,and grant[ar] > done[ar] in s .
2 1
users with tickets for room i,and wait[i] grant[i]
Thus inductively byinvariant 2, there is no user in the ad-
is the number of blocked users with tickets for room i.
vance ro om region in s . In order for j to b e in the advance
1
(Thus grant[i] done[i] is the number of unblocked
ro om region in s , myDone at j must equal grant[ar] in s
2 2
users with tickets for room i.) Al l users with tickets
(so that Step 15 is enabled with a successful conditional ).
for room i have myTicket wait[i].
This o ccurs only if grant[ar] = done[ar] in s , b ecause
2
myDone = done[ar] after . Hence, in all cases, invariant2
2. At most one user is in the advanceroom region. If a
holds in s .
2
user is in the advanceroom region, then activeRoom
Finally,toshowinvariant3holdsins ,we consider each
2
6= 1 and for al l rooms i, grant[i] = done[i].
relevant case for , namely, Steps 2, 5, 6, 13, 19, 20, and
24. If is a Step 2, then byinvariant 1 applied to b oth
3. If there exists an unblocked user with a ticket for room
s and s ,thenumber of unblo cked users with tickets for
1 2
i then activeRoom = i. If Step 6 (14, 20) is enabledat
ro om i is unchanged. Moreover, activeRoom is unchanged,
a user j , then activeRoom = i (ar, newAr,respectively)
so the invariant is maintained. If is a Step 5 that succeeds
at j and activeRoom 6= 1.
in changing activeRoom,then activeRoom = 1in s and
1
i 6= 1in s .Thus, inductively byinvariant 3 and the fact
2
Proof. The pro of is by induction on the number of ac-
that Step 5 do es not create an unblo cked ticketed user, the
tions in the execution. Initially, wait[i] = grant[i] =
invariant is maintained. If is a Step 6, then inductivel y by
done[i] = 0, and all three invariants hold for the start state.
invariant3,activeRoom = i in s and hence in s . Step 6 can
1 2
Assume that all three invariants hold for all executions of
only unblo ck users with tickets for ro om i, so the invariant
t 0 actions. Consider an arbitrary execution with t
is maintained. The case for Step 20 is symmetric. If
actions and consider all p ossible next actions . W.l.o.g.,
is a Step 13, then activeRoom is the same in s and s .
1 2
assume that is an action byuser j or its agent. Let s
1
For user j , sets ar = activeRoom and enables Step 14.
b e the last state in and s b e the up dated state after
2
As argued ab ove, j is an unblo cked user with a ticket for
o ccurs.
aroomar in s ;thus ar 6= 1. Step 13 neither creates
1
To showinvariant1holdsins ,wemust consider all the
2
anewunblo cked ticketed user nor changes activeRoom,so
cases where up dates either myTicket, one of the counters,
inductively byinvariant 3, the invariant is maintained. If
the number of ticketed users, or the number of blocked users,
is a Step 19, then user j is in the advance ro om region in
namely Steps 2, 6, 14, and 20. Step 2 of enterRo om(i) incre-
s , and hence inductivel y byinvariant 2, there is no other
1
ments b oth wait[i] and the numb er of (blo cked) users with
user in the advance ro om region in s .Thus inductivel y by
1
tickets for ro om i (by Lemma 1). In addition, each myTicket
invariants 1 and 2, there are no unblo cked ticketed users in
is at most wait[i] in s .Thus invariant 1 is maintained.
2
s , and hence in s . As argued ab ove, Step 14 is enabled at
1 2
Step 6 of enterRo om(i)setswait[i] grant[i] to zero. It
0 0
some user j only if j is an unblo cked ticketed user. Thus
follows inductively byinvariant 1 that any user with a ticket
Step 14 is not enabled at any user in s . Moreover, Step 6
2
for ro om i in s must have myTicket wait[i] = grant[i]
1
0 0
and Step 20 are enabled at some user j only if j is in the
in s , and hence cannot b e blo cked. Likewise, Step 20 main-
2
advance ro om region. Thus neither step is enabled in s ,and
1
tains the invariant for ro om newAr. IfStep14isenabledin
hence in s , with the exception of Step 20 b eing enabled at
2
s , then user j has a ticket (by de nition) for some ro om
1
user j in s .But sets activeRoom = newAr 6= 1, as is
2
i. Moreover, by an argument similar to that in the pro of of
required. Hence, the invariant is maintained. The case for
Lemma 2, j is not blo cked. Thus inductively byinvariant3,
Step 24 is the same as for Step 19, except that Step 20 will
activeRoom = i = ar in s . Step 14 increments done[ar],
1
not b e enabled at user j . Hence, in all cases, invariant3
and by Lemma 1, it decrements the numb er of users with
holds in s .
2
tickets for ro om ar, so the invariant is maintained. Hence,
This concludes the pro of of Lemma 3.
in all cases, invariant 1 holds in s . 2
127
di erent rates. Thus a ro om may op en and close multi- To complete the pro of of prop erty P2, supp ose there were
0
ple times b efore all the granted agents are done with the an execution resulting in two distinct ro oms i and i with
0
ro om. Furthermore, an agentthatisslow tograbaticket users j and j inside the resp ective ro oms. Then byLemma2,
0 0
maybebypassed by faster agents an unb ounded number of j (j )isanunblo cked user with a ticket for ro om i (i ,re-
times. This do es not contradict prop erties P9 or P10, b e- sp ectively). Thus byinvariant 3 of Lemma 3, activeRoom
0
cause within time t + O ( )theslow agent will grab a ticket, equals b oth i and i , a contradiction.
f
and from there will pro ceed in constant time (a function of
Other prop erties. Let m be the numb er of ro oms and p
, ,and m)toenter inside the ro om. On the other hand,
b e the numb er of users. Intuitively, the remaining prop erties
if all agents run at roughly the same sp eed, an agentcanbe
hold due to the following observations. (1) A user desiring
bypassed for its desired ro om at most once.
a ticket will get a ticket within a constantnumb er of its
Our proto col has the prop ertythatany user requesting to
actions. (2) The last done when exiting a ro om i starts
enter an already op en ro om cannot enter the ro om until all
at ro om i + 1 and cycles through all m ro oms (including
users inside the ro om have exited.
backtoi), granting access to the rst ro om it encounters
with ticketed users. (3) A ro om with ticketed users gets
Variants. To satisfy the stronger mutual exclusion among
its turn within m turns or less. (4) Eachticketed user in
ro oms prop erty discussed in the fo otnote to prop erty P2, it
the enterRo om while lo op will get inside the ro om within
suces to ignore all misb ehaving requests, as follows. Asso-
a constantnumb er of its actions after access is granted to
ciate with eachroomavector V ,onetwo-bit entry p er user,
the ro om. (5) If we place an upp er b ound on the time
indicating the e ective status of the user, as outside (0),
a user is inside a ro om, and an upp er b ound on the time
preparing to enter (1), inside (2), or preparing to exit (3).
between successive actions of an agent preparing to enter or
At the b eginning of enterRo om (prior to Step 2), determine
exit a ro om, then we can compute the time to enter or exit
the id of the user, and if V [id] 6= 0 or the requested ro om
a ro om. Eachturnforaroomtakes at most + O ( )time,
number is invalid, the user is misb ehaving and a failure co de
thus in any execution, any user preparing to enteraroomis
is returned. Otherwise, set V [id] = 1 and p ermit the user's
inside the ro om in time T ( + O ( )) m. Moreover, in
agent to pro ceed with Step 2. At the end of enterRo om (just 1
any execution, any user preparing to exit a ro om is outside
prior to Step 11), set V [id] = 2. Similarly, at the b eginning
the ro om in time T = O ( m) for the last done and O ( )
2
of exitRo om (prior to Step 13), determine the id of the user,
for all others.
and if V [id] 6= 2, the user is misb ehaving and a failure co de
In the ab ove time analysis, we are implicitly assuming that
is returned. Otherwise, set V [id] = 3 and p ermit the user's
is not a function of p, e.g., the time for a fetch-and-add is
agent to pro ceed with Step 13. Finally,setV [id]=0 just
indep endentofp. If instead, welet t = t (p)beanupper
f f
prior to Steps 21, 25, and 27.
b ound on the time for a fetch-and-add with p pro cessors
We implemented a version of the Exit Ro om primitive
(e.g., t = log p), then T ( +2 t + O ( )) m and
that includes sp ecial exit code. Exit co de is assigned to f 1 f
T t + O ( m).
2 f
a ro om using an Assign Exit Co de primitivethattakes a
Note that all these time b ounds and other prop erties hold
p ointer to a function and a p ointer to the arguments to the
even in the presence of arbitrarily fast agents who do their
function. The exit co de is executed by the last user to b e
b est to starve other agents. The reader is referred to the
done, prior to searching for the next active ro om. Thus we
full pap er [4] for the pro ofs.
are guaranteed that the exit co de is executed once, and only
after all users granted access to the ro om are no longer inside
Remarks. The technique of using a wait counter and a
the ro om, but b efore any users can gain subsequent access
grant counter is used in mutual exclusion proto cols suchas
to any ro om. Wehave found the exit co de to b e quite use-
TicketME [6, 16]. Mutual exclusion proto cols are simpler
ful in our applicati ons of ro om synchronization (an example
b ecause only one agent is granted access at a time. Thus
is given later in Figure 4 and is also used in our exp eri-
the grant counter can double as the done counter. Also,
ments of Section 4). Intuitively, the exit co de can b e viewed
the test for granting access is simply whether your ticket
as enabling functionali ty such as \the last one to leavethe
equals the grant counter, so mo dulo p counters are used
ro om turns out the lights". The need for this functionality
for p users. With ro om synchronizati on, we need an in-
do es not arise with mutual exclusion, b ecause there is only
equality test b etween the ticket and the grant counter in
a single user inside the critical section at a time.
order to admit multiple waiting users at once, so mo dulo p
counting do es not suce. Instead, and to avoid unb ounded
ters, we rely on the fact that when integers are rep-
coun 3. APPLICATIONS
resented in twos-complement(andp is less than the maxi-
In this section we describ e two additional applications
mum integer), then letting the counters wrap around to a
of ro om synchronization s: a shared queue and a dynamic
negativenumb er (ignoring the over ow) gives the desired
shared stack.
result for any inequality test in the proto col. This is the
reason for using myTicket - r->grant[i] > 0 instead of FIFO queue. The implementation of a linearizabl e FIFO
myTicket > r->grant[i] in the enterRo om co de. queue is given in Figure 3. The newQueue routine creates a
Note that activeRoom maybesetto 1even when there new queue ob ject, including allo cating an arrayofa xed
are users waiting with their tickets. However, such users size mysize and calling createRo oms to create tworooms
must have grabb ed their tickets after the last agent done asso ciated with the queue (one for enqueuing and one for
checked to see if there were ticketed waiters. Moreover, al- dequeuing). The queue ob ject contains top,which p oints
though all agents granted access to a ro om are spinning to the top of the queue (i.e., the next lo cation to insert
waiting for that ro om to op en, and hence will tend to pro- an element), and bot, which p oints to the b ottom of the
ceed together into the ro om, in the worst case the agents queue (i.e., the rst element to remove). The implementa-
may pro ceed to enter the ro om and exit the ro om at very tion prop erly checks for over ow and under ow (emptiness).
128
1 struct queue f
2 val *A; 17 int ENQUEUEROOM= 0;
3 unsigned size; 18 int DEQUEUEROOM= 1;
t *q) f 31 val dequeue(Queue
4 int top, bot;
32 val x;
19 int enqueue(val y, Queue t *q) f
5 Rooms t*r;
33 enterRoom(q->r, DEQUEUEROOM);
20 int status = SUCCESS;
6 g Queue t;
34 int j = fetchAdd(&q->bot,1);
21 enterRoom(q->r, ENQUEUEROOM);
35 if (q->top - j <= 0) f
7 Queue t *newQueue(int mysize) f 22 int j = fetchAdd(&q->top,1);
36 fetchAdd(&q->bot,-1);
8 Queue t*q=(Queuet*) 23 if (j - q->bot >= q->size) f
37 x = EMPTY;
9 malloc(sizeof(Queue t)); 24 fetchAdd(&q->top,-1);
38 g
10 q->A = (val *) malloc( 25 status = OVERFLOW;
39 else x = q->A[j % q->size];
11 mysize * sizeof(val)); 26 g
40 exitRoom(q->r);
12 q->size = mysize; 27 else q->A[j % q->size] = y;
41 return x;
13 q->top = q->bot = 0; 28 exitRoom(q->r);
42 g
14 q->r = createRooms(2); 29 return status;
15 return q; 30 g
16 g
Figure 3: The co de for a parallel queue using ro om synchronizations.
In the case of over ow during an enqueue, the elementis commits from other enqueues coming after but within the
not inserted and the top pointer is not incremented (the safe enqueue region will reserve higher, and hence later in
implementation rst increments it, but then go es back and \time", lo cations in the queue (b ecause of the prop erties
decrements it). Similarly in the case of an empty queue, the of fetch-and-add). Also, all writes to the reserved lo cations
bot p ointer is not incremented. Assuming the int typ e is of in Step 27 will complete while still in the safe enqueue re-
xed precision then bot and top can b oth over ow. In the gion (b ecause of prop erty P2 of the ro oms). Hence these
pro of b elowwe assume there is no over ow. The co de, how- enqueues will have the prop er linear order. If an enqueue
ever, works correctly even with over ow as long as the range op eration do es return OVERFLOW (the if branchistaken), then
int is greater than 2 * (q->size + P) for P pro cesses, and any other enqueues coming after but within the safe enqueue
8
as long as the range of int is a multiple of q->size. Alter- region will also return OVERFLOW. This is b ecause only an en-
natively we could use exit co de (see the end of the previous queue that increments q->top such that q->top - q->bot
section) to reset the counter when it is close to over owing. > q->size will take the if branch. By decrementing q->top
by one in Step 24 it cannot make it the case that q->top -
q->bot < q->size. Since no dequeue can commit in a safe
Theorem 2. The algorithm of Figure 3 implements a lin-
enqueue region, having all future enqueues within the region
earizable FIFO queue, such that both the enqueue and the de-
return OVERFLOW is the exp ected results of the sp eci ed lin-
queue operations take O (t + ) time regard less of the number
f
ear order. The argument for the prop er linearized order of
of concurrent users, where t is an upper bound on the time
f
the dequeues is similar to that for enqueues.
for a fetch-and-add, and is an upper bound on the time for
In regards to time, we note that the time that a pro cess is
any other instruction.
in a ro om is b ounded by the time for two fetch-and-adds (on
Steps 22 and 24 of enqueue, for example), and a constant
Proof: We rst show the queue is linearizabl e. Consider
numb er of other standard instructions (i.e., reads, writes,
a collection of p users executing enqueue and dequeue op er-
arithmetic op erations and conditional jumps). Eachuseris
ations on a queue, and p agents executing enterRo om and
therefore in a ro om for at most =2t + O ( ) time. Based
exitRo om op erations in resp onse to user requests, as indi- f
on Theorem 1, the maximum time a pro cessor will wait to
cated in the co de. Let b e the execution comprising the
enter or exit a ro om is prop ortional to t , and (e.g., the
actions by b oth users and agents and the complete state af- f
time to enter a ro om is b ounded by m ( +2t + O ( ))).
ter each action. Consider the subsequence of the actions in f
Therefore the total time to enter, pro cess, and exit is O (t +
comprised of the fetch-and-add actions generated byStep22 f
).
of the enqueue op eration and Step 34 of the dequeue op er-
ation. We call these the commit actions. We argue that the
Dynamic stack. Wenow consider the implementation of
ordering of these commit actions sp eci es a prop er linearized
a linearizabl e dynamic stack. In a dynamic stackwe assume
order of the corresp onding queue op erations.
the size of the stack is not known ahead of time and hence
Consider the enqueue. We call any region b etween a
the space allo cated for the stackmust b e capable of growing
commit action of an enqueue and the next EnterRo om-
dynamically. Each time it grows, we double the allo cated
Grant(DEQUEUEROOM) action a safe enqueue region. Because
space, and the old stack is copied to the new larger one.
of prop erty P2 of the ro oms, this region will contain no ac-
In practice such dynamic stacks are very imp ortant. If an
tions from the b o dy of the dequeue. In particular q->bot
application uses a collection of stacks that share the same
will not change, and no dequeue will commit. Consider an
p o ol of memory, it is crucial to minimize the space needed
enqueue op eration. In the case that the queue is not full
by each stack (i.e., allo cating the maximum that eachmight
(the else branch is taken) the committing fetch-and-add ef-
p ossibly need is impractical). Our implementation is shown
fectively reserves a lo cation to put the item to enqueue. Any
in Figure 4. The copying from a smaller to a larger stack
8
is implemented incrementally. We assume that INITSIZE
As with the ro oms co de, this takes advantage of twos-
is greater than the maximum numb er of concurrent users.
complement arithmetic and is sensitive to the particular way
The pushRoomExit routine is assigned as the exit co de to the comparisons are made in Steps 22 and 34.
129
Stack_t *newStack() {
Stack_t *s = (Stack_t *)
int PUSHROOM = 0;
malloc(sizeof(Stack_t));
int POPROOM = 1;
s->A = (val *) malloc(
void push(val y, Stack_t *s) {
INITSIZE * sizeof(val));
enterRoom(s->r, PUSHROOM);
struct stack {
s->B = NULL;
int j = fetchAdd(&s->top,1);
val *A; /* current stack */
s->top = s->copy = s->start = 0;
if (j >= s->size) {
val *B; /* prev stack */
s->size = INITSIZE;
s->start = 1;
int copy; /* being copied? */
s->r = createRooms(2);
fetchAdd(&s->top,-1);
int start;
assignExitCode(s->r, PUSHROOM,
exitRoom(s->r);
long size, top;
pushRoomExit, s);
push(y, s);
Rooms_t *r;
return s;
return;
} Stack_t;
}
}
if (s->copy) {
void pushRoomExit(void *arg) {
val pop(Stack_t *s) {
if (j >= s->size/2) {
Stack_t *s = (Stack_t *) arg;
val x;
s->A[j - s->size/2] =
if (s->start) {
enterRoom(s->r, POPROOM);
s->B[j - s->size/2];
s->start = 0;
int j = fetchAdd(&s->top,-1);
s->A[j] = y;
s->copy = 1;
if(j<0){
}
s->size = 2*s->size;
fetchAdd(&s->top,1);
else s->B[j] = y;
free(s->B);
x = EMPTY;
}
s->B = s->A;
}
else s->A[j] = y;
s->A = (val *) malloc(
else if (s->copy && j < s->size/2)
exitRoom(s->r);
s->size * sizeof(val));
x = s->B[j-1];
}
}
else x = s->A[j-1];
}
exitRoom(s->r);
return x;
}
Figure 4: The co de for a parallel dynamic stack.
the PUSHROOM.For simplicity,wehave left out some details, lo cal stack back to the shared stack. The p op of 500 no des,
such as handling p ossible failures of malloc. and push of the lo cal stack are each done within a single
ro om (for the ro om synchronizatio n version) or a single lo ck
Theorem 3. The algorithm of Figure 4 implements a lin-
(for the lo cked version). A no de with a countofk (> 0)
earizable stack of dynamic size, such that both the push and
turns into two no des of count k 1,andanodewithazero
the pop operations take O (t + t + ) time regard less of the
f a
count simply disapp ears. In other words, each of the origi-
number of concurrent users, where t is an upper bound on
f
nal 16,000 no des is the ro ot no de of a fully balanced binary
the time for a fetch-and-add, t is the upper bound on the
a
tree of depth 11. In addition to pro cessing the no des on
time for a cal l to malloc or free and is an upper bound
the lo cal stack, each pro cessor waits for a random amount
on the time for any other instruction.
of time b etween p opping and pushing. The random time is
We prove this in the full pap er [4]. We note that since
selected uniformly b etween 0 and 2t ,where t is a param-
k k
the blo cks of memory are p owers of two, allo cating from a
eter of the exp eriment, and is meant to represent the work
shared p o ol should b e quite cheap using a buddy system
asso ciated with pro cessing a stack element. In the case of
(i.e., t should b e small).
garbage collection, such additional work might include de-
a
co ding ob jects, copying ob jects, and installin g forwarding ters.
4. EXPERIMENTAL RESULTS p oin
An interesting problem p osed by the shared stackbench-
In this section we describ e the results of a b enchmark
mark (and also present in the garbage collector) is detect-
that uses a shared stack to distribute work across pro cessors.
ing termination, which should o ccur only when there are no
The exp eriments were p erformed on an Sun UltraEnterprise
no des left for any pro cessor to work on. One can correctly
10000 with 64 250Mhz UltraSparc-I I pro cessors. This is a
test whether the shared stack is emptyby using suitably
shared-memory machine with a compare-and-swap instruc-
designed exit co de in the push ro om. However, detecting
tion, but no fetch-and-add. The fetch-and-add is therefore
that the shared stackisempty ignores the work on the lo cal
simulated using the compare-and-swap. We demonstrate the
stacks of each pro cessor in its work section (i.e., each pro ces-
e ectiveness of ro om synchronizations by comparing the p er-
sor not trying to enter the p op ro om and not having just left
formance of the shared stack using ro oms and using lo cks.
the push ro om). To solve this, wecentralize the emptiness
We also implemented the Treib er non-blo cking stack [21].
condition by including in the shared stack a counter indi-
However, the p erformance of the non-blo cking stackwas so
cating how many lo cal stacks haveborrowed items from the
p o or due to a large overhead and general unscalabil ity that
shared stack and are (p ossibly) non-empty. This counter is
we did not include the results in this pap er.
up dated when a pro cessor p erforms a push or p op. Termi-
The b enchmark is lo osely structured after the parallel
nation is triggered when the shared stack is empty and the
graph traversal of a garbage collector [5]. When the b ench-
counter is simultaneousl y zero.
mark rst executes, the shared stack is initiali zed with 16,000
For each setting of the parameters (lo cks vs. ro oms, num-
no des with a count of 11. Each pro cessor rep eatedly p ops
b er of pro cessors (1 to 30), and an amount of additional
500 no des from the shared stackonto its lo cal stack, op er-
work), the exp eriments were run 5 times and the averages
ates on the no des, and then pushes the entire contents of its
130
Elapsed Time (Rooms vs. Locks) with 40% additional wait Elapsed Time (Rooms vs. Locks) with 100% additional wait 400 400 Linear Speedup − No Wait Linear Speedup − No Wait Linear Speedup − Wait Linear Speedup − Wait Rooms − Wait Rooms − Wait 350 Locks − Wait 350 Locks − Wait
300 300
250 250
200 200
150 150
100 100 Work (Elapsed Time in Seconds * Processors) Work (Elapsed Time in Seconds * Processors) 50 50
0 0 0 5 10 15 20 25 30 0 5 10 15 20 25 30 Number of Processors Number of Processors
Elapsed Time (Rooms vs. Locks) with 600% additional wait 300 Linear Speedup − No Wait Linear Speedup − Wait
Rooms − Wait
Figure 5 shows three graphs with di ering are rep orted. Locks − Wait
250
amounts of additional wait time (the parameter t ). The
k
graphs corresp ond, from left to right, to applications where
the time to pro cess an item is 40%, 100%, or 600% of the 200
time it takes to transfer the items from the shared stack
to the lo cal stack. In each graph, the b ottom line (widely-
150
spaced dots) represents the work for the unipro cessor case
when no synchronization is p erformed and no wait time is
tro duced. The next line (dotted) adds a varying amount
in 100
of work re ected in the distance from the b ottom line. Fi-
nally, the solid and dashed curves represent the ro oms and
Work (Elapsed Time in Seconds * Processors) 50
lo cks versions of the b enchmark with synchronization and
additional work. The vertical axis represents the total work
h is calculated as the pro duct of wall-clo ck p erformed, whic 0 0 5 10 15 20 25 30
Number of Processors
time and numb er of pro cessors. In all cases p erfect sp eedup
corresp onds to the at thinly-spaced dotted line.
The ro oms synchronizatio n has go o d p erformance, intro-
Figure 5: Total work (elapsed time in seconds num-
ducing an overhead approximately equal to basic stack trans-
b er of pro cessors) vs. numb er of pro cessors, for 40%
fer time (without synchronizatio n or additional work). The
additional wait time, 100% additional wait time, and
overhead is mostly indep endentofthenumb er of pro cessors,
600% additional wait time.
indicating that the sp eedup achieved is (after including the
overhead) linear. Additionally the magnitude of the over-
would eventually cause the work to increase linearly with the
head is also indep endent of the amount of additional work
numb er of pro cessors, but the slop e would b e much less than
intro duced.
that of the lo cked version. This is a subtle and crucial p oint
In contrast, at few pro cessors, the lo cks version has almost
ab out ro om synchronization and a reason why our heavy use
no overhead. However, as the numb er of pro cessors increase,
of fetch-and-adds do es not preclude go o d p erformance even
the contention in acquiring lo cks increases, causing a rapid
on machines that do not supp ort a parallel fetch-and-add.
p erformance degradation. The p oint at which this transi- aries from immediately to 10 pro cessors and 20
tion o ccurs v 5. RELATED WORK AND DISCUSSION
pro cessors for the three wait times, resp ectively. This trend
Synchronization. There is a long history of synchro- is exp ected as the intro duction of additional work b etween
nization mo dels and synchronization constructs for parallel p opping and pushing means that the pro cessors sp end less
and distributed computation. At the one end of the sp ec- time lo cking the push and p op co de.
trum, there are synchronous mo dels such as the PRAM, Given that wesimulate the fetch-and-add with a compare-
in which the pro cessors execute in lo ck-step and there is and-swap, which sequentializes the fetch-and-add, one might
no charge for synchronization. Shared data structure de- wonder why the ro om version scales so well. Recall, how-
sign is simpli ed bynothaving to deal with issues of asyn- ever, that we need only execute the fetch-and-add a small
chrony. Bulk-synchrono us mo dels such as the BSP [23] or constantnumb er of times p er ro om access: once when en-
the QSM [7] seek to retain the simplicityofsynchronous tering, once when leaving and twice in each of the push
mo dels, while p ermitting the pro cessors to run asynchro- and p op co des. This compares with the 500 elements that
nously b etween barrier synchronization s (typically) among have to b e copied o the stack during a p op, and from 0
all the pro cessors. Algorithms designed for these mo dels to 1000 that need to b e copied backonto it during a push.
are necessarily blo cking (due to the barrier synchroniza- The prop ortion of instructions sp ent in co de that mightbe
tions). For the lo osely synchronous application s considered sequentializ ed is therefore small. Asymptotically our co de
131
means that it is unlikely that a pro cess will b e swapp ed in this pap er, there are signi cantoverheads in implement-
out (context switched) by the op erating system while inside ing shared data structures using barrier synchronization s,
a ro om. There are a couple p otential mechanisms to deal b ecause all the pro cessors must co ordinate/wait even if they
with the case where the op erating system could swap out are not currently accessing the data structure. To reduce the
a pro cess while inside a ro om. First, the interrupt for a numb er of barriers, co ordination primitives such as parallel
context switch might b e delayed until the exitRo om. This pre x or fetch-and-add are still needed to enable pro cessors
can b e achieved on most pro cessors by temp orarily disabling to make parallel up dates to the data structure. In lo osely
certain kinds of interrupts while inside a ro om. A second synchronous applicatio ns having multiple classes of op era-
p otential solution is to have sp ecial interrupt handler co de tions that must b e separated (e.g., the push class and the
that restores the state of the pro cess to a p oint in which p op class) and unpredictable arrival times for op eration re-
it is safe to exit the ro om, and then exit the ro om b efore quests, it is desirable to have exibility in selecting which
submitting to the context switch. class is p ermitted during the next phase. The application
would b e forced to implement this exibility, and reason
Lo cks and other constructs. Traditional implementa-
ab out fairness and other issues, whereas this exibili tyis
tions of shared data structures use lo cks. Often lo cks are
already encapsulated in ro om synchronizatio ns.
distinguish ed as either shared or exclusive. A shared lo ck
At the other end of the synchronization mo dels sp ectrum
p ermits other users to also b e granted shared lo cks to the
are the fully asynchronous mo dels, in which pro cessors can
data structure, but forbids the granting of an exclusivelock.
b e arbitrarily delayed or even fail, and shared data struc-
An exclusivelock forbids the granting of any other lo cks.
tures are designed to tolerate such delays and failures. Wait-
Ro om synchronization can b e viewed as providing a lo ck
free data structures [10] have the prop ertythatany user's
that is shared among those requesting the same ro om, but
request (e.g., a push or p op request) will complete in a
exclusive to the shared lo cks for other ro oms. This greater
b ounded numb er of steps, regardless of the delays or fail-
sharing implies greater concurrency.
ures at other pro cessors. Because of the large overheads in
The closest algorithms to any of ours is an algorithm by
wait-free data structures, there has b een considerable work
Gottlieb, Lubachevsky and Rudolph for parallel queues [8].
on non-blo cking (or lo ck-free) data structures [10], which
Like ours, the algorithm works with unpredictable arrival
only require that some user's request will complete in a
times or requests, is based on the fetch-and-add op eration,
b ounded numb er of steps (although any particular user can
and can fully parallelize access. Also like ours, it is not
be delayed inde nitel y). Examples of non-blo ckin g data
non-blo cking. It, however, has some imp ortant disadvan-
structures work includes [1, 2, 9, 10, 11, 17, 18, 24, 25].
tages compared to ours. Firstly, it is not linearizable |the
Most of these implementation s still fully sequentialize ac-
following can o ccur on two pro cessors:
cess to the data structure. Moreover, they often require un-
P1 P2
9
b ounded memory , or the use of atomic op erations on two
enqueue(v1) enqueue(v2)
or more words of memory (such as a double compare-and-
v1 <- dequeue()
EMPTY <- dequeue() swap or transactional memory [12, 20]). Such op erations are
signi cantl y more dicult to implement in hardware than
Secondly, the algorithms requires a lo ck (or counter) for ev-
single word atomic op erations. Thus, wait-free and non-
ery element of the queue. This b oth requires extra mem-
blo cking data structures are essential in contexts where the
ory, and requires manipulating this lo ck for every insert and
primary goal is making progress in highly-asyn chrono us en-
delete. In our solution it is easy to batch the inserts or
vironments, but there is a signi cant costtoproviding their
deletes, as was done in our exp eriments. Thirdly,thetech-
guarantees.
nique do es not app ear to generalize to other data structures
Ro om synchronizations are designed for asynchronous set-
such as stacks. The technique do es have one advantage,
tings more concerned with fast parallel access (and b ounded
which is that the blo cking is at a ner grain|at each lo ca-
memory) than with providing non-blo cking prop erties. In
tion rather than across the data structure.
other words, settings somewhat in b etween those suitable
Gottlieb, Lubachevsky and Rudolph [8] also provide jus-
for bulk-synchrono us mo dels and those suitable for fully
ti cation for why a parallel fetch-and-add op eration can b e
asynchronous mo dels. In the exp erimental results in this
implemented eciently in hardware, and indeed, the Ultra-
pap er, as well as exp eriments with our parallel garbage col-
computer they subsequently built provided a parallel fetch-
lector, wehave obtained go o d p erformance with ro om syn-
and-add. In addition to queues they showed how the fetch-
chronizations on the Sun UltraEnterprise 10000, a 64 pro ces-
and-add could b e used for various other op erations includ-
sor shared-memory machine, indicating that ro om synchro-
ing a solution to the readers-writers problem, whichallows
nizations are suitable for that machine. We exp ect similar
a shared lo ck for the readers of a data structure, but an
p erformance on other shared-memory machines suchasthe
exclusivelock for the writers. Ro om synchronization can
SGI Power Challenge and the Compaq servers.
be viewed as extracting from these algorithms (and other
We note that our exp eriments are run in an environment
previous algorithms using fetch-and-add) a synchronization
in which each pro cess is mapp ed to one pro cessor. This
construct useful for many problems.
In our earlier work on garbage collection we brie y de-
9
scrib ed another weaker version of the ro om synchroniza- In such algorithms, memory can never b e reused, b ecause a
delayed pro cessor may still haveapointer to the old memory
tions [3]. That version had a couple imp ortant disadvan-
lo cation. A discussion of this problem for a lo ck-free queue
tages over the version describ ed in this pap er. Firstly,itdid
algorithm is in [24]. A compare-and-swap is used to insert
not guarantee prop erties P5{P10, unless you assumed cer-
or delete from the queue, so a delayed pro cessor should fail
tain prop erties of the exact time taken byeach instruction.
in its compare-and-swap b ecause the queue has changed in
In particular one pro cess could b e starved for an arbitrary
the meantime. However, if the memory is reused, then the
amount of time if other pro cesses kept going through the compare-and-swap may succeed when it should fail.
132
[5] P. Cheng and G. E. Blello ch. A parallel, real-time garbage ro om very quickly (fast enough that the pro cess missed do-
collector. In Proc. ACM SIGPLAN'01 Conf. on
ing a check of a ag for the brief time the ro om is op en).
Programming Languages Design and Implementation,June
This could even b e true if all pro cessors are running at the
2001.
same rate. Secondly, the ro om synchronizations required
[6] M. J. Fischer, N. A. Lynch, J. E. Burns, and A. Boro din.
that each pro cessor once entering the rst ro om had to also
Distributed FIFO allo cation of identical resources using
go through the second ro om. Thirdly, it only considered
small shared space. ACM Trans. on Programming
synchronizatio n with two ro oms. Finally, the conditions on
Languages and Systems, 11(1):90{114, Jan. 1989.
what constitutes a ro om synchronizatio n were not formal-
[7] P. B. Gibb ons, Y. Matias, and V. Ramachandran. Can a
shared-memorymodelserve as a bridging mo del for parallel
ized. In our new implementation of the garbage collector [5],
computation? Theory of Computing Systems,
we use the implementation discussed in this pap er.
32(3):327{359, 1999. Preliminary version app eared in
As an abstraction, ro om synchronization is distinct from
SPAA'97.
all previous abstractions known to us. It di ers from mu-
[8] A. Gottlieb, B. D. Lubachevsky, and L. Rudolph. Basic
tual exclusion byintro ducing ro oms that are mutually exclu-
techniques for the ecient co ordination of very large
sive to each other, but allowing concurrent access to a given
numb ers of co op erating sequqnetial pro cessors. ACM
Trans. on Programming Languages and Systems, ro om. In more general abstractions, such as the resource
5(2):164{189, Apr. 1983.
allo cation abstraction [16] of which the dining philosophers
[9] M. Greenwald. Non-Blocking Synchronization and System
problem is a well-known sp eci c instance, each user con icts
Design. PhD thesis, Stanford University,Palo Alto, CA,
with a subset of the other users, and only noncon icti ng sets
1999. Tech. rep ort STAN-CS-TR-99-1624.
of users can pro ceed in parallel. It is not clear howonewould
[10] M. P. Herlihy.Wait-free synchronization. ACM Trans. on
emulate ro oms using this formalization.
Programming Languages and Systems, 13(1):123{149, Jan.
Finally, there have b een a numb er of pap ers describing
1991.
techniques for reducing the contention in accessing shared
[11] M. P. Herlihy and J. E. B. Moss. Lo ck-free garbage
collection for multipro cessors. IEEE Trans. on Paral lel and data structures (e.g., [8, 19, 21]). These techniques are com-
Distributed Systems, 3(3):304{311,May 1992. Preliminary
plementary to ro om synchronizatio n, and p erhaps can b e
version in SPAA'91.
exploited in the implementation of ro oms by increasing the
[12] M. P. Herlihy and J. E. B. Moss. Transactional memory:
concurrency.
Architectural supp ort for lo ck-free data structures. In
Proc. 20th International Symp. on Computer Architecture, y1993.
6. CONCLUSIONS pages 289{300, Ma
[13] M. P. Herlihy and J. M. Wing. Axioms for concurrent
We presented a class of synchronizatio ns called ro om syn-
ob jects. In Proc. 14th ACM Symp. on Principles of
chronizations, that are likelytobeusefulinacontext that
Programming Languages, pages 13{26, Jan. 1987.
lies b etween highly synchronous mo dels such as the PRAM
[14] M. P. Herlihy and J. M. Wing. Linearizability: A
or BSP mo del, and highly asynchronous mo dels where it is
correctness condition for concurrent ob jects. ACM
assumed pro cessors can stall, fail, or b ecome disconnected.
Trans. on Programming Languages and Systems,
In particular ro om synchronization s can handle requests that 12(3):463{492, July 1990.
[15] L. Lamp ort. A new solution of Dijkstra's concurrent
come in at arbitrary times, and from arbitrary subsets of the
programming problem. Communications of the ACM,
pro cessors. They,however, are blo cking and hence if a pro-
17(8):435{455, Aug. 1974.
cessor fails in certain critical regions of the co de, the other
[16] N. A. Lynch. DistributedAlgorithms. Morgan Kaufmann,
pro cessors can b ecome blo cked.
San Francisco, CA, 1996.
Based on ro om synchronizations wepresented simple and
[17] M. M. Michael and M. L. Scott. Simple, fast, and practical
ecient implementations of shared stacks and queues. To
non-blo cking and blo cking concurrent queue algorithms. In
the b est of our knowledge these are the rst implemen-
Proc. 15th ACM Symp. on Principles of Distributed
Computing, pages 267{275, May 1996.
tations of stacks and queues that are linearizabl e, handle
[18] M. C. Rinard. E ective ne-grain synchronization for
asynchronous requests, and allow for constant-time access
automatically parallelized programs using optimistic
(assuming a constant-time fetch-and-add).
synchronization primitives. ACM Trans. on Computer
Acknowledgements Systems, 17(4):337{371, Nov. 1999.
Thanks to Toshio Endo and the Yonezawa Lab oratory for
[19] N. Shavit and D. Touitou. Elimination trees and the
use of their 64-way Sun UltraEnterprise 10000. construction of p o ols and stacks. In Proc. 7th ACM
Symp.onParal lel Algorithms and Architectures, pages
7. REFERENCES 54{63, June 1995.
[20] N. Shavit and D. Touitou. Software transactional memory.
[1] O. Agesen, D. L. Detlefs, C. H. Flo o d, A. T. Garthwaite,
Distributed Computing, 10(2):99{116, Feb. 1997.
P. A. Martin, N. N. Shavit, and G. L. Steele Jr.
[21] N. Shavit and A. Zemach. Combining funnels. In
DCAS-based concurrent deques. In Proc. 12th ACM
Proc. 17th ACM Symp. on Principles of Distributed
Symp. on Paral lel Algorithms and Architectures, pages
Computing, pages 61{70, June-July 1998.
137{146, July 2000.
[22] R. K. Treib er. Systems programming: Coping with
[2] G. Barnes. A metho d for implementing lo ck-free shared
parallelism.Technical Rep ort RJ 5118, IBM Almaden
data structures. In Proc. 5th ACM Symp. on Paral lel
ResearchCenter, Apr. 1986.
Algorithms and Architectures, pages 261{270, June 1993.
[23] L. Valiant. A bridging mo del for parallel computation.
[3] G. E. Blello ch and P. Cheng. On b ounding time and space
Communications of the ACM, 33(9):103{111, Sept. 1990.
for multipro cessor garbage collection. In Proc. ACM
[24] J. D. Valois. Implementing lo ck-free queues. In Proc. 7th
SIGPLAN'99 Conf. on Programming Languages Design
International Conf. on Paral lel and Distributed Computing
and Implementation, pages 104{117, May 1999.
Systems, pages 64{69, Oct. 1994.
[4] G. E. Blello ch, P. Cheng, and P. B. Gibb ons. Ro om
[25] J. D. Valois. Lock-Free Data Structures. PhD thesis,
synchronizations. Technical rep ort, Carnegie Mellon
Rensselaer Polytechnic Institute, Troy, NY, 1995.
University, Pittsburgh, PA, Apr. 2001.
133