Room Synchronizations 

Guy E. Blelloch Perry Cheng Department

void push(val y) {

it do es not matter what order relative to the increments the

int j;

values are inserted.

enterRoom(PUSHROOM);

Therefore, if we can separate the pushes from the p ops,

void push(val y) {

j = fetchAdd(&top,1);

wewould have a safe implementation of a stack. Ro om syn- int j;

A[j] = y;

j = fetchAdd(&top,1);

exitRoom();

chronizations allow ustodothisasshown in Figure 1(b).

} A[j] = y;

The ro om synchronization guarantees that no two users will

}

ever simultaneously b e in the PUSHROOM and POPROOM,sothe

val pop() {

push and p op instructions will never b e interleaved. How-

val x; val pop() {

ever, it will allowanynumb er of users to b e in either the

int k;

int k;

val x;

push or p op ro om simultaneousl y. enterRoom(POPROOM);

k = fetchAdd(&top,-1);

k = fetchAdd(&top,-1);

In the full pap er [4] weshow that using our ro om syn-

if (k <= 0) {

if (k <= 0) {

chronization (a) asynchronous accesses (pushes or

fetchAdd(&top,1);

fetchAdd(&top,1);

p ops) can arriveatany time, (b) every access is serviced in

x = EMPTY;

x = EMPTY;

2

time prop ortional to the time of a fetch-and-add , and ()

}

}

that this implements a linearizab le stack. The exp eriments

else x = A[k-1]; elsex=A[k-1];

describ ed in Section 4 are based on a variant of this stackin return x;

exitRoom();

}

return x;

whichmultiple elements are pushed or p opp ed within each

}

ro om.

(b) (a)

Outline. The pap er is organized as follows. Section 2

formally de nes ro om synchronizati ons, and presents an al-

gorithm that provably implements them. Section 3 shows

Figure 1: The co de for a parallel stack. (a) Do es not

howroomsynchronizations can b e e ectively used to imple-

work if the push and p op are interleaved in time. (b)

ment shared queues and dynamic shared stacks. Section 4

Avoids this problem using a ro om synchronization.

presents our exp erimental results. Section 5 discusses fur-

ther issues and related work, and Section 6 concludes.

degrades linearly with the numb er of pro cessors (b ecause

tialized ), our version scales almost p er-

accesses are sequen 2. ROOM SYNCHRONIZATIONS

fectly,beyond an initial cost for going from one to two pro-

The ro om synchronization problem involves a set of m

cessors. The ro om synchronizatio ns intro duced in this pap er

ro oms, and a set of p indep endent users who wish to access

have b een incorp orated into our real-time garbage collector

the ro oms. A user wishing access to a ro om calls an En-

co de [5], where they are used extensively.

ter Ro om primitive, which returns once the user is granted

One disadvantage of ro om synchronization s is that if a

p ermission to enter the ro om. The user is then said to b e

user fails or stops while inside a ro om, the user will blo ck

inside the ro om. When done with the ro om, the user exits

other users from entering another ro om. This prop ertyis

the ro om by calling an Exit Ro om primitive. The ro oms

clearly not desirable under all conditions, but in manycon-

synchronization construct must ensure that at any p ointin

ditions, includin g the garbage collector, it is not a problem.

time, there is at most one \op en" ro om with users inside it.

We discuss this issue in Section 5.

However, anynumb er of users can b e inside the op en ro om.

eprovide details on the ro om synchro-

1.1 A Motivating Example In this section, w

nization problem, discuss necessary and desirable prop erties

To motivate our problem, we consider implementing a par-

for proto cols implementing ro om synchronizatio n, presenta

allel stack using a fetch-and-add. We assume the stackis

proto col that implements ro om synchronizatio n, and show

stored in an array A and the index top p oints to the next

that this proto col satis es all these prop erties.

free lo cation in the stack (the stack grows from 0 up). The

tr,cnt) op eration adds cnt to the contents of

fetchAdd(p 2.1 Primitives

lo cation ptr and returns the old contents of the lo cation (b e-

The basic primitives of ro om synchronization are:

fore the addition). We assume this is executed atomically.

Consider the stack co de shown in Figure 1(a). Assuming a

Create Ro oms. Given a p ositiveinteger m,createa

constant-time fetchAdd, the push and pop op erations will

ro oms ob ject R for a set of m ro oms, and return a p ointer

take constant time. The problem is that they can work

(a reference) to R. There can b e multiple ro oms ob jects at

incorrectly if a push and pop are interleaved in time. For

the same time.

example, in the following interleaving of instructions

Enter Ro om. Given a p ointer to a ro oms ob ject R and

aroomnumber i, try to enter ro om i of R. Return when

j = fetchAdd(&top,1); // from push

the user has succeeded in entering the ro om. When the

k = fetchAdd(&top,-1); // from pop

primitive returns, the user is said to b e inside the ro om. A

x = A[k-1]; // from pop

A[j] = y; // from push

ro om with a user inside is said to b e open.

Exit Ro om. Given a p ointer to a ro oms ob ject R, exit

the p op will return garbage. Without an atomic op eration

the ro om in R that the user is currently inside. Because the

that changes the counter at the same time as mo difying the

user can b e inside at most one ro om in R, there is no need

stack element, we see no simple way of xing this problem.

to sp ecify the ro om numb er. When a user requests to exit

One should note, however, that anyinterleaving of twoor

a ro om, it is no longer considered to b e inside the ro om. If

more pushes or two or more p ops is safe. Consider pushes.

2

The fetchAdd reserves a lo cation to put the value y, and the

This is under certain assumptions ab out all pro cessors

write inserts the value. Since the counter is only increasing, making progress.

123

Asynchrony is mo deled by the fact that actions from dif- there are no users remaining inside the ro om, the ro om is

ferent agent pro cesses can b e interleaved in an arbitrary said to b e closed.

manner; thus one agentmayhavemany actions b etween

DestroyRooms. Given a p ointer to a ro oms ob ject,

actions by another agent. We will consider various fairness

deallo cate the ro oms ob ject.

metrics for executions and for ro om accesses. A weak form

Other primitives. In addition to these four basic primi-

of fairness among the agent actions is the following: An ex-

tives, there is a Change Ro om primitive, whichisequivalent

ecution is weakly-fair if it is either (a) nite and no agent

to an Exit Ro om followed byanEnter Ro om, but with the

action is enabled in the nal state, or (b) in nite and each

guarantee of immediate entry into the requested ro om if it

agent has in nitely many opportunities to p erform an ac-

is the next ro om to b e op ened. There is also an Assign Exit

tion (either there is an action by the agentornoactionis

Co de primitive, discussed at the end of Section 2.3.

enabled).

Certain actions are sp ecially designated as external ac-

The Enter Ro om and Exit Ro om primitives can b e viewed

tions; these are the (only) actions in which a user communi-

as the counterparts to the \trying" (e.g., lo ck) and \exit"

cates with its agent. For ro om synchronizatio n, the external

(e.g., unlo ck) primitives for the problem.

actions for a user j (and its agent) are:

As with the mutual exclusion problem, what the users do

while inside the ro om (or ) is part of the appli-

 EnterRo omReq (i): the action of user j signalli ng to

j

cation and not part of the synchronization construct. This

its agent p a desire to enter ro om i.

j

enables the generality of the primitive, as the same construct

can b e used for a variety of applicatio ns. The drawbackis

 EnterRo omGrant (i): the action of agent p signalli ng

j j

that, as with mutual exclusion, the construct relies on the

to user j that its Enter Ro om request has b een granted.

application to alternate entering and exiting requests bya

 ExitRo omReq : the action of user j signalli ng to its given user.

j

The Create Ro oms and Destroy Ro oms primitives are ex-

agent p a desire to exit its currentroom.

j

ecuted once for a given ro oms ob ject, in order to allo cate

 ExitRo omGrant : the action of agent p signalling to

j j

and initiali ze the ob ject prior to its use and to deallo cate

user j that its Exit Ro om request has b een granted.

the ob ject once it is no longer needed. To simplify the dis-

cussions that follow, we will fo cus on a single ro oms ob ject,

The trace (traceat j ) of an execution is the subsequence of

for which Create Ro oms has already b een executed, and De-

the execution consisting of its external actions (for a user

stroy Ro oms will b e executed once the ob ject is no longer

j ).

needed. Extending the formalizations and discussions to

The terminology ab ove fo cuses on mo deling the agents

multiple ro oms ob jects and to issues of creating and de-

that act on b ehalf of user requests, as needed to formal-

stroying ob jects is relatively straightforward.

ize b oth the ro om synchronization problem and proto cols

vide ro om synchronization s. Section 3, on the other

2.2 Formalization that pro

hand, fo cuses on mo deling user applications,suchasstacks

We can formalize the ro om synchronization problem using

and queues, that make use of ro om synchronizations . All

the I/O Automaton mo del [16]. Our terminology and formal

of the terminology stated ab ove for agents can b e similarly

mo del are an adaptation of those used in [16] for formalizing

de ned in order to mo del users. For example, each user pro-

mutual exclusion.

cess has some lo cal private memory, and there is a global

Each user j is mo deled as a state machine that commu-

shared memory accessible byall user pro cesses. From the

nicates with an agent pro cess p byinvoking ro om synchro-

j

p ersp ective of the present section, however, all users do is

nization primitives and receiving replies. The agent pro cess

make requests to enter or exit ro oms.

p , also a state machine, works on b ehalf of user j to p er-

j

Necessary prop erties. We rst state formally a condi-

form the steps of the synchronizati on proto col. Each agent

tion on users of ro om synchronization and their agents. A

pro cess has some lo cal private memory, and there is a global

trace at j of an execution for a ro oms ob ject with m ro oms

shared memory accessible by all agent pro cesses. The set of

is said to b e behaved if it is a pre x of the cyclicall y ordered

agent pro cesses, together with their memory, is called the

sequence:

protocol automaton.Anaction (an instruction step) is a

transition in a state machine. Wesay an action is enabled

EnterRo omReq (i ), EnterRo omGrant (i ), ExitRo omReq ,

j 1 j 1 j

when it is ready to execute. Actions are low-level atomic

ExitRo omGrant ,EnterRo omReq (i );: : :

j j 2

steps such as reading a shared memory lo cation or incre-

menting a lo cal counter. An execution is a sequence of al-

where i ;i ;::: 2 [1::m]. In other words, (i) the Enter Ro om

1 2

ternating states and actions, b eginning with a start state,

and Exit Ro om primitives byagiven user alternate, starting

such that each action is enabled in the state immediately

with an Enter Ro om, (ii) the user waits for a request to b e

preceding it in the sequence and up dates that state to b e the

granted prior to making another request, (iii) conversely,the

state immediately succeeding it. Thus actions are viewed as

agentwaits for a request b efore granting a request and only

3

o ccurring in some linear order.

grants what has b een requested, and (iv) the requested ro om

numb ers are valid. Wesay a user j 's requests are behaved

3

Although an execution is mo deled as a linearized sequence

if no request is the rst misb ehaved action in the trace at

of low level actions, this is a completely general mo del for

j (formally, there is no EnterRo omReq or ExitRo omReq

j j

sp ecifying parallel and studying their correctness

in the trace at j such that the pre x of the trace up to but

prop erties: The state changes resulting from actions o ccur-

not includi ng this action is b ehaved, but the pre x including

ring in parallel are equivalent to those resulting from some

the action is not b ehaved). In a b ehaved trace at j ,Enter-

linear order (or interleaving ) of these actions, given that the

Ro omReq (i) transitions user j from outside all ro oms to actions are de ned to b e suciently lowlevel. j

124

most time , and the time b etween successiveactions preparing to enter ro om i,EnterRo omGrant (i) transitions

j

of each agent preparing to enter or preparing to exit is user j from preparing to enter ro om i to inside ro om i, Exit-

at most  , then any closed ro om requested byatleast Ro omReq transitions user j from inside to preparing to

j

one user is op en within time T = T ( ; ;m;p). exit, and ExitRo omGrant transitions user j from preparing

R R j

to exit to outside.

P7. Constantdelay for ro oms: This is a stronger ver-

We can now state formally what it means for a proto col

sion of P6 in which T is indep endentofp.

R

to implementroomsynchronizatio n. A proto col automa-

ton A solves the room synchronizationproblem for a given

P8. Lo ckout-freedom (i.e., no user starvation): In any

collection of users with b ehaved requests if the following

weakly-fair execution: (1) If all users inside a ro om

prop erties hold:

eventually prepare to exit the ro om, then anyuser

preparing to enter a ro om eventually gets inside the

P1. Trace b ehaved: In any execution, for any j , the trace

ro om. (2) Any user preparing to exit a ro om eventu-

at j is b ehaved. One implication is that only users re-

ally gets outside the ro om. Note that this is a stronger

questing to enter a ro om are given access to the ro om,

prop ertythanP5.

and only after its EnterRo omReq and b efore any sub-

sequent ExitRo omReq.

P9. Bounded time to enter and exit: In any execution,

any user preparing to enter a ro om is inside the ro om

P2. Mutual exclusion among ro oms: Thereisnoreach-

4

within time T = T ( ; ;m;p), and any user preparing

able state of A in which more than one ro om is op en.

1 1

to exit a ro om is outside the ro om within time T =

2 Equivalently,inany execution, b etween anyEnter-

T (;m;p), where ,  , m and p are as de ned in P6.

Ro omGrant (i) in the trace and the next ExitRo om-

2

j

Req (or the end of the trace if there is no such action)

j

0 0

P10. Constant time access: This is a stronger version of

there are no EnterRo omGrant(i ) actions for i 6= i.

P9 in which T and T are indep endentofp. Note that

1 2

this implies unb ounded concurrent access to ro oms P3. Weakly-concurrent access to ro oms: There are

(otherwise the time b ound would necessarily dep end reachable states of A in which more than one user is

on p).

inside a ro om.

P4. Global progress: Atany p ointinaweakly-fair exe-

P11. Demand driven: When a user is inside a ro om or

cution: (1) If there is a user j preparing to enter ro om i

outside all ro oms, there are no actions by its agent.

(i.e., its most recent external action is an EnterRo om-

Thus, an agent p erforms work only in resp onse to a

Req (i)), and there are no ro oms with users inside,

request by its user. j

then at some later p oint some user is inside some ro om

terRo omGrant action). (2) If there is

(i.e., there is a En 2.3 A room synchronization protocol

a user j preparing to exit a ro om (i.e., its most recent

Wenow present a proto col (an algorithm) for solving the

external action is an ExitRo omReq ), then at some

j

ro om synchronization problem that satis es all the prop er-

later p oint user j is outside the ro om (i.e., there is an

ties P1{P11. The proto col assumes a linearizabl e shared

ExitRo omGrant action).

j

memory [14] supp orting atomic reads, writes, fetch-and-

6

adds, and compare-and-swaps on single words of memory.

Note that prop erties P1{P4 do not guarantee any fair-

In our actual implementation of this proto col, wedonot

ness among ro oms or among users, the delays in entering or

assume a linearizabl e shared memory, and instead explic-

exiting a ro om, etc. These are considered next.

itly insert Memory Barriers into the co de (details in the full

Desirable prop erties. Prop erties P5{P11 de ne addi-

pap er [4]).

tional desirable prop erties for a proto col automaton A solv-

Consider a ro oms ob ject with m ro oms. It includes three

ing the ro om synchronizatio n problem. It is useful to ana-

arrays of size m: wait, grant,and done.Thearrays hold

lyze proto col p erformance (such as in P6, P7, P9, and P10

three counters for each ro om, all initiall y zero. It includes

b elow) assuming an upp er b ound  on the (wall clo ck) time

a numRooms eld, set to m. It also includes an activeRoom

5

between successive actions byanagent.

eld, which holds the ro om numb er of the (only) ro om that

may b e op en. The sp ecial value -1 is used to indicate when

P5. No ro om starvation: In anyweakly-fair execution:

there is no active ro om, e.g., initiall y and whenever there

If all users inside a ro om eventually prepare to exit the

are no users either inside a ro om or waiting to enter a ro om.

ro om, then any closed ro om requested by at least one

7

The proto col is depicted in Figure 2. For conciseness

user is eventually op ened.

and readability,we present C co de instead of a full I/O Au-

tomaton sp eci cation. Agents enter a ro om by incrementing

P6. Bounded delay for ro oms: In any execution with m

the wait counter to get a \ticket" for the ro om (Step 2),

ro oms and p users: If each user is inside a ro om for at

and then waiting until that ticket is granted (Steps 3{10).

4

A stronger prop erty is to require that at most one ro om

6

can b e op en even if some user requests are not b ehaved. Wehave explicitly avoided atomic op erations on twoor

5

more words of memory.We use only the weaker fetch-and-

The time b ound only applies if the second action is enabled

increment form of fetch-and-add. Also note that wecould

no later than when the rst action o ccurs: an agentblocked

in principle implement ro oms using only reads and writes,

waiting for, say, a request from its user can b e arbitrarily

following ideas used, e.g., in the Bakery Algorithm [15] for

delayed. Also, note that in the absence of a p ositivelower

mutual exclusion, at the cost of far greater delays.

b ound on the time b etween successive actions, wehave not

7

restricted the relative sp eeds of the agents. Finally,note We show only the co de for enterRo om and exit-

that the time b ound applies only to the analysis of the time Ro om. The complete version of the co de is available at:

p erformance, and not to any correctness (safety) prop erties. www.cs.cmu.edu/afs/cs/project/pscico/www/rooms.html

125

1 void enterRoom(Rooms t*r,inti)f

2 int myTicket = fetchAdd(&r->wait[i],1) + 1; // get ticket for the ro om

3 while (myTicket - r->grant[i]> 0) f // wait until your ticket is granted

4 if (r->activeRoom == -1) f // while waiting, if no activeroom

5 if (compareSwap(&r->activeRoom,-1,i)==-1) f // then make i the activeroom

6 r->grant[i] = r->wait[i]; // enable all with tickets to enter ro om i

7 break;

8 g

9 g

10 g

11 g

t*r)f 12 int exitRoom(Rooms

13 int ar = r->activeRoom; // preparing to exit ro om ar

14 int myDone = fetchAdd(&r->done[ar],1)+ 1; // increment the done counter

15 if (myDone == r->grant[ar]) f //iflasttobedone

16 for (int k = 0, newAr = ar; k < r->numRooms;k++) f

17 newAr = (newAr + 1) % r->numRooms; // go round robin thru the ro oms

18 if (r->wait[newAr] - r->grant[newAr]> 0) f // if ticketed waiters

19 r->activeRoom = newAr; // make it the activeroom

20 r->grant[newAr] = r->wait[newAr]; // enable all with tickets to enter ro om newAr

21 return 1;

22 g

23 g

24 r->activeRoom = -1; // no waiters found, so no activeroom

25 return 1; // 1 indicates was last to b e done

26 g

27 return 0; // 0 indicates was not the last

28 g

Figure 2: C co de for Enter Ro om and Exit Ro om.

Agents exit a ro om by incrementing the done counter (Step and return of this pro cedure. Consider any execution and

14). Once the done counter matches the grant counter any user j with b ehaved requests. An EnterRo omGrant (i)

j

(Step 15), then all agents granted access to the ro om have (ExitRo omGrant ) action cannot b e the rst misb ehaving

j

exited the ro om. The unique agent to incrementthedone action in the trace at j , b ecause it can o ccur in the trace only

counter up to the grant counter (the \last done") do es the immediately after the matching EnterRo omReq (i)(Exit-

j

work of selecting the next active ro om (Steps 16{23). The Ro omReq , resp ectively) that initiated the pro cedure call.

j

grant counter of that active ro om is set to b e equal to its Thus the trace at j is b ehaved.

current wait counter (Step 20), therebygranting all wait-

Prop ertyP2. To provemutual exclusion, we will need

ing tickets for that ro om. If the last done agent fails to

the following de nitions and lemmas. For an execution  ,

discover a ro om with waiting tickets, it resets activeRoom

let  jj b e the subsequence of  consisting of its actions for

to 1 (Step 24). Whenever activeRoom = 1 (Step 4), a

a user j or its agent p . A user j has a ticket foraroomi

j

ticketed agent can self-select its requested ro om as the next

after an execution  for each Step2ofenterRo om(i)in jj

active ro om (Step 5), and grantallwaiting tickets for that

with no subsequentStep14ofexitRoomin jj . A user j

ro om (Step 6).

with a ticket for a ro om i is blocked after an execution  if

myTicket at j is greater than grant[i]. A user j is in the

Theorem 1. The room synchronization protocol in Fig-

advanceroom region after an execution  if Step 6, 16, 17,

ure 2 satis es properties P1{P11.

18, 19, 20, or 24 is enabled, or Step 15 is enabled with a

Proof. Due to page constraints, we only presentthe

successful conditional test.

pro ofs of prop erty P1 (trace b ehaved) and prop ertyP2(mu-

Lemma 1. Each user (with a behavedtrace) has at most

tual exclusion among ro oms). Other prop erties are discussed

one ticket.

brie y, with the details left to the full pap er [4]. Wealsore-

strict our attention to the case where wait, grant,and done

Proof. Supp ose there exists an execution  suchthata

are unb ounded counters, and hence they are monotonically

user j has multiple tickets after  . By the de nition of hav-

nondecreasing. For ease of description, we view each step of

ing a ticket, for each such ticket, there is a Step 2 in  jj with

the C co de as an atomic action. This is done without loss of

no subsequent Step 14 in  jj .For eachsuch Step 2, there is

generality for all steps that access at most one shared mem-

a preceding EnterRo omReq , but no subsequent ExitRo om-

j

ory lo cation. For the other steps (Steps 6, 18, 20), the pro of

Grant b ecause Step 14 must precede any ExitRo omGrant.

j

is readily extended to have separate atomic actions for each

Thus the trace at j is not b ehaved, and hence prop ertyP1

shared memory access. Finally, to simplify the notation, we

fails to hold, a contradiction.

omit explicit reference to the ro oms ob ject p ointer r , e.g.,

Lemma 2. In any execution  (with behavedtraces), if a

we use enterRo om(i) instead of enterRo om(r;i).

user j is inside room i after  , then j is an unblockeduser

Prop erty P1. In the proto col of Figure 2, an Enter-

with a ticket for room i.

Ro omReq (i) (ExitRo omReq ) action corresp onds to the

j j

Proof. If user j is inside ro om i, then b ecause the trace user j initiating a pro cedure call to enterRo om(i) (exit-

at j is b ehaved, the last external action at j is EnterRo om- Ro om, resp ectively). An EnterRo omGrant (i) (ExitRo om-

j

Grant (i). The last o ccurrence of Step 2 of enterRo om(i)in Grant , resp ectively) action corresp onds to the completion

j j

126

Toshowinvariant2holdsins ,we again consider each  jj precedes the EnterRo omGrant (i)in , and there can

2 j

relevant case for , namely, Steps 5{6, 14{20, and 24. If b e no subsequent Step 14 b ecause there is no subsequent

is a Step 5 that succeeds in enabling Step 6 in s ,then ExitRo omReq in  jj .Thus j has a ticket for ro om i. More-

2

activeRoom = 1ins (otherwise, the compareSwap would over, supp ose j were blo cked. Then myTicket at j is greater

1

not return 1). Inductively byinvariants 1 and 3, grant[i] than grant[i] after  . Let  =   where is the ab ove

1 2

= done[i] for all ro oms i in s , and hence in s . Inductively Step 2,  is the pre x of  prior to ,and  is the sux

1 2 1 2

byinvariant 2, there are no users in the advance ro om region of  after . By Lemma 1 and examination of the co de,

in s .Thus the invariant is maintained. If is a Step 6, we see that there is no p ossible step by user j that mo d-

1

15 (with a successful conditional ), 16{20, or 24, then user j i es myTicket at j : its value is the same in all states in

is in the advance ro om region in s . Inductively byinvari-  . Because grant[i] is a nondecreasing counter, user j 's

1 2

ant2,j is the only such user in s .If is a Step 6, 20, myTicket > grant[i] in all states in  .Thus j would never

1 2

or 24, then there are no users in the advance ro om region exit the enterRo om while lo op, and hence there would b e no

in s , and the invariant is maintained. On the other hand, EnterRo omGrant (i)in  jj ,a contradiction. Thus user j

2 j 2

if is a Step 15{19, then inductively byinvariant2and is not blo cked.

the fact that none of these steps add a user to the advance The heart of the mutual exclusion pro of is the following

ro om region, set activeRoom = 1, up date grant, or up date lemma, which presents three invariants that also provide

done, the invariant is maintained. If is a Step 14, then, as insightinto the proto col.

argued ab ove, Step 14 increments done[ar], where j is an

Lemma 3. In any execution (with behavedtraces):

unblo cked user with a ticket for a ro om ar in s . Inductively

1

byinvariants 1 and 3, grant[i] = done[i] in s and hence

1

1. For al l rooms i, wait[i] done[i] is the number of

s for all ro oms i 6= ar,and grant[ar] > done[ar] in s .

2 1

users with tickets for room i,and wait[i] grant[i]

Thus inductively byinvariant 2, there is no user in the ad-

is the number of blocked users with tickets for room i.

vance ro om region in s . In order for j to b e in the advance

1

(Thus grant[i] done[i] is the number of unblocked

ro om region in s , myDone at j must equal grant[ar] in s

2 2

users with tickets for room i.) Al l users with tickets

(so that Step 15 is enabled with a successful conditional ).

for room i have myTicket  wait[i].

This o ccurs only if grant[ar] = done[ar] in s , b ecause

2

myDone = done[ar] after . Hence, in all cases, invariant2

2. At most one user is in the advanceroom region. If a

holds in s .

2

user is in the advanceroom region, then activeRoom

Finally,toshowinvariant3holdsins ,we consider each

2

6= 1 and for al l rooms i, grant[i] = done[i].

relevant case for , namely, Steps 2, 5, 6, 13, 19, 20, and

24. If is a Step 2, then byinvariant 1 applied to b oth

3. If there exists an unblocked user with a ticket for room

s and s ,thenumber of unblo cked users with tickets for

1 2

i then activeRoom = i. If Step 6 (14, 20) is enabledat

ro om i is unchanged. Moreover, activeRoom is unchanged,

a user j , then activeRoom = i (ar, newAr,respectively)

so the invariant is maintained. If is a Step 5 that succeeds

at j and activeRoom 6= 1.

in changing activeRoom,then activeRoom = 1in s and

1

i 6= 1in s .Thus, inductively byinvariant 3 and the fact

2

Proof. The pro of is by induction on the number of ac-

that Step 5 do es not create an unblo cked ticketed user, the

tions in the execution. Initially, wait[i] = grant[i] =

invariant is maintained. If is a Step 6, then inductivel y by

done[i] = 0, and all three invariants hold for the start state.

invariant3,activeRoom = i in s and hence in s . Step 6 can

1 2

Assume that all three invariants hold for all executions of

only unblo ck users with tickets for ro om i, so the invariant

t  0 actions. Consider an arbitrary execution  with t

is maintained. The case for Step 20 is symmetric. If

actions and consider all p ossible next actions . W.l.o.g.,

is a Step 13, then activeRoom is the same in s and s .

1 2

assume that is an action byuser j or its agent. Let s

1

For user j , sets ar = activeRoom and enables Step 14.

b e the last state in  and s b e the up dated state after

2

As argued ab ove, j is an unblo cked user with a ticket for

o ccurs.

aroomar in s ;thus ar 6= 1. Step 13 neither creates

1

To showinvariant1holdsins ,wemust consider all the

2

anewunblo cked ticketed user nor changes activeRoom,so

cases where up dates either myTicket, one of the counters,

inductively byinvariant 3, the invariant is maintained. If

the number of ticketed users, or the number of blocked users,

is a Step 19, then user j is in the advance ro om region in

namely Steps 2, 6, 14, and 20. Step 2 of enterRo om(i) incre-

s , and hence inductivel y byinvariant 2, there is no other

1

ments b oth wait[i] and the numb er of (blo cked) users with

user in the advance ro om region in s .Thus inductivel y by

1

tickets for ro om i (by Lemma 1). In addition, each myTicket

invariants 1 and 2, there are no unblo cked ticketed users in

is at most wait[i] in s .Thus invariant 1 is maintained.

2

s , and hence in s . As argued ab ove, Step 14 is enabled at

1 2

Step 6 of enterRo om(i)setswait[i] grant[i] to zero. It

0 0

some user j only if j is an unblo cked ticketed user. Thus

follows inductively byinvariant 1 that any user with a ticket

Step 14 is not enabled at any user in s . Moreover, Step 6

2

for ro om i in s must have myTicket  wait[i] = grant[i]

1

0 0

and Step 20 are enabled at some user j only if j is in the

in s , and hence cannot b e blo cked. Likewise, Step 20 main-

2

advance ro om region. Thus neither step is enabled in s ,and

1

tains the invariant for ro om newAr. IfStep14isenabledin

hence in s , with the exception of Step 20 b eing enabled at

2

s , then user j has a ticket (by de nition) for some ro om

1

user j in s .But sets activeRoom = newAr 6= 1, as is

2

i. Moreover, by an argument similar to that in the pro of of

required. Hence, the invariant is maintained. The case for

Lemma 2, j is not blo cked. Thus inductively byinvariant3,

Step 24 is the same as for Step 19, except that Step 20 will

activeRoom = i = ar in s . Step 14 increments done[ar],

1

not b e enabled at user j . Hence, in all cases, invariant3

and by Lemma 1, it decrements the numb er of users with

holds in s .

2

tickets for ro om ar, so the invariant is maintained. Hence,

This concludes the pro of of Lemma 3.

in all cases, invariant 1 holds in s . 2

127

di erent rates. Thus a ro om may op en and close multi- To complete the pro of of prop erty P2, supp ose there were

0

ple times b efore all the granted agents are done with the an execution resulting in two distinct ro oms i and i with

0

ro om. Furthermore, an agentthatisslow tograbaticket users j and j inside the resp ective ro oms. Then byLemma2,

0 0

maybebypassed by faster agents an unb ounded number of j (j )isanunblo cked user with a ticket for ro om i (i ,re-

times. This do es not contradict prop erties P9 or P10, b e- sp ectively). Thus byinvariant 3 of Lemma 3, activeRoom

0

cause within time t + O ( )theslow agent will grab a ticket, equals b oth i and i , a contradiction.

f

and from there will pro ceed in constant time (a function of

Other prop erties. Let m be the numb er of ro oms and p

,  ,and m)toenter inside the ro om. On the other hand,

b e the numb er of users. Intuitively, the remaining prop erties

if all agents run at roughly the same sp eed, an agentcanbe

hold due to the following observations. (1) A user desiring

bypassed for its desired ro om at most once.

a ticket will get a ticket within a constantnumb er of its

Our proto col has the prop ertythatany user requesting to

actions. (2) The last done when exiting a ro om i starts

enter an already op en ro om cannot enter the ro om until all

at ro om i + 1 and cycles through all m ro oms (including

users inside the ro om have exited.

backtoi), granting access to the rst ro om it encounters

with ticketed users. (3) A ro om with ticketed users gets

Variants. To satisfy the stronger mutual exclusion among

its turn within m turns or less. (4) Eachticketed user in

ro oms prop erty discussed in the fo otnote to prop erty P2, it

the enterRo om while lo op will get inside the ro om within

suces to ignore all misb ehaving requests, as follows. Asso-

a constantnumb er of its actions after access is granted to

ciate with eachroomavector V ,onetwo-bit entry p er user,

the ro om. (5) If we place an upp er b ound on the time

indicating the e ective status of the user, as outside (0),

a user is inside a ro om, and an upp er b ound  on the time

preparing to enter (1), inside (2), or preparing to exit (3).

between successive actions of an agent preparing to enter or

At the b eginning of enterRo om (prior to Step 2), determine

exit a ro om, then we can compute the time to enter or exit

the id of the user, and if V [id] 6= 0 or the requested ro om

a ro om. Eachturnforaroomtakes at most + O ( )time,

number is invalid, the user is misb ehaving and a failure co de

thus in any execution, any user preparing to enteraroomis

is returned. Otherwise, set V [id] = 1 and p ermit the user's

inside the ro om in time T  ( + O ( ))  m. Moreover, in

agent to pro ceed with Step 2. At the end of enterRo om (just 1

any execution, any user preparing to exit a ro om is outside

prior to Step 11), set V [id] = 2. Similarly, at the b eginning

the ro om in time T = O (  m) for the last done and O ( )

2

of exitRo om (prior to Step 13), determine the id of the user,

for all others.

and if V [id] 6= 2, the user is misb ehaving and a failure co de

In the ab ove time analysis, we are implicitly assuming that

is returned. Otherwise, set V [id] = 3 and p ermit the user's

 is not a function of p, e.g., the time for a fetch-and-add is

agent to pro ceed with Step 13. Finally,setV [id]=0 just

indep endentofp. If instead, welet t = t (p)beanupper

f f

prior to Steps 21, 25, and 27.

b ound on the time for a fetch-and-add with p pro cessors

We implemented a version of the Exit Ro om primitive

(e.g., t = log p), then T  ( +2 t + O ( ))  m and

that includes sp ecial exit code. Exit co de is assigned to f 1 f

T  t + O (  m).

2 f

a ro om using an Assign Exit Co de primitivethattakes a

Note that all these time b ounds and other prop erties hold

p ointer to a function and a p ointer to the arguments to the

even in the presence of arbitrarily fast agents who do their

function. The exit co de is executed by the last user to b e

b est to starve other agents. The reader is referred to the

done, prior to searching for the next active ro om. Thus we

full pap er [4] for the pro ofs.

are guaranteed that the exit co de is executed once, and only

after all users granted access to the ro om are no longer inside

Remarks. The technique of using a wait counter and a

the ro om, but b efore any users can gain subsequent access

grant counter is used in mutual exclusion proto cols suchas

to any ro om. Wehave found the exit co de to b e quite use-

TicketME [6, 16]. Mutual exclusion proto cols are simpler

ful in our applicati ons of ro om synchronization (an example

b ecause only one agent is granted access at a time. Thus

is given later in Figure 4 and is also used in our exp eri-

the grant counter can double as the done counter. Also,

ments of Section 4). Intuitively, the exit co de can b e viewed

the test for granting access is simply whether your ticket

as enabling functionali ty such as \the last one to leavethe

equals the grant counter, so mo dulo p counters are used

ro om turns out the lights". The need for this functionality

for p users. With ro om synchronizati on, we need an in-

do es not arise with mutual exclusion, b ecause there is only

equality test b etween the ticket and the grant counter in

a single user inside the critical section at a time.

order to admit multiple waiting users at once, so mo dulo p

counting do es not suce. Instead, and to avoid unb ounded

ters, we rely on the fact that when integers are rep-

coun 3. APPLICATIONS

resented in twos-complement(andp is less than the maxi-

In this section we describ e two additional applications

mum integer), then letting the counters wrap around to a

of ro om synchronization s: a shared queue and a dynamic

negativenumb er (ignoring the over ow) gives the desired

shared stack.

result for any inequality test in the proto col. This is the

reason for using myTicket - r->grant[i] > 0 instead of FIFO queue. The implementation of a linearizabl e FIFO

myTicket > r->grant[i] in the enterRo om co de. queue is given in Figure 3. The newQueue routine creates a

Note that activeRoom maybesetto1even when there new queue ob ject, including allo cating an arrayofa xed

are users waiting with their tickets. However, such users size mysize and calling createRo oms to create tworooms

must have grabb ed their tickets after the last agent done asso ciated with the queue (one for enqueuing and one for

checked to see if there were ticketed waiters. Moreover, al- dequeuing). The queue ob ject contains top,which p oints

though all agents granted access to a ro om are spinning to the top of the queue (i.e., the next lo cation to insert

waiting for that ro om to op en, and hence will tend to pro- an element), and bot, which p oints to the b ottom of the

ceed together into the ro om, in the worst case the agents queue (i.e., the rst element to remove). The implementa-

may pro ceed to enter the ro om and exit the ro om at very tion prop erly checks for over ow and under ow (emptiness).

128

1 struct queue f

2 val *A; 17 int ENQUEUEROOM= 0;

3 unsigned size; 18 int DEQUEUEROOM= 1;

t *q) f 31 val dequeue(Queue

4 int top, bot;

32 val x;

19 int enqueue(val y, Queue t *q) f

5 Rooms t*r;

33 enterRoom(q->r, DEQUEUEROOM);

20 int status = SUCCESS;

6 g Queue t;

34 int j = fetchAdd(&q->bot,1);

21 enterRoom(q->r, ENQUEUEROOM);

35 if (q->top - j <= 0) f

7 Queue t *newQueue(int mysize) f 22 int j = fetchAdd(&q->top,1);

36 fetchAdd(&q->bot,-1);

8 Queue t*q=(Queuet*) 23 if (j - q->bot >= q->size) f

37 x = EMPTY;

9 malloc(sizeof(Queue t)); 24 fetchAdd(&q->top,-1);

38 g

10 q->A = (val *) malloc( 25 status = OVERFLOW;

39 else x = q->A[j % q->size];

11 mysize * sizeof(val)); 26 g

40 exitRoom(q->r);

12 q->size = mysize; 27 else q->A[j % q->size] = y;

41 return x;

13 q->top = q->bot = 0; 28 exitRoom(q->r);

42 g

14 q->r = createRooms(2); 29 return status;

15 return q; 30 g

16 g

Figure 3: The co de for a parallel queue using ro om synchronizations.

In the case of over ow during an enqueue, the elementis commits from other enqueues coming after but within the

not inserted and the top pointer is not incremented (the safe enqueue region will reserve higher, and hence later in

implementation rst increments it, but then go es back and \time", lo cations in the queue (b ecause of the prop erties

decrements it). Similarly in the case of an empty queue, the of fetch-and-add). Also, all writes to the reserved lo cations

bot p ointer is not incremented. Assuming the int typ e is of in Step 27 will complete while still in the safe enqueue re-

xed precision then bot and top can b oth over ow. In the gion (b ecause of prop erty P2 of the ro oms). Hence these

pro of b elowwe assume there is no over ow. The co de, how- enqueues will have the prop er linear order. If an enqueue

ever, works correctly even with over ow as long as the range op eration do es return OVERFLOW (the if branchistaken), then

int is greater than 2 * (q->size + P) for P pro cesses, and any other enqueues coming after but within the safe enqueue

8

as long as the range of int is a multiple of q->size. Alter- region will also return OVERFLOW. This is b ecause only an en-

natively we could use exit co de (see the end of the previous queue that increments q->top such that q->top - q->bot

section) to reset the counter when it is close to over owing. > q->size will take the if branch. By decrementing q->top

by one in Step 24 it cannot make it the case that q->top -

q->bot < q->size. Since no dequeue can commit in a safe

Theorem 2. The algorithm of Figure 3 implements a lin-

enqueue region, having all future enqueues within the region

earizable FIFO queue, such that both the enqueue and the de-

return OVERFLOW is the exp ected results of the sp eci ed lin-

queue operations take O (t +  ) time regard less of the number

f

ear order. The argument for the prop er linearized order of

of concurrent users, where t is an upper bound on the time

f

the dequeues is similar to that for enqueues.

for a fetch-and-add, and  is an upper bound on the time for

In regards to time, we note that the time that a pro cess is

any other instruction.

in a ro om is b ounded by the time for two fetch-and-adds (on

Steps 22 and 24 of enqueue, for example), and a constant

Proof: We rst show the queue is linearizabl e. Consider

numb er of other standard instructions (i.e., reads, writes,

a collection of p users executing enqueue and dequeue op er-

arithmetic op erations and conditional jumps). Eachuseris

ations on a queue, and p agents executing enterRo om and

therefore in a ro om for at most =2t + O ( ) time. Based

exitRo om op erations in resp onse to user requests, as indi- f

on Theorem 1, the maximum time a pro cessor will wait to

cated in the co de. Let  b e the execution comprising the

enter or exit a ro om is prop ortional to t , and  (e.g., the

actions by b oth users and agents and the complete state af- f

time to enter a ro om is b ounded by m  ( +2t + O ( ))).

ter each action. Consider the subsequence of the actions in  f

Therefore the total time to enter, pro cess, and exit is O (t +

comprised of the fetch-and-add actions generated byStep22 f

 ).

of the enqueue op eration and Step 34 of the dequeue op er-

ation. We call these the commit actions. We argue that the

Dynamic stack. Wenow consider the implementation of

ordering of these commit actions sp eci es a prop er linearized

a linearizabl e dynamic stack. In a dynamic stackwe assume

order of the corresp onding queue op erations.

the size of the stack is not known ahead of time and hence

Consider the enqueue. We call any region b etween a

the space allo cated for the stackmust b e capable of growing

commit action of an enqueue and the next EnterRo om-

dynamically. Each time it grows, we double the allo cated

Grant(DEQUEUEROOM) action a safe enqueue region. Because

space, and the old stack is copied to the new larger one.

of prop erty P2 of the ro oms, this region will contain no ac-

In practice such dynamic stacks are very imp ortant. If an

tions from the b o dy of the dequeue. In particular q->bot

application uses a collection of stacks that share the same

will not change, and no dequeue will commit. Consider an

p o ol of memory, it is crucial to minimize the space needed

enqueue op eration. In the case that the queue is not full

by each stack (i.e., allo cating the maximum that eachmight

(the else branch is taken) the committing fetch-and-add ef-

p ossibly need is impractical). Our implementation is shown

fectively reserves a lo cation to put the item to enqueue. Any

in Figure 4. The copying from a smaller to a larger stack

8

is implemented incrementally. We assume that INITSIZE

As with the ro oms co de, this takes advantage of twos-

is greater than the maximum numb er of concurrent users.

complement arithmetic and is sensitive to the particular way

The pushRoomExit routine is assigned as the exit co de to the comparisons are made in Steps 22 and 34.

129

Stack_t *newStack() {

Stack_t *s = (Stack_t *)

int PUSHROOM = 0;

malloc(sizeof(Stack_t));

int POPROOM = 1;

s->A = (val *) malloc(

void push(val y, Stack_t *s) {

INITSIZE * sizeof(val));

enterRoom(s->r, PUSHROOM);

struct stack {

s->B = NULL;

int j = fetchAdd(&s->top,1);

val *A; /* current stack */

s->top = s->copy = s->start = 0;

if (j >= s->size) {

val *B; /* prev stack */

s->size = INITSIZE;

s->start = 1;

int copy; /* being copied? */

s->r = createRooms(2);

fetchAdd(&s->top,-1);

int start;

assignExitCode(s->r, PUSHROOM,

exitRoom(s->r);

long size, top;

pushRoomExit, s);

push(y, s);

Rooms_t *r;

return s;

return;

} Stack_t;

}

}

if (s->copy) {

void pushRoomExit(void *arg) {

val pop(Stack_t *s) {

if (j >= s->size/2) {

Stack_t *s = (Stack_t *) arg;

val x;

s->A[j - s->size/2] =

if (s->start) {

enterRoom(s->r, POPROOM);

s->B[j - s->size/2];

s->start = 0;

int j = fetchAdd(&s->top,-1);

s->A[j] = y;

s->copy = 1;

if(j<0){

}

s->size = 2*s->size;

fetchAdd(&s->top,1);

else s->B[j] = y;

free(s->B);

x = EMPTY;

}

s->B = s->A;

}

else s->A[j] = y;

s->A = (val *) malloc(

else if (s->copy && j < s->size/2)

exitRoom(s->r);

s->size * sizeof(val));

x = s->B[j-1];

}

}

else x = s->A[j-1];

}

exitRoom(s->r);

return x;

}

Figure 4: The co de for a parallel dynamic stack.

the PUSHROOM.For simplicity,wehave left out some details, lo cal stack back to the shared stack. The p op of 500 no des,

such as handling p ossible failures of malloc. and push of the lo cal stack are each done within a single

ro om (for the ro om synchronizatio n version) or a single lo ck

Theorem 3. The algorithm of Figure 4 implements a lin-

(for the lo cked version). A no de with a countofk (> 0)

earizable stack of dynamic size, such that both the push and

turns into two no des of count k 1,andanodewithazero

the pop operations take O (t + t +  ) time regard less of the

f a

count simply disapp ears. In other words, each of the origi-

number of concurrent users, where t is an upper bound on

f

nal 16,000 no des is the ro ot no de of a fully balanced binary

the time for a fetch-and-add, t is the upper bound on the

a

tree of depth 11. In addition to pro cessing the no des on

time for a cal l to malloc or free and  is an upper bound

the lo cal stack, each pro cessor waits for a random amount

on the time for any other instruction.

of time b etween p opping and pushing. The random time is

We prove this in the full pap er [4]. We note that since

selected uniformly b etween 0 and 2t ,where t is a param-

k k

the blo cks of memory are p owers of two, allo cating from a

eter of the exp eriment, and is meant to represent the work

shared p o ol should b e quite cheap using a buddy system

asso ciated with pro cessing a stack element. In the case of

(i.e., t should b e small).

garbage collection, such additional work might include de-

a

co ding ob jects, copying ob jects, and installin g forwarding ters.

4. EXPERIMENTAL RESULTS p oin

An interesting problem p osed by the shared stackbench-

In this section we describ e the results of a b enchmark

mark (and also present in the garbage collector) is detect-

that uses a shared stack to distribute work across pro cessors.

ing termination, which should o ccur only when there are no

The exp eriments were p erformed on an Sun UltraEnterprise

no des left for any pro cessor to work on. One can correctly

10000 with 64 250Mhz UltraSparc-I I pro cessors. This is a

test whether the shared stack is emptyby using suitably

shared-memory machine with a compare-and-swap instruc-

designed exit co de in the push ro om. However, detecting

tion, but no fetch-and-add. The fetch-and-add is therefore

that the shared stackisempty ignores the work on the lo cal

simulated using the compare-and-swap. We demonstrate the

stacks of each pro cessor in its work section (i.e., each pro ces-

e ectiveness of ro om synchronizations by comparing the p er-

sor not trying to enter the p op ro om and not having just left

formance of the shared stack using ro oms and using lo cks.

the push ro om). To solve this, wecentralize the emptiness

We also implemented the Treib er non-blo cking stack [21].

condition by including in the shared stack a counter indi-

However, the p erformance of the non-blo cking stackwas so

cating how many lo cal stacks haveborrowed items from the

p o or due to a large overhead and general unscalabil ity that

shared stack and are (p ossibly) non-empty. This counter is

we did not include the results in this pap er.

up dated when a pro cessor p erforms a push or p op. Termi-

The b enchmark is lo osely structured after the parallel

nation is triggered when the shared stack is empty and the

graph traversal of a garbage collector [5]. When the b ench-

counter is simultaneousl y zero.

mark rst executes, the shared stack is initiali zed with 16,000

For each setting of the parameters (lo cks vs. ro oms, num-

no des with a count of 11. Each pro cessor rep eatedly p ops

b er of pro cessors (1 to 30), and an amount of additional

500 no des from the shared stackonto its lo cal stack, op er-

work), the exp eriments were run 5 times and the averages

ates on the no des, and then pushes the entire contents of its

130

Elapsed Time (Rooms vs. Locks) with 40% additional wait Elapsed Time (Rooms vs. Locks) with 100% additional wait 400 400 Linear Speedup − No Wait Linear Speedup − No Wait Linear Speedup − Wait Linear Speedup − Wait Rooms − Wait Rooms − Wait 350 Locks − Wait 350 Locks − Wait

300 300

250 250

200 200

150 150

100 100 Work (Elapsed Time in Seconds * Processors) Work (Elapsed Time in Seconds * Processors) 50 50

0 0 0 5 10 15 20 25 30 0 5 10 15 20 25 30 Number of Processors Number of Processors

Elapsed Time (Rooms vs. Locks) with 600% additional wait 300 Linear Speedup − No Wait Linear Speedup − Wait

Rooms − Wait

Figure 5 shows three graphs with di ering are rep orted. Locks − Wait

250

amounts of additional wait time (the parameter t ). The

k

graphs corresp ond, from left to right, to applications where

the time to pro cess an item is 40%, 100%, or 600% of the 200

time it takes to transfer the items from the shared stack

to the lo cal stack. In each graph, the b ottom line (widely-

150

spaced dots) represents the work for the unipro cessor case

when no synchronization is p erformed and no wait time is

tro duced. The next line (dotted) adds a varying amount

in 100

of work re ected in the distance from the b ottom line. Fi-

nally, the solid and dashed curves represent the ro oms and

Work (Elapsed Time in Seconds * Processors) 50

lo cks versions of the b enchmark with synchronization and

additional work. The vertical axis represents the total work

h is calculated as the pro duct of wall-clo ck p erformed, whic 0 0 5 10 15 20 25 30

Number of Processors

time and numb er of pro cessors. In all cases p erfect sp eedup

corresp onds to the at thinly-spaced dotted line.

The ro oms synchronizatio n has go o d p erformance, intro-

Figure 5: Total work (elapsed time in seconds  num-

ducing an overhead approximately equal to basic stack trans-

b er of pro cessors) vs. numb er of pro cessors, for 40%

fer time (without synchronizatio n or additional work). The

additional wait time, 100% additional wait time, and

overhead is mostly indep endentofthenumb er of pro cessors,

600% additional wait time.

indicating that the sp eedup achieved is (after including the

overhead) linear. Additionally the magnitude of the over-

would eventually cause the work to increase linearly with the

head is also indep endent of the amount of additional work

numb er of pro cessors, but the slop e would b e much less than

intro duced.

that of the lo cked version. This is a subtle and crucial p oint

In contrast, at few pro cessors, the lo cks version has almost

ab out ro om synchronization and a reason why our heavy use

no overhead. However, as the numb er of pro cessors increase,

of fetch-and-adds do es not preclude go o d p erformance even

the contention in acquiring lo cks increases, causing a rapid

on machines that do not supp ort a parallel fetch-and-add.

p erformance degradation. The p oint at which this transi- aries from immediately to 10 pro cessors and 20

tion o ccurs v 5. RELATED WORK AND DISCUSSION

pro cessors for the three wait times, resp ectively. This trend

Synchronization. There is a long history of synchro- is exp ected as the intro duction of additional work b etween

nization mo dels and synchronization constructs for parallel p opping and pushing means that the pro cessors sp end less

and distributed computation. At the one end of the sp ec- time lo cking the push and p op co de.

trum, there are synchronous mo dels such as the PRAM, Given that wesimulate the fetch-and-add with a compare-

in which the pro cessors execute in lo ck-step and there is and-swap, which sequentializes the fetch-and-add, one might

no charge for synchronization. Shared data structure de- wonder why the ro om version scales so well. Recall, how-

sign is simpli ed bynothaving to deal with issues of asyn- ever, that we need only execute the fetch-and-add a small

chrony. Bulk-synchrono us mo dels such as the BSP [23] or constantnumb er of times p er ro om access: once when en-

the QSM [7] seek to retain the simplicityofsynchronous tering, once when leaving and twice in each of the push

mo dels, while p ermitting the pro cessors to run asynchro- and p op co des. This compares with the 500 elements that

nously b etween barrier synchronization s (typically) among have to b e copied o the stack during a p op, and from 0

all the pro cessors. Algorithms designed for these mo dels to 1000 that need to b e copied backonto it during a push.

are necessarily blo cking (due to the barrier synchroniza- The prop ortion of instructions sp ent in co de that mightbe

tions). For the lo osely synchronous application s considered sequentializ ed is therefore small. Asymptotically our co de

131

means that it is unlikely that a pro cess will b e swapp ed in this pap er, there are signi cantoverheads in implement-

out (context switched) by the op erating system while inside ing shared data structures using barrier synchronization s,

a ro om. There are a couple p otential mechanisms to deal b ecause all the pro cessors must co ordinate/wait even if they

with the case where the op erating system could swap out are not currently accessing the data structure. To reduce the

a pro cess while inside a ro om. First, the for a numb er of barriers, co ordination primitives such as parallel

might b e delayed until the exitRo om. This pre x or fetch-and-add are still needed to enable pro cessors

can b e achieved on most pro cessors by temp orarily disabling to make parallel up dates to the data structure. In lo osely

certain kinds of while inside a ro om. A second synchronous applicatio ns having multiple classes of op era-

p otential solution is to have sp ecial co de tions that must b e separated (e.g., the push class and the

that restores the state of the pro cess to a p oint in which p op class) and unpredictable arrival times for op eration re-

it is safe to exit the ro om, and then exit the ro om b efore quests, it is desirable to have exibility in selecting which

submitting to the context switch. class is p ermitted during the next phase. The application

would b e forced to implement this exibility, and reason

Lo cks and other constructs. Traditional implementa-

ab out fairness and other issues, whereas this exibili tyis

tions of shared data structures use lo cks. Often lo cks are

already encapsulated in ro om synchronizatio ns.

distinguish ed as either shared or exclusive. A shared lo ck

At the other end of the synchronization mo dels sp ectrum

p ermits other users to also b e granted shared lo cks to the

are the fully asynchronous mo dels, in which pro cessors can

data structure, but forbids the granting of an exclusivelock.

b e arbitrarily delayed or even fail, and shared data struc-

An exclusivelock forbids the granting of any other lo cks.

tures are designed to tolerate such delays and failures. Wait-

Ro om synchronization can b e viewed as providing a lo ck

free data structures [10] have the prop ertythatany user's

that is shared among those requesting the same ro om, but

request (e.g., a push or p op request) will complete in a

exclusive to the shared lo cks for other ro oms. This greater

b ounded numb er of steps, regardless of the delays or fail-

sharing implies greater concurrency.

ures at other pro cessors. Because of the large overheads in

The closest algorithms to any of ours is an algorithm by

wait-free data structures, there has b een considerable work

Gottlieb, Lubachevsky and Rudolph for parallel queues [8].

on non-blo cking (or lo ck-free) data structures [10], which

Like ours, the algorithm works with unpredictable arrival

only require that some user's request will complete in a

times or requests, is based on the fetch-and-add op eration,

b ounded numb er of steps (although any particular user can

and can fully parallelize access. Also like ours, it is not

be delayed inde nitel y). Examples of non-blo ckin g data

non-blo cking. It, however, has some imp ortant disadvan-

structures work includes [1, 2, 9, 10, 11, 17, 18, 24, 25].

tages compared to ours. Firstly, it is not linearizable |the

Most of these implementation s still fully sequentialize ac-

following can o ccur on two pro cessors:

cess to the data structure. Moreover, they often require un-

P1 P2

9

b ounded memory , or the use of atomic op erations on two

enqueue(v1) enqueue(v2)

or more words of memory (such as a double compare-and-

v1 <- dequeue()

EMPTY <- dequeue() swap or transactional memory [12, 20]). Such op erations are

signi cantl y more dicult to implement in hardware than

Secondly, the algorithms requires a lo ck (or counter) for ev-

single word atomic op erations. Thus, wait-free and non-

ery element of the queue. This b oth requires extra mem-

blo cking data structures are essential in contexts where the

ory, and requires manipulating this lo ck for every insert and

primary goal is making progress in highly-asyn chrono us en-

delete. In our solution it is easy to batch the inserts or

vironments, but there is a signi cant costtoproviding their

deletes, as was done in our exp eriments. Thirdly,thetech-

guarantees.

nique do es not app ear to generalize to other data structures

Ro om synchronizations are designed for asynchronous set-

such as stacks. The technique do es have one advantage,

tings more concerned with fast parallel access (and b ounded

which is that the blo cking is at a ner grain|at each lo ca-

memory) than with providing non-blo cking prop erties. In

tion rather than across the data structure.

other words, settings somewhat in b etween those suitable

Gottlieb, Lubachevsky and Rudolph [8] also provide jus-

for bulk-synchrono us mo dels and those suitable for fully

ti cation for why a parallel fetch-and-add op eration can b e

asynchronous mo dels. In the exp erimental results in this

implemented eciently in hardware, and indeed, the Ultra-

pap er, as well as exp eriments with our parallel garbage col-

computer they subsequently built provided a parallel fetch-

lector, wehave obtained go o d p erformance with ro om syn-

and-add. In addition to queues they showed how the fetch-

chronizations on the Sun UltraEnterprise 10000, a 64 pro ces-

and-add could b e used for various other op erations includ-

sor shared-memory machine, indicating that ro om synchro-

ing a solution to the readers-writers problem, whichallows

nizations are suitable for that machine. We exp ect similar

a shared lo ck for the readers of a data structure, but an

p erformance on other shared-memory machines suchasthe

exclusivelock for the writers. Ro om synchronization can

SGI Power Challenge and the Compaq servers.

be viewed as extracting from these algorithms (and other

We note that our exp eriments are run in an environment

previous algorithms using fetch-and-add) a synchronization

in which each pro cess is mapp ed to one pro cessor. This

construct useful for many problems.

In our earlier work on garbage collection we brie y de-

9

scrib ed another weaker version of the ro om synchroniza- In such algorithms, memory can never b e reused, b ecause a

delayed pro cessor may still haveapointer to the old memory

tions [3]. That version had a couple imp ortant disadvan-

lo cation. A discussion of this problem for a lo ck-free queue

tages over the version describ ed in this pap er. Firstly,itdid

algorithm is in [24]. A compare-and-swap is used to insert

not guarantee prop erties P5{P10, unless you assumed cer-

or delete from the queue, so a delayed pro cessor should fail

tain prop erties of the exact time taken byeach instruction.

in its compare-and-swap b ecause the queue has changed in

In particular one pro cess could b e starved for an arbitrary

the meantime. However, if the memory is reused, then the

amount of time if other pro cesses kept going through the compare-and-swap may succeed when it should fail.

132

[5] P. Cheng and G. E. Blello ch. A parallel, real-time garbage ro om very quickly (fast enough that the pro cess missed do-

collector. In Proc. ACM SIGPLAN'01 Conf. on

ing a check of a ag for the brief time the ro om is op en).

Programming Languages Design and Implementation,June

This could even b e true if all pro cessors are running at the

2001.

same rate. Secondly, the ro om synchronizations required

[6] M. J. Fischer, N. A. Lynch, J. E. Burns, and A. Boro din.

that each pro cessor once entering the rst ro om had to also

Distributed FIFO allo cation of identical resources using

go through the second ro om. Thirdly, it only considered

small shared space. ACM Trans. on Programming

synchronizatio n with two ro oms. Finally, the conditions on

Languages and Systems, 11(1):90{114, Jan. 1989.

what constitutes a ro om synchronizatio n were not formal-

[7] P. B. Gibb ons, Y. Matias, and V. Ramachandran. Can a

shared-memorymodelserve as a bridging mo del for parallel

ized. In our new implementation of the garbage collector [5],

computation? Theory of Computing Systems,

we use the implementation discussed in this pap er.

32(3):327{359, 1999. Preliminary version app eared in

As an abstraction, ro om synchronization is distinct from

SPAA'97.

all previous abstractions known to us. It di ers from mu-

[8] A. Gottlieb, B. D. Lubachevsky, and L. Rudolph. Basic

tual exclusion byintro ducing ro oms that are mutually exclu-

techniques for the ecient co ordination of very large

sive to each other, but allowing concurrent access to a given

numb ers of co op erating sequqnetial pro cessors. ACM

Trans. on Programming Languages and Systems, ro om. In more general abstractions, such as the resource

5(2):164{189, Apr. 1983.

allo cation abstraction [16] of which the dining philosophers

[9] M. Greenwald. Non- Synchronization and System

problem is a well-known sp eci c instance, each user con icts

Design. PhD thesis, Stanford University,Palo Alto, CA,

with a subset of the other users, and only noncon icti ng sets

1999. Tech. rep ort STAN-CS-TR-99-1624.

of users can pro ceed in parallel. It is not clear howonewould

[10] M. P. Herlihy.Wait-free synchronization. ACM Trans. on

emulate ro oms using this formalization.

Programming Languages and Systems, 13(1):123{149, Jan.

Finally, there have b een a numb er of pap ers describing

1991.

techniques for reducing the contention in accessing shared

[11] M. P. Herlihy and J. E. B. Moss. Lo ck-free garbage

collection for multipro cessors. IEEE Trans. on Paral lel and data structures (e.g., [8, 19, 21]). These techniques are com-

Distributed Systems, 3(3):304{311,May 1992. Preliminary

plementary to ro om synchronizatio n, and p erhaps can b e

version in SPAA'91.

exploited in the implementation of ro oms by increasing the

[12] M. P. Herlihy and J. E. B. Moss. :

concurrency.

Architectural supp ort for lo ck-free data structures. In

Proc. 20th International Symp. on Computer Architecture, y1993.

6. CONCLUSIONS pages 289{300, Ma

[13] M. P. Herlihy and J. M. Wing. Axioms for concurrent

We presented a class of synchronizatio ns called ro om syn-

ob jects. In Proc. 14th ACM Symp. on Principles of

chronizations, that are likelytobeusefulinacontext that

Programming Languages, pages 13{26, Jan. 1987.

lies b etween highly synchronous mo dels such as the PRAM

[14] M. P. Herlihy and J. M. Wing. : A

or BSP mo del, and highly asynchronous mo dels where it is

correctness condition for concurrent ob jects. ACM

assumed pro cessors can stall, fail, or b ecome disconnected.

Trans. on Programming Languages and Systems,

In particular ro om synchronization s can handle requests that 12(3):463{492, July 1990.

[15] L. Lamp ort. A new solution of Dijkstra's concurrent

come in at arbitrary times, and from arbitrary subsets of the

programming problem. Communications of the ACM,

pro cessors. They,however, are blo cking and hence if a pro-

17(8):435{455, Aug. 1974.

cessor fails in certain critical regions of the co de, the other

[16] N. A. Lynch. DistributedAlgorithms. Morgan Kaufmann,

pro cessors can b ecome blo cked.

San Francisco, CA, 1996.

Based on ro om synchronizations wepresented simple and

[17] M. M. Michael and M. L. Scott. Simple, fast, and practical

ecient implementations of shared stacks and queues. To

non-blo cking and blo cking concurrent queue algorithms. In

the b est of our knowledge these are the rst implemen-

Proc. 15th ACM Symp. on Principles of Distributed

Computing, pages 267{275, May 1996.

tations of stacks and queues that are linearizabl e, handle

[18] M. C. Rinard. E ective ne-grain synchronization for

asynchronous requests, and allow for constant-time access

automatically parallelized programs using optimistic

(assuming a constant-time fetch-and-add).

synchronization primitives. ACM Trans. on Computer

Acknowledgements Systems, 17(4):337{371, Nov. 1999.

Thanks to Toshio Endo and the Yonezawa Lab oratory for

[19] N. Shavit and D. Touitou. Elimination trees and the

use of their 64-way Sun UltraEnterprise 10000. construction of p o ols and stacks. In Proc. 7th ACM

Symp.onParal lel Algorithms and Architectures, pages

7. REFERENCES 54{63, June 1995.

[20] N. Shavit and D. Touitou. Software transactional memory.

[1] O. Agesen, D. L. Detlefs, C. H. Flo o d, A. T. Garthwaite,

Distributed Computing, 10(2):99{116, Feb. 1997.

P. A. Martin, N. N. Shavit, and G. L. Steele Jr.

[21] N. Shavit and A. Zemach. Combining funnels. In

DCAS-based concurrent deques. In Proc. 12th ACM

Proc. 17th ACM Symp. on Principles of Distributed

Symp. on Paral lel Algorithms and Architectures, pages

Computing, pages 61{70, June-July 1998.

137{146, July 2000.

[22] R. K. Treib er. Systems programming: Coping with

[2] G. Barnes. A metho d for implementing lo ck-free shared

parallelism.Technical Rep ort RJ 5118, IBM Almaden

data structures. In Proc. 5th ACM Symp. on Paral lel

ResearchCenter, Apr. 1986.

Algorithms and Architectures, pages 261{270, June 1993.

[23] L. Valiant. A bridging mo del for parallel computation.

[3] G. E. Blello ch and P. Cheng. On b ounding time and space

Communications of the ACM, 33(9):103{111, Sept. 1990.

for multipro cessor garbage collection. In Proc. ACM

[24] J. D. Valois. Implementing lo ck-free queues. In Proc. 7th

SIGPLAN'99 Conf. on Programming Languages Design

International Conf. on Paral lel and

and Implementation, pages 104{117, May 1999.

Systems, pages 64{69, Oct. 1994.

[4] G. E. Blello ch, P. Cheng, and P. B. Gibb ons. Ro om

[25] J. D. Valois. -Free Data Structures. PhD thesis,

synchronizations. Technical rep ort, Carnegie Mellon

Rensselaer Polytechnic Institute, Troy, NY, 1995.

University, Pittsburgh, PA, Apr. 2001.

133