Process Synchronization in the UTS Kernel Lawrence M. Ruane AT&T Bell Laboratories

ABSTRACT: Any kernel has some form of process synchronization, allowing a process to wait for a particular condition. The traditional choice for UNIX systems, the event-wait mechan- ism, leads to race conditions on multiprocessors. This problem was initially solved in Amdahl's UTS multiprocessing kernel by replacing the event-wait mechanism with Dijkstra semaphores. The kernel, however, became noticeably more complicated and less reliable when based on semaphores. This has led us to develop a race-free multiprocessor event- wait mechanism with some novel properties. A few common synchronization techniques have emerged, the most complex of which was verified correct with the supertrace protocol validation system spin. A new approach with per-CPU run queues reduces the number of unnecessary context switches due to awakening all waiting processes. The overall approach is claimed to be simple, efficient, and reliable.

@ Computing Systems, Vol. 3 'No. 3 'Summer 1990 387 I. Introduction

Broadly, process synchronization refers to processes blocking them- selves until the system changes state in some way. In the original UNIX kernel, processes wait for events [Thompson 1978]. An event is represented by an arbitrary integer (the channel), chosen by convention to be the address of the kernel data structure whose state)change the process is interested in. For example, suppose a resource, represented by a data struc- ture pointed to by r, must be used by only one process at a time. Here is the standard protocol to the resource:l while (r->Iock) sleep(r); r-)lock = 1; If the resource is locked lock( is nonzero), the process suspends itself until the resource becomes free Lock( is zero). The process then locks the resource and proceeds to use it. The lock flag is often called a sleep lock. Since the kernel is non-preemptable (the kernel will not switch to another process unless explicitly requested via sleepO), there ls no in the test and set of the sleep lock: it is impossible for two processes to see the lock cleared and then both set the lock and use the resource simultaneously. Non-preemptability also simplifies the kernel tremendously, since it implies that sleep locks are only needed when a process might sleep at a "lower level," during which no other process should access the resource. For example, while a process waits for I/O completion for a particular inode (file), the process holds the inode's sleep lock. After changing the state of the system in a way that might be relevant to sleeping processes, a process must wake them up using the agreed-upon channel: r-)Iock = 0; wakeup(r);

l. For simplicity, we ignore process scheduling priority upon waking up, the second argument to sleep O.

388 Lawrence M. Ruane An awakened process must treat the wakeup as advisory; it must re-evaluate the wait condition - thus the white instead of an if - for two reasons. First, any number of the processes might have run since the the wakeup was issued, any of which might have changed the wait condition. Second, more than one event can map to the same channel, so the wait condition might not have changed at all. Our experience has been that while event-wait is not as efficient as some other synchronization methods (that wake only one waiting process, for example), it puts the least burden on ker- nel algorithms because it closely approximates the busy-wait model. Under pure busy-wait, there are no process blocking or resumption concerns at all - every process has its own processor' Algorithms can use "polling" loops that are much more complex than the simple while statement above; then the polling is slowed down by slipping in calls to sleepO and wakeupO at the right places, without changing the algorithms. Appendix,'4 compares event-wait to other forms of monitors.

2. Race Conditions

If the resource is released by an handler, a race condi- tion can cause the wakeup to be lost. A process wishing to use the resource tests the lock, finds it set, and commits to go to sleep. But before actually calling sleep O, an intemrpt occurs. The interrupt handler frees the lock and does the wakeup O (which has no effect), and returns to the intemrpted process, which goes to sleep, possibly forever. Disabting between the test of r->lock and the transition into the wait state eliminates the race condition: spl6O ; /*disable interruPtsx/ while (r->Iock) sleep(r); r-)lock = 1; splOO ; /xenable interruPtsx/ Interrupts are automatically enabled after the process is switched out, and are disabled again when the process switches back in.

Process Synchronization in the UTS Kernel 389 On multiprocessing systems, disabling interrupts on the current cPU is not sufficient - the intemrpt handler can run on another CPU. Also, the race condition prevented by non- preemptability on a uniprocessor can now occuf: two processes on different CPUs attempt to lock the same resource at about the same time; the processes both see lock set to zero, set lock, and simultaneously use the resource - practically a guaranteed disaster lBach le86l.

3. Semaphores

The problems with the event-wait method on multiprocessors have led to several implementations that substitute semaphores po( and vo routines) [Dijkstra 1968]. Multiprocessing kernels based on semaphores exist for the IBM/370 architecture (uTS and the UNIX System for System/370 [Felton et al. 1984]), the AT&T 38204 computer, and the Sequent systems [Beck et al. 1987], among others. In supporting and developing new kernel features for UTS over the years, we have found semaphores troublesome. There are the obvious portability problems of converting algorithms based on event-wait to a different synchronization model. But more impor- tantly, we have encountered many difficulties due to the nature of the semaphore mechanism itself. Sometimes semaphores simplify kernel algorithms by providing just the right abstraction' but often their use leads to significantly more complicated algorithms. what is the root of the problems with semaphores? Semanti- cally, if not in actual implementation, the semaphore mechanism encapsulates the common while (resource is locked) sleepO; get resource lock;

synchronization paradigm we have already seen (or a slight gen- eralization if the semaphore is initialized to a value greater than one). There are really three separate parts to this: testing for resource availability; blocking; and resource allocation.

390 Lawrence M. Ruane But when the algorithms are complex, we would often like to have direct control over the individual parts. For example, the buffer cache getblkO routine requires sophisticated "polling" loops [Bach 1986]. In other situations, we would like to (indivisi- bly) test for the availability of several different resources before allocating any of them, to simplify the handling of a non-available resource. In cases like these, which are all too common in the real world, event-wait works better. By analogy, consider the printf O routine. It really does two separate things - formatting a string and writing it to standard output - that often happen to be needed together. But if only printf O were available, the programmer's life would be difficult indeed. This is why sprintf O and puts O, the two com- ponents of printf O, are made available. Providing high-level primitives such as printf O and P O is fine, as long as one allows access to the lower level ones as well. A case study showing some of the problems with semaphores in actual use is given in Appendix.B. Due to these problems, we decided to develop a modifred event-wait mechanism for UTS.

4. Spinlocks

Before getting into the multiprocessor event-wait implementation, we must review spinlocks. All multiprocessing kernels use spin- locks to provide low-level processor (as opposedto process) synchronization. A spinlock is a variable that takes on values 0 and 1. The spintockO routine indivisibly changes a given spinlock variable from 0 to 1 (retrying if necessary, without giving up the CPU). The UTS implementation of spinlocks is typical - a busy-wait loop around an indivisible test-and-set instruction. The freelocko routine simply sets the given spinlock to 0. (Spinlocks are known by other names, such as primitive semaphores in Bach [1986].) Spinlocks provide short-term exclusion; they must never be held across sleeps (otherwise every processor may try to get the spinlock at about the same time, causing system-wide since the process holding the lock cannot run to free it).

Process Synchronization in the rlrs Kernel 391 Similarþ, intemrpts must be disabled while holding any spin- locks (at least any that an interrupt handler might try to get), lest the interrupt handler deadlock on a held spinlock. This require- ment is met in UTS in the simplest possible way: the entire kernel is non-interruptible. (Another simple way is to increment a per- CPU counter in spinlockO, decrement it in freelockO, and only enable intemrpts if the counter is zero.) The sleep queue is a typical example. In UTS, the sleep queue is hashed by channel number, with one spinlock per hash chain. While manipulating or referencing a hash chain, one must hold its spinlock: struct hsque { long Ik; struct proc *proc; ) hsque[64]; #define sqhash(c) (&hsque[(c>>3)&63] ) sleep (chan) { struct hsque ahp = sqhash(chaa);

spintock(&hp->1k) ; línk current proc onto hp->proci

freetock (&hp->Ik) ; swtchO; ) wakeup (chan) { struct hsque shp = sqhash(chan);

spinlock(&hp->Ik) ; move all procs waiting on chan from hp->proc to run queue:

freelock(&hp->Ik) ; )

392 Lawrence M. Ruane 5. A Multiprocessor Event-Wait Mechanism

The modifred event-wait mechanism extends the use of spinlocks to prevent the race conditions described earlier. A new routine, sleepl O (for "sleep while holding a (spin)lock"), takes the address of a spinlock as an additional argument. wakeup O is unchanged. sleeplO has the following form: sleepl(chan, lp) long x1P' { struct hsque xhP = sqhash(chan);

spinlock(&hp->Ik) ; freelock(Ip); link current proc onto hp->proc;

freelock(&hp->Ik) ; swtchO;

spinlock(1p) ; ) This is just sleepO except that after locking the sleep queue' the argument lock Ip is freed. We will see below that this is the earliest possible time Ip can be freed, minimizing contention. As a matter of convenience, it is relocked as late as possible before returning. Here is the multiprocessor protocol for acquiring a resource:

spinlock(&r->Ik) ; while (r->tock) sIeepl(r, &r->Ik); r-)lock = 1; freelock(&r->1k) ;

(The resource structure now includes a spinlock 1k.) It is easy to see how the race condition in which two processes use the resource at the same time is prevented: the test and set of the sleep lock (lock) is made indivisible by the use of the spinlock (1k).

Process Synchronization in the uTS Kernel 393 A new primitive, waitlockO, solves the lost wakeup race condition in a particularþ elegant and (to our knowledge) unique way. As the name implies, it waits until the given spinlock is free, but does not lock it: waittock(Ip) long xIP; { white (*Ip) )

This primitive is used when releasing the resource: r-)lock = 0;

waitlock (&r->1k) ; wakeup(r);

To see how the wakeup cannot be lost, suppose process A, try- ing to lock the resource, has seen lock set, but has not yet called sleepl O. Now process .B on another CpU, releasing the resource, sets lock to zero. At this point B spins in waitlocko until,4 gets the sleep queue hash spinlock and releases Ik, at which point B can proceed to wakeupo. The frrst thing wakeup O does is get the (same) sleep queue hash spinlock, which it cannot do until I is queued to the sleep hash chain. Therefore, .B is sure to frnd A and move it to the run queue. It is not necessary, as we frrst thought, to hold the spinlock throughout the entire sequence to release the resource. Using waitlockO minimizes contention on Ik. The rule for preventing a lost wakeup is that the spinlock must be held, even if momentarily, between the time the awaited condi- tion is made true and the wakeup. waitlockO fulfills this requirement since it is equivalent to a spinlockO followed by a freelockO.

Interestingly, no race results from lock being inspected at the same time it is being set (to zero) on another cpu. Also, the fact that lock is set to zero without holding any spinlock provides a simple counter- example to our early impression that a spinlock protects data structures - that if a spinlock needs to be held for any change of a variable, it

394 Lawrence M. Ruane needs to be held for every change. Here we see that a spinlock actually protects transitions - the spinlock must be held while changing lock from zero to one, but not the reverse. This sort ofthing occurs in other places.2

6. Combining the Two Uses of a Spinlock

Often, the above rule is satisfied without using waitlockO. This happens when the same spinlock is used to both protect a data structure and prevent a lost wakeup. For example, suppose we have a pool of resources (data struc- tures), with the free ones on a singly linked list: struct { long Ik; struct resource *head; ) freelist; When a process wants one of the resources and there is none available, it sleeps:

spinlock(&freelist . fk) ; while (freelist.head == NULL) sleepl(&freelist, &freelist.fk) ; r = freelist.head; freelist.head = r_)next; freelock(&freelist . Ik) ; return(r); Putting r back on the freelist is done with:

spinlock(&freetist . fk) ; r-)next = freelist.head; freelist.head = r; freelock(&freetist . tk) ; wakeup (&freelist) ;

2. For example, in UTS it is necessary to hold the per-process spinlock for some process state (sleep, run, zombie, etc.) transitions, but not for others.

Process Synchronization in the UTS Kernel 395 Here, f reelist. tk is used both to protect the freelist and to prevent a lost wakeup. When returning a resource to the freelist, there is naturally a moment at which both the wait condition is true (freelist != NULL) and the spinlock is held, so no waitlockO is needed.

7. "Wanted" Flags

To improve performance, a "wanted" flag is sometimes added to the protocol to avoid unnecessary wakeups. The uniprocessor method of allocating the resource becomes: white (r->lock) { r-)wanted = 1; sleep(r); ) r-)lock = 1; The resource is released with: r-)lock = 0; if (r->wanted) { r->wanted = 0; wakeup(r); ) Using multiprocessor event-wait, the resource is allocated with:

spinlock (&r->1k) ; while (r->Iock) { r-)wanted = 1; sIeepl(r, &r->Ik); ) r-)lock = 1;

freelock(&r->tk) ; and released with: r-)|eck = 9'

waitlock(&r->Ik) ;

396 Lawrence M. Ruane if (r->wanted) { r-)wanted = 0; waitlock(&r->1k) ; wakeup(r); ) It is easy to see that since the wakeup is conditioned on the test of wanted, a waitlockO must be done before the test. It is harder to see the need for the second waitlockO, but one can imagìne that without it process,,4 is about to clear wanted; pro- cess B sets wanted and is about to go to sleep; process 1 clears wanted and does the wakeuP O, which .B misses. A very complex sequence with this ending, requiring three pro- cessors, was indeed found by the spin stpertrace protocol verifrer [Holzmann 1990]. With the second waittockO inplace, spin has proven this protocol safe for at least 5 processors. The spin model is given in Appendix C. Surprisingly, wanted can be getting turned on and offby different processors at the same time (with an indeterminate result), yet there is no race condition. This is unusual in a mul- tiprocessing kernel. This protocol provides very low contention on spinlocks, and especially fast resource releases. Normally, Ik is not held and wanted is not set, in which case no spinlocks at all are needed to release the resource.

8. Conditionally Acquiring a Sleep Lock

Another interesting protocol used in uniprocessing kernels involves waiting for a sleep lock to become available, but not lock- ing the sleep lock unless sleep is required. Here is an ôutline of f ree O, which puts a given disk block on the free list of a given filesystem. The in-memory part of the free list and the sleep lock are kept in the filesystem's superblock, pointed to by fp. If f ree O discovers that the in-memory free list is full, it synchro- nously writes the free list to disk and resets the list to empty. Then f ree O adds the given disk block the free list.

Process Synchronization in the uTS Kernel 397 while (fp->lock) sleep(fp); if (fp->free is fulD { fP->Iock = 1; synchronously write fp->free to disk: set fp->free to " empty" : fp->Iock = 0; wakeup(fp); ) add block being freed to fp->free:

This routine doesn't bother to change the sleep lock or do a wakeup O in the normal case that the free list is not full. We can get the same advantage in the multiprocessing version of this routine:

spinlock(&fp->1k) ; while (fp->lock) sleepl(fp, &fp->tk); if (fp->free ß full) { fp->Iock = 1; freelock(&fp->fk) ; synchronously write fp->free to disk; " set fp->free to empty" ;

spinlock(&fp->1k) ; fP->lock = 0; wakeup(fp); ) add block being freed to fp->free: freelock(&fp->Ik) ;

The spinlock is held across the wakeup O, which could be prevented in most cases by adding a wanted flag. Notice that the wakeup O happens before we are done using the resource, which is frne since we hold the spinlock in the interval.

398 I¿wrence M. Ruane 9. What Provides Processor ?

The previous example provides a good lead-in to the general ques- tion of what type of lock provides processor mutual exclusion. At one extreme, such as with the sleep queue hash chains, spinlocks provide all of it - there are no sleep locks. This applies to many other parts of UTS as well: the run queue, the free list of process structures, the hash lists of the buffer cache, etc. In other cases, such as the most recent example, the responsi- bility is split between a spinlock and a sleep lock. The spinlock provides the processor mutual exclusion for the statement add block being freed to fp->free; (we hold the spinlock but not the sleep lock) while the sleep lock protects

set fp->free to " empty"; (we hold the sleep lock but not the spinlock). At the other extreme, spinlocks only provide synchronization for manipulating the sleep lock (including preventing missed wakeups); the data structure is manipulated with just the sleep lock held. This is also very common, applying to individual buffer cache headers, inodes, etc. Of course sleep locks provide all process mutual exclusion.

10. Scope of Sleep Locks

When a sleep lock provides processor mutual exclusion, it must be held, not just across sleep, but across all manipulations of the resource. We have discovered parts of the standard kernel where the scope of a sleep lock is too small for a multiprocessor. In a uniprocessing kernel, an inode can be unlocked and then still manipulated; since the kernel is non-preemptable, no other pro- cess can use the inode until the current process gives up the CPU. On a multiprocessor, this doesn't work. From the instant the sleep lock is released, a process on another processor can get the

Process Synchronization in the (ITS Kernel 399 lock and start to use the inode, even though the first process is still running. So, in a few cases, the unlock of the resource had to be moved (delayed) to the point at which the resource was really finished being used.

1 1. Indivisibly Freeing a Sleep Lock

There.are many parts of the kernel where a process gets the sleep lock of a resource, but then discovers that the resource is "not ready" in some sense, requiring the process to wait. Before block- ing, however, it must free the sleep lock to allow another process to change the state ofthe resource to "ready." One place this occurs is in the IPC message facility. Each mes- sage queue has a sleep lock. Suppose process R wishes to receive a message from a certain queue, but after locking the queue, dis- covers that it is empty. It then frees the sleep lock and sleeps on the address of a particular member (q¡r->qnun) of that queue structure (not the address of the overall structure, since that is used for the sleep lock). When another process ,S sends (enqueues) a message to the queue, it does a wakeup on the address of qP->qnun. On a multiprocessor, without any precautions, R can miss the wakeup, since as soon as it frees the sleep lock in preparing to sleep, S can get the sleep lock, enqueue the message, and do the wakeup - before R is asleep. The solution for S is: /x get sleep lock: x/ spinlock(&nsg-Ik) ; while (qp->lock) sIeepl(qp, &usg-Ik); QP->lock = 1; freelock(&nsg-tk) ; /* deposit the nessage: x/

wakeup (&qp->qnum) ; /* free the sleep lock: x/ qp-)lock = 0;

400 Lawrence M. Ruane waitlock (&nsg_Ik) ; wakeup(qp); and for R: Ioop: /* get sleep lock: */

spinlock(&nsg_Ik) ; while (qp->Iock) sleepl(qp, &nsg_Ik); Ç[p->lock = 1;

freelock(&msg_Ik) ; '/*' no messages to read; x free sleep lock and x wait f or a ¡nessage: ,f/

spinlock(&usg_Ik) ; qp-)lock = 0; wakeup(qp);

sleepl (&q¡r->qnun, &msg_Ik) ;

freelock(&nsg_fk) ; goto loop; /*re-evaluatex/

(There is one global spinlock for the entire IpC message facility, rather than one spinlock per resource, although it could have been done the other way.) The important point is that.R gets the spin- lock before it frees the sleep lock, so there is no way S can lock the sleep lock until R is asleep, since S must get the spinlock before getting the sleep lock. This means the wakeup cannot be lost. (The spinlock is be held across a wakeup O; as before, this can be prevented in most cases by use of a wanted flag.) Notice that after S gets the sleep lock and queues a message, it does not have to do a waitlockO before the wakeupO; R can- not be about to go to sleep, or ,S would not have been able to get the spinlock, which is needed to get the sleep lock.

Process Synchronization in the UTS Kernel 401 12. Semaphores on Top of Event-Wait

A, semaphore is a counter initialized to the number of some class of resources (or one for mutual exclusion). When nonnegative, the semaphore gives the number of available resources; when negative, the absolute value of the semaphore gives the number of processes waiting for resources. P O is called when allocating a unit of the resource; it decrements the counter and blocks if nega- tive. A call to V O releases a unit by incrementing the counter and, if processes are waiting, unblocking one of them. Here is the old UTS implementation: struct senaphore { long Ik; int count; struct proc *P; ); p (s) struct semaphore *s; { spinlock(&s->1k) ; if (--s-)count < 0) { queue curproc onto s;

freelock(&s->Ik) ; swtchO; ) else freelock(&s->lk) ; )

v (s) struct senaphore *s; { struct Proc *P; spinlock(&s->lk) ; if (s-)count++ < 0) { p = dequeue one process from s;

402 Lawrence M. Ruane freelock(&s->1k) ;

setrq(p) ; ) else

freelock(&s->Ik) ; ) Since semaphores can easily be implemented using event-wait, this is what we did in UTS, replacing the "primitive" semaphore implementation above. Here is the event-wait version of semaphores:3 struct senaphore { long Ik; int count; ); p (s) struct senaphore *s; {

spinlock (&s->1k) ; while (--s-)count < 0) sleepl(s, &s->lk);

freelock(&s->lk) ; ) V(s) struct semaphore *s; {

spintock (&s->1k) ; if (s->count < 0) { s->count = 1;

freelock(&s->Ik) ; wakeup(s); ) else { ++s-)count;

freelock(&s->1k) ; ) )

3. For simplicity, we ignore the case of interruptible semaphores, although they are easy to implement.

Process Synchronization in the UTS Kernel 403 (Since a negative count indicates one or more processes are sleeping, it is functionally a "wanted" flag.) This reimplementa- tion of semaphores changed their semantics slightly, and unfor- tunately broke some things. (It is unclear if the original program- mers were aware they were depending on such detailed semantics of semaphores.) The old semaphores were "strict": if process,'4 blocks on a P O , and process .B executes V O on the same semaphore, ,,4 is guaranteed to get the semaphore next, even if some other process, C, tries to do a P O before -,4 runs. C blocks; ,4 switches in and gets the semaphore. The new semaphores, shown above, aÍe "lazy"i in the previ- ous example C gets the semaphorc; A would switch in only to find the semaphore still locked, and go back to sleep. One part of the kernel that broke is the named pipe open pro- tocol. A per-inode (per-pipe) semaphore, i-fwcnt, maintains the number of writers. When a process tries to open a named pipe for read, it does a P O on that semaphore, which blocks if there are no writers. What can happen is that the writer comes along; increments the writer count (does a v O, which awakens the reader); quickly writes a small amount of data; and closes the pipe, which decrements the writer count (with P O ) to zero. Then, the reader gets the CPU, but frnds the semaphore count is still zero, and goes back to sleep! The reader remains stuck in the open protocol. This was easily frxed by porting the standard event-wait based code, which is simpler anyway. Other problems of this sort were fixed the same way.

13. Convoys

In going from semaphores to event-wait, two performance factors work in opposite directions. Semaphores have the advantage that processes never wake up unnecessarily. That is, once a process is on the run queue, it is guaranteed to obtain the resource. Arising from this very fact, however, are semaphofe convoys, which hurt performance [Lee et al. 1987]. Convoys can occur when processes lock and release a semaphore repeatedly without

404 Lawrence M. Ruane sleeping. Such high-contention semaphores are typical of database environments, for example. If process 4 holds a semaphore, and process p blocks on that semaphore, a convoy begins. When 4 releases the semaphore, p is put on the run queue. Then 4loops back to re-obtain the sema- phore, but blocks because the semaphore is owned by p. Process p runs, uses the resource, and releases the semaphore, giving it back to q, and so on. The two processes alternate on every use of the rèsource, dramatically increasing the number of context switches. Convoys tend to be self-perpetuating, because once started, the resource-locked duration per use increases dramatically. It now includes the time for the new owner process to be scheduled to run, in addition to the time the process actually uses the resource. This in turn increases the probability that another process will block on the semaphore. Once identified, convoys can usually be solved by redesigning the kernel algorithms, but at the expense of simplicity. With event-wait, convoys are not possible, because even though a process has been awakened and is on the run queue, expecting to use the resource, it does not yet own the resource, so the current process can lock it again without blocking.

14. The "Thundering Herd" Problem

This delightful expression refers to a problem that derives from the fact that wakeup O makes all processes waiting on the event runnable, not just one. They race to lock the resource, one gets it, and the others go back to sleep. Semaphores have the advantage that processes never wake up unnecessarily. Once a process is on the run queue, it is guaranteed to obtain the resource. However, Beck et al. [1987] point out that there usually isn't a problem with event-wait on uniprocessors, since by the time each process runs, the resource is generally available. For example, suppose that while a buffer is busy being read into from disk, flve processes block trying to read it. When the IiO is done, all frve go on the run queue. Each one in turn finds the data in the buffer

Process Synchronization in the UTS Kernel 405 valid, copies it to user space, frees the buffer, and gives up the CPU for an unrelated reason (e.g., sleeping on another resource). As long as each process does not sleep while holding the resource, there are no unnecessary context switches. On a multiprocessor, this condition is not sufficient. While the five processes in the example are waiting on the run queue, they are available to be run by other CPUs. This problem becomes worse as CPUs are added to the system. We addressed this problem by giving a private run queue to each CPU. wakeupO (actually, setrqO) puts processes on the current CPU's run queue. The global run queue remains, but only holds processes with user priority (that is, processes ready to return to the user program that have deferred to a better priority process); thus user priorities are honored. \When a CPU wants a process to run (in swtchO), it first looks in its own run queue, then in the user priority queue, and finally, since it doesn't make sense for a CPU to go idle if there is really work to do somewhere, it "raids" other CPU's run queues. The result is that the number of unnecessary context switches is reduced to about the level of a uniprocessor, as long as there is sufficient work in the system to prevent a lot of raiding (and if there isn't that much work, we probably don't care about unneces- sary context switches anyway). Another advantage is that the con- tention on the run queue spinlock, usually the worst in the system, is far lower, To demonstrate the full effect of this approach, we started four identical processes, each reading the frrst 64 blocks of the same frle forever:

fd = open("bigf ilet' , 0) ; for (;;) { lseek(fd, 0, 0); read(fd, buf , 64x4096) ; ) Then we started another four reading a different frle. The system had two processors. Using the new run queue scheme and scheduler, the system did 9090 fewer context switches per second, and got 500/o more

406 Lawrence M. Ruane "real work" done (bytes read) per second. The effect on a typical workload is not yet known. What happens is that each resource becomes informally associ- ated, for perhaps a few seconds at a time, with a particular CpU; each CPU becomes a defacto server for a set of resources. If pro- cess 4, running on CPU 1, blocks while trying to access a resource that is being used by process å, running on CPU 2,then when å releases the resource, a goes on CPU 2's run queue. Thus, the users of a certain resource tend to migrate to the same CpU as other users of the same resource. This may also improve the pro- cessor cache efficiency. The global user-priority run queue pro- vides load balancing among CPUs.

15. Conclusions

Our experience with event-wait on UTS has clearly been positive. The code, except for per-CPU run queues, has been incorporated into the UTS product beginning with release 2.1. It is conceptually easy to understand and leads to simpler and more reliable kernel algorithms. There are just two new primitives, sleepl O and waitlockO, with trivial implementations. The importance of easing porting from the standard unipro- cessing kernel is hard to overstate, especially as the kernel (unfor- tunately) becomes more complicated. Porting the uniprocessor buffer cache code, for example, was almost as easy as adding spinlockO to the top of each function, freelockO to the bot- tom, and replacing each sleepO with sleeplO (a single spin- lock protects the entire buffer cache). None of the algorithms needed modification. (Of course, it's not always quite that easy!) Our performance measurements so far have been encouraging. Without the per-CPU run queues (which is still being evaluated), the throughput of the event-wait kernel was consistently measured to be 0.8V0 to l.2o/o better than the semaphore based kernel on a typical mix (compiles, greps, troffi, etc.), depending on the load. The only difference was that in the event-wait kernel, semaphores were implemented using event-wait rather than as primitives; none of the other algorithms were changed (i.e., they used semaphores).

Process Synchronizøtion in the IITS Kernel 407 A greater relative improvement can be expected as more of the kernel is converted to use event-wait directly, and as a result of per-CPU run queues. This approach might provide a good basis for future multipro- cessing kernels, since it is in no way specifrc to the 5/370 architec- ture or to UTS.

^ Acknowledgements

Thanks to Joan Kalmanek and Craig Harmer for their contribu- tions to this project. Gerard Holzmann provided lots of help with using his protocol verifrer.

408 Lawrence M. Ruane Appendix A: Event-wait and Monitors

The event-wait mechanism is actually a variant of the traditional monitor synchronization model as described in Ben-Ari [ 1990]. A particular implementation of monitors is defined by several independent characteristics, many of which are described in the literature as if they were dependent. What all forms of monitors have in common is the existence of two routines, wait O and signal O (sometimes with different names). The signal o primitive always suspends the current process (unlike the sema- phore P O operation); wait O unblocks one or more processes. Each of the following sections describes a separate monitor characteristic.

f . implicit vs. explicit mutual exclusion Traditionally, a monitor is described as consisting of a set of pro- cedures and data, encapsulated in a nonitor construct provided by the language, In some cases, a monitor is a class, from which many instances of the encapsulated data can be created. The language provides implicit per-instance mutual exclusion among the procedures within the monitor, ensuring single-threaded execution. The alternative is explicit mutual exclusion of code segments using calls to primitives such as spinlockO and freelocko on a multiprocessor or spl6 O and spf 0 O on a uniprocessor. This is how most "real systems" are written, not only because C does not support monitors, but because of the added flexibility of being able to apply the mutual exclusion exactly where needed, without having to pull code segments into separate monitor func- tions. (Of course, the encapsulation approach can be fcllowed by convention.)

2. blocking vs. non-blocking mutual exclusion

If a process finds the mutual exclusion lock unavailable, the sys- tem can either suspend the process, or spin until the lock is free. In the practical world, spinning is almost always best, because the

Process Synchronization in the UTS Kernel 409 lock duration should be small (otherwise a design problem is indicated). Also, blocking mutual exclusion presents difficulties for inter- rupt handlers, which cannot sleep. The only use of a blocking mutual exclusion lock in UTS is in the memory manager (which should probably be redesigned to use spinlocks). When a disk paging request completes, but the disk interrupt handler cannot get the memory manager lock, it queues a data structure to a spe- cial list (protected by a spinlock) of "paging operations completed but not yet processed through the memory manager." Then, when- ever the memory manager lock is freed, any items on this list are processed. This obviously complicates the memory manager'

3. condition variables vs' wait channels In the usual description of monitors, wait o is passed a pointer to a condition variable. The process is queued on the condition variable, to be later dequeued by signal O. There is no state associated with a condition variable except the list of processes. In the UNIX kernel, the argument to wait O's equivalent C sleepO) is an arbitrary integer, the channel, which usually is hashed to one of many sleep queues for speed. The difference is almost purely cosmetic. The arbitrary integer approach simplifies matters by not requiring variables to be allo- cated (and named!). Also, condition variables might use a significant amount of space if they are members of highly repli- cated structures.

4. immediate resumption vs. normal scheduling In the usual description of monitors, if signal o finds a process waiting, that process is made runnable; but more than that, it is sure to be the next process to run in the monitor. The advantage is that the condition which was made true by the process doing the signalo need not be reevaluated by the awakened process - it is guaranteed to be true. Usually signal O is required to be the last statement of the monitor rou- tine so that the signaling process can not change the state of the monitor before the awakened process runs. The alternative is that signal o simply puts the process on the run queue, and an arbitrary amount of time might pass until it

410 Lawrence M. Ruane runs. Since the state of the monitor might change between the signal and when the awakened process runs, it must reevaluate the condition that caused it to block. This is the approach taken in Mesa [Lampson & Redell 1980]. How is immediate resumption implemented? If the mutual exclusion lock is the blocking type, then signal O can simply not free the lock (and not have wait O try to get the lock before retuuing). But this approach won't work if the mutual exclusion lock is a spinlock, unless the run queue is bypassed entirely, with the signaling process directly resuming the process being signaled. Immediate resumption can lead to convoys (discussed else- where in this paper), since a process cannot reenter the monitor after doing a signal O without giving up the CPU. 5. wake one ys. wake all This alternative is actually a subset of the "normal scheduling" case above; it does not apply to immediate resumption. If signal O finds more than one process waiting, should it make just one process runnable, or all waiting processes? Waking just one process is sometimes more efficient (however, see the "Thundering Herd" section of this paper), but adds to the complexity of the monitor. When the process resumes inside the monitor, it bears a heavy burden of responsibility: if the process frnds the condition true (i.e., this was not an unnecessary wake up), but decides not to make the sleep condition false for any rea- son, it must signal the next waiting process. For example, after sleeping on a busy buffer in the getblkO buffer cache routine (see Appendix B), the process cannot blithely goto toop as if it has just come into the routine for the first time; it might choose a different buffer and thus fail to make the original buffer busy. That could cause processes waiting for the original buffer to wait forever. Also, the mapping from conditions to events must be unique across the entire system. Waking everyone, also called a broadcasl signal, allows each awakened process to worry only about itself with regard to the event that has been signaled. It minimizes complexity by allowing algorithms to be written in the busy-wait model, adding process synchronization at the last minute.

Process Synchronization in the UTS Kernel 4Ll Monitors Summary

Characteristics I through 3 are mostly superfrcial, and do not affect the structure of kernel algorithms much. Characteristics 4 and 5 can have a signifrcant impact on algorithms. The standard uniprocessing UNIX kernel uses explicit mutual exclusion, wait channels, normal scheduling, and wake all. UTS uses,in addition non-blocking mutual exclusion (spinlocks).

412 Lawrence M. Ruane Appendix B: Problems with Semaphores

One of the areas in which semaphores have given us serious prob- lems is the section of the kernel that implements the buffer cache. By looking at how the design evolved as problems were discavered, we can gain insight into the question of when sema- phores are appropriate and when they are not. (While the story of this evolution is not historically accurate, it's not far off ) Although the following discussion applies to both uniprocess- ing and multiprocessing kernels, we will ignore the multiprocess- ing issues; the structures of the algorithms are the same. The getblkO routine flnds a buffer in the disk buffer cache [Bach 1986]. Here is an outline of getbtkO using event-wait (for simplicity, we assume there are no delayed-write (dirty) buffers): loop: if (find buîer in cache) { if (buffer BUSr) { sleep(&buffer); goto loop; ) set bufer BUSY; freecnt--;

take bu"ffer off freelist :

return(buffer) ; ) if (freecnt == 0) {

sleep (&freetist) ; goto loop; ) freecnt--; set rtr$ buffer on freelist BUSY: take first bufer oîfreelist; reassign buffer: return(buffer);

Process Synchronization in the IITS Kernel 413 There are two "polling" loops, corresponding to the two goto statements. In the frrst loop, the cache hit case, the current pro- cess p polls the buffer it wants until it is not busy. In the second loop, for a cache miss, p polls for the freelist to be non-empty. The loops must enclose the search as well as the test for BUSY because anything can happen while p sleeps. For the cache hit case, when the buffer is no longer busy, p sets it busy, unlinks it from somewhere in the freelist (the buffer must be there since it wasn't busy) and returns. In the case of a cache miss, when the freelist becomes non- empty, p takes an old buffer from the freelist and reassigns it (i.e., change the disk block this buffer will henceforth be associated with). Converting this routine to use semaphores seems straightfor- ward. Since p might wait for two kinds of resources (a particular buffer, and any buffer on the freelist), we replace each buffer's BUSY bit with a semaphore, and replace f reecnt with a sema- phore, freesema, initialized to the number of buffers. Here is a frrst try using semaphores: if (find buffer in cache) { P(buffer); P(freesena); take bufer oîfreelist: return(buffer); ) P(freesema);

P (first buffer on freelist) ; take first bufer offfreelist: reassign buîer; return(buffer);

It's pretty easy to see why this version doesn't work. If p finds the buffer, and blocks while locking it (P(buff er)), the buffer might be reassigned, causing p to return the wrong buffer. Simi- larly, if the process doesn't frnd the buffer, it might block on the freelist semaphore, allowing the buffer it is looking for to appear in the cache. This results in two buffers corresponding to the same disk block - bad news.

414 Lawrence M. Ruane Here is a frx for the frrst problem: loop: if (find bufer in cqche) { P(buffer); if (wrong buffer) { V(buffer); goto loop; ) P(freesena); take bufer of freelist: return(buffer) ; ) After frnding the buffer, p locks it. Since in doing so p might have blocked, it then verifies that this buffer is still the one it wants. If not, p releases the buffer and re-evaluates. Since in practice the process usually doesn't block in P O, ver- ifying the match every time seems wasteful. Even worse, the analogous solution to the second problem (the case of a cache miss) requires that p make a second search for the buffer every time, since it might have blocked on the freelist semaphore. The problem seems to be that p doesn't know when it blocks in P O; that's hidden from it. We can make blocking explicit with a new primitive, CP O, the conditional version of P o CP O ( is available in UTS). CP O attempts to lock the semaphore but will not block, returning zero instead. Here is a version using CP O: loop: if (find buîer in cache) { if (!CP(buffer)) { P(buffer); V(buffer); goto loop; ) P (f reese¡oa) ; take bufer offfreelist;

return(buffer) ; )

Process Synchronization in the UTS Kernel 415 if (!CP(freesena)) {

P (f reeserna) ; V(freesema); goto loop; ) P (first bufer on freelist) ; take first buffer of freelist; reassign bufer; " return(buffer);

Suppose p flnds the buffer. If the buffer is available, CP o succeeds, and since p hasn't blocked, verifying the match is unnecessary - the normal case is fast. If CP O fails, p sleeps until the buffer is free, and re-evaluates. Similarly, p knows it must re- evaluate if it blocks on the freelist. This version does not work (with the usual "strict" sema- phores). Suppose two processes, p and q, block while trying to access the same buffer. When the buffer is released, p runs, does a VO (which puts 4 on the run queue), loops back, and finds the same buffer. But the attempt to lock the buffer fails (the sema- phore count is zero) because the buffer is owned by ø (who is wait- ing on the run queue). So p blocks, q runs, and the cycle repeats forever. The solution is to re-evaluate only if the buffer has been reassigned: loop: if (find bufer in cache) { if (!CP(buffer)) { P(buffer); if (wrong bufer) { V(buffer); goto loop; ) ) P(freesena); take buffer of freelist return(buffer) ; )

4t6 Lawrence M. Ruane In the cache miss case, the freelist semaphore can cause the same kind of livelock, but no analogous solution exists. Although keeping track of the number of items on the freelist seems such a natural use of semaphores, we know of no versions that actually do. Instead, a semaphore that never goes positive is used to block processes, in an event-wait style. There is yet another problem with all the above semaphore- based versions (assuming strict semaphores). Just because a buffer is on'the freelist, a PO operation on it might still block. The buffer may be owned by a process on the run queue, who is about to take the buffer off the freelist. So in the cache miss case, the process must scdÍt the freelist until it frnds a buffer that can be locked. More generally, it is harder to enforce invariants (such as "a buffer is on the freelist if and only if it is not locked") because of the existence of an interval between when a resource (a buffer) becomes owned by someone and when that process actually runs and changes the state ofthe system accordingly (takes the buffer offthe freelist).

Process Synchronization in the UTS Kernel 417 Appendix C: Spin Model for Sleep/Wakeup with "Wanted" Flag

Using BITSTATE, this model needed 3 minutes of CPU time and 32MB on an IBM 3090 running UTS to get a coverage factor over 100. (see Holzmann [1990]). No errors were found. /** get resource:

spintock(&r->1k) ; while (r->lock) { r-)wanted = 1; sleepl(r, &r->lk); ) r-)lock = 1;

freelock(&r->1k) ; uge resource... free resource: r-)lock = 0;

waittock(&r->tk) ; if (r->wanted) { r-)wanted = 0; waitlock (&r->Ik) ; wakeup(r); ) **/ /x nunber of cpus (processes) */ #define N 5 #define RUN 0 #define SLEEP 1 /x resource wanted ftag and sleep lock x/ byte r_wanted, r_lock; /x resource spinlock, O or L x/ byte r-Ik; /* sleep queue spinlock, O or L */ byte sq-Ik;

41 8 Lawrence M. Ruane /*, process state, SLEEP or RUN */ byte state[N]; proctype cpu(int i) { int j; /x get resource: */ atomic { . (r_Ik == 0) _) r_Ik = 1 ); do : : (r-tock == 1) r-wanted = 1; /x inside sleeplO x/ atonic { (sq-lk == 0) -> sq-Ik = 1 ); r-Ik = 0; state[i] = SLEEP; sq-Ik = 0; /* wait to be awakened */ (state[i] == RUN); atonic { (r_Ik == 0) _) r_lk = 1 ) : : (r-lock == 0) break od;

assert (r-tock == 0) ; r-lock = 1; r_Ik = 0; /x use resource free resoutce x/ assert(r-lock == 1); r-lock = 0; (r-Ik == 0); if : : (r-wanted == 1) r-wanted = 0; (r-Ik == 0)

Process Synchronization in lhe UTS Kernel 4I9 /x inside wakeupO */ atonic { (sq-ft == 0) -> sq-Ik = 1 ); /x wakeup everyone 'l'/ atomic { do :: (j < N) stateljJ = RUN; j = j+t : : (j >= N) -) break od ); sq-lk = 0 :: (r-wanted == 0) skip fi ) init { int i; atonic { do :: (i < N) run cpu(i); i=i+l :: (i >= N) _) break od ) )

420 Lawrence M. Ruane References

M. J. Bach, The Design of the UNIX Operating System, Englewood Cliffs, NJ: Prentice-Hall, 1986. B. Beck, B. Kasten, and S. Thakkar, VLSI Assist for a Multiprocessor, Proceedings ofthe Second International Conference on Architec' tural Support for Programming Languages and Operating Systems, pages 10-20, October 1987. M. Ben-Ari, Principles of Concurrent and Distributed Programming, Englewood Cliffs, NJ: Prentice Hall, t990. E. W. Dijkstra, Cooperating Sequential Processes, in Programming Languages, ed. F. Genuys, pages 43-112, New York: Academic Press, 1968. W. A. Felton, G. L. Miller, and J. M. Milner, A UNIX System Implemen- tation for System/3'|\, AT&T Bell Laboratories Technical Journal, 63(8), Part 2, pages I75l-1768, October 1984. Gerard Holzmann, Algorithms for Automated Protocol Validation, AT&T T echnical Journal, 69(l):32-59, January/February I 990. B. W. Lampson, D. D. Redell, Experience With Processes and Monitors, Communications of the ACM, 23(2):105-117, Feb. 1980. T. P. Lee, M. W. Luppi, and R. E. Menninger, Solving Performance Problems on a Multiprocessor UNIX System, Proceedings of the USENIX Conference, pages 399-405, Summer, 1987 . K. Thompson, L/NIX Implementation, The Bell System Technical Jour- na[ 57 (6) P art 2, pages 1 93 1- I 946, July-August I 978. lsubmitted April 9, 1990; revised June 14, 1990; accepted June 20, 19901

Process Synchronization in the urS Kernel 421