USENIX Association

Proceedings of the FREENIX Track: 2001 USENIX Annual Technical Conference

Boston, Massachusetts, USA June 25Ð30, 2001

THE ADVANCED COMPUTING SYSTEMS ASSOCIATION

© 2001 by The USENIX Association All Rights Reserved For more information about the USENIX Association: Phone: 1 510 528 8649 FAX: 1 510 548 5738 Email: [email protected] WWW: http://www.usenix.org Rights to individual papers remain with the author or the author's employer. Permission is granted for noncommercial reproduction of the work for educational or research purposes. This copyright notice must be included in the reproduced paper. USENIX acknowledges all trademarks herein. Improving the FreeBSD SMP implementation

GregLehey IBM LTC Ozlabs [email protected] [email protected]

ABSTRACT UNIX-derivedoperating systems have traditionally have a simplistic approach to process synchronization which is unsuited to multiprocessor application. Initial FreeBSD SMP support kept this approach by allowing only one process to run in kernel mode at any time, and also blocked interrupts across multiple processors, causing seriously suboptimal performance of I/O bound systems. This paper describes work done to remove this bot- tleneck, replacing it with fine-grained locking. It derivesfrom work done on BSD/OS and has manysimilarities with the approach taken in SunOS 5. Synchronization is per- formed primarily by a locking construct intermediate between a spin lock and a binary semaphore, termed mutexes.Ingeneral, mutexesattempt to block rather than to spin in cases where the likely wait time is long enough to warrant a process switch. The issue of blocking interrupt handlers is addressed by attaching a process context to the interrupt handlers. Despite this process context, an interrupt handler normally runs in the context of the interrupted process and is scheduled only when blocking is required.

• There is only one processor.All code runs on it. Introduction • If both an interrupt handler and a process are Acrucial issue in the design of an operating sys- available to run, the interrupt handler runs. tem is the manner in which it shares resources • Interrupt handlers have different priorities. If such as memory,data structures and processor one interrupt handler is running and one with time. In the UNIX model, the main clients for re- ahigher priority becomes runnable, the higher sources are processes and interrupt handlers. In- priority interrupt immediately preempts the terrupt handlers operate completely in kernel lower priority interrupt. space, primarily on behalf of the system. Pro- cesses normally run in one of twodifferent • The scheduler runs when a process voluntari- modes, user mode and kernel mode. User mode ly relinquishes the processor,its time slice ex- code is the code of the program from which the pires, or a higher-priority process becomes process is derived, and kernel mode code is part runnable. The scheduler chooses the highest of the kernel. This structure givesrise to multiple priority process which is ready to run. potential conflicts. • If the process is in kernel mode when its time slice expires or a higher priority process be- Use of processor time comes runnable, the system waits until it re- The most obvious demand a process or interrupt turns to user mode or sleeps before running routine places on the system is that it wants to the scheduler. run: it must execute instructions. In traditional This method works acceptably for the single pro- UNIX, the rules governing this sharing are: cessor machines for which it was designed. In the following section, we’ll see the reasoning behind the last decision. vice drivers, the process context (‘‘top half’’)and the interrupt context (‘‘bottom half’’)must share Kernel data objects data. Two separate issues arise here: each half must ensure that anychanges to shared data struc- The most obvious problem is access to memory. tures occur in a consistent manner,and theymust Modern UNIX systems run with memory protec- find a way to synchronize with each other. tion,which prevents processes in user mode from accessing the address space of other processes. Protection This protection no longer applies in kernel mode: all processes share the kernel address space, and Each half must protect its data against change by theyneed to access data shared between all pro- the other half. Forexample, the buffer header cesses. For example, the fork() system call structure contains a flags word with 32 flags, needs to allocate a proc structure for the new some set and reset by both halves. Setting and re- process. The file sys/kern_fork.c contains the fol- setting bits requires multiple instructions on most lowing code: architectures, so the potential for data corruption exists. UNIX solves this problem by locking out int fork1(p1, flags, procp) interrupts during critical sections. Tophalf code struct proc *p1; must explicitly lock out interrupts with the spl int flags; 1 struct proc **procp; functions. One of the most significant sources of { bugs in drivers is inadequate synchronization with struct proc *p2, *pptr; the bottom half. ... /* Allocate new proc. */ Interrupt code does not need to perform anyspe- newproc = zalloc(proc_zone); cial synchronization: by definition, processes don’trun when interrupt code is active. The function zalloc takes a struct proc Blocking interrupts has a potential danger that an entry offafreelist and returns its address: interrupt will not be serviced in a timely fashion. On PC hardware, this is particularly evident with item = z->zitems; z->zitems = ((void **) item)[0]; serial I/O, which frequently generates an interrupt ... for every character.At115200 bps, this equates return item; to an interrupt every 85 µs. In the past, this has givenrise to the dreaded silo overflows; evenon What happens if the currently executing process is fast modern hardware it can be a problem. It’sal- interrupted exactly between the first twolines of so not easy to decide interrupt priorities: in the the code above,maybe because a higher priority early days, disk I/O was givenahigh priority in process wants to run? item contains the pointer order to avoid overruns, while serial I/O had a low to the process structure, but z->z_items still priority.Now adays disk controllers can handle points to it. If the interrupting code also allocates transfers by themselves, but overruns are still a aprocess structure, it will go through the same problem with serial I/O. code and return a pointer to the same memory area, creating the process equivalent of Siamese Waiting for the other half twins. In other cases, a process will need to wait for UNIX solves this issue with the rule ‘‘The UNIX some event to complete. The most obvious exam- kernel is non-preemptive’’.This means that when ple is I/O: a process issues an I/O request, and the aprocess is running in kernel mode, no other pro- driverinitiates the transfer.Itcan be a long time cess can execute kernel code until the first process before the transfer completes: if it’sreading relinquishes the kernel voluntarily,either by re- turning to user mode, or by sleeping. 1. The naming goes back to the early days of UNIX on the PDP-11. The PDP-11 had a relatively simplistic level-based interrupt Synchronizing processes and inter- structure. When running at a specific level, only higher priority rupts interrupts were allowed. UNIX named functions for setting the interrupt priority levelafter the PDP-11 SPL instruction, so initially The non-preemption rule only applies to process- the functions had names like spl4 and spl7.Later machines came out with interrupt masks, and BSD changed the names to es. Interrupts happen independently of process more descriptive names such as splbio (for block I/O) and context, so a different method is needed. In de- splhigh (block out all interrupts). keyboard input, for example, it could be weeks ev ents map to the same address. before the I/O completes. When the transfer com- pletes, it causes an interrupt, so it’sthe interrupt handler which finally determines that the transfer Adapting the UNIX model to SMP is complete and notifies the process. Traditional UNIX performs this synchronization with the Anumber of the basic assumptions of this model functions sleep and wakeup,though current no longer apply to SMP,and others become more BSD no longer uses sleep:ithas been replaced of a problem: tsleep with ,which offers additional functional- • More than one processor is available. Code ity. can run in parallel. sleep tsleep The top half of a drivercalls or • Interrupt handlers and user processes can run when it wants to wait for an event, and the bottom on different processors at the same time. half calls wakeup when the event occurs. In more detail, • The ‘‘non-preemption’’rule is no longer suffi- cient to ensure that twoprocesses can’tex- read • The process issues a system call ,which ecute at the same time, so it would theoreti- brings it into kernel mode. cally be possible for twoprocesses to allocate • read locates the driverfor the device and the same memory. calls it to initiate a transfer. • Locking out interrupts must happen in every • read next calls tsleep,passing it the ad- processor.This can adversely affect perfor- dress of some unique object related to the re- mance. quest. tsleep stores the address in the proc structure, marks the process as sleeping and The initial FreeBSD model relinquishes the processor.Atthis point, the process is sleeping. The original version of FreeBSD SMP support solved these problems in a manner designed for • At some later point, when the request is com- reliability rather than performance: effectively it wakeup plete, the interrupt handler calls found a method to simulate the single-processor with the address which was passed to paradigm on multiple processors. Specifically, tsleep wakeup . runs through a list of only one process could run in the kernel at any sleeping processes and wakes all processes one time. The system ensured this with a spin- waiting on this particular address. lock, the so-called Big Kernel Lock (BKL ), which This method has problems evenonsingle proces- ensured that only one processor could be in the sors: the time to wakeprocesses depends on the kernel at a time. On entry to the kernel, each pro- number of sleeping processes, which is usually cessor attempted to get the BKL. If another pro- only slightly less than the number of processes in cessor was executing in kernel mode, the other the system. FreeBSD addresses this problem with processor performed a busy wait until the lock 128 hashed sleep queues, effectively diminishing became free: the search time by a factor of 128. Alarge system MPgetlock_edx: might have 10,000 processes running at the same 1: time, so this is only a partial solution. movl (%edx), %eax movl %eax, %ecx In addition, it is permissible for more than one andl $CPU_FIELD,%ecx cmpl _cpu_lockid, %ecx process to wait on a specific address. In extreme jne 2f cases dozens of processes wait on a specific ad- incl %eax movl %eax, (%edx) dress, but only one will be able to run when the ret resource becomes available; the rest call tsleep 2: movl $FREE_LOCK, %eax again. The term thundering horde has been de- movl _cpu_lockid, %ecx vised to describe this situation. FreeBSD has par- incl %ecx lock tially solved this issue with the wakeup_one cmpxchg %ecx, (%edx) function, which only wakes the first process it jne 1b GRAB_HWI finds. This still involves a linear search through a ret possibly large number of process structures, and it has the potential to deadlock if twounrelated In an extreme case, this waiting could degrade • Read/write locks address a different issue: SMP performance to belowthat of a single pro- frequently multiple processes may read spe- cessor machine. cific data in parallel, but only one may write it. Howtosolvethe dilemma There is some confusion in terminology with these locking primitives. In particular,the term Multiple processor machines have been around mutex has been applied to nearly all of them at for a long time, since before UNIX was written. different times. We’lllook at howFreeBSD uses During this time, a number of solutions to this the term in the next section. kind of problem have been devised. The problem One big problem with all locking primitiveswith wasless to find a solution than to find a solution the exception of spin locks is that theycan block. which would fit in the UNIX environment. At This requires a process context: a UNIX interrupt least the following synchronization primitives handler can’tblock. This is one of the reasons have been used in the past: that the old BKL was a spinlock, eventhough it • Counting semaphores were originally de- could potentially use up most of processor time signed to share a certain number of resources spinning. amongst potentially more consumers. To get access, a consumer decrements the semaphore The new FreeBSD implementation counter,and when it is finished it increments it again. If the semaphore counter goes neg- The newimplementation of SMP on FreeBSD ative,the process is placed on a sleep queue. bases heavily on the implementation in BSD/OS If it goes from -1 to 0, the first process on the 5.0, which has not yet been released. Even the sleep queue is activated. This approach is a name SMPng (‘‘newgeneration’’) was taken from possible alternative to tsleep and wakeup BSD/OS. Due to the open source nature of synchronization. In particular,itavoids a FreeBSD, SMPng is available on FreeBSD before lengthysequential search of sleeping process- on BSD/OS. es. The most radical difference in SMPng are: • SunOS 5 uses turnstiles to address the se- quential search problem in tsleep and • Interrupt code (‘‘bottom half’’)now runs in a wakeup synchronization. A turnstile is a process context, enabling it to block if neces- separate queue associated with a specific wait sary.This process context is termed an inter- address, so the need for a sequential search rupt thread. disappears. • Interrupt lockout primitives(splfoo )hav e • Spin locks have already been mentioned. been removed. The low-levelinterrupt code FreeBSD used to spin indefinitely on the still needs to block interrupts briefly,but the BKL, which doesn’tmakeany sense, but they interrupt service routines themselves run with are useful in cases where the wait is short; a interrupts enabled. Instead of locking out in- longer wait will result in a process being sus- terrupts, the system uses mutexes, which may pended and subsequently rescheduled. If the be either spin locks or blocking locks. av erage wait for a resource is less than this time, then it makes sense to spin instead. Interrupt threads • Blocking locks are the alternative tospin The single most important aspect of the imple- locks when the wait is likely to be longer than mentation is the introduction of a process or it would taketoreschedule. A typical imple- ‘‘thread’’context for interrupt handlers. This mentation is similar to a counting semaphore change involves a number of tradeoffs: with a count of 1. • The process context allows a uniform ap- • Condition variables are a kind of blocking proach to synchronization: it is no longer nec- lock where the lock is based on a condition, essary to provide separate primitivestosyn- for example the absence of entries in a queue. chronize the top half and the bottom half. In particular,the spl primitivesare no longer needed. For compatibility reasons, the calls drop in performance: each interrupt could poten- have been retained, but theytranslate to no- tially cause twocontext switches, and the inter- ops. rupt would not be handled while another process, ev enauser process, was in the kernel. • The action of scheduling another process takes significantly longer than interrupt over- Experience with the initial implementation met head, which also remains. expectations: we have seen no stability problems with the implementation, and the performance, • The UNIX approach to scheduling does not though significantly worse, was not as bad as we allowpreemption if the process is running in had expected. kernel mode. At the time of writing, we have improvedthe im- SMPng solves the latencyand scheduling issues plementation somewhat by allowing limited ker- with a technique known as lazy scheduling:onre- nel preemption, allowing interrupt threads to be ceiving an interrupt, the interrupt stubs note the scheduled immediately rather than having to wait PID of the interrupt thread, but theydonot sched- for the current process to leave kernel mode. The ule the thread. Instead, it continues execution in potential exists for complete kernel preemption, the context of the interrupted process. The thread where anyhigher priority process can preempt a will be scheduled only in the following circum- lower priority process running in the kernel, but stances: we are not sure that the benefits will outweigh the • If the thread has to block. potential bug sources. • If the interrupt nesting levelgets too deep. The final lazy scheduling implementation has been tested, but it is not currently in the -CUR- We expect this method to offer negligible over- RENT kernel. Due to the current kernel lock im- head for the majority of interrupts. plementation, it would not showany significant From a scheduling viewpoint, the threads differ performance increase, and problems can be ex- from normal processes in the following ways: pected as additional kernel components are mi- grated from under Giant. • Theynev erenter user mode, so theydonot have user text and data segments. Not all interrupts have been changed to threaded interrupts. In particular,the old fast interrupts re- • Theyall share the address space of process 0, main relatively unchanged, with the restriction the swapper. that theymay not use anyblocking mutexes. Fast • Theyrun at a higher priority than all user pro- interrupts have typically been used for the serial cesses. drivers, and are specific to FreeBSD: BSD/OS has no corresponding functionality. • Their priority is not adjusted based on load: it remains fixed. Locking constructs • An additional process state SWAIT has been The initial BSD/OS implementation defined two introduced for interrupt processes which are basic types of lock, called mutex : currently idle: the normal ‘‘idle’’state is SSLEEP,which implies that the process is • The default locking construct is the spin/sleep sleeping. mutex.This is similar in concept to a semaphore with a count of 1, but the imple- Experience with the BSD/OS implementation mentation allows spinning for a certain period showed that the initial implementation of interrupt of time if this appears to be of benefit (in oth- threads was a particularly error-prone process, er words, if it is likely that the mutexwill be- and that the debugging tools were inadequate. come free in less time than it would taketo Due to the nature of the FreeBSD project, we schedule another process), though this feature considered it imperative tohav e the system rela- is not currently in use. It also allows the user tively functional at all times during the transition, to specify that the mutexshould not spin. If so we decided to implement interrupt threads in the process cannot obtain the mutex, it is twostages. The initial implementation was very placed on a sleep queue and woken when the similar to that of normal processes. This offered resource becomes available. the benefits of relatively easy debugging and of stability,and the disadvantage of a significant • An alternate construct is a spin mutex.This • Create an sx lock with sx_init(). corresponds to the spin lock which was al- • Attain a read (shared) lock with ready present in the system. Spin mutexesare sx_slock() and release it with sx_sun- used only in exceptional cases. lock(). The implementation of these locks was derivedal- • Attain a write (exclusive)lock with most directly from BSD/OS, but has since been sx_xlock() and release it with sx_xun- modified significantly. lock(). In addition to these locks, the FreeBSD project • Destroyansxlock with sx_destroy. has included twofurther locking constructs:

Condition variables are built on top of mutexes. Removing the Big Kernel Lock Theyconsist of a mutexand a wait queue. The following operations are supported: These modifications made it possible to remove the Big Kernel Lock. The initial implementation • Acquire a condition variable with replaced it with twomutexes: cv_wait(), cv_wait_sig(), cv_timedwait() or cv_timed- • Giant is used in a similar manner to the wait_sig(). BKL, but it is a blocking mutex. Currently it protects all entry to the kernel, including in- • Before acquiring the condition variable, the terrupt handlers. In order to be able to block, associated mutexmust be held. The mutex it must allowscheduling to continue. will be released before sleeping and reac- quired on wakeup. • sched_lock is a spin lock which protects the scheduler queues. • Unblock one waiter with cv_signal(). This combination of locks supplied the bare mini- • Unblock all waiters with cv_broad- mum of locks necessary to build the newframe- cast(). work. In itself, it does not improve the perfor- • Wait for queue empty with mance of the system, since processes still block cv_waitq_empty. on Giant.

• Same functionality available from the Idle processes msleep function. The planned light-weight interrupt threads need a Shared/exclusive locks, or sx locks,are effectively process context in order to work. In the tradition- read-write locks. The difference in terminology al UNIX kernel, there is not always a process con- came from an intention to add additional func- text: the pointer curproc can be NULL.SMPng tionality to these locks. This functionality has not solves this problem by having an idle process been implemented, so currently sx locks are the which runs when no other process is active. same thing as read-write locks: theyallowaccess by multiple readers or a single writer. Recursive locking The implementation of sx locks is relatively ex- Normally,ifalock is locked, it cannot be locked pensive: again. On occasions, however, itispossible that a struct sx { process tries to acquire a lock which it already struct lock_object sx_object; holds. Without special checks, this would cause a struct mtx sx_lock; deadlock. Manyimplementations allowthis so- int sx_cnt; struct cv sx_shrd_cv; called recursive locking.The locking code checks int sx_shrd_wcnt; for the owner of the lock. If the owner is the cur- struct cv sx_excl_cv; rent process, it increments a recursion counter. int sx_excl_wcnt; struct proc *sx_xholder; Releasing the lock decrements the recursion }; counter and only releases the lock when the count goes to zero. Theyshould be only used where the vast majority There is much discussion both in the literature of accesses is shared. and in the FreeBSD SMP project as to whether re- cursive locking should be allowed at all. In gen- eral, we have the feeling that recursive locks are avoids lock order reversals. evidence of untidy programming. Unfortunately, • Except for the Giant mutexused during the the code base was neverdesigned for this kind of transition phase, mutexesprotect data, not locking, and in particular library functions may code. attempt to reacquire locks already held. We hav e come to a compromise: in general, theyare dis- • Do not msleep() or cv_wait() with a couraged, and recursion must be specifically en- recursed mutex. Giant is a special case and abled for each mutex, thus avoiding recursion is handled automagically behind the scenes, where it was not intended. so don’tpass Giant to these functions.

Migrating to fine-grained locking • Trytohold mutexesfor as little time as possi- ble. Implementing the interrupt threads and replacing • Trytoavoid recursing on mutexesifatall the Big Kernel Lock with Giant and sched- possible. In general, if a mutexisrecursively lock did not result in anyperformance im- entered, the mutexisbeing held for too long, provements, but it provided a framework in which and a redesign is in order. the transition to fine-grained locking could be per- formed. The next step was to choose a locking One of the weaknesses of the project structure is strategy and migrate individual portions of the that there is no overall strategy for locking. In kernel from under the protection of Giant. manycases, the choice of locking construct and granularity is left to the individual developer.In One of the dangers of this approach is that lock- almost every case, locks are leaf node locks: very ing conflicts might not be recognized until very little code locks more than one lock at a time, and late. In particular,the FreeBSD project has differ- when it does, it is in a very tight context. This re- ent people working on different kernel compo- sults in relatively reliable code, but it may not be nents, and it does not have a strong centralized ar- result in optimum performance. chitectural committee to determine locking strate- gy.Asaresult, we developed the following There are a number of reasons whywepersist guidelines for locking: with this approach: • Use sleep mutexes. Spin mutexesshould only • FreeBSD is a volunteer project. Developers be used in very special cases and only with do what theythink is best. Theyare unlikely the approvalofthe SMP project team. The to agree to an alternative implementation. only current exception to this rule is the • We donot currently have enough architectural scheduler lock, which by nature must be a direction, nor enough experience with other spin lock. SMP systems, to come up with an ideal lock- • Do not tsleep() while holding a mutex ing strategy.This derivesfrom the volunteer other than Giant.The implementation of nature of the project, but note also that large tsleep() and cv_wait() automatically UNIX vendors have found the choice of lock- releases Giant and gains it again on wakeup, ing strategy to be a big problem. butnoother mutexeswill be released. • Unlikelarge companies, there is much less • Do not msleep() or cv_wait() while concern about throwawayimplementations. holding a mutexother than Giant or the mu- If we find that the performance of a system texpassed as a parameter to msleep(). component is suboptimal, we will discard it msleep() is a newfunction which com- and start with a different implementation. bines the functionality with atomic release and regain of a specified mutex. Migrating interrupt handlers • Do not call a function that can grab Giant and then sleep unless no mutexes(other than pos- This newbasic structure is nowinplace, and im- sibly Giant) are held. This is a consequence plementation of finer grained locking is proceed- of the previous rules. ing. Giant will remain as a legacy locking mecha- nism for code which has not been converted to the msleep() cv_wait() • If calling or while newlocking mechanism. Forexample, the main Giant Giant holding and another mutex, loop of the function ithread_loop,which must be acquired first and released last. This runs an interrupt handler,contains the following for (;;) { code: CTR3(KTR_INTR, "ithd_loop pid %d(%s) need=%d", if ((ih->ih_flags & IH_MPSAFE) == 0) me->it_proc->p_pid, mtx_lock(&Giant); me->it_proc->p_comm, .... me->it_need); ih->ih_handler(ih->ih_argument); ... if ((ih->ih_flags & IH_MPSAFE) == 0) CTR1(KTR_INTR, mtx_unlock(&Giant); "ithd_loop pid %d: done", me->it_proc->p_pid); mi_switch(); The flag INTR_MPSAFE indicates that the inter- CTR1(KTR_INTR, "ithd_loop pid %d: resumed", rupt handler has its own synchronization primi- me->it_proc->p_pid); tives. Atypical strategy planned for migrating device The calls CTR1 and CTR3 are twomacros which drivers involves the following steps: only compile anykind of code when the kernel is built with the KTR kernel option. If the kernel • Add a mutextothe driver softc. contains this option and the bit KTR_INTR is set • Set the INTR_MPSAFE flag when registering in the variable ktr_mask,then these events will the interrupt. be masked to a circular buffer in the kernel. The ddb debugger has a command show ktr which • Obtain the mutexinthe same kind of situation dumps the buffer one page at a time, and gdb where previously an spl wasused. Unlike macros are also available. This givesarelatively spls, however, the interrupt handlers must useful means of tracing the interaction between also obtain the mutexbefore accessing shared processes: data structures. 2791 968643993:219224100 Probably the most difficult part of the process will cpu1 ../../i386/isa/ithread.c:214 involvelarger components of the system, such as ithd_loop pid 21 ih=0xc235f200: 0xc0324d98(0) flg=100 the file system and the networking stack. We 2790 968643993:219214043 have the example of the BSD/OS code, but it’s cpu1 ../../i386/isa/ithread.c:197 ithd_loop pid 21(irq0: clk) need=1 currently not clear that this is the best path to fol- 2789 968643993:219205383 low. cpu1 ../../i386/isa/ithread.c:243 ithd_loop pid 21: resumed 2788 968643993:219190856 cpu1 ../../i386/isa/ithread.c:158 sched_ithd: setrunqueue 21 Kernel trace facility 2787 968643993:219179402 cpu1 ../../i386/isa/ithread.c:120 The ktr package provides a method of tracing sched_ithd pid 21(irq0: clk) need=0 kernel events for debugging purposes. It is not in- tended for use during normal operation, and The lines here are too wide for the paper,sothey should not be confused with the kernel call trace are shown wrapped as several lines. This exam- facility ktrace. ple traces the arrivaland processing of a clock in- terrupt on the i386 platform, in reverse chronolog- Forexample, the function sched_ithd,which ical order.The number at the beginning of the schedules the interrupt threads, contains the fol- line is the trace entry number. lowing code: • Entry 2787 shows the arrivalofaninterrupt at CTR3(KTR_INTR, the beginning of sched_ithd.The second "sched_ithd pid %d(%s) need=%d", value on the trace line is the time since the ir->it_proc->p_pid, ir->it_proc->p_comm, epoch, followed by the CPU number and the ir->it_need); file name and line number.The remaining ... if (ir->it_proc->p_stat == SWAIT) { values are supplied by the program to the CTR1(KTR_INTR, CTR3 function. "sched_ithd: setrunqueue %d", ir->it_proc->p_pid); • Entry 2788 shows the second trace call in sched_ithd,where the interrupt handler is placed on the run queue. The function ithd_loop,which runs the inter- rupt in process context, contains the following code at the beginning and end of the main loop: • Entry 2789 shows the entry into the main loop lowing major tasks, some of which are in an ad- of ithd_loop. vanced state of implementation: • Entries 2790 and 2791 showthe exit from the • Split NFS into client and server. main loop of ithd_loop. • Add locking to NFS.

Witness facility • Makethe IP stack thread-safe. • Create mechanism in cdevsw structure to The witness code was designed specifically to de- protect thread-unsafe drivers. bugmutexcode. It keeps track of the locks ac- quired and released by each thread. It also keeps • Complete locking struct proc. track of the order in which locks are acquired • Cleanup the various mp_machdep.c’s, unify with respect to each other.Each time a lock is ac- various SMP API’ssuch as IPI delivery,etc. quired, witness uses these twolists to verify that a lock is not being acquired in the wrong order.Ifa • Make printf() safe to call in almost any lock order violation is detected, then a message is situation to avoid deadlocks. output to the kernel console detailing the locks in- • Makembufsystem use condition variables in- volved and the locations in question. Witness can stead of msleep() and wakeup(). also be configured to drop into the kernel debug- ger when an order violation occurs. • Remove the MP safe syscall flag from the system call table and add explicit mtx_lock The witness code also checks various other condi- of Giant to all system calls which need it. tions such as verifying that one does not recurse on a non-recursive lock. For sleep locks, witness • Use per-CPU buffers for ktr to reduce syn- verifies that a newprocess would not be switched chronization. to when a lock is released or a lock is blocked on • Remove the priority argument from during an acquire while anyspin locks are held. msleep() and cv_wait(). If anyofthese checks fail, the kernel will panic. • Implement lazy interrupt thread switching (context stealing). Project status • Lock structs filedesc, pgrp, sigio, session ifnet The project started in June 2000. The major mile- and . stones in the development are: • Makethe virtual memory subsystem thread- • June 2000: Ported the BSD/OS mutexcode safe. and replaced the Big Kernel Lock with Gi- • Convert select() to use condition vari- ant and sched_lock. ables. • September 2000: Replaced interrupt handlers • Reimplement kqueue using condition vari- with heavyweight interrupt processors. Initial ables. commit to the FreeBSD source tree. • Conditionalize atomic operations used for de- • November 2000: Made softclock MP-safe bugging statistics. and migrate from under Giant. • Lock the virtual file system code. • January 2001: Implemented condition vari- ables. • March 2001: Implemented read/write locks Performance (called ‘‘shared/exclusive’’ or sx locks). The implementation has not progressed far • March 2001: Complete locking of enough of enough to makeany firm statements about perfor- the proc structure to allowsignal handlers to mance, but we are expecting reasonable scalabili- be movedfrom under Giant. ty to beyond 32 processor systems. The main issue in the immediate future is to mi- grate more and more code out from under Gi- ant.Inmore detail, we have identified the fol- Acknowledgements Implementation of the 4.4BSD , Addison-Wesley1996. The FreeBSD SMPng project was made possible Curt Schimmel, UNIX Systems for Modern Archi- by BSDi’sgenerous donation of code from the tectures,Addison-Wesley1994. development version 5.0 of BSD/OS. The main contributors were: Uresh Vahalia, UNIX Internals.Prentice-Hall, 1996. • John Baldwin rewrote the lowlev elinterrupt code for i386 SMP,made much code machine independent, worked on the WITNESS code, Further reference converted allproc and proctree locks from lockmgr locks to sx locks, created a See the FreeBSD SMP home page at mechanism in cdevsw structure to protect http://www.FreeBSD.org/smg/. thread-unsafe drivers, locked struct proc and unified various SMP API’ssuch as IPI deliv- ery. • Jake Burkholder ported the BSD/OS locking primitivesfor i386, implemented msleep(), condition variables and kernel preemption. • Matt Dillon converted the Big Kernel spin- lock to the blocking Giant lock and added the scheduler lock and per-CPU idle process- es. • Jason Evans made malloc and friends thread- safe, converted simplelocks to mutexesand implemented sx (shared/exclusive)locks. • Greg Lehey implemented the heavy-weight interrupt threads, rewrote the lowlev elinter- rupt code for i386 UP,removed spl sand port- ed the BSD/OS ktr code. • BoskoMilekic made sf_bufs thread-safe, cleaned up the mutexAPI and made the mbuf system use condition variables instead of msleep(). • Doug Rabson ported the BSD/OS locking primitives. implemented the heavy-weight in- terrupt threads and rewrote the lowlev elinter- rupt code for the Alpha architecture. Further contributors were Tor Egge, Seth Kings- ley, Jonathan Lemon, Mark Murray,Chuck Pater- son, Bill Paul, Alfred Perlstein, Dag-Erling Smørgravand Peter Wemm.

Bibliography

Per Brinch Hansen, Operating System Principles. Prentice-Hall, 1973. Marshall Kirk McKusick, Keith Bostic, Michael J. Karels, John S. Quarterman, The Design and