<<

BeyondMultiprocessing: Multithreadingthe SUnOSKernel

. 4. Fyt4olg S, R. Kleimaq S. Børton,R. Faulkner,A. Shivalingiah, M. Smith,D. Stein,J. Voll, M. Weeks,D. Williams- SunSoft,[nc. ABSTRACT

Preparingthe SUnOSiSVR4kernel for today'schallenges: symmetric multiprocessing, multi-th¡eadedapplications, real-time, and multimedia,led to the incorporationof several innovativetechniques. In particular,the kernelwas re-structuredaround threads. Threads are usedfor most asynchronousprocessing, including interrupts. The resultingkernel is fully preemptibleand capableof real-timeresponse. The combinationprovides a robustbase for highly concurrent,responsive operation.

Introduction having only a small data structure and a stack. Switchingbetween When we started to investigateenhancements kernelthreads does not requirea changeof virtual memory to the SunOSkernel to supportmultiprocessors, we addressspace information, so it is relatively realized that we wanted to go further than merely inexpensive.Kernel threadsare fully preemptible addinglocks to the kernel and keepingthe userpro- and may be scheduledby any of the scheduling cessmodel unchanged.It was importantfor the ker- classesin the system,including the real-time(fixed priority) nel to be capableof a high degreeof concurrencyon class. Sinceall other exe- cution entities tightly coupled symmetricmultiprocessors, but it are built using kernel th¡eads,they representa fully preemptible, was also a goal to supportmore thanone threadof real-time"nucleus" within the kernel. controlwithin a userprocess. These threads must be capableof executingsystem calls and handlingpage Kernel th¡eads use synchronizationprimitives faults independently. On multiprocessorsystems, that supportprotocols for preventingpriority inver- theseth¡eads of control must be capableof running sion, so a 'spriority is determinedby which concurently on different processors.[Powell 1991] activitiesit is impedingby holdinglocks as well as describedthe user-visiblethread architecture. by the serviceit is performing[Khanna 1992]. We also wanted the kernel to be capableof SUnOSuses kernel tlueads to provideasynchro- bounded dispatch latency for real-time threads nous kernel activity, such as asynchronouswrites to [Khanna1992]. Real-time response requires absolute disk, servicingSTREAMS queues, and callouts. This control over scheduling, requiring preemption at removesvarious diversions in the idle loop and trap almostany point in the kernel,and eliminationof code and replaces them with independently unboundedpriority inversionswherever possible. scheduledthreads. Not only does this increase potential (these The kernel itself is a very complex multi- concurrency activities can be han- dledby otherCPUs), it gives th¡eadedprogram. Th¡eads can be used by user but also eachasynchro- nous activity a priority applicationsas a structuring techniqueto manage so that it can be appropri- atelyscheduled. multiple asynchronousactivities; the kernel benefrts from a th¡eadfacility thatis essentiallythe same. Even interruptsare handledby kernel threads. The kernel synchronizeswith The resultingSunOS 5.0 kernel, the central interrupt handlersvia normal thread operatingsystem component of Solaris 2.0, is fully synch¡onizationprimitives. If an interrupt preemptible,has real-timescheduling, symmetrically threadencounters a locked synchronization variable,it supports multiprocessors,and supports userJevel blocks and allows the critical section to clear. multithreading.Several of the locking strategies used in this kernel were describedin [Kleiman A major featureof the new kernelis its support 1992), In this paper we'll describesome of the of multiple kernel-supportedth¡eads of control, implementation features that make this kernel called lightleight processes(t-wts), in any userpro- unique. cess,sharing the addressspace of the processand other resources,such as open files. The kernel sup- Overview of the Kernel Architecture ports the executionof user LWPs by associatinga A kernel threadis the fundamentalentitv that is kernel threadwith eachLWp, as shown in Figure 1. scheduledand dispatchedonto one of the ôpUs of While all LWPshave a kernel thread,not all kernel the system. A kernel thread is very lightweight, th¡eadshave an LWP.

'92 Summer USENIX- June8-June 12,lgg2 - SanAntonio, TX 11 BeyondMultiprocessing: Multithreading the SUnOSKernel J. R. Eykholt,...

= VM addressspace [ Thread C=t*t Q="tu tl LWP data _L_YP_{ala_ r------l thread

Hardware Figure 1: Multi+hreadarchitecture examoles A user-levellibrary uses LWps to implement nl*p user-levelthreads lstein 1992]. These threadsare scheduledat user-leveland switchedbv the librarv E to any of the LrJi,rPsbelonging to the piocess. User L--_____J threads can also be bound to a particular LWp. Figure 2: MT Data Structuresfor a Process Separatinguser-level threads from the LWp allows the user thread library to quickly switch between The kernel threadstructure contains the kernel userthreads without entering the kernel. In addition, registers,scheduling class, dispatch queue links, and it allows a userprocess to havethousands of threads, pointersto the stack and the associatedLwp, pro- without overwhelmingkernel resources. cess,and CPUstructures. The threadstructure is not Data Structures swapped,so it also containssome data associated with the LWP that is neededeven when the LWp In the traditionalkemel, the user and proc structure is swapped out. Thread structures are structurescontained all kernel data for the process. linked on a list of threadsfor the process,and also Processordata was held in global variablesand data on a list of all existingthreads in the system. structures. The per-process data was divided betweennon-swappable data in the proc structure, Per-processordata is kept in the cpu structure, pointers andswappable data in the user structure.The ker- which has to the cunently executingthread, nel stack of the process,which is also swappable, the idle threadfor that CPU,and cunent dispatching was allocatedwith the user structurein the user and interrupt handlinginformation. There is a sub- area,usually one or two pageslong. structureof the cpu structurethat can be architec- ture dependent,but the main body is intendedto be The restructuredkemel must separatethis data applicableto mostmultiprocessing architectures. into data associatedwith each Lwp and its kernel thread,the data associatedwith each process.and To speedaccess to the thread,LWp, process, the data associatedwith each pto"".sot. Figure 2 and CPUstructures, the SPARCimplementation uses global showsthe relationshipof theseðata structuresln the a register,Vog7, to point to the currentthread restructuredkemel. structure. A C-preprocessormacro, curthread, allows accessto fields in the currentthread struc¡ure The per-processdata is contained in the proc with a singleinstruction. The currentLWp, process, structure. It containsa list of kernel threadsassoci- and CPU structuresare quickly accessiblethrough ated with the process, a pointer to the process pointers in the thread structure. In the future we addressspace, user credentials, and the list of signal may dedicateadditional global registersfor other handlers.The proc structurealso containsthe ves- frequentlyaccessed structures. tigial user structure,which is now much smaller thana page,and is no longerpractical to swap. Kernel Thread Scheduling SunOS5.0 provides The LWP structurecontains the per-LWp data severalscheduling classes. such as the process-control-block(pcU¡ for A schedulingclass determines the relativepriority of storing processeswithin user-levelprocessor registers, system call arguments, the class,and convertsthat priority to a global priority. With signal handling masks, resourceusage information, the addition of mul- tithreading,the scheduling and profiling pointers. It also containspointers to classesand dispatcher operate on threads the associatedkemel threadand Drocessstructures. instead of processes. The schedulingclasses The kernel stack of the threadis ãilocatedwith the cunently supportedare system, timesharing, LWPstructure inside a swappablearea. andreal-time (fixed-priority).

'92 t2 Summer USENIX - June 8-June 12, L992- San Antonio, TX J. R Eykholt,... BeyondMultiprocessing: Multithreading the SunOSKernel

The dispatcherchooses the thread with the semaghores,and multiple readers, single writer greatestglobal priority to run on the CpU. If more (readen/writer)locks. The interfacesare shown in than one thread has the same priority, they are Figure4r. dispatchedin round-robinorder. /* Mutual excluglon tocke */ The kernel hasbeen made preemptible to better vold mutex_enter(kmutex_t *Ip) t support the real-time class and interrupt threads. void nutex_exit(kmutex_t *lp); Preemptionis disabledonly in a small number of *Ip, boundedsections of code. This meansthat a runn- void nutex_init(k¡nutex_t cha! *name, knrutex_type_t type, void *arg); able thread runs as soon as is practical after its vold . mutex_destroy ( k¡rutex_t * Ip ) i priority becomeshigh enough. For example,when int mutex_tryênter(kmutex_t *lp); thread A releasesa lock on which higher priority threadB is sleeping,the running threadA immedi- /* conditfon varfablea */ . vold cv_waltlkcondvar_t *cp, kmutex_t ately puts itself back on the run queue and allows int cv_wait_aig(kcondvar_t *cp, "lp); the CPU to run thread B. On a multiprocessor,if kmutex_t *lp) i threadA hasbetter priority than threadB, but thread int cv_timedwaít(kcondvar_t icvp, B has better priority than the current th¡ead on k¡nutex t *Ip, long tirn); anotherCPU, that CPU is directed to preempt vold cv-signallEcondüar_t i.p), its voíd cv_broadcagt(kcondvar_t *cp) currentthread and choosethe best threadto run. In ; addition, user code run by an underlying kernel /* multipte reader, alngle wrlter locks */ thread of sufficient priority (e.g., real-time threads) void rw_init(krwlock_t *Ip, char *narìe., will executeeven thoughother lower priority kernel . krvr_type_t typg, void *arg¡; void rn_destroy(krsrlock_t *Ip) threadswait for executionresources. Further details ; void r$r_enter(knrlock_t *1p, krw_t rvr) i can be found in [Khanna1992]. int rw_tryenter(krwlock_t *lp, Frw_t rw); SystemThreads voÍd rw_exÍt(krwlock_t *Ip) i void rw_downgrade(krqrlock_t *tp) i System threads can be created for short or int rw_tryupgrade(krvrlock:t *lp); long-term activities. They are scheduledlike any other thread, but usually belong to the system /* counting aemaphores */ scheduling void sema_init(kaena_t *sp, class. These threadshave no need for *na¡re, LWP uneigned int val, char structures,so the threadstructure and stack for kEena_type_t type, void *arg¡; these threadscan be allocated together in a non- vold se¡Ìa_destroy ( ksema_t *sp ) t swappablearea, as shownin Figure3. void semaJr(ksema_t *ep); Ínt sena3_aig (k8ena_t *sp) ; I l-. int sena_tryp(ksemå_t *sp) i I void sema_v(ksemat *Bp)i t;; I stack t-l f-^l Figure 4: Kernel ThreadSynchronization lnterfaces ttl ttl These are all implemented such that the thread thread behavior of the synchronizationobject is specified alltlueads when it is initialized. Synchronizationoperations, such as acquiringa mutex lock, take a pointerto the object as an argumentand may behavesomewhat differently dependingon the type and optional type- Figure 3: SystemThreads specific argumentspecified when the objectwas ini- tialized. A new segmentdriver, seg_kp, handlesstack Most of the synchronizationobjects have types allocations. It handlesvirtual memorv allocations that enable collecting statistics such as blocking for the kernel that can be pagedor swäppedout; it counts or times. A patchablekernel variable can also provides "red zones" to protect againststack also set the default types to enablestatistics gather- overflow. Systemthreads use seg_kp for the stack ing. This allows the selectionof statisticsgathering and the threadstructure, in a non-swappableregion. on particularsynchronization objects or on the kernel LWPsuse it to allocatethe LWp structureand kernel as a whole. stackin a swappableregion. SynchronizationArchitecture The kernel implementsthe same synchroniza- tion objects for lNote internal use as are providedby the that kernel synchronizationprimitives must use a user-levellibraries for use in multith¡eadedapplica- different type name than use¡ synchronizationprimitives tion programs[Powell 1991]. These are mutual so that the typesare not confusedin applicationsthat read exclusion locks (mutexes), condition variables, internalkemel datastructures.

'92 Summer USENIX - June 8-June L2, Lgg2- San Antonio, TX 13 BeyondMultiprocessing: Multithreading the SunOSKernel J. R. Eykholt,...

The semanticsof most of the synchronization andlow overheadfor simplecontention. primitives causethe calling thread to be prevented Spin mutexes are available as type from progressingpast the primitive until somecondi- MUTEx_sprN,which takes as its type-specificargu- tion is satisfied. The way in which further progress ment the interrupt level to be disabled while the is impeded(e.g., sleep, spin, or other)is a function mutex is held. It is rarely used,as adaptivemutexes of the initialization. By default, the kemel thread are more efficient,in general. synchronízationprimitivês that can logically block, canpotentially sleep. Device drivers are restricted to using type MUTEx_DRIvER,which takes a Sun-DDl-defined Some of the synchronizationprimitives are opaque value as an argument. This argument is strictly bracketing (e.g., the thread that loclcs a basically an intemrpt priority in the cunent imple- mutex must be the threadthat unloclæit) and a sin- mentation,and determineswhether the blockingpol- gle owner can be determined(i.e., mutexesand icy is adaptiveor spin, basedon whetherthe inter- writer locks). In these cases,the synchronization rupt priority is above the "thread level" (see primitives support the priority inheritanceprotocol, below). as describedin [Khanna1992]. A simple trick speedsup mutex_enter( ) Some synchronizationprimitives are intended for adaptivemutexes. Non-adaptivemutexes use a for situations where they may block for long or separateprimitive lock field in the mutex data struc- indeterminateperiods. Variants of some of the ture, with the lock field used by the adaptivetype primitives provided are (e.g.,cv_wait_sig( ) and always in the locked state. This is so that semaS_sig( that allow blocking to be inrer- )) mutex_enter ( ) can always attempt to apply an ruptedby a receptionof a signal. There is no non- adaptivelock first, and only if that fails, jump considerthe local to the head of the systemcall, as there possibilitythat the mutexmight be anothertype. was in the traditionalsleep routine. When a sig- nal is pending,the primitive returns with a value Turnstiles vs Queuesin SynchronizationObjects indicating the blocking was intemrptedby a signal Each synchronizationobject requiresa way of and the caller must releaseany resourcesand return. finding threadsthat are suspendedwaiting for that Mutual Exclusion Lock Implementation. object. It is important to keep the storagecost of synchronizationobjects small, becausemany system Mutual exclusionlocks (mutexes)prevent more structurescontain synchronizationobjects, so the than one th¡ead from proceedingwhen the lock is queueheader is not directly in the object. Instead, acquired. prevent They races on accessto shared two bytes in the synchronizationobject are used to dataand are far by the most heavily usedprimitive. frnd a urnstiJe structurecontaining the sleep queue Mutexes are usually held for short intervals. headerand priority inheritanceinformation [Khanna For example,it would not be good to hold a critical 19921.Tumstiles are preallocatedsuch that thereare system mutex while waiting for disk I/O to com- always more turnstilesthan the number of th¡eads plete. Mutexesare not recursive;the ownerof the active. lock cannot again call mutex_enter} for the same One altemativemethod would be to select the lock. If a thread holds a mutex, the same thread sleepqueue from an array using a hashfunction on must be the one to release the mutex. Theserules the addressof the synchronizationobject. This is are enforcedto promotegood programming practice essentiallythe approachused by sleepl in the and to avoid deadlocks. ¡ traditionalkernel. The tumstile approachis favored If mutu_enter cannotset the lock (becauseit is for more predictablereal-time behavior, since they alreadyset), The blocking action taken dependson are never shared by other locks, as hashedsleep the mutex type that was passedto mutex_init, and queuessometimes are. storedin the mutex. The defaultblocking policy for mutexes, called adaptive (type MUTEx_DEFÀuLT), Interrupts as Threads spins while the owner of the lock (recordedwhen Many implementations [Hamilton 1988] the lock is acquired)remains running on a processor. [Peacock 19921have a variety of synchronization This is doneby pollingthe owner'sstatus in the spin primitivesthat have similar semantics(e.g., mutual wait loopz. If the owner ceasesto run, the caller exclusion)yet explicitly sleep or spin for blocking. stopsspinning and sleeps3.This gives fast response For mutexes,the spin primitives must hold intemrpt priority high enough while the lock is held to @while inspectingthe owner's preventany interrupthandlers that may also use the statusduring the spin,the stateis determinedindirectly. synchronizationobject from interrupting while the The algorithmspins while the currenttbread pointer of object is locked, causing deadlock. The intemrpt any CPUpoints to the owningtluead, indicating it is level must be raisedbefore the lock is acquiredand running. then loweredafter the lock is released. rOn uniprocessors,this tu¡nsiuto alwayssleeping, since theowner cannot be running.

L4 Summer'92 USENIX- June8-June t2,1992 - SanAntonio, TX J. R. Eykholt,... BeyondMultiprocessing: Multithreading the SunOSKernel

This has several drawbacks.First, the raising on that processor.In thesesystems the kernelsyn- and lowering of interrupt priority can be an expen- chronizeswith interrupt handlersby blocking out sive operation, especially on architectures that interruptswhile in critical sections. require external interrupt controllers(remember that In SUnOS5.0, interruptsbehave like asynchro- mutexesare heavilyused). Secondly,in a modular nously createdthreads. Interruptsmust be effrcient, kernel,such as SunOS, many subsystemsare inter- so a full threadcreation for eachinterrupt is imprac- dependent.In severalcases (e.g., mappingin kernel tical. Instead, we preallocateinterrupt threads, memory or memory allocation) these requestscan alreadypartly initialized. TVhenan interruptoccurs, come from interrupt handlers and can involve many we do the minimum amount of work to move onto kernel subsystems.This in turn, means that the the stack of an interrupt thread, and set it as the mutexesused in many kernel subsystemsmust pro- currentthread. At this point, the interruptthread and tect themselvesat priority a relatively high from the the interruptedthread are not completelyseparated. possibility that they may be requiredby an interrupt The interrupt threadis not yet a full-fledgedthread handler. This tends to keep priority interrupr high (it cannotbe descheduled)and the interruptedthread for relativelylong periods andthe cost of raisingand is pinned until the interruptthread returns or blocks, lowering interrupt priority must paid be for every and cannot proceed on another CpU. When the mutex acquisition and release. Lastly, interrupt interrupt returns,we restore the state of the inter- handlersmust live in a constrainedenvironment that ruptedthread and return. avoids any use of kernel functions that can poten- tially sleep,even for shortperiods. Interruptsmay nest. An interrupt thread may itself be interruptedand be pinned by anotherinter- To avoid thesedrawbacks, the SunOS5.0 ker- rupt thread. nel treats most interruptsas asynchronouslycreated and dispatchedhigh-priority threads. This enables If an interruptthread blocks on a synchroniza- these interrupt handlersto sleep,if required,and to tion variable(e.g., mutex or conditionvariable), it usethe standardsynchronization primitives. saves state Qtassivates)to make it a full-fledged thread,capable of being run by any CpU,and then On most architecturesputting threadsto sleep returnsto the pinnedthread. Thusmost of the over- must be done in software. This must protected be headof creatinga full threadis only donewhen the from interruptsif interruptsare to sleep themselves interruptmust block, due to contentiona. or wakeup other threads. The restructuredkernel usesa primitive spin lock protectedby raisedprior- While an interruptthread is in progress,the ity to implementthis. This is one of a few bounded interruptlevel it is handling,and all lower-priority sectionsof codewhere interrupts are lockedout. interrupts,must be blocked. This is handledby the normal interrupt priority mechanismunless the Traditional UMx kernel implementations threadblocks. If it blocks, these interruptsmust 1989] 1986] also protect flæffler [Bach rhe remaindisabled in casethe interrupthandler is not dispatcherby locking out interrupts,usually all inter- reenterableat the point that it blocked or it is still rupts. The restructuredkernel has a modifiable level doing high-priorityprocessing (i.e., should not be (the "thread level") abovewhich interrupts are no interruptedby lower-prioritywork). lVhile it is longerhandled as threadsand are treatedmore like blocked,the interruptthread is boundto the proces- non-portable (e.g., "firmware" simulatingDMA via sor it startedon as an implementationconvenience programmedI/O). Theseinterrupt handlers can only and to guaranteethat therewill alwaysbe an inter- synchronizeusing the spin variants of mutex locks rupt thread available when an interrupt occurs and softwareinterrupts. If the "thread level" is set (thoughthis may changein the future). A flag is set to the maximum priority, then all interrupts are in the cpu structureindicating that an interruptat locked out during dispatching. For implementations that level has blocked,and the minimum interrupt where the cannottolerate "firmtvare" eventhe rela- level is noted. \Vhenever the interrupt level tively small dispatcherlockout time, the "thread changes,the CPU'sbase interrupt level is checked, level" can be lowered. Typicallythis is loweredto and the actual interrupt priority level is never the interrupt level at which the schedulingclock allowedto be belowthat. runs. There is also an interfacewhich allows an ImplementingInterrupts as Threads interrupt thread to continue as a normal, high- Previousversions of SunOShave treated inter- prioritythread. When release_interruptO is rupts in the traditionalUNrx way. When an inter- called,it savesthe stateof the the ninnedthread and rupt occurs the interruptedprocess is held captive clears the indicationthat the interruntthread has Qinned) until the interruptreturns. Typically, inter- blocked, allowing the CPU to lower the inrerrupr rupts are handledon the kernel stack of the inter- ruptedprocess or on a separateinterrupt stack. The ffi involvesflushing the interrupthandler must completeexecution and get enti¡e registerfile. This is only doneif the interrupthandle¡ off the stack before anything else is allowed to run sleeps,not duringinterrupt handling without contention.

'92 Summer USENIX - June 8-June 12, LggZ- San Antonio, TX 15 BeyondMultiprocessing: Multithreading the SUnOSKemel J. R. Eykholt, ... priority level. Kemel Locking Strateg/ An alternative approach to this is to use The locking approachused almost exclusively boundedfirst-level interrupthandlers to capturedev- in the kernelto ensuredata consistency is data-based ice stateand then wake up an interruptthread that is locking. That is, the mutex and readers/writerlocks waiting to do the remainderof the servicing[Barnett eachprotect a set of shareddata, as opposedto pro- L9921. This approach has the disadvantagesof tecting routines (monitors). Every piece of shared requiring device drivers to be restructuredand of datais protectedby a synchronizationobject. alwaysrequiring a full contextswitch to the second Someaspects of locking in the virtual memory, level thread. The. approachused in SunOS 5.0 file system, STREAMS,and device drivers allows full thread behavior without restructured have already been discussedin t9921. Here drivers and with very little additionalcost in the no- [Kleiman we'll elaboratea bit on devicedriver contentioncase. issues,as they areclosely related to interruptthreads. Interrupt ThreadCost Non-MT Driver Support The additional overheadin taking an interrupt Somedrivers haven't been modified to protect is about 40 SPARCinstructions. The savingsin the themselvesagainst concurrencyin a multithreaded mutexenter/exit path is about12 instructions.How- environment. These drivers are c¿lled MT-unsafe, ever, mutex operationsare much more frequentthan becausethey don't providetheir own locking. interrupts,so thereis a net gain in time cost, as long as interruptsdon't block too frequently.The work In order to provide some interim support for to convert an interrupt into a "real" thread is per- MT-unsafe drivers, we provided wrappers that formedonly when thereis lock contention. acquirea single global mutex, unsafe_driver. Thesewrappers insure that only one suchdriver will There is a cost in termsof memoryusage also. be active at any one time. This wrapperis illus- Currentlyan interruptthread is preallocatedfor each tratedby Figure5. potentially active interrupt level below the thread level for each CPUr. An additionalinterruot thread calls is preallocatedfor the clock (one per system). Since each thread requires a stack and a data structure, perhaps8K bytes or so, the memory cost can be locking wrapper high. signal However,it is unlikely that all interrupt levels longjmp are activeat any one time, so it is possibleto havea smaller pool of interrupt threadson each CPU and block all subsequentinterrupts below the thread level when the pool is empty, essentiallylimiting how many interruptsmay be simultaneouslyactive. ClockInterrupt The clock interruptó is handled specialty. There is only one clock interrupt threadin the sys- tem (not one per CPU), and the clock interrupt Figure 5: UnsafeDriver Wrapper handler invokes the clock thread only if it is not alreadyactive. There are several ways a driver may be entered,from the explicit points, The clock threadcould possiblybe delayedfor driver entry inter- rupts, more than one clock tick by blocking on a mutex or and call-backs. Each of these entries must by higher-levelinterrupts; When a clock tick occurs acquirethe unsafe_driver mutex if the driver and the clock threadis alreadyactive, the interrupt isn't safe. For example,if an MT-unsafedriver uses tÍmeout( to request is clearedand a counteris incremented.If the clock ) a function call at a later time, the threadfinds the counternon-zero before it returns,it callout structureis markedso that the will decrementthe counterand repeatthe clock pro- unsafe driver mutex will be held during the cessing.This occursvery rarelyìn practice. When functionãall. it occurs,it is usuallydue to heavyactivity at higher Ml-unsafe drivers can also use the old interruptlevels. It can also occurwhile debugging. sleep/wakeup mechanism. S1eep( ) safely releasesthe unsafe driver mutex after the threadis asleep,and reaiquiresit beforereturning. - The J.ongjmp( featureof sleep( is main- JThereare nine intemrptlevels on the Sun SPARC ) ) tained as well. When a thread is imllementationthat can potentially use threads. signalled in Ûfhisoccurs 100 times a secondon currentSun SpARC sleep( ), if it specifieda dispatchvalue greater implementations. than PZERO,a longjmp( ) takes the threadto a

'92 t6 Summer USENIX - June E.JuneL2, L992- San Antonio, TX J. R. Eykholt,... BeyondMultiprocessing: Multithreading the SunOSKernel

setjmp( ) that was performedin the unsafedriver similartool is describedin lKorty 1989]. entry wrapper,which returnsEINTR to the caller of DeadlockDetection the driver. A side-benefit of the priority inheritance Sleep( checksto make it is ) sure calledby mechanism[Khanna 1992], is thãt deadlockscaused an MÏ-unsafedriver, and panicsif it isn't. It isn't by hierarchyviolations are usually detectedat run safeto use sleep( ) from a driver which doesits time as well. It does a good job on mutexesand own locking. readers/writerlocks held for write, but since there It is fairly easyto provide at least simple lock- isn't a completelist of threadsholding a readlock, it ing for a driver, so almost all drivers in the system can't always find deadlocksinvolving readers/writer have some of their own locking. These drivers are locks. There are other deadlockspossible with con- calledWI-søfe, regardlessof how fine-grainedtheir ditionvariables; these aren't detected. locking is. Some developershave used the term MT-hot to indicate that a driver does fine-erained Summary locking. SunOS5.0 is a multithreadedand symmetric SYR4MPDKI Locking Primitives multiprocessorversion of the SVR4kernel. The pri- As we implementedour driver interfaces, maryfeatures are: Internationaland USL were defining the SVR4Mul- o Fully preemptible,real-time kernel tiprocessor Device Driver Interface and Driver- o High degree of concunency on symmetric Kernel Interface(DDI/DKI), with a different set of multiprocessors locking primitives,based around the traditionaluNx . Supportfor userthreads interrupt-blockingmodel. o Interruptshandled as independentthreads r Adaptivemutual-exclusion locks SUnOS5.0 implementsthose interfaces to the extent defined so far, using our locking primitives The threadmodels inside the kemeland at user and ignoring any spin semantics. This allows level are almost identical. The schedulingof kernel drivers using those interfacesto be more easily threads onto CPUs is analogousto the way the ported. SUnOSdrivers typically use the the SunOS threads library schedulesuser-level threads onto synchronizationprimitives. LWPs. The use of threadsfor structuringthe kernel has mostly good effects though they can be over- ImplementationTechnologr used. Th¡eadsdo havea cost. The stacksare large, and must be allocatedon separatepages Some interestingtechniques made it if protection easierto for potential get this all working. stack overrunis needed. Also, context switching is still expensive. Some things are still Kernel Time Slicing better implementedby callouts and other "zeÍo- Since the kernel is fully preemptiblewe were weight" processes,but th¡eadsprovide a nice struc- able to make kernel threadstime-slice. We simply turing paradigmfor the kernel. . addedcode to the clock intemrpt handlerto preempt whateverthread was interrupted.This allowseven a References uniprocessorto havealmosf arbitrary codeinterleav- [Bach1986] Maurice J. Bach, The Design of the ings. Increasingthe clock intemrpt rate made this UNIX , Prentice-Hall, Inc., even more valuablein finding windows where data EnglewoodCliffs, New Jersey,1.986. was unprotected. By causing kernel threads to [Barnett1992] David Barnett, Kernel Threads and preempt each other as often as possible we were their Perþrmance Benefits,Real Time, Vol. 4, able to find locking problemsusing uniprocessor No. 1., Lynx Real-Time Systems,Inc., Los hardware before multiprocessorhardware was avail- Gatos,CA-, 1992. able. Even when working multiprocessorhardware 1988]Graham Hamilton and Daniel S. anived, there were far more uniprocessorsavailable [Hamilton Conde,An ExperimentalSymmetric Multipro- than multiprocessors.We intend this only as a cessor Kerner, USENIX, Winter debuggingfeature, since it does have 1988, some adverse Dallas,Texas. performanceimpact, however slight. [Kleiman1992] S. Kleiman,J. Voll, J. Eykholt,A. Lock Hierarchy Violation Detection Shivalingiah,D. Williams, M. Smith, S. Bar- Insteadof establishinga systemlock hierarchy ton, and G. Skinner,Symmetric Multiprocessing a priori, we developeda static analysistool that in Solaris 2.0, COMPCONSpring 1992, ptïL, would checkfor lock orderingviolations in the sys- SanFrancisco, California. tem. This lint-like tool, called locknesf, reads C [Khanna1992] Sandeep Khanna, Michael Sebrée, sourcecode, constructscall graphsand reportson JohnZolnowsky, Realtime Scheduling in SunOS locking cycles. We feel it helped during early 5.0,USENIX, Winter 1992,San Francisco, Cali- implementationdebugging and probablyreduced the fornia. This describesthe real-time features amount of time spent debuggingdeadlocks. A andconsiderations in this kemel.

'92 Summer USENIX - June 8-June12, LggZ- SanAntonio, TX L7 BeyondMultiprocesslng: Multithreading the SunOSKernel J. R. Eykholt,...

[Korty 1989]Joe Korty, Sema:a Lint-Like Tool for He's been with Sun for the last four years. Reach AnalyzingSemaphore Usage in a Multithreaded him electronicallyat [email protected]. UNIX Kernel, USENIX Winter 1989, San Roger Faulkner is a Senior Staff Engineer in Diego, California. This describesa tool for the OS-Multithreading group at SunSoft. He doing static lock hierarchyanalysis. receiveda B.S. in Physicsfrom N. C. StateUniver- [Leffler 1989]Samuel J. Leffler, Marshall Kirk sity in 1963and a Ph.D.in Physicsfrom Princeton McKusick, Michael J. Karels, John S. Quarter- University in L967, then joined Bell Laboratories, man, The Design and Implementationof the where he was seducedby . He has been 4.3BSD UMx Operating System, Addison- actively involved in the inner workings of the wx Wesley,Reading, MA, L989. kernel since L976 and has done compiler and [PeacockL992]J. Kent Peacock,Sunil Saxena,Dean debuggerdevelopment along the way. He is one of Thomas,Fred Yang and Wilfred Yu, Experi- the principals involved in the developmentof the encesfrom MultithreadingSystem V Release4, /proc file system for SVR4. His E-mail addressis Symposiumon Experienceswith Distributed& [email protected] phone:(415) 336-1115. MultiprocessorSystems (seous) ilI, March Anil Shivalingiah is a Staff Engineer in the 1992,Newport Beach, California. OS.Virtual Memory group at SunSoft. He received 1991]M. Powell, [Powell L. S. R. Kleiman,S. Bar- an M.S. in Sciencefrom University of ton, D. Shah, D. Stein, M. Weeks, SunOS Texas,Arlington in L983 and a B.S. in Electronics Multi-thread Architecture, USEMX Winter Engineeringfrom UVCE,India in 1981. He's been 1991,Dallas, Texas. This describesthe archi- with Sun for the last threeyears. His E-mail address tecturefor user-levelmulti-threading. is [email protected]. 1992]D, Stein, D. Shah,Implementing Light- lstein Mark Smithgraduated with a B.S. in weight Threads,USENIX Summer 1992, San Computer Science from the University Antonio,Texas, pp. 1-10. This describesthe of California, Santa Barbara in 1,986. He implementationof the user-levelthreads pack- is currently a Member of age. TechnicalStaff at SunSoftin the OS-Multithreading group. Prior to coming to Sun he worked in the Author Information DesignAutomation department of Amdahl Corp. He canbe reachedby E-mailat [email protected]. JosephEykholt is a Senior Staff Engineerand technical leader in the OS-Multithreadinggroup at Dan Stein is curently a Member of Technical SunSoft.He receivedan MSEEfrom PurdueUniver- Staff at Sunsoftwhere he is one of the developersof sity in 1978,and a BSEEfrom Pu¡dueinL977. Prior the SunOSMulti+hread A¡chitecture. He graduated from to coming to Sun, he was one of the leading the University of Wisconsinin 1981with a BS developers of multiprocessing features for the in ComputerScience. Amdahl UTS system,and a logic designerfor the Jim Voll works in the OS-Multithreadinggroup Amdahl580 CPU. His addressis SunSoft,Inc., M/S at SunSoft. He receivedhis B.S. from the Univer- MIV5-40, 2550 Garcia Avenue, Mountain View, sity of California,Santa Barbara in 1981. Prior to CA, 94043. His E-mail address is working at Sun he has worked at Amdahl and [email protected] phone:(415) 336-1849. Cygnet Systems.He routinely destroys his home [email protected]. Steve Kleiman is a DistinguishedEngineer in directory.His E-mailaddress is the OperatingSystems Technology Departmentof Mary Weeks has been a memberof technical SunSoft. He is currentlyarchitect of Multi+hreading staff at Sun Microsystemsince 1986. Prior to Sun, in SunOS. He received an M.S. in Electrical sheworked at Xerox. She receivedher B.A. in com- Engineeringfrom StanfordUniversity in 1978 and a puter sciencefrom the University of California at B.S. in ElectricalEngineering and ComputerScience Berkeleyin 1984. from M.LT in 1977. He hasbeen involved with the Dock Williams is a Staff Engineerin the OS- design and developmentof uN¡x and Multithreadinggroup at SunSoft.He receiveda S.B. architecturesince 1.977; first at Bell Telephone in Electrical Engineering and Computer Science Laboratoriesand then at Sun. He was one of the from M.LT. in 1980. He hasbeen with Sun for over developersof NFS,Vnodes, and of the originalport six years. Prior to joining Sun, he worked at Ameri- of SUnOS to SPARC. His E-mail address is can InformationSystems, ONYX Systems,Tri-Comp [email protected] phone: (41.5)336-7295. Systems,and HughesAircraft Radar Systems. His SteveBarton graduatedfrom the University of E-mailaddress is [email protected],phone: (4L5) California,Santa Cruz in 1982 with a BA in Com- 336-1246. puter and Information Sciences. Since then he has worked at Zilog Inc., Parallel Computers,Counter- Point Computers, and Telestream Corp. He is currently a Member of Technical Staff at Sunsoft.

'92 18 Summer USENIX - June 8.June12,1992 - SanAntonio, TX