File SystemMultithreading in SystemVRelease4MP

. J. Kent Peacock- Intel MultiprocessorConsortium

ABSTRACT An Intel-sponsoredConsortium of computer companies has developed a multiprocessor version of System V Release4 (SVR4MP) which has been releasedby Sysæm Laboratories. The Consortium'sgoal was to add fine-grainedlocking ûo SVR4 with minimal change to the kemel, and with complete backward compatibility for user progrâms. To do this, a locking sEategy was developedwhich complemented,rather than replaced, existing UNIX synchronizationmechanisms. To multitlread the flle systems,some general locking sEategieswere developedand applied to the generic Virtual File Sysæm(VFS) and vnode interfaces. Of particular interest were the disk-basedSi and UFS types, especially with respect to their scalability. Contentionpoints were found and successivelyeliminated to the point wherethe tle systems were found to be disk-bound. In particular, several file system caches were restructured using a low-contention,highly-scalable approach called a SoftwareSet-Associartve Cache. This æchniquereduced the measuredlocking contention of each of these caches from the l0-I57o rangeto lessthan 0.17o. A numberof experimentalchanges to disk queuesorting algorithms were attemptedto reduce the disk bottleneck, with limited success. However, these experiments provided the following insight into the need for balancebetween I/O and CPU utilization in the system: that anempting to increaseCPU utilization to show higher parallelism could actually lower system throughput. Using the GAEDE benchmarkwith a sufflcient number of disks configured, the kernel was found to obtain throughputscalability of 887oof the theoreticalmaximum on 5 processors.

Introduction Graphics, Solboume and Corollary have all offered multiprocessorUND( systems,though not all have goal assembledby Intel The of the Consortium describedthe changesmade to the operatingsystem produce multiprocessorversion of System was to a in the literahre. Bach describesa multiprocessor possible. As such, [2] V Release4 in as shorta time as versionof SystemV releasedby AT&T. Encorehas no inclination to perform a radical restruc- there was describedseveral generationsof their multithreading kernel, nor to supportuser-level threads turing of fhe effort, ûrst on MACH and then on OSF/I in a series while the evolution of a consensusth¡eads standard of papers[4, 11, 12,I5]. This work hasthe greatest was incomplete. similarity to that reported here, so an attempt has There were two main performance goals for been made úo make relevant comparisonsto it this efforu binary performancerunning the GAEDE throughoutthe paper. DEC has also publishedpapers benchmark[7] with respectto the uniprocessorsys- on their approach ¡o multithreading their BSD- tem should degrade by no more than 5Vo; and the derivedULTRIX systemLL0,ZL}. Ruane'spaper on proportional increaseof throughputshould be at least Amdahl's UTS multiprocessingkernel is the best 857oof the ûrst processorfor eachprocessor brought precedentto the philosophy of the approachused by online, up to 6 processors. The performance the Consortium [20]. Lately, NCR has discussed numbers were arrived at through negotiations with their parallelization efforts on System V Release4 IJNIX System Laboraûoriesas their acceptancecri- [5, 6]. As NCR has participatedas a memberof the teria. The 6 processorlimit arosefrom considera- Inæl Consortium, this work influenced the tions of how well the implementationwas expected Consortium'sapproach. to scale, namely between8 and 16 processors,and, more importantly,of how large a systemwas likely Locking Model to be availablefor testing. Multiprocessorlocking implementationscan be PreviousWork divided roughly into two camps: thosewho replaced Many atæmptshave been made to adaptLJNIX the traditional sleep-walceupUNIX synchronization to run on multiprocessormachines. Companies such with semaphoresor other locking primitives, and as AT&T, Encore, Sequent,NCR, DEC, Silicon those who opted to retain the sleep-wakeup

Sumrner'92 USENIX - June 8-June12,1992 - SanAntonio' TX t9 Flle SystemMultithreading in SVR4MP Peacock synchronizationmodel and add mutual exclusion acquired by interrupt routines must spin, whereas aroundthe appropriatecritical sections. Bach's [2] locks of the same class which might deadlockone implementationis of the fust type, with existingsyn- another can avoid deadlock by sleeping. The ch¡onizationmechanisms replaced by Dijkstra sema- deadlockavoidance comes from the fact that held phores. Ruane'spaper [20] representsa sort of locks are releasedwhen a lock requestersleeps. In canonical description of the type of locking addition, mutex locks may be configured as described most often by previous multithreaders shared/exclusivelocks, allowing multiple reader using the mutual exclusion approach. The paper locks or a singlewriter lock to be held. gives a very good discussionof some subtle syn- For purposesof the following discussion,it is chronizationissues relative to a pre-SVR4environ- useñrl to carefully define the differencebetween a ment, as well as some sound argumentsagainst resource lock and a mutex lock. Logically, a usingDijkstra semaphores. resource lock is a lock which protects a resource Although a thoroughdescription of the Consor- even when the locker is not running on a CPU. A tium locking protocolsis beyondthe scopeof this mutex lock, on the other hand, only protects a paper, there are a number of key featuresof the resourcewhen the locker is running on a CPU. locking primitives which should be noted: Firstly, Stated anotherway, a resourcelock may be held mutual exclusionlocks are implementedso that any acrosscontext switches,while a mutex lock would mutex locks held by a processare automatically not be. A resourcelock can be an actual locking releasedand reacquired across a context switch. primitive, such as a semaphore[2], but neednot be. This is an imitation of the implicit uniprocessor Implementinga resourcelock as locking data pro- locking achievedby holding the processorin a non- tectedby a mutex has beenshown by Ruaneto pro- preemptivekernel. It is necessaryfor propersleep- vide greaterflexibility in locking than the semaphore wakeupoperation for a lock protectinga sleepcondi- approach[20]. It should be noted that there are tion to be releasedafrer the sleepingprocess has resourcelocks already presentin the uniprocessor establisheditself on the sleepqueue. Most previous code, for example,the B_BUSYflag bit in a disk efforts haveenhanced the sleepfunction to releasea buffer cacheheader or the ILOCKEDflag in an singlemutex lock, usuallyspecified as an extraargu- structure. Protectingmanipulations of theseresource ment to the sleepcall. With automaticreleasing of locks with mutexesis usuallysufficient to generalize locks acrosscontext switches,sleep calls need not them to work on a multiprocessor.In fact, all of the be changed. This means that much of the mul- instancesof a given type of resourcelock can be tithreading effort merely involves adding mutex protectedusing a single mutex lock rather than a Iocks aroundsections of code,even though they may lock per instance,typically with very low contention sleep. Comparisonsof many multith¡eadedfrles on the mutex. with the originals show this to be literally the only More detaileddescriptions of the locking primi- changerequired. This is an importantfeature when tives and more detailedarguments in favor of using it is necessaryto occasionallyupgrade to ongoing them over semaphorescan be found elsewhere[5, releasesof the underlyinguniprocessor system. 171. The ability to releaseall of the locks acquired in the call stack relates to another feature of the File SystemsLocking locks, namelythat recursivelocking of each indivi- The SVR4 file system implementationcenters dual lock is allowed. SVR4 is designedin such a aroundtwo separateobject-oriented interfaces which way that thereare a numberof object-orientedinter- originatedin SunOS[9]. One is the Virtual File faces between sub-systemsof the kernel, derived System(VFS) interface,which consistsof functions from SunOS[8]. Thesesubsystems call back and which perform file systemoperations, for example, forth to one another and create surprisingly deep mounting and unmounting. The other is the virtual recursive call stacks. (This happens particularly node (vnode) interface,which allows operationson betweenvirtual memory and file systernmodules.) instancesof files which reside in a virtual file svs- Though providing considerable modularity, this tem. featuremakes it very difficult to establishassefions Most aboutwhich mutex locks might be held coming into of the file system multithreadingeffort was general any given kernel function. Allowing lock recursion spentdeveloping a multithreadingmodel for thesetwo and the automaticrelease of mutex locks allows a interfaces.Although there are around given function to deal only with its own locking 10 file system.typesin SVR4, only the two disk- basedfile systems requirements,without having to worry about which are discussed:the UNIX File Sys- (UFS) which locks its callershold, asidefrom deadlockconsidera- tem type, is a derivativeof the Berke- tions. ley Fast File System[13] via SunOS,and rhe 55 type, which is derivedfrom the SystemV Release3 The primitives were also designedto be able to: file system. The strategiesfor multithreadingthese configurewhether the caller spins or gives up the two file systemswere basicallythe same. processorwhen the lock is not available. l¡cks

20 Summer'92 USENIX- June8-June 12,1992 - SanAntonio, TX Peacock File SystemMultithreading in SVR4Mp

VFS Locking flag locks the inode during most operationswhich The VFS layer of the file systemprovides the would accessor change the contentsof the inode supportfor an object-orientedinterface to the various itself, whereasthe IRWLOCKEDflag makes read or frle systemtypes available.This supportincludes write operationsatomic. This allowsa long reador maintainingthe list of mountedfile systemsand write operationto proceedwithout blocking a star resolving races betweenfile system unmountsand operation,for example. These locks are set and pathname searches into a frle system being clearedby the ILOCK - IUNLOCKand IRWIOCK - unmounted.The multithreadingof the VFS layer IRwt NLOCK macro pairs. Both the IIOCKED and centersaround a shared/exclusivemutual exclusion IRWLOCKEDbits can be held at the same time by lock which serializesaccess to points during two different processes. In a multiprocessor,the pathnamelookup, mount and unmount operations. processesmust be preventedfrom running at the This lock protectsthe existinguniprocessor resource sametime for correct emulationof the uniprocessor lock on each . This approach semantics.In order to accomplishthis, the vnode seemsto be essentiallythe same as that used by lock must be held concurrentlywith eachof the two Encore in their multithreadingof the Mach vnode- resourcelock flags. This is done by acquiring the basedfile system[11]. vnode lock from within ILOCK and IRWLOCKand Vnode Locking releasingit within IUNLOCKand IRWUNLOCK.The effect of this on return from a VOp function which The vnode layer of the kernel implements does an ILOCK without an IUNLOCK due to lock anotherobject-oriented interface for operationson recursion,is to leave the vnode locked outside the eachfile within a virtual file system given of a type. VOP function. For example,the previouslymen- The focal point of these operations is the vnode tioned VOP_IOOKUPfunction returns with both structure,which is usually contained within a VFS- ILOCKEDset and the vnode lock held on the found dependentnode structurefor each vfs active element or file. If the processdoes a context in the system. The object-orientedinterface consists switch after the return, the vnode lock is released, of a numberof macrosprefixed with "VOp " which but the resourcelock, embodiedin the flag bit, is call througha VFS-dependentfunction code table. not. Hence,the integrity of the inode is still pro- With only a few exceptions, these macro calls tected,although a processholding the IRWLOCKED representthe only path into the file system code. lock could run after acquiringthe vnode lock. This The interfacewas designedto support a setof opera- behavioris the sameas in the uniprocessorsystem. tions which were atomic and mostlv statelesi.in When the next vOP call is made which does the order to support a statelessnenuori< file sysiem, matching IIJNLOCKon the inode, the extra vnode namelyNFS In actualimplementation, [9]. localfile lock on the file is released, and the vnode is systemsdo retain some statenecessary to implement unlockedcompletely on returnfrom the call. normalLJMX file systemsemantics. For example,in a UFS or 55 filesystem,a VOP_LOOKUPoperation, This locking approachrepresents a different philosophy which does one stageof a pathnamelookup, leaves from the OSF/1 file system [12], where the found file or directoryin a lockedstate on return. by design,no file systemlocks can be held on return from a VOP function. Since the uniprocessormodel The granularityof locking desiredat vnode the is preservedby our strategy,no additionalraces are level is the individual vnode,which suggestsa lock- introducedsuch that an error path could causethe ing strategywhereby a lock on a vnodã is obtained lock not to be released.The only way for the vnode and releasedaround almost every VOp call. Using lock not to be releasedis for the inode lock not to this strategyprovides a blanketlevel of essentially be released,which would also qualify as a unipro- automaticvnode locking for most of the VOp func- cessorbug. Having the vnode lock visible tions. at the VOPinterface also allows the lock to be held around Inode Locking multiple VOP calls, which is made possibleby the There are two main aspectsto file systemlock- recursiveproperty of the Consortiummutex locks. ing inside UFS and 55: locking of a given file inode This is usedin severalcases to avoid a racewith file while some operationis performedon it, and the lockingon a vnode. locking of the file system inode cache during the Becausethere is a lock on each vnode, the lookup or freeing of an inode. Inode cachelocking potentialfor deadlockexists when it is necessaryto is discussedin a later sectiondevoted to cachecon- lock more than one vnode, as during pathname tention. searchesor some STREAMS operations.The solu- The per-inode lock is a resourcelock which tion to this problemis to have the lock s/eepwhen it can remain held acrossblocking operations,such as waits for a busy vnode lock. As previously dis- disk VO. The original uniprocessorimplementation cussed,this removesthe deadlockpossibility because of the inode locking usestwo separatelock flag bits all of the locks held by the locker are released.The in the inode,II¡CKED and IRWLOCKED.(Pre-SVR4 one problem that this causes is that potential systemshad only the ILOCIGD flag.) The ILOCKED context-switchesare introduced which were nor

'92 Summer USENIX- June8-June tZ, tgg?- SanAntonio, TX 2l File SystemMultithreading in SVR4MP Peacock presentbefore the vnode locking was added. One inactivation attempts. To solve this problem it is solution to this is to widen the scopeof the vnode necessaryto changethe file systemsnot to use the locking so that the preemptionhappens in a safe zero count as the inactiveindicator. For example,in place. Locking aroundthe VOP calls has beenfound the 55 and UFS frle systems,the count is decre- to be sufficientlywide for almost all problemsitua- mented again to -1 under the inode cache lookup tions, sincecallers should conservatively assume that lock to truly indicate the inactive state. On a any VOPcall might sleep. Additional vnode locking lookup,which is also doneholding the cachelookup which is boundwith inode lock manipulationsis also lock, whenthe countfor an inodeis -1, the inodeis safe,. becausethe inode locking itself may sleep. reactivatedand the count set back to zero before The only place where the vnode lock's possible doing a VN_HOLDto count the new reference.This sleeping causesa problem is in the iget function solution makes the state transitionsbetween active where the inode has been found by a cachelookup. and inactive statessafe, while allowing the use of A sleep to wait for the vnode lock could allow the atomicoperations for VN_HOLDand M{_RELE. identity of the inode to change.The solutionis sim- Encore describesusing a per-vnodespin lock ply to recheck after the vnode is locked that the around the VN_HOLD^/N_RELEincrement and inode obtainedstill matchesthe identity of the one decrementoperations [L1], but they do not reveal searchedfor. their solutionto the aboverace condition. When an inode is heavily shared by many processes,the processestend to queueup sleepingto PerformanceTuning wait for the inode lock. This representsan example A numberof useful tools were availableto the of what has beencalled the thunàeringherd prcblem SVR4MP developersÍ6, l7l. These tools, together t5]. When an inode lock is released,the normal with techniquesvery similar to thosedescribed by wakeupfunction makesall of the processessleeping Paciorek,et al. 115), allowed reasonablyaccurate on the inode runnable. In a multiprocessor,this characterizationof the locking contentionand scala- resultsin all but one of the processesfinding thp bility of the systemand the detectionof deadlocks. lock busy and going back to sleep. This takesO(N') CacheContention CPU time to processN sleepers.To alleviatethis, a wakelproc version of wakeup has been used to A numberof different cachesare used within awakenonly one processwhen an inode lock is all UNIX variants to enhancesystem performance. released.The processawakened using this approach Logically, such caches are collections of objects has to assumemore responsibility:if the processno each taggedwith an identifier. Typical operations longer wants the inode lock, it must do another performsome actions on an individual cacheobject. wakelproc to pass the lock to anotherwaiting pro- Theseactions can often be quite self-containedand cess. very scalable,since their locality is constrained. However,once the VN-HOLD/VN_RELE Locking operationenters the domainof the cacheitself, it can collídewith otheroperations Vnodes represent shared resources in most on differentobjects in the cache. Hence,attention is cases,and can hencebe referencedby a numberof naturallydrawn to thesecaches as placeswhere pro- different processesin a completely asynchronous cessorcontention can occur and needs to be relieved. fashion. When a new pointerreference to a vnodeis Most of these cacheshave a similar structure,as created, a referencecount in the vnode is incre- illustratedin Figure1. In Figure1, the hashqueues mentedby doing a VN_HOLD(which usesan atomic are an array of queue head structures,indexed by add operationin the multiprocessorcase). When the computing a hash function of an object identifier caller is done with the vnode, it releasesthe refer- during a cache operation. Each square box ence by doing a VN_RELE operation. The representsan elementof the cache,and thosethat VN_RELE does an atomic decrementof the refer- are not marked "Busy" are chained in least- ence count and if it becomes zeÍo, calls recently-used(LRU) order on a free list. To aid in VOP_INACTIVE,which does a file systemdependent discussinglocking strategiesfor this organizationit cleanupoperation on the vnode. is useful to considerthe three main operations: In some file systemsthe vnode continuesto lookup,release and flush. exist in a quiescentstate after the VOP_INACTM, A cachelookup operationscans the appropriate such that it can be reclaimed by a future lookup hashqueue for the given object identifier. If an ele- operation. The problem with this is that the zero ment matchingthe identifier is found and is not count stateis usedto signify the quiescentstate in busy, it is removedfrom the free list, markedbusy thesefile systems.But the countgoes to zercbefore and returnedto the caller. If the matchingelement the inactivation is performed,so there is a serious is busy, then the caller waits for it to be releasedby racebetween the lookupoperation and the inactiva- the cur¡entowner and then rescans the hashqueue to tion. It is possiblefor anotherprocess to find the make surethe identity of the matchedelement has vnode and do a VN_HOLDfollowed by a VN_RELE not changed. If the identifier is not found, an beforethe inactivationis performed,resulting in two

22 Summer'92 USENIX - June 8.June12, L992- SanAntonio, TX Peacock File SystemMultithreading in SVR4MP

element is removedfrom the front of the free list in their Macl/4.3BSD buffer cachelocking, and has and its current hash list and replacedon the hash a number of deadlock and performancedifficulties queueof the searched-foridentifier. The elementis t4l. In terms of overhead,this locking requires markedbusy, its contentsare then changedto reflect obtainingat least three locks per lookup or release the new identity and it is returnedto the caller. The operation,including the per-elementlock. A cache releaseoperation of a busy cache element puts a lookup miss may require obtaining two arbitrary busy elementback onto the free list and clearsthe hash queue locks, thus requiring a deadlock busv mark. avoidancestrategy to be implemented.More impor- tantly, the free list lock is obtainedfor every opera- äún tion, so it representsthe limiting factor for the scala- Oucuc bility of the cache, as discoveredand reportedby Encore. A simplification of this approachlowers the --E locking overhead without sacrifrcing scalability (assumingshort averagehash queue size) by using only the singlefree list lock to protectall of the data structures,as shown in Figure 3. An early version of the Consortium locking for the buffer cache includeda per-elementlock alongwith the free list lock. It was notedthat an elementlock was almost KU alwaysobtained while holding the free list lock, and Frcc Ust was henceredundant. Removing the elementlock reducedlock overhead Figure 1: GeneralCache Organization by 50Voand loweredconten- tion also. Of course,the elementlock can not be The flush operationruns periodically to clean removedif it is a semaphoreresource lock which elementson the free list so that they can be reused replacesthe busy mark on the cacheelement, as in immediatelywhen a cachemiss occurs. The flush the semaphorelocking approach.In the Consortium operationscans the list for items that needcleaning, case,the elementlock was a mutex which only pro- removesthem from the free list, cleans them and tectedmanipulations of fieldswithin the datastruc- returns them to the free list when done. As an ture. example,the buffer cachecleaning involves queue- ing modifieddisk blocksto be written to the disk. A very detaileddescription of this structurerelative to the disk buffercache can be foundin Bach[1].

i ns¡ À l, Qucuc lL I ll i-\ Hash Qucuc / i Æ\; i----tBßyl i \i \! '-._._-À \ \i I

:.:.:.:i------l--- Figure 3: SingleI¡ck for All CacheData Structures UU Frcc List Frcc List l¡ck Unfortunately,the singlelock still representsa contentionpoint when the accessrate to a cacheis high. A solution to this problem is fragment Figure 2: SeparateHash Queue and FreeList Locks to the free list intô segmentswhich are associãtedwith The "intuitively obvious" way to lock this eachhash list and use a lock on eachhash list to structureinvolves the use of threetypes of locks:a protect both the hash queue and the free list, as lock on each cache element,a lock for each hash shown in Figure 4. Any time a free element is queueand a lock to protectthe free list, as shownin required,it is allocatedfrom the free list associated Figure 2. The dotted and dashedlines enclosethe with the hash queue being searched,so that each data structuresprotected by eachof the hashqueue cacheelement is permanentlybound to a fixed hash and free list locks (the cacheelement locks are not queue.Because of this, eachhash queue is typically shown). This locking strategywas used by Encore initialized to havea configurable,constant number of elements. This cache organizationwas dubbeda

'92 Summer USENIX - June 8.June 12,lgg2 - SanAntonio, TX 23 File SystemMultithreading in SVR4MP Peacock

SofrwareSet-Associative Cache (SSAC) due to its SegmapCache logical resemblancpto a hardware set-associative The segmap cache was changed first, as it cache,and is describedin detail elsewhere[17]. It representedthe largestbottleneck. In this case,the representsthe key techniqueused in multithreading hashqueue and free list pointersin the cachestruc- the file systemcaches to obtainpractically unlimited ture were kept. The hash and free queuesfor each scalability. The descriptionis extendedhere to show hashqueue were set up at initialization to contain4 the applicationof the techniqueto a numberof dif- cacheelements each. The hashqueue manipulations ferentcaches and illustrate the problemsthat arose. were simplified, since elements are never moved from one hash chain to another. This code change was very straightforwardand took less than one day to accomplish. No side effects or difficulties were apparentafter the change,and the contentionwas reducedas described. Segmapcache elementsare lhsh/Frcc Lirt IÆk! not locked for exclusive use on return from the lookup operation,but a referencecount is incre- mented. Hence,more than one processmay actually use a cacheentry. To protect theseprocesses from one another,the hashqueue lock for the queuecon- taining an elementis lockedwhile fields in the cache Figure 4: SoftwareSet-Associative Cache Locking elementare manipulated. This avoids the necessity for a separatemutex lock to proteatthe datacontents A follow-on paperby Encoredescribing paral- within the cache. .ontentionfor this per-element lelization of a vnode-basedfile system[L1] doesnot locking is included in the less than .lVo contention refer to the free list contentionproblem. The buffer measurement. cachelocking is describedas consistingof a lock on DiskBuffer Cache each hash queue,plus a lock inside each buffer. The applicationof the SSAC structureto the From this descriptionit possible is not to determine disk buffer cache proved to be somewhat more how or evenif they have solvedthe free list conten- interestingand challengingdue to some featuresof tion problem. The problem is likely moot to them, the SVR4 buffer cachecode. Memory for the cache however,as their stated intention was ultimatelv to headersand buffers is not statically allocatedat ini- replacethe buffer cacheentirely with the Mach "No tialization, as it was in SVR3. Rather,the free list Buffer Cache" codefrom CMU. of headersgrows as neededand the buffers them- File SystemCaches selvesare dynamically allocated'and freed to accom- While tuning the file systems,three SVR4 file modatedifferent buffer sizes. This made it difficult systemcaches showed up as havingsignificant levels to configurea set of fixed-lengthbuffer chainson the of cachecontention. The first of thesecaches is the hash queues,so anotherapproach was needed. In segrnapcache, which is usedto map file pagesinto addition,the cachehashing function was found to be windows where they can be copied during read and very poor. Most kemel hash queue arrays are write operations.The segmap cache replacesthe configuredto have2n entriesso that the cheaperbit- buffer cachein SVR4 for most data read and write maskoperation (tag & (2"-1) can be usedinstead of operations.The secondcache is the buffer cache(a the modulus operationto fold the object identifier shadowof its former self), which is usedto hold file tag into a hash index. If the object identifiers are system structural information, such as blocla of separatedby a power of two, then only a subsetof . The last cacheis the directoryname lookup the hashqueues are actually ever used. In the case cache(DNLC), which is a cacheof pathnamecom- of the buffer cache,a UFS file systemwith 4-I$yte ponent translations. The segmap cache exhibited blocls generatesobject identifierswhich are usually 70-15%contention with only two processorsrunning multiplesof 8 (when convertedto 512-bytephysical cache-boundfile system I/O, while the buffer and numbers),thus using only every 8th hash DNLC caches had similar contention when the queue. This problemcan be alleviatedby using only GAEDE and AIM 3 benchmarkswere run on 5 2"-1 entries, which requires the more expensive CPUs. The SSAC techniquewas applied to these modulusoperation, but gives a better hashingdistri- caches,with slightly different structures in each bution. (As anotherexample of this, processstruc- case. The locking contentionfor all of the caches tureswere alwaysallocated on 512-byteboundaries, was reducedto less than .LVo. The different struc- so any sleep on the addressof any proc structure tures lead to some small problemswhich highlight alwaysqueued the sleeperon sleephash queue 0.) the propertiesof the approach. One of the desirableproperties of a cache is that the hashqueue remain fairly short,to reducethe searchtime requiredfor a lookup. A traditional tar- get has been to keep the averagesize of each hash

24 Summer'92 USENIX- JuneE-June 12,1992 - SanAntonio, TX Peacock File SystemMultithreading in SVR4MP queue at around 4 buffers. rühen the free list for in the cleaningcase (for example,the buffer cache), each hash queuewas configuredto be this sÞe, it where the cleaned element is not reusedimmedi- was discoveredthat the buffer cachecode contains ately. In the purging case,it is a poor approxima- an inherentdeadlock. Certainoperations require the tion to selectinga least-recently-usedentry to purge. use of 2 or more buffers,which can deadlockwhen A better LRU approximationfor the purging case is all of the buffers on a free list'are usedup by such to select the least-recently-usedentry from a cache operations. This deadlockcould not be eliminated row, purgeit and then move onto the next row in the without radical restructuringof the code. So, to cache,cycling throughthe cache. make its occurrence less likely, larger free lists were A potentialproblem was noticedin the hashing configured and shared among a number of hash function for this cache. The same entry can bè queues,with per a single mutex lock free list. Each enteredinto the cache with different user permis- hash queue is bound at initialization to a particular sions (credentials),so an often-useddirectory entry free list, and buffers queues, do move betweenhash could be in the cache many times with different but only within those bound to the same free list. credentialsfor different users, all hashedonto the This approach provided the contention reduction same queue. This could cause thrashing on that desired, while increasing probability the of the queue.The solution to the problem is to use the deadlock only marginally. credentialpointer in the hashcalculation to distribute Directory NameLoohtp Cache the occurrencesof the nameto a numberof different The DNLC cachechanges were anothervaria- hash queues. This illustrates a generalproperty of tion on the SSAC theme. All of the free list and the SSAC organization,which is that the goodness hash queue pointers were removed and the cache of the hashing function becomes more important structuredas a 2-dimensionalanay, with a vector of when relatively short, fixed-sized cache rows are simplespin locks to lock eachrow of the cache. To defined. In general,the shortereach row is, the fas- maintain the LRU ordering of each cache line, a ter the searchcan be, but the probabilityof thrashing timestampwas addedto eachDNLC structureindi- within a row increases.Experience from hardware cating the last time that the structurewas accessed, caches,where 4-way set associativityis considered with a 0 timestampindicating a busy cache entry. adequate,suggests that 4 is a reasonablestarting When a cachesearch is attempted,the entrywith the value for tuning the hashqueue length. lowest non-zero timestamp is rememberedand The DNLC cachealso has someother interest- reusedif the nameis enteredinto the cache. ing behaviorin its interactionswith the UFS and 55 The semanticsof this particular cache have file systems. When a vnode is entered into the some unusualfeatures, most of which do not affect DNLC cache,it has a M{_HOLDplaced on it. Quite the SSAC structure. The lookup and enter opera- often, the last referenceto a UFS or 55 file is from tions are split into two distinct operations,where the DNLC cache. Becauseof this, the iger function, they are normally combined.A numberof opera- which attemptsto find a given file in the inode tions purgeentries from the cachebased on different cache, calls the DNLC purge-any function when characteristics,such as vnode identity, file system there are no more free inodes in the cache. This identity,just any old entry or the entire cache. The purging representsa hit-or-miss effort which is purge operation actually removesthe entry, rather repeateduntil one of the file system'sihodes whose than cleaning it as in other flush-type operations. last referenceis from the DNLC cacheis purged. If Becauseof this, a little more care needsto be taken one is hit, the purge function calls MrI_RELEto in deciding which entries should be purged when decrementthe vnode referencecount to zero, which merely reclaiming space. In most of the purge then calls VOP_INACTIVEto releaseits inode. In operations,the entirecache has to be scannedand all the 55 file system,this is wherethe fun began: If elements satisfying the search criterion must be the file being inactivedwas unlinked,then the inac- removed. The division of free list locks in this case tivate routine called the ifree functionto releasethe is an advantage,because each row of the cache is inodenumber back to the inodefree list. If the call locked individually as the cacheis searched,allow- to iget was for a file being allocated(from ialloc), ing other operationsto proceedin parallel. On the then the ifree finction would deadlock trying to other hand,the multiple locla do not allow the state obtain the non-recursivemutex lock on the file sys- of the entirecache to remainfrozen during an opera- tem structure,which was held by ialloc. The solu- tion. (Although this property does not present a tion to the problemwas to releasethe mutex lock in problem in normal operation,a debuggingfunction ialloc before the call to iget. Unfortunately,this which countsall of the occurrencesof a given vnode allowed a race where attemptscould be made to in the cacheis no longerreliable.) allocatethe sameinode more than once,which had to be dealt with Since there is no single LRU free list, purging by detectingthe already-allocated inode after return an elementwhen any one will do has to be done from dger. This exampleis cited to illustratethe (often carefully. The approachof traversing each cache unexpected)recursive flavor of the file row processingeach element in tum works very well SVR4 systemcode.

'92 Summer USENIX - June 8-June 12, Lggz- San Antonio, TX 25 F'ileSystem Multithreading in SVR4MP Peacock

The Encore approachto locking the DNLC similar to this, but with better file system con- cacheis the combinationof hashand free list locks sistency,by forcing the correctordering of asynchro- shown in Figure 2 Í111. As such, each operation nous operationsto the disk for directory and inode acquiresbetween 1 and 4 locks to accomplishan updates. The change described here was not enteroperation, whereas the SSAC approachalways intendedprimarily as a file systemenhancement, but acquiresonly 1 row lock with lower contentionthan rather as a tool to make the benchmarkmore CPU- in theEncore.case due to the fragmentedfree lists. bound to uncoverlocking contentionin the ñle sys- Inode Caches ¡ems. The inode cachesin the UFS and 55 file sys- Another approach,which has been used suc- tems are additional candidatesfor the SSAC mul- cessfully to increasewrite throughputof the buffer tithreading approach. However, a single lock has cache in SVR3 [16], was to queue synchronous been used for the entire collectionof inode hash requestsnear the front of the disk queue. This queuesand the inode free list in each file system. changegave only minimal benefit. It was found that This strategywas chosendue to an insufficientlevel requestswere often queuedasynchronously, but then of measuredcontention. It could be arguedthat the a processwould laterdecide that it wantedthe data lack of contention arises from the relative infre- block. This suggestedan approachwhereby requests quencyof file open or createoperations relative to which were being waited for were dynamicallypro- others,such as readingor writing. Also, the DNLC moted to the front of the disk queue,to allow wait- cacheremoves some of the inodelookup load caused ing processesto becomeactive sooner. Surprisingly, by pathnamesearches. The Encorelocking approach this changecaused overall throughputof the bench- is the same as that used for their other caches, markto declinedramatically, because the disk queue namelythe useof a lock for eachhash queue as well never actually becameempty, but increasedmono- as a free list lock. tonicallyin size. Disk QueueT\rning It appearsfrom this that balanceamong utiliza- Once the contentionon the three cacheswas tions of resourcesis more importantthan optimizing removed,a Z-CPUsystem with a single disk became the utilization of one class of resource,namely the essentiallyl/O-bound running the GAEDE bench- CPUs. The segmentsof idle CPU time seemto be mark. To investigatethe causeof this, someperfor- necessaryto restorebalance by lowering the effec- mance measuringwas added to the kernel which tive requestrate to the disk, allowingit to clearthe reportedutilizations of all processorsand the disk accumulatedbacklog. controller, as well as the size of the disk queue. The real problemhere is that thereis a "high These measurementswere collected at each clock wall" betweenthe disk driver and upper level file tick and averagedover a l.-secondinterval. Watch- systemlayer. Oncean asynchronouswrite requestis ing thesestatistics in real time revealedthat the sys- passedto the disk driver, there is no currentway to tem was alternatingbetween periods of processor retrieve it until the operation is complete. This saturationwith little or no disk activity, and disk representswasted I/O activity, since a processthat saturationwith little or no processoractivity. The waits for such a requestis likely to modify the data disk saturationoccurred when the disk queue size just written. Similarly,the disk canbe idle for long rose to severalhundred requestsin length. Unfor- periodsof time becausethe cache flushing code tunately,it is theoreticallypossible in SVR4 for the which could give the driver some work to do does size of the disk queueto be boundedonly by the not know that the disk is idle. When it doesgive it numberof allocatablememory pages in the system. work to do, it tends to flood the disk queuewith The blocking of the CPUs during periodswhen requests,which overloadsthe driver and slows or the disk queueswere very large was due to the fact idlesthe CPUs. that processeswere blocked on synchronousI/O Fixing theseproblems to even out the flow of which was generatedto updateinode contentsand cachecopy-back traffic to the disk is one part of the directories. As an experiment,all of the synchro- solutionto this problem.The otherpart of the solu- nous inode update operations were changed to tion is to reducethe write requirementsof the file delayed writes to decouple them from the disk systems themselves,possibly by the use of log- driver. This changeimproved the CPU scalability structuredfile systemssuch as that implementedfor substantially,at the cost of a less consistentfile sys- the Spriteproject [19]. tem, On a two-processorsystem, the GAEDE bench- mark could be transformedfrom mostly disk-bound Related Work to mostly CPU-boundwith this changó. (A kernel The closestwork to this effort was done by variablewhich turnedthis featureon or off was cyn- Encoreto multithreadtheir vnode-basedMach file ically dubbed the "benchmark flag", with the system and the OSF/1 file system ÍLL, I2). Their suggestionthat at least some vendorsactually have approachwas actuallyquite different in a numberof sucha flag.) McVoy [14] proposeddoing something wavs.

26 Summer '92 USENIX - June 8.June 12, L992- San Antonio, TX Peacock File SystemMultithreading in SVR4Mp

The most significant difference is that they Summary choseto lock the file systemsbelow the vnodelayer, The SVR4MPeffort that is, there is no generic locking outsidethe file resultedin a systemwhich largely met its goals. The system. The SVR4MP approach was the exact degradationrelative to a uniprocessorSVR4 kernel oppositeof this. The locking is providedin a gen- is very close to the SVo target,and the throughput eric senseacross the vnodeoperation interface, with scalabilityat 5 processors ß 88Voof the the vnode lock available as the mutual exclusion theoreticalmaximum. This is com- puted as 5-CPU Throughput (5 lock for all of the per-vnodedata, including file sys- I "' l-CpU Throughput). Figure 5 is a graph which tem dependentdata. The Encoreapproach replaced shows the actual increasein the inode ILOCKED flag with read/write resource throughputas a function of the numberof processors, well locks, which extendedthe file semanticsto allow as as the ideal scalabil- ity. To achieve goal given true concurrencyfor file reading. The rationalefor this the dísk-bound behavior previously, this extensionwas that somefiles are readfrequently described it was necessaryto configure8 separate on a large system, and that performancecould be disk drives on 4 disk controllers in the benchmark (with enhancedby reducinglock contentionon thesefiles. system the "benchmark flag" tumed off). During Consortiumtuning, this type of contention was not encountered, probably due to our The file systemmultith¡eading effort resultedin configurationsbeing smallerin size than Encore's. a model which providesa lot of genericsupport for Anotheradvantage of lockingbelow the vnodeinter- locking within each file system type. The vnode face is that the file systemwriter has the flexibility locking aroundVOP functionsprovides default lock- to determinewhether a read/writeor simple mutex ing protectionfor most file systemfunctions, making lock should be used, whereas the ionsortium individualfile systemmultithreading easier. approachallows only mutex locking. On the other Significantcontention points in the file system hand,the Consortiumapproach involves less seman- caches were relieved by the application of the tic (i.e. code)changes, particularly since it is permis- Software Set-AssociativeCache structure. Once sible for a vnode lock to be held aroundmore than these significant locking bottleneckswere relieved, oneVOP call when that is required. This was impor- the file systemswere found to be inherently disk tant due to the time-to-marketand robustnesscon- bound by our benchmarks. Some disk queueing straintsthat the projectfaced. modificationswere tried, with limited successand Encore avoids contention by releasing the one dramaticfailure. However, thesemodifications resourcelocks acrossany blocking operations,which lead to insightsabout the importanceof maintaining allows a race betweentwo simultaneouslookups of balance to achieve optimal throughput,rather than the sameinode.. This racerequires a recheckof the attempting to optimize the utilization of a single hashqueue whenever a new inode is read from disk. resourceclass. In the OSF/1file system,the recheckis avoidedby timestampingdata structures,such as a hash queue, 200000 when they changeand only recheckinga queueif its timestamp changes across a blocking operation. Much of the paper on OSF/L describesthe races 150000 'Actual introducedinto directoryoperations by more permis- sive locking and how they were solvedusing times- P':"ell* looooo tamps. In the Consortium approach,the original scrrpts/flouf resourcelocking semanticsare preserved,keeping the ILOCKED or IRWLOCKEDflags locked across blocking operations,even thoughthe vnode mutex locksare released. Another clear difference can be seen in the -0 generalcache locking strategy, which hasbeen com: mentedon throughoutthe paper. Practically 1.2345 all of P¡ocessors their describedcache locking fits the hash queue lock + free list lock + elementlock model,as shown Figure 5: Scalabilityof the GaedeBenchmark in Figure 2. The Consortiumlocking usesthe single lock model shown in Figure 3, modified to use The Sprite Logging File System is reportedly SSAC locking (Figure 4) where contentionis too an order of magnitudemore efficient at using disks high. than existing file systems[19], running at CPU saturation while consuming only LTVo of disk bandwidth in a test which is similar to GAEDE. This being the oppositeof our situation,the maniage of multiprocessortechnology with log-basedfile

'92 Summer USENIX - June 8-June12,l9g2 - SanAntonio, TX 27 File SystemMultithreading in SVR4MP Peacock

systemsappears to be quitedesirable. [7] S. Gaede.A ScalingTechnique for Comparing SVR4MP is available from UNIX Systems InteractiveSystem Capacities. Froceedingsof Laboratoriesas a source-codeupgrade to SystemV the Conferenceof CMG XIII, December1,982. Release4. More informationcan be obtainedbv cal- [8] R. Gingell,J. Moran and W. Shannon.Virtual ling 1-800-828-UNIX. Memory Architecture in SunOS. Froceedings of the Summer1987 USENIX Conference. Acknowledgments [9] S. Kleiman.Vnodes: An Architecturefor Multi- This project has been a pleasurefor us to be ple File Systemsin Sun LINIX. Froceedingsof involved with, as most of the architectureteam had the Summer1986 USENIX Conference. at least two generationsof multiprocessordesign [10] G. Hamiltonand D. Conde. An Experimental experience.A large number of people have contri- Symmetric MultiprocessorULTRIX Kernel. butedto the successof SVR4MP,most having writ- Froceedings of the Winter 1988 USENIX ten code: Mike Abbott, Sunil Bopardikar,Fiorenzo Conference. Catteneo,Calvin Chou,Ho Chen,Ben Cuny, Jane [11] A. Langerman,J. Boykin and S. LoVerso.A Ha, Jim Hanko, Mohan Krishnan,John Litvin, Mani Highly-ParallelizedMach-based Vnode Filesys- Mahalingam, Arun Maheshwari, Cliff Neighbors, tem. Froc¿edingsof the Winter 1990USENIX SandeepNijhawan, Mark Nudelman, Lisa Repka, Conferenc¿. Sunil Saxena,K. M. Sherif, Moyee Siu, John Slice, [12] S. LoVerso,N. Paciorek,A. Langermanand G. Dean Thomas, Vijaya Verma, Fred Yang and Feinberg.The OSF/1 (UFS). Wilfred Yu. Specialthanks are due to Wilfred Yu Froceedings of the Winter L99t USENIX for the performancenumbers quoted here. Thanks Conference. are also due to the reviewers for their helpful [13] M. K. McKusick, W. Joy, S. Leffler and R. suggestionsfor improvementsto this paper. Fabry.A Fast File Systemfor UNIX. Transac- tions on ComputerSystems, Yol. 2 No. 3, pp References I81-L97.ACM, 1984. [1] M. J. Bach. TheDesign of the UNIX Operating [14] L. W. McVoy and S. R. Kleiman.-like System. Prentice-Hall, Englewood Cliffs NJ, Performance from a . 1986,pp. 38-59, Froceedings of the Winter t99L USENIX [2] M. Bach and S. Buroff, MultiprocessorUNIX Conference. Operating Systems.AT&T Bell Laboratories [15] N. Paciorek, S. LoVerso, A. I:ngerman. Technical Journal, 64:1733-t749, October Debugging Kernels. t984. Froceedingsof the Second Symposium on [3] R. Barkleyand T. P. Iæe.A DynamicFile Sys- Experienceswith Distributed and Multiproces- tem Inode Allocation and Reclaim Strategy. sor Systems(SEDMS ID. USENIX, March Froceedings of rhe Winter 1990 USENIX 799t. Conference. [16] K. Peacock.The CounterpointFast File Sys- [4]J. Boykin and A. Langerman.The Paralleliza- tem. Froceedingsof the 1988Winter USENIX tion of Macly'4.3BSD:Design Philosophyand Conference. Performanc.eAnalysis. Froceedings of the [17] J. K. Peacock,S. Saxena,D. Thomas,F. Yang USENIX Workshopon Experienceswith Distri- and W. Yu. Experiencesfrom Multithreading butedand MultiprocessorSystems, pp 105-126, SystemV Release4. Froceedingsof the Third October1989. Also appearedas: MaciV4.3BSD: Symposium on Experienceswith Distributed A ConservativeApproach to Parallelization. and MultiprocessorSystems (SEDMS III). ComputingSystems, Vol. 3, No. L, USENIX, USENIX,March 1992. V/inter1990. [18] R. Rodriguez,M. Koehler,L. Palmerand R. [5] M. Campbell,Richard Barton, J. Browning,D. Palmer.A DynamicUNIX OperatingSystem. Cervenka,B. Curry,T. Davis,T. Edmonds,R. Froceedingsof the Summer 1988 USENIX Holt, J. Slice, T. Smith and R. Wescott.The Conference. Parallelizationof UNIX SystemV Release4.0. [19] M. Rosenblumand J. K. Ousterhout.The Froceedings of the Winter 1991 USENIX Designand Implementationof a Log-Structured Conference. File System. Froceedingsof the Thirteenth [6] M. Campbell, R. Holt and J. Slice. tock ACM Symposiumon OperatingSystems Princi- Granularity Tuning Mechanismsin SVR4/lvIp. ples. ACM SIGOPS,Vol. 25, No. 5, October Froceedings of the Second Symposium on 1991. Experienceswith Distributed and Multiproces- [20] L. M. Ruane. ProcessSyncluonization in the sor Systems(SEDMS IÐ. USENIX, March UTS Kemel. ComputingSystems, Vol. 3 No, 3, 1991. USENIX,Summer L990.

2E Summer'92 USENIX - June 8-JuneL2,1992 - SanAntonio, TX Peacock FÍle SystemMultithreading in SVR4MP

l2L)V. Sinkewicz. A Strategyfor SMP ULTRIX. Froceedingsof the Summer 1988 USENIX Conference. Author Information Kent Peacockhas worked at Intel as a Consul- tant for 2 years as a memberof the Intel Multipro- cessor Consortium. Prior to that, he worked at Acer/Counterpoint developing a multiprocessor implementation of System V and the Acer/CounterpointFast File System,which is now part of SCO UNIX [16]. He has worked on design and implementationof 5 multiprocessorsystems since 1978, and has dabbledin performancetuning, C compilers, multiprocessordebugging tools and graphics applications. He graduated from the University of Waterloo in Ontario, Canadawith a Ph.D. in ComputerScience n 1979, having previ- ously completeda Master of Mathematicsin Com- puter Sciencen L975. ln 1974,he graduatedfrom the University of Manitoba, in Vy'innipeg,Canada, with a Bachelorof Sciencein ElectricalEngineering. He can be reachedvia U. S. Mail at 1747Fanwood Ct.;San Jose, CA 95133 and electronically at [email protected].

'92 Summer USENIX - June 8-June12,lgg} - SanAntonio, TX 29