File System Multithreading in Systemvrelease4mp

File SystemMultithreading in SystemVRelease4MP . J. Kent Peacock- Intel MultiprocessorConsortium ABSTRACT An Intel-sponsoredConsortium of computer companies has developed a multiprocessor version of System V Release4 (SVR4MP) which has been releasedby UNIX Sysæm Laboratories. The Consortium'sgoal was to add fine-grainedlocking ûo SVR4 with minimal change to the kemel, and with complete backward compatibility for user progrâms. To do this, a locking sEategy was developedwhich complemented,rather than replaced, existing UNIX synchronizationmechanisms. To multitlread the flle systems,some general locking sEategieswere developedand applied to the generic Virtual File Sysæm(VFS) and vnode interfaces. Of particular interest were the disk-basedSi and UFS file system types, especially with respect to their scalability. Contentionpoints were found and successivelyeliminated to the point wherethe tle systems were found to be disk-bound. In particular, several file system caches were restructured using a low-contention,highly-scalable approach called a SoftwareSet-Associartve Cache. This æchniquereduced the measuredlocking contention of each of these caches from the l0-I57o rangeto lessthan 0.17o. A numberof experimentalchanges to disk queuesorting algorithms were attemptedto reduce the disk bottleneck, with limited success. However, these experiments provided the following insight into the need for balancebetween I/O and CPU utilization in the system: that anempting to increaseCPU utilization to show higher parallelism could actually lower system throughput. Using the GAEDE benchmarkwith a sufflcient number of disks configured, the kernel was found to obtain throughputscalability of 887oof the theoreticalmaximum on 5 processors. Introduction Graphics, Solboume and Corollary have all offered multiprocessorUND( systems,though not all have goal assembledby Intel The of the Consortium describedthe changesmade to the operatingsystem produce multiprocessorversion of System was to a in the literahre. Bach describesa multiprocessor possible. As such, [2] V Release4 in as shorta time as versionof SystemV releasedby AT&T. Encorehas no inclination to perform a radical restruc- there was describedseveral generationsof their multithreading kernel, nor to supportuser-level threads turing of fhe effort, ûrst on MACH and then on OSF/I in a series while the evolution of a consensusth¡eads standard of papers[4, 11, 12,I5]. This work hasthe greatest was incomplete. similarity to that reported here, so an attempt has There were two main performance goals for been made úo make relevant comparisonsto it this efforu binary performancerunning the GAEDE throughoutthe paper. DEC has also publishedpapers benchmark[7] with respectto the uniprocessorsys- on their approach ¡o multithreading their BSD- tem should degrade by no more than 5Vo; and the derivedULTRIX systemLL0,ZL}. Ruane'spaper on proportional increaseof throughputshould be at least Amdahl's UTS multiprocessingkernel is the best 857oof the ûrst processorfor eachprocessor brought precedentto the philosophy of the approachused by online, up to 6 processors. The performance the Consortium [20]. Lately, NCR has discussed numbers were arrived at through negotiations with their parallelization efforts on System V Release4 IJNIX System Laboraûoriesas their acceptancecri- [5, 6]. As NCR has participatedas a memberof the teria. The 6 processorlimit arosefrom considera- Inæl Consortium, this work influenced the tions of how well the implementationwas expected Consortium'sapproach. to scale, namely between8 and 16 processors,and, more importantly,of how large a systemwas likely Locking Model to be availablefor testing. Multiprocessorlocking implementationscan be PreviousWork divided roughly into two camps: thosewho replaced Many atæmptshave been made to adaptLJNIX the traditional sleep-walceupUNIX synchronization to run on multiprocessormachines. Companies such with semaphoresor other locking primitives, and as AT&T, Encore, Sequent,NCR, DEC, Silicon those who opted to retain the sleep-wakeup Sumrner'92 USENIX - June 8-June12,1992 - SanAntonio' TX t9 Flle SystemMultithreading in SVR4MP Peacock synchronizationmodel and add mutual exclusion acquired by interrupt routines must spin, whereas aroundthe appropriatecritical sections. Bach's [2] locks of the same class which might deadlockone implementationis of the fust type, with existingsyn- another can avoid deadlock by sleeping. The ch¡onizationmechanisms replaced by Dijkstra sema- deadlockavoidance comes from the fact that held phores. Ruane'spaper [20] representsa sort of locks are releasedwhen a lock requestersleeps. In canonical description of the type of locking addition, mutex locks may be configured as described most often by previous multithreaders shared/exclusivelocks, allowing multiple reader using the mutual exclusion approach. The paper locks or a singlewriter lock to be held. gives a very good discussionof some subtle syn- For purposesof the following discussion,it is chronizationissues relative to a pre-SVR4environ- useñrl to carefully define the differencebetween a ment, as well as some sound argumentsagainst resource lock and a mutex lock. Logically, a usingDijkstra semaphores. resource lock is a lock which protects a resource Although a thoroughdescription of the Consor- even when the locker is not running on a CPU. A tium locking protocolsis beyondthe scopeof this mutex lock, on the other hand, only protects a paper, there are a number of key featuresof the resourcewhen the locker is running on a CPU. locking primitives which should be noted: Firstly, Stated anotherway, a resourcelock may be held mutual exclusionlocks are implementedso that any acrosscontext switches,while a mutex lock would mutex locks held by a processare automatically not be. A resourcelock can be an actual locking releasedand reacquired across a context switch. primitive, such as a semaphore[2], but neednot be. This is an imitation of the implicit uniprocessor Implementinga resourcelock as locking data pro- locking achievedby holding the processorin a non- tectedby a mutex has beenshown by Ruaneto pro- preemptivekernel. It is necessaryfor propersleep- vide greaterflexibility in locking than the semaphore wakeupoperation for a lock protectinga sleepcondi- approach[20]. It should be noted that there are tion to be releasedafrer the sleepingprocess has resourcelocks already presentin the uniprocessor establisheditself on the sleepqueue. Most previous code, for example,the B_BUSYflag bit in a disk efforts haveenhanced the sleepfunction to releasea buffer cacheheader or the ILOCKEDflag in an inode singlemutex lock, usuallyspecified as an extraargu- structure. Protectingmanipulations of theseresource ment to the sleepcall. With automaticreleasing of locks with mutexesis usuallysufficient to generalize locks acrosscontext switches,sleep calls need not them to work on a multiprocessor.In fact, all of the be changed. This means that much of the mul- instancesof a given type of resourcelock can be tithreading effort merely involves adding mutex protectedusing a single mutex lock rather than a Iocks aroundsections of code,even though they may lock per instance,typically with very low contention sleep. Comparisonsof many multith¡eadedfrles on the mutex. with the originals show this to be literally the only More detaileddescriptions of the locking primi- changerequired. This is an importantfeature when tives and more detailedarguments in favor of using it is necessaryto occasionallyupgrade to ongoing them over semaphorescan be found elsewhere[5, releasesof the underlyinguniprocessor system. 171. The ability to releaseall of the locks acquired in the call stack relates to another feature of the File SystemsLocking locks, namelythat recursivelocking of each indivi- The SVR4 file system implementationcenters dual lock is allowed. SVR4 is designedin such a aroundtwo separateobject-oriented interfaces which way that thereare a numberof object-orientedinter- originatedin SunOS[9]. One is the Virtual File faces between sub-systemsof the kernel, derived System(VFS) interface,which consistsof functions from SunOS[8]. Thesesubsystems call back and which perform file systemoperations, for example, forth to one another and create surprisingly deep mounting and unmounting. The other is the virtual recursive call stacks. (This happens particularly node (vnode) interface,which allows operationson betweenvirtual memory and file systernmodules.) instancesof files which reside in a virtual file svs- Though providing considerable modularity, this tem. featuremakes it very difficult to establishassefions Most aboutwhich mutex locks might be held coming into of the file system multithreadingeffort was general any given kernel function. Allowing lock recursion spentdeveloping a multithreadingmodel for thesetwo and the automaticrelease of mutex locks allows a interfaces.Although there are around given function to deal only with its own locking 10 file system.typesin SVR4, only the two disk- basedfile systems requirements,without having to worry about which are discussed:the UNIX File Sys- (UFS) which locks its callershold, asidefrom deadlockconsidera- tem type, is a derivativeof the Berke- tions. ley Fast File System[13] via SunOS,and rhe 55 type, which is derivedfrom the SystemV Release3 The primitives were also designedto be able to: file system. The strategiesfor multithreadingthese configurewhether the caller spins or gives up the two file systemswere basicallythe same. processorwhen the lock is not

File System Multithreading in Systemvrelease4mp

Copy on Write Based File Systems Performance Analysis and Implementation

Serverless Network File Systems

A Brief History of UNIX File Systems

Ext4 File System and Crash Consistency

11.7 the Windows 2000 File System

Chapter 11: File System Implementation! Chapter 11: File System Implementation!

Absolute BSD—The Ultimate Guide to Freebsd Table of Contents Absolute BSD—The Ultimate Guide to Freebsd

File Systems

[11] Case Study: Unix

Chapter 15: File System Internals

File-System Implementation

Comparative Analysis of Distributed and Parallel File Systems' Internal Techniques