CPU Inheritance

Bryan Ford Sai Susarla

Department of Computer Science University of Utah Salt Lake City, UT 84112 [email protected] http://www.cs.utah.edu/projects/flux/

Abstract such as support for ®xed-priority threads[17], or sev- eral ªscheduling classesº to which threads with different Traditional processor scheduling mechanisms in operat- purposes can be assigned (e.g., real-time, interactive, or ing systems are fairly rigid,often supportingonly one ®xed background)[25]. However, even these variants are gener- scheduling policy, or, at most, a few ªscheduling classesº ally hard-coded intothe system implementationand cannot whose implementations are closely tied together in the OS easily be adapted to the needs of individual applications. kernel. This paper presents CPU inheritance scheduling, a novel processor scheduling framework in which arbitrary In this paper we develop a novel processor schedul- threads can act as schedulers for other threads. Widely dif- ing framework based on a generalized notion of priority ferent scheduling policies can be implemented under the inheritance. In this framework, known as CPU inheri- framework, and many different policies can coexist in a tance scheduling, arbitrary threads can act as schedulers for single system, providing much greater scheduling ¯exibil- other threads by temporarily donatingtheirCPUtime tose- ity. Modular, hierarchical control can be provided over the lected threads while waiting on events of interest such as processor utilization of arbitrary administrative domains, clock/timer interrupts. The receiving threads can further such as processes, jobs, users, and groups, and the CPU donate their CPU time to other threads, and so on, form- resources consumed can be accounted for and attributed ing a logical hierarchy of schedulers as illustrated in Fig- accurately. Applications, as well as the OS, can imple- ure1. A scheduler threadcan benoti®edwhen thethread to ment customized local scheduling policies; the framework which it donated its CPU time no longer needs it (e.g., be- ensures that all the different policies work together logi- cause the target has blocked), so that the CPU can be cally and predictably. As a side effect, the framework also reassigned to another target. The basic thread dispatching cleanly addresses priority inversion by providing a gener- mechanism necessary to implement this framework does alized formof priorityinheritancethat automatically works not haveany notionof threadpriority,CPUusage, or clocks within and among diverse scheduling policies. CPU inher- and timers; all of these functions, when needed, are imple- itance scheduling extends naturallyto multiprocessors, and mented by threads acting as schedulers. supports processor management techniques such as proces- Under this framework, arbitrary scheduling policies can sor af®nity[29] and scheduler activations[3]. We show that be implemented by ordinary threads cooperating with each this ¯exibilitycan be provided with acceptable overhead in other through well-de®ned interfaces that may cross pro- typical environments, depending on factors such as context tection boundaries. For example, a ®xed-priority multi- switch speed and frequency. processor scheduling policy can be implemented by main- taining, among a group of scheduler threads (one for each available CPU), a prioritized queue of ªclientº threads to 1 Introduction be scheduled; each scheduler thread successively picks a thread to run and donates its CPU time to the selected tar- Traditional operating systems control the sharing of the get thread while waiting for an interesting event such as machine's CPU resources among threads using a ®xed quantum expiration (e.g., a clock interrupt). See Figure 2 scheduling scheme, typically based on priorities. Some- for an illustration. If the selected thread blocks, its sched- times a few variants on the basic policy are provided, uler thread is noti®ed and the CPU is reassigned. On the This research was supported in part by the Defense Advanced Re- other hand, if a different event causes the scheduler thread search Projects Agency,monitoredby the Departmentof the Army, under to wake up, therunningthread is preempted and theCPU is contract number DABT63±94±C±0058. The opinions and conclusions containedin thisdocumentarethoseof theauthorsandshould not be inter- given back to the scheduler immediately. Other scheduling preted as representing of®cial views or policies of the U.S. Government. policies, such as timesharing[23], ®xed-priority[12, 17], Root Scheduler Scheduler (real-time fixed-priority Real-time threads RR/FIFO scheduler) Scheduler CPU 0 threads Ready queues

Timesharing class CPU 1 (lottery scheduler) scheduling Web browser requests Mike CPU donation (BSD scheduler) Java applets port

Running thread Ready Jay threads (lottery scheduler) Jay's threads

App 1 Waiting thread App 2 Background class (gang scheduler) Figure 2: Example ®xed-priority scheduler Background jobs

 The framework naturally extends to multiprocessors.

 Processor af®nity scheduling is supported.

 Scheduler activations can be implemented easily.

Based on a prototype implementation in a user-level Figure 1: Example scheduling hierarchy. The dark circles represent threads package, we demonstrate that this framework be- threads acting as schedulers, while the light circles represent ªordinaryº threads. haves identically to traditional multi-class schedulers in equivalent con®gurations, and that it provides modular re- source control, load insulation, and priority inversion pro- rate monotonic[22], fair share[9, 16, 19], and lottery/stride tection automatically across multiple scheduling policies. scheduling[30, 31, 32], can be implemented in the same The additional scheduling overhead caused by this frame- way. work is negligible in the test environment, and we show This scheduling framework has the following features: that it should also be acceptable in many kernel environ- ments as well despite greater context switch costs.

 It supports multiple arbitrary scheduling policies on the same or different processors. The rest of this paper is organized as follows. Sec- tion 2 describes motivation and related work, Section 3

 Since scheduler threads may run either in the OS ker- presents the CPU inheritance scheduling algorithm in de- nel or in user mode, applications can easily extend or tail, and Section 4 shows how traditional scheduling algo- replace the scheduling policies built into the OS. rithms can be implemented on this framework. Section 5 demonstrates the behavior and performance properties of  It provides hierarchical control over the processor re- our framework through experimental results, and Section 6 source usage of different logical or administrative do- concludes with a short re¯ective summary. mains in a system, such as users, groups, individual processes, and threads within a . 2 Motivation  CPU usage accounting can be provided to various de- grees of accuracy depending on the resources one is Traditional operating systems divide a machine's CPU willing to invest. resources among threads using a ®xed scheduling scheme, typically based on priorities. However, the requirements

 Priority inversion is addressed naturally in the pres- ence of resource contention, without the need for ex- imposed on an operating system's scheduler often vary plicit priority inheritance/ceiling protocols. from application to application. For example, for interac- tive applications, response time is usually the most criti-

 CPU use is attributedproperly even in the presence of cal factorÐi.e., how quickly the program responds to the priority inheritance. user's commands. For batch jobs, throughput is of primary importance but latency is a minorissue. Hard real-time ap- is possible to obtain the correct behavior for some appli- plications require meeting application-speci®c deadlines, cations. However, this approach has serious, well-known while for soft real-time applications, missing a deadline limitations; in many cases, entirely different non-priority- is unfortunate but not catastrophic. There is no single based scheduling policies are needed[18]. scheduling scheme that works well for all applications. Even in normal interactive and batch-mode computing, Over the years, the importance of providing a variety traditional priority-based scheduling algorithms are show- of scheduling policies on a single machine has waxed and ing their age. For example, these algorithms do not pro- waned, following hardware and application trends. In the vide a clean way to encapsulate a set of processes/threads early years of computing, use of the entire machine was as a single unit to isolate and control their processor usage limited to a single user thread; this evolved into multipro- relative to the rest of the system. This shortcoming opens grammed machines in which a single scheduling policy the system to various denial-of-service attacks, the most managed batch jobs effectively. The advent of timesharing well-known being the creation of a large number of threads on machines still used for batch jobs caused a need for two which overwhelm processor resources and crowd out other scheduling policies. As timesharing gradually gave way to activity. These vulnerabilities generally don't cause seri- single-user workstations and PCs, a single scheduling pol- ous problems for machines only used by one person, or icy was again adequate for a while. when the users of the system fall into one administrative Supportingmultiplescheduling policies is again becom- domain and can ªcomplain to the bossº if someone is abus- ing important, not because of multiple users on a system, ing the system. However, as distributed and mobile com- but because of the variety of concurrent uses to which mod- puting becomes more prevalent and administrative bound- ern systems are put. Multimedia drives the need for soft aries become increasingly blurred, this form of system se- real-time scheduling policies on general purpose worksta- curity is becoming more important. This is especially true tions. Untrusted executable content (e.g., Java applets) re- when completely unknown, untrusted code is downloaded quires secure control of resource usage while also provid- and run in a ªsecureº environment such as that provided ing soft real-time guarantees. The hard real-time domain is by Java[13] or OmniWare[2]. Schedulers have been de- also making inroadsonto general purposemachines as pro- signed that address this problem by providing ¯exible hi- cessors and instruments supportingembedded applications erarchical control over CPU usage at different administra- become networked, and some customers (e.g., the military) tiveboundaries[5, 14, 15, 30, 31, 32]. However, itis notyet need the ability to shift processing power dynamically to clear how these algorithms will address other needs, such the problem of the moment. All of these additional poli- as those of hard real-time applications; certainly it seems cies must still work alongside traditional interactive and unlikely that a single ªholy grailº of scheduling policies batch scheduling: thoughmultimediaand video conferenc- will be found. ing may be the current rage, this does not mean that users With the growing diversity of application needs and no longercare aboutgettinggood interactiveresponse from scheduling policies, it is increasingly desirable for an op- the word processors, spreadsheets, and other applications erating system to support multiple independent policies. that they still rely on daily. Therefore, as the diversity of On multiprocessor systems, one simple but limited way of applications increases, operating systems need to support doing this is to run a different scheduling policy on each multiple coexisting processor scheduling policies, in order processor. A more general approach is to allow multi- to meet individual applications' needs as well as to utilize ple ªscheduling classesº to run on a single processor, with the system's processor resources more ef®ciently. a speci®c scheduling policy associated with each class. The classes have a strictly ordered priority relationship to 2.1 Related Work each other, so the highest-priority class gets all the CPU time it wants, the next class gets any CPU time left un- One simple approach to providing real-time support in used by the ®rst class, etc. Although this approach shows systems with traditional timesharing schedulers, which has promise, one drawback is that since the schedulers for the been adopted by many commonly-used systems such as different classes generally don't communicate or cooperate Unix, Mach[1, 4], and Windows NT[25], and has even be- with each other closely, only the highest-priority schedul- come part of the POSIX standard[17], is support for ®xed- ing class on a given processor can make any assumptions prioritythreads. Although these systems generally still use about how much CPU time it will have to dispense to the conventional priority-based timesharing schedulers, they threads under its control. allow real-time applications to disable the normal dynamic Additionally, most existing multi-policy scheduling priority adjustments on threads speci®cally designated as mechanisms still require every scheduling policy to be ªreal-time threads,º so that those threads always run at a implemented in the kernel and to be closely tied with other programmer-de®ned priority. By carefully assigning pri- kernel mechanisms such as threads, context switching, orities to the real-time threads in the system and ensuring clocks, and timers. The only existing system we know that all non-real-time threads execute at lower priority, it of that allows different scheduling policies to be imple- mented in separate, unprivileged protection domains is Clientthreadscan inturnact as scheduler threads, distribut- the Aegis Exokernel[8]. However, the Aegis scheduling ing their CPU time among their own clients, and so on, mechanism was not described at length and does not forming a scheduling hierarchy. address important issues such as timing, CPU usage The only threads in the system that inherently have ac- accounting, and multiprocessor scheduling. Both Aegis's tual CPU time available to them are the set of root sched- scheduling mechanism and our framework are based uler threads; other threads can only run if CPU time is do- on the use of a ªdirected yieldº primitive which allows nated to them. There is one root scheduler thread for each application-level threads to schedule each other; this core real CPU in the system; each CPU is permanently dedi- concept originally comes from coroutines[6], in which cated to supplying CPU time to its associated root sched- directed yield is the only way thread switching is done. We uler thread. The actions of the root scheduler thread on a believe our scheduling framework could be implemented given CPU determine the base-level scheduling policy for in an Exokernel environment through the use of suitable that CPU. kernel primitives, application-level support code, and standardized inter-process scheduling protocols. 3.1 The Dispatcher Finally, most existing systems still suffer from vari- ous priority inversion problems. Priority inversion oc- Even though all high-level scheduling decisions are per- curs when a high-priority thread requesting a service has formed by threads, a small low-level mechanism is still to wait arbitrarily long for a low-priority thread to ®n- needed to implement primitive thread management func- ish being serviced. This problem can be addressed in tions. We call this low-level mechanism the dispatcher to priority-based scheduling algorithms by supporting prior- distinguish it clearly from high-level schedulers. ity inheritance[7, 26], wherein the thread holding up the service inherits the priority of the highest priority thread The role of the dispatcher is to to implement thread waiting for service. In some cases this approach can be blocking, unblocking, and CPU donation. The dispatcher adapted to other scheduling policies, such as with ticket ®elds events and directs them to threads waiting on those transfer in [31]. However, the problem events, without actually making any real scheduling de- of resolving priority inversion between threads of different cisions. Events can be synchronous, such as an explicit scheduling classes using policies with different and incom- wake-up of a sleeping thread by a running thread, or asyn- parable notionsof ªpriorityºhas not been addressed so far. chronous, such an external interrupt (e.g., I/O or timer). The dispatcher itself is not a thread; it merely runs in the context of whatever thread owns the CPU at the time an 3 CPU Inheritance Scheduling event of interest occurs. In our scheduling model, as in traditional systems, a The dispatcher inherently contains no notion of thread thread isa virtualCPUwhose purposeistoexecute instruc- priorities, CPU usage, or even clocks and timers. In a ker- tions. A threadmay ormay nothaveareal CPU assignedto nel supporting CPU inheritance scheduling, the dispatcher it at any given instant; a running thread may be preempted is the only scheduling component that must be in the ker- and its CPU reassigned to another thread at any time. (For nel; all other scheduling code could in theory run in user- the purposes of this framework, it is not important whether mode threads outside of the kernel (although this ªpuristº these threads are kernel-managed or user-managed threads, approach may be impractical for performance reasons). or whether they run in supervisor or user mode.) In traditional systems, threads are generally scheduled 3.2 Requesting CPU Time by some lower-level entity, such as a scheduler in the OS kernel or a user-level threads package. In contrast, the ba- Because no thread (except a root scheduler thread) can sic idea of CPU inheritance scheduling is that threads are ever run unless some other thread donates CPU time to it, scheduled by other threads. Any thread that has a real CPU a newly-created or newly-woken thread must request CPU available to it at a given instant can donate its CPU tem- time from some scheduler before it can run. Each thread porarily to another speci®c thread instead of using the CPU has an associated scheduler which has primary responsibil- itself. This operation is similar to priority inheritance in ity for providing CPU time to the thread. When the thread conventional systems, except that it is done explicitly by becomes ready, the dispatcher noti®es the thread's sched- thedonating thread, and no notionof ªpriorityºis involved, uler that the thread needs CPU time. The exact form such only a direct transfer of the CPU from one thread to an- a noti®cation takes is not important; in our implementa- other; hence the name ªCPU inheritance.º tion, noti®cationsare simply IPC messages sent by the dis- A scheduler thread is a thread that spends most of its patcher to Mach-like message ports. timedonatingitsCPU resources to client threads; theclient When a thread wakes up,thenoti®cationit producesmay threads thus inherit some portion of the scheduler thread's in turn wake up a scheduler thread waiting to receive such CPU resources, and treat that portion as their virtual CPU. messages on its port. Waking up that scheduler thread will cause another noti®cationto be sent to its scheduler, which CPU S0 may wake up still another thread, and so on. Thus, wak- T0 ing up an arbitrary thread can cause a chain of wakeups to propagate back through the scheduler hierarchy. Eventu- S1 ally, this propagation may wake up a scheduler thread that is currentlybeing suppliedwithCPU time but isdonatingit T1 to some other thread. In that case, thethread currently run- ning on that CPU is preempted and control is given back to the woken scheduler thread immediately; the scheduler T2 can then make a decision to re-run the preempted client, switch to the newly-woken client, or even run some other Figure 3: CPU donation chain thread. Alternatively, the propagation of wake-up events may terminate at some point, for example because a noti- out (e.g., due to quantum expiration), the dispatcher will ®ed scheduler is already awake (not waiting for messages) automatically send scheduling request noti®cations on be- but has been preempted. In that case, the dispatcher knows half ofall thethreads dependingon(donatingto)to thepre- that thewake-up event is irrelevant for schedulingpurposes empted thread, effectively switching the thread automati- at the moment, so the currently running thread is resumed cally to another available CPUsource. One potentialworry immediately. is that a thread consuming CPU time from many sources will cause an ªavalanche effectº every time it is preempted 3.3 Relinquishing the CPU or woken as the dispatcher ®res off several scheduling re- quests, each of which may cause more scheduling requests At any time, a runningthread may block to wait for one as intermediate-level schedulers are woken up. We believe or more events to occur, such as I/O completion or timer that in practice it should be uncommon for a thread to in- expiration. When a thread blocks, the dispatcher returns herit from more than one or two other threads at once, so control of the CPU to the scheduler thread that provided this should not be a major problem; however, we have not it to the running thread. That scheduler may then choose yet examined this issue in detail. another thread to run, or it may relinquish the CPU to its scheduler, and so on up the line until some scheduler ®nds 3.5 The schedule operation work to do. The call a scheduler thread makes to donate CPU time to 3.4 Voluntary Donation a client thread is simply a special form of voluntary CPU donation, in which the thread to donate to and the event to Instead of simply blocking, a running thread can volun- wait for can be speci®ed explicitly. In our implementation, tarily donate its CPU to another thread while waiting on an the schedule operation takes as parameters a thread to event of interest; thisisdonein situationswhere priorityin- donate to, a port on which to wait for messages from other heritance wouldtraditionallybe used. For example, when a client threads, and a wakeup sensitivity parameter indicat- thread attempts to obtain a lock that is already held, it may ing in what situations the scheduler should be woken. The donate its CPU time to the thread holding the lock; simi- operationdonatestheCPU to thespeci®ed target thread and larly, when a thread makes an RPC to a server thread, the puts the scheduler thread to sleep on the speci®ed port; if a client thread may donate its CPU time to the server for the message arrives on that port, such as a noti®cation that an- duration of the request. When the event of interest occurs, other client thread has been woken or a message indicating thedonationends and theCPU is givenback to theoriginal that a timerhas expired, then the schedule operation ter- thread. In our implementation of this framework, the basic minates and control is returned to the scheduler thread. synchronization and IPC primitives automatically invoke In addition, the schedule operation may be inter- the dispatcher to perform voluntary donation appropriately rupted before a message arrives, depending on the behav- when thethread blocks; however, voluntarydonationcould ior of the thread to which the CPU is being donated and also be done optionallyor throughexplicit dispatcher calls. the value of the wakeup sensitivity parameter. The wakeup It is possible for a single thread to inherit CPU time sensitivitylevel acts as a hintto thedispatcher allowingitto from more than one source at a given time: for example, avoid waking up the scheduler thread except when neces- a thread holding a lock may inherit CPU time from several sary; it is onlyan optimizationand is not in theory required threads waiting on that lock in addition to its own sched- for the system to work. Our system supports the following uler. In this case, the effect is that the thread has the op- three sensitivity levels: portunityto run at any time any of its donor threads would

have been able to run. A thread only ªusesº one CPU  WAKEUP ON BLOCK: If thetarget of the schedule source at a time; however, if its current CPU source runs operation blocks without further donating the CPU, then the schedule operation terminates and con- void fifo_scheduling_loop()

trol is returned to the scheduler immediately. For ex- { S

ample, in Figure 3, if scheduler thread 1 has do- cur_thread = NULL;

more_msgs = 1; T

nated the CPU to thread 2 using this wakeup sen-

for (;;) { T

sitivity setting, but 2 blocks and can no longer if (more_msgs) { S use the CPU, then 1 will receive control again. more_msgs = msg_rcv(&fifo_pset, &msg); WAKEUP ON BLOCK is the ªmost sensitiveº setting, } else { and is typically used when the scheduler has other /* Select the thread to run next. */ client threads waiting to run. if ((cur_thread == 0) && !queue_empty(&fifo_runq))

 WAKEUP ON SWITCH:

If the client thread using the cur_thread = q_remove(&fifo_runq); T

CPU ( 2 ) blocks, control is not immediately returned

S /* Select wakeup sensitivity level. */

to its scheduler ( 1 ): the dispatcher behaves instead

cond = q_is_empty(&fifo_runq) ? S

as if 1 itself blocked, and passes control on back to

WAKEUP_ON_CONFLICT : WAKEUP_ON_BLOCK;

S T 2

its scheduler, 0 . If is subsequently woken up,

S S 1

then when 0 again provides the CPU to , the dis- /* Schedule and wait for messages. */ T if (cur_thread != NULL)

patcher passes control directly back to 2 without ac-

more_msgs = schedule(fifo_pset, &msg, S

tually running 1 . However, if a different client of

cur_thread, cond);

S T 1

1 , such as , wakes up and sends a noti®cation to else

S S 1 1 's message port, then 's schedule operation more_msgs = msg_rcv(fifo_pset, &msg); will be interrupted. This sensitivity level is typically } used when a scheduler has only one thread to run at /* Process the received message. */ the moment and doesn't care when that thread blocks switch (msg.request_code) { or unblocks, but it stillwants to switch between differ- case MSG_SCHED_REQUEST: ent client threads manually: for example, the sched- /* A client thread wants to run. */ q_enter(&fifo_runq, msg.thread_id); uler may need to start and stop timers when switching break; between client threads. case MSG_SCHED_BLOCKED: /* Last thread gave up the CPU. */

 WAKEUP ON CONFLICT: The scheduler is only cur_thread = 0; awakened if a second client thread wakes up while break; the scheduler is already donating CPU to a client }

}

T T T

2 2 (e.g., if 1 wakes up while is running). If }

blocks, the scheduler blocks too; then, if any single S

client of scheduler 1 is subsequently woken, such T

as 1 , the dispatcher passes control directly to the Figure 4: Example single-processor FIFO scheduler. woken client thread without waking up the scheduler. At this weakest sensitivity level, the dispatcher is allowed to switch among client threads freely; the noti®cations from newly-woken client threads). When scheduler only acts as a ªcon¯ict resolver,º making there are no client threads waiting to be run, the sched- decisions when two client threads are runnable at the uler uses theordinarynon-donatingmsg rcv operation in- same time. stead of the schedule operation in order to relinquish the CPU while waiting for messages. If there is only one client thread in the scheduler's queue, the scheduler 4 Implementing High-level Schedulers uses the weaker WAKEUP ON CONFLICT sensitivity level This section describes how the basic CPU inheritance when running it to indicate that the dispatcher may switch mechanism can be used to implement high-level schedul- among client threads arbitrarily as long as only one client ing policies and related features such as CPU usage ac- thread attempts to use the CPU at a time. counting, processor af®nity, and scheduler activations. 4.2 Timekeeping and Preemption 4.1 Single-CPU Schedulers The simple FIFO scheduler above can be converted to a Figure 4 shows a simpli®ed code fragment from a basic round-robin scheduler by introducing some form of clock non-prioritized FIFO scheduler in our system. The sched- or timer. For example, if the scheduler is the root scheduler uler keeps a queue of client threads waiting for CPU time, on a CPU, then the scheduler might be directly responsible and successively runs each one using the schedule oper- for servicing clock interrupts. Alternatively, the scheduler ation while waiting for messages to arrive on its port (e.g., may rely on a separate ªtimer threadº to notify it when a periodic timer expires. In any case, a timer expiration or 4.3.1 Processor Af®nity clock interrupt is indicated to the scheduler by a message Scheduling policies that take processor af®nity into being sent to the scheduler's port. This message causes consideration[27, 28, 29], can be implemented by treat- the scheduler to break out of its schedule operation and ing each scheduler thread as a processor and attempting preempt the CPU from whatever client thread was using to schedule a client thread from the same scheduler thread it. The scheduler can then move that client to the tail of that previously donated CPU time to that client thread. Of the ready queue for its priorityand give control to the next course, this will only work if the scheduler threads them- client thread at the same priority. selves are consistently run on the same processor. Any pro- cessor af®nity support in one scheduling layer will only work well if all the layers between it and theroot scheduler 4.3 Multiprocessor Support also consider processor af®nity; a mechanism to ensure this is described in the next section. Since the example scheduler above only contains a sin- gle scheduler thread, it can only schedule a single client 4.3.2 Scheduler Activations thread at once. Therefore, although it can be run on a In the common case, client threads ªcommunicateº with multiprocessor system, it cannot take advantage of multi- their schedulers implicitly throughnoti®cations sent by the ple processors simultaneously. For example, a separate in- dispatcher on behalf of the client threads. However, there stance oftheFIFO schedulercouldbe runas therootsched- is nothingto prevent client threads from explicitly commu- uler on each processor; a client thread assigned to a given nicating with their schedulers through some additional in- scheduler is effectively bound to its scheduler's CPU. Al- terface. One particularly useful explicit interface is sched- though in some situations this arrangement can be useful, uler activations[3], which allows clients to determine ini- e.g., when each processor is to be dedicated to a particular tially and later track the number of actual processors avail- purpose, in most cases it is not what is needed. able to them. Clients can then create or destroy threads as In order for a scheduler to provide ªrealº multiproces- appropriate in order to make use of all available proces- sor scheduling to its clients, where different client threads sors withoutcreating super¯uous threads that compete with can be dynamically assigned to different processors on de- each other uselessly on a single processor. mand, the scheduler itself must be multi-threaded. Assume Since scheduler threads are noti®ed by the dispatcher for now that the scheduler knows how many processors are when a client thread blocks and temporarily cannot use the available, and can bind threads to processors. (This is triv- CPU (e.g., because the thread is waiting for an I/O request ialiftheschedulerisrunastherootscheduleronsomeorall or a page fault to be serviced), the scheduler can notify the processors; we will show later how this requirement can be client in such a situation and give the client an opportunity met for non-root schedulers.) The scheduler creates a sepa- to create a new thread to make use of the CPU while the rate thread bound to each processor; each of these scheduler original thread is blocked. For example, a client can cre- threads then selects and runs client threads on that proces- ate a pool of ªdormantº threads, or ªactivations,º which the sor. The scheduler threads cooperate with each other us- scheduler knows about but normally never runs. If a CPU ing shared variables, e.g., shared run queues in the case of becomes available, e.g., because of another client thread a multiprocessor FIFO scheduler. blocking, the scheduler ªactivatesº one of these dormant Since a scheduler's client threads are supposed to be un- threads on the CPU vacated by the blocked client thread. aware that they are being scheduled on multiple proces- Later, when the blocked thread eventually unblocksand re- sors, the scheduler exports only a single port representing quests CPU time again, the scheduler preempts one of the the scheduler as a whole to all of its clients. When a client currently running client threads and noti®es the client that thread wakes up and sends a noti®cation to the scheduler it should make one of the active threads dormant again. port, the dispatcher arbitrarily wakes up one of the sched- Scheduler activations were originallydevised to provide uler threads waiting on that port. A good policy is for the better support for application-speci®cthread packages run- dispatcher to wake up the scheduler thread associated with ninginasingleuser modeprocess. In anOS kernelthatim- the CPU on which the wakeup is being done; this allows plements CPU inheritance scheduling, extending a sched- the scheduler to be invoked on the local processor without uler toprovidethissupportshouldbe quitestraightforward. interfering with other processors unnecessarily. If the wo- However, in a multiprocessor system based on CPU in- ken scheduler thread discovers that the newly woken client heritance scheduling, scheduler activations are also useful shouldbe run on a different processor (e.g., because it is al- in stacking of ®rst-class schedulers. As mentioned previ- ready running a high-priority client but another scheduler ously, multiprocessor schedulers need to know the num- thread is running a low-priorityclient), it can interrupt the ber of processors available in order to use the processors other scheduler thread's schedule operation by sending ef®ciently. As long as a base-level scheduler (e.g., the it a message or ªsignalº; this corresponds to sending inter- root scheduler on a set of CPUs) supports scheduler activa- processor interrupts in traditional systems. tions, a higher-level multiprocessor scheduler running as a client of the base-level scheduler can use the scheduler ac- In the root scheduler on a processor, these methods can tivations interface to track the number of processors avail- be applied directly. To implement statistical accounting, able and schedule its clients effectively. (Simple single- the scheduler simply checks what thread it ran last upon be- threaded schedulers that only make use of one CPU at a ing woken up by the arrival of a timeout message. To im- time don't need scheduler activations and can be stacked plement time stamp-based accounting, the scheduler reads on top of any scheduler.) the current time each time it schedules a different client thread. The scheduler must use the WAKEUP ON BLOCK 4.4 Timing sensitivitylevel in order to ensure that itcan check the time at each thread switch and to ensure that idle time is not Most scheduling algorithms require a measurable no- charged to any thread. tion of time in order to implement preemptive scheduling. For schedulers stacked on top of other schedulers, CPU For most schedulers, a periodic interrupt is suf®cient, al- accounting becomes a little more complicated because the though some real-time schedulers may need ®ner-grained CPU time supplied to such a scheduler is already ªvirtualº

timers whose periods can be changed at each quantum. and cannot be measured accurately by a wall-clock timer.

S T 2 With CPU inheritance scheduling, the precise nature of the For example, in Figure 3, if scheduler 1 measures 's

timing mechanism is not important to the general frame- CPU usage using a wall-clock timer, then it may mistakenly T

work; all that is needed is some way for a scheduler thread charge against 2 time actually used by the high-priority

T S 1

to be woken up after some amount of time has elapsed. In thread 0 , which has no knowledge of because it is S our implementation, schedulers can register timeouts with scheduled by the root scheduler 0 . In many cases, this a central clock interrupt handler; when a timeout occurs, a inaccuracy caused by stacked schedulers may be ignored message is sent to the appropriate scheduler's port, waking in practice on the assumption that high-prioritythreads and up the scheduler. The dispatcher automatically preempts schedulers will consume relatively littleCPU time; this as- the running thread if necessary and passes control to the sumption is similar to the one made in many existing ker- scheduler so that it can account for the elapsed time and po- nels that hardware interrupthandlers consume littleenough tentially switch to a different client thread. CPU time that they may be ignored for accounting pur- poses. In situations in which this assumption is not valid 4.4.1 CPU Usage Accounting and accurate CPU accounting is needed for stacked sched- Besides simplydecidingwhich thread torun next,sched- ulers, virtual CPU time information provided by base-level ulers often must account for CPU resources consumed. schedulers can be used instead of wall-clock time, at the CPU accounting information is used for a variety of pur- cost of additional communication between schedulers. For poses, such as reporting usage statistics to the user on de- example, in Figure 3, at each clock tick (for statistical ac-

mand, modifying scheduling policy based on CPU usage counting) or each context switch (for time stamp-based ac- S

(e.g., dynamically adjusting thread priority), or billing a counting), scheduler 1 could request its own virtual CPU S customer for CPU time consumed for a particular job. As time usage from 0 instead of checking the current wall-

with scheduling policies, there are many possible CPU clock time. It then uses this virtual time information to

T T 2 accounting mechanisms, each with different cost/bene®t maintain usage statistics for its clients, 1 and . tradeoffs. The CPU inheritance scheduling framework al- lows a variety of accounting policies to be implemented by 4.4.2 Effects of CPU Donation on Timing scheduler threads. As mentioned earlier, CPU donation can occur implicitly There are two well-known approaches to CPU usage ac- as well as explicitly, e.g., to avoid priority inversion when counting: statistical and time stamp-based[4]. With sta- a high-priority thread attempts to lock a resource already

tistical accounting, the scheduler wakes up on every clock held by a low-priority thread. For example, in Figure 5, S

tick, checks the currently running thread, and charges the scheduler 0 has donated the CPU to high-priority thread

T T 1

entire time quantum to that thread. This method is quite 0 in preference over low-priority thread . However, it

T T T

0 0

ef®cient, since the scheduler generally wakes up on ev- turns out that 1 is holding a resource needed by , so T

ery clock tick anyway; however, it provides limited accu- implicitly donates its CPU time to 1 . Since this donation S racy. A variation that provides better accuracy at slightly merely extends the scheduling chain, 0 isunaware that the

higher cost is to sample the current thread at random points switch occurred, and it continues to charge CPU time used

T T 1 between clock ticks[21]. Alternatively, with time stamp- to 0 instead of which is thethread that is actually using based accounting, the scheduler reads the current time at the CPU. every context switch and charges the appropriatethread for While it may seem non-intuitive at ®rst, in practice this the time since the last context switch. This method pro- is often precisely the desired behavior; it stems from the

vides extremely high accuracy, but also imposes a high cost basic rule that with privilege comes responsibility. While

T T T

1 1

due to lengthened context switch times, especially on sys- 0 is donating CPU to , is effectively doing work

T T 1 tems on which reading the current time is expensive. on behalf of 0 , even if is unaware that it is receiving CPU S0 up, ®nishes all of its work and goes back to sleep again be- T0 fore its real-time scheduling quantum is expired (presum- (high-priority) ably the common case), then the noti®cation posted to the low-priority timesharing scheduler at wakeup time will be T1 canceled (removed fromthe queue) when thethread goes to (low-priority) sleep again, so the timesharing scheduler effectively never

sees it. T

Figure 5: Implicit CPU donation from high-priority thread 0 to low- T

priority thread 1 to avoid priority inversion during a resource con¯ict.

5 Analysis and Experimental Results

T T 0 CPU time from 0 . Since 's share of the CPU is being We have created a prototype implementation of this

used to perform this work, the CPU time consumed must scheduling framework and devised a number of tests to T also be charged to 0 , even if the work is actually being evaluate its ¯exibility, performance, and practicality.

done by another thread. Demonstrated another way, charg-

T T 0 ing 1 rather than would allow the accounting system 5.1 Test Environment

to be subverted. For example, if high-priorityCPU time is T

ªexpensiveº and low-priorityCPU time is ªcheap,º then 0 In order to provide a clean, easily controllable environ- T could collude with 1 to use high-priorityCPU time while ment, we implemented an initial prototype of this frame-

being charged the low-priority ªrateº simply by arranging work in a user-level threads package. The threads package

T T 0

for 1 to do all the actual work while blocks on a lock supports common abstractions such as mutexes, condition T perpetually held by 1 . This ability to charge the ªproperº variables, and message ports for inter-thread communica- thread for CPU usage even in the presence of priority in- tion and synchronization. The package implements sepa- heritance is unnatural and dif®cult to implement in tradi- rate thread stacks with setjmp/longjmp,and the virtual tional systems, and therefore is generally not implemented CPU timer alarm signal (SIGVTALRM) is used to provide bythem; ontheotherhand, thisfeaturefallsoutof theCPU preemption and simulate clock interrupts. We used the vir- inheritance framework automatically. tual CPU timer instead of the wall-clock timer in order to minimize distortion of the results due to other activity in 4.5 Threads with Multiple Scheduling Policies the host Unix system; in a ªrealº user-level threads pack- age based on this scheduling framework, the normal wall- Sometimes it is desirable for a single thread to be asso- clock timer would probably be used instead. Our prototype ciated with two or more scheduling policies at once. For allows threads to wait on only one event at a time; how- example, a thread may normally run in a real-time rate- ever, there is nothingabout the framework that makes it in- monotonic scheduling class; however, if the thread's quan- compatible with thread models in which threads can wait tum expires before its work is done, it may be desirable for on multiple events at once[25]. the thread to drop down to the normal timesharing class in- Although implemented in user space, our prototype is stead of simply stopping dead in its tracks. designed to re¯ect the structure and execution environment Support for multiple scheduling policies per thread can of an actual OS kernel running in privilegedmode. For ex- be achieved in our framework even under a dispatcher that ample, the dispatcher itself is passive, nonpreemptible code supports only a single scheduler association per thread, by executed in the context of the currently running thread, an creating one or more additional ªdummyº threads which environment similar to that of BSD and other traditional perpetually donate their CPU time to the ªprimaryº thread. kernels. The dispatcher is cleanly isolated from the rest of Each of these threads can have a different scheduler asso- the system, and supports scheduling hierarchies of unlim- ciation, and the dispatcher automatically ensures that the ited depth and complexity. Our prototype schedulers are primary thread always uses the highest priority scheduler also isolated from each other and from their clients; the available, as described in Section 3.4. In situations in various components communicate with each other through which this solutionis not acceptable due to performance or message-based protocolsthat could easily be adapted to op- memory overhead, the dispatcher could easily be extended erate across protection domains using IPC. to allow multiple schedulers to be associated with a sin- Except where otherwise noted, all measurements were gle thread, so that when such a thread becomes runnable taken on a 100MHz Pentium PC with 32 megabytes of the dispatcher automatically noti®es all of the appropriate RAM, running FreeBSD 2.1.5. schedulers. Althoughit may at ®rst seem inef®cient to notify two or 5.2 Scheduler Con®guration more schedulers when a single thread awakes, in practice many of these noti®cations never actually need to be deliv- The following experiments are based on scheduling hi- ered. For example, if a real-time/timesharing thread wakes erarchy shown in Figure 6, which is designed to re¯ect RM1 Real-time Scheduler 2.5 Root Scheduler Rate-monotonic Rate-monotonic thread 1 (RM1) Fixed-priority Round-robin thread 1 (RR1) Real-time Round-robin thread 2 (RR2) periodic threads 2 Lottery thread 1 (LS1) Rate-monotonic thread 2 (RM2)

RM2 1.5 Timesharing Class Lottery scheduling 1 LS1 6

Web browser JAVA1 0.5 Lottery scheduling Accumulated CPU usage (sec)

Java applet 1 2 3 4 5 threads 0 0 20 40 60 80 100 120 Time (clock ticks) JAVA2 Figure 7: Behavior of a multilevel scheduling hierarchy. Events are numbered according to the description in Section 5.3. Background FIFO Scheduler FIFO1 Round-robin Non-preemptive Cooperating 2. A round-robinbackground thread, RR1, beginsa long threads computation in which it consumes all the CPU time RR1 it can get. The rate-monotonic thread, RM1, is un- FIFO2 affected because the ®xed-priority root scheduler al- RR2 ways schedules the real-time scheduler in preference to the background scheduler. 3. A second round-robin background thread, RR2, be- Figure 6: Multilevel scheduling hierarchy used for tests. comes runnable at the same priority as RR1. 4. An application thread in the timesharing class be- the activity present in a general-purpose environment. In comes runnable, stealing all available CPU time from this environment, the root scheduler is a nonpreemptive the background jobs for a short burst of time (e.g., a ®xed-priority scheduler with a ®rst-come-®rst-served pol- spreadsheet recalculation). icy among threads of same priority. This scheduler arbi- trates between three scheduling classes: a real-time rate- 5. A second rate-monotonic thread, RM2, becomes monotonicscheduler at the highestpriority,a lotterysched- runnable, consuming 25% of the total CPU time. The uler providing a timesharing class, and a simple round- time available to the timesharing class is reduced robin scheduler for background jobs. On top of the lottery accordingly, but RM1 remains unaffected. scheduler managing the timesharing class, an application- 6. The timesharing thread, LS1, ®nishes its burst of ac- speci®c third-level scheduler manages two threads using tivity, allowing the background jobs to continue once lottery scheduling (e.g., a Web browser managing Java ap- again. plet threads). Finally, a second application-speci®c sched- uler in the timesharing class schedules two cooperating threads using nonpreemptive FIFO scheduling. 5.4 Modular Control

5.3 Scheduling Behavior In order to demonstrate the modular control provided by the framework, we now consider the two third-level The purpose of our ®rst experiment is simply to ver- schedulers in our example con®guration, which represent ify that our framework works as expected, producing the application-speci®c schedulers. The ®rst, implementing same scheduling behavior as traditional single- and multi- lottery scheduling, simulates a web browser arbitrating be- class schedulers do in equivalent con®gurations. Figure 7 tween two downloaded Java applets. The browser can vary shows the scheduling behavior of the threads simulated in the ticket ratio allocated to each applet according to some our scheduling hierarchy over a time period covering the policy, e.g., giving more CPU time to applets whose output following sequence of events: windows are currently on-screen. The second application- level scheduler, a simple non-preemptive FIFO scheduler, 1. A rate-monotonicthread, RM1, becomes runnable and represents an application that uses threads merely for pro- starts consuming CPU time periodically using a ®xed grammatic purposes, and does not need to maintain any no- reservation of 50%. tion of priority or timeliness between its threads. Using a 100 non-preemptivescheduler in such an applicationeliminates JAVA1 unnecessary contention and context switches between ap- plication threads without giving up the bene®ts of preemp- 80 tive scheduling in other parts of the system. JAVA2 In this example, the two Java applet threads perpetually 60 consume as much CPU time as possible, and each of the co- operating FIFO threads alternately run for some time and FIFO1 40 then yield control to the other. Initially, the ticket ratio be- FIFO2 tween the web browser and the cooperative application is RR1 4:1, the ratio between the applet threads is 1:4, and each of 20

the FIFO threads compute for approximately the same time Relative CPU time allocation (percent) RR2 before yieldingto theother. Figure8 shows the behaviorof 1 2 3 4 0 the system across four events: 0 2000 4000 6000 8000 10000 Time (clock ticks) 1. At time 2000, the web browser changes the rela- Figure 8: Demonstrationof modular control of CPU allocation. Space tive ticket ratio between the Java applet threads to between the lines represents the percentage of CPU time used by a par- be 1:1, e.g., because the ®rst has come on-screen. ticular thread. Events occur at 2000-tick intervals, and correspond to the The amount of CPU time allocated to the coopera- description in Section 5.4. tive applicationremains unchanged, however, demon- strating load insulation. Since the timesharing-class and stores it in a memory buffer, and a low-priority com- scheduler and the web browser's scheduler are both lottery schedulers, this example is equivalent to the putationalthread inspects that data in the background using useoftwo currencies in a singlelotteryscheduler[31]. spare CPU cycles. In our simulated environment, the high- priority thread wakes up every ®ve clock ticks, acquires a 2. The cooperative thread FIFO1 changes its computa- shared mutex lock, and then releases it one clock tick later. tion so that it now consumes four times the amount The low-priority thread alternately acquires the lock and of CPU time before yielding to FIFO2. The effec- computes for up to three clock ticks, and then releases it tive distribution of CPU time changes to 4:1, re¯ect- for a varying amount of time. The lottery scheduled thread, ing the fact that the FIFO scheduling policy makes no LS1, is a medium-priority thread representing other prior- attempt at fairness. However, since the timesharing- ity inversion-producingactivityin the system; it alternately class scheduler uses a proportional-share policy, the runs and sleeps for random amounts of time up to 20 ticks, applicationsare insulatedfrom each other and theweb but otherwise does not interact with the other threads. browser is unaffected. This example demonstrates Figure 9 shows a plot of the distributionof latencies ex- load insulation between different scheduling policies. perienced by the high-priority thread in attempting to ac- quire the shared mutex, with and without the voluntary do- 3. The user changes the ticket allocation between the two nation protocol described in Section 3.4. When the inheri- applications to 1:1. Both sub-schedulers automati- tance protocolis in effect, maximum latency is bounded by cally adjust to thenew allocation according to their in- the maximum amount of time that the low-priority thread dividual scheduling policies. runs with the lock held (three clock ticks), and does not re- ¯ect the much larger delays that would be imposed if LS1 4. Finally, the priority of RR1 is raised above that of could prevent RR1 from running while RM1 is waiting for RR2, causing it to receive all of the CPU time allo- it to release the shared lock. Thus, the framework provides cated to the round-robin scheduler while leaving the correct real-time behavior even thoughall three threads run other scheduling domains unaffected. under different schedulingpolicies: it is not priority inheri- tance, but the more general CPU inheritance protocol, that makes this work. 5.5 Avoidance of Priority Inversion

In the next experiment we demonstrate how priority 5.6 Multiprocessor Scheduling inversion can be avoided in our framework even among threads running under different scheduling policies. For To test the ability of our framework to support multi- this experiment, we use rate-monotonic thread RM1, lot- processor scheduling, we extended the test environment to tery scheduled thread LS1, and background thread RR1. simulate a multiprocessor. Since a real multiprocessor ma- The rate-monotonic thread and the background thread to- chine was not available for this test, we emulated one by gether simulate a data acquisition application in which time slicing the BSD process at a ®ne granularity among a high-priority real-time thread periodically receives data several ªvirtual processors.º Each processor has its own 70 CPU donation on mutex contention Scheduling Hierarchy Depth Dispatch Time (s) No CPU donation on mutex contention 60 Root scheduler only 8.0 2-level scheduling 11.2 50 3-level scheduling 14.0 4-level scheduling 16.2 40 8-level scheduling 24.4

30 Table 1: Basic dispatching costs. 20 Number of occurrences

10 dent on the depth of the scheduling hierarchy since the dispatcher must often iterate through trees and linked lists 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 whose length is proportional to the depth of the hierarchy. Mutex lock latency for real-time thread (clock ticks) One concern raised by the dependence of dispatch time on scheduling hierarchy depth is that, since in theory the Figure 9: Priority inheritance between schedulers framework imposes no depth limit and the dispatcher it- self is always effectively the ªhighest priority activity in dispatcher and root scheduler thread, but the root sched- the system,º this creates a source of unboundedpriorityin- uler threads cooperate as described in Section 4.3 to imple- version that would be unacceptable in hard real-time sys- ment a single ®xed-priority multiprocessor scheduler with tems. However, in practice there is no reason a particular a shared ready queue for each priority level. We tested the dispatcher must support unlimited depth; the dispatcher in multiprocessor scheduler by implementing a simple paral- a hard real-timesystemcouldlimitthedepthtofouroreight lel database lookup application, in which multiple threads levels, which should be suf®cient for all practical purposes repeatedly search for a randomly chosen item in a shared and still impose little computational overhead. binary tree, locking and unlocking the nodes of the tree as they descend. We ran this application on 1, 2, 4, and 8- 5.7.2 Context Switch Costs processor con®gurations, with varying numbers of worker The second type of overhead in our framework is the threads; in each case, the behavior and performance was cost of additional context switches to and from scheduler exactly as wouldbe expected froma comparable traditional threads. Unfortunately, the cost of a context switch varies scheduler. Although this experiment demonstrates that the widely between different environments, from user-level framework extends to multiprocessors, we have not yet ex- threads packages in which context switches are almost free, plored the effects of stacking multiprocessor scheduling to monolithickernels in which context switches are several policies, nor have we implemented scheduler activations, orders of magnitudemoreexpensive. An obviousinference which we suspect will be critical to making CPU inheri- is that our framework is likely to be practical in some en- tance scheduling work well in many practical multiproces- vironments but not in others. Therefore, in order to gain a sor situations. meaningful idea of how expensive our framework is likely to be in a given environment, we ®rst measure the number 5.7 Scheduling Overhead of additional context switches produced, which varies with different applicationsand systemloads butis notdependent While the preceding experiments demonstrate the ba- on thread and protection boundary-crossing costs. sic properties and potential usefulness of our scheduling Table 2 shows the context switch statistics observed for framework, it remains to be seen whether it is ef®cient each of several example applications: ªClient/Serverº is an enough inpractice. In comparison to traditionalmulti-class application in which a number of client threads repeatedly scheduling mechanisms, our framework introduces two ad- invoke services on a smaller set of server threads; ªPar- ditional sources of overhead: ®rst, the overhead caused by allel Databaseº is the multiprocessor binary search appli- thedispatcher itselfwhilecomputingthethread toswitch to cation described in Section 5.6; ªReal-timeº is the data after a givenevent, and second, theoverhead resultingfrom acquisition application from Section 5.5; ®nally, ªGen- additional context switches to and from scheduler threads. eralº is the test in Section 5.3 representing general-purpose The computation overhead caused by a given scheduling computing activity. The chart shows the number of times algorithmis not an issue, since the same algorithms can be the dispatcher switched to each thread over the course of used in each case. the test; user threads are shown separately from scheduler threads. The last row in the table shows the percentage 5.7.1 Dispatcher Costs of total context switches that were invocations of sched- To address the ®rst issue, Table 1 shows the basic costs uler threads. This percentage is consistently near 50% re- of the dispatcher's computation. These costs are depen- gardless of the application; this indicates that, on average, Client/ Parallel Real- General gzip gcc tar configure Server Database time Run time (sec) 26.4 35.3 9.6 26.0 RM1 57 322 101 Context switches/sec 11 32 81 202 RM2 19 26 Traps/sec 10 562 22 3470 RM3 19 System calls/sec 23 651 517 1807 LS1 25 622 17 Device interrupts/sec 427 509 3337 1055 JAVA1 46 FIFO1 9 Table 3: Scheduling-relatedstatistics for test applications measuredun- RR1 114 238 249 7 der FreeBSD-2.1.5 on a 100MHz Pentium PC. RR2 3 242 14 RR3 234 RR4 243 User invocations 492 957 1193 165 to 1000 (1 ms). This graph effectively shows the ªtoler- Root scheduler 262 956 1237 142 anceº of a system to scheduling overhead in different sit- Rate monotonic 43 1 65 uations: for example, supposing that typical users would Lottery scheduler 30 57 3 tolerate no more than about2% slowdown for typicalappli- Java thread scheduler 2 FIFO scheduler 1 cations such as gcc, our framework (or any other schedul- Round-robin scheduler 8 8 8 ing mechanism) would have to add no more than about

Scheduler invocations 346 956 1303 218 300s to the process-to-process context switch time if im- Total context switches 838 1913 2496 383 plemented in FreeBSD. Suppose FreeBSD was changed Scheduler invoc. rate 41% 50% 52% 56% so that all scheduling was done in user mode, adding ap- proximately one additional context switch due to a sched- Table 2: Switch Costs for Simple Applications. uler invocation for each existing process-to-process con- text switch. Based on context switch times we mea- lmbench approximately one additional scheduler thread invocation sured using the benchmark suite[24], which are can beexpected in our framework for each ordinarycontext approximately 39s on this machine, plus an additional switch. However, note that this chart is overly pessimistic 11s to re¯ect the dispatcher's cost (Section 5.7.1), the because the ªscheduler invocationsº ®gures include all overall overhead should still be negligible simply because context switches to scheduler threads, not only those di- FreeBSD does not context switch all that often (see arrow rectly related to scheduling: in particular, it includes con- A on the graph). Furthermore, given BSD's monolithicde- text switches caused by explicit RPCs, which greatly in¯ate sign, we would in practice expect at least the root sched- the counts for the root scheduler in our system because our uler to be implemented in the kernel and only application- root scheduler thread also provides message-based timers speci®c schedulers to be in user mode; this would reduce and other facilities to the rest of the system. the cost even further. Of course, microkernels have much less tolerance for scheduling overhead simply because they perform more 5.7.3 Overhead in Kernel Environments context switches. For example, in Mach, traps and sys- Although we have not yet implemented our framework tem calls can be expected to produce about two additional in a kernel environment, we can get some idea of how it context switches each; for an extremely ªpuristº microker- would perform in such an environment based on the re- nel such as L4, in which even device drivers are in user sults shown above and a knowledge of how real-world ap- mode, device interrupts would also add an additional two plications exercise the scheduler. Table 3 shows system- context switches each. Figure 10 also shows a plot of wide statistics we gathered for several familiar applica- scheduling overhead tolerance for a hypothetical L4-like tions running on FreeBSD 2.1.5, measured using BSD's microkernel, based on the numbers in Table 3 except with vmstat command. gzip is a compute-intensive applica- traps, system calls, and interrupts each counting as two ad- tion compressing an 8MB ®le; gcc is a build of a 20,000- ditional context switches. For example, to keep the over- line program using the GNU C compiler; tar represents head under 2% for gcc in this system, the additional per-

an I/O-intensive application copying 8MB of source ®les; context switch cost must be no more than about 6s (see and configure is an extremely I/O and fork-intensive arrow B), which would be dif®culteven with L4'sphenom-

3000-line Unix shell script. All tests were performed on a enal 3.2s round-trip RPC time[20]. On the other hand, networked machine running in multi-user mode with a full these ªback of the envelopeº calculations are pessimistic complement of daemons but no other user activity, repre- in several ways: ®rst, an additional scheduler invocation sentative of a typical single-user workstation. should not be needed for every normal context switch, as For three of the applications in Table 3, gzip (best- explained in Section 5.7.2; second, in a microkernel that case), gcc (average), and configure (worst-case), Fig- dispatches hardware interrupts to threads, we would expect ure 10 shows the overall application slowdown that would at least a minimal ®xed-priority root scheduler to be im-

result if each normal process-to-process context switch was plemented in the kernel so that device driver threads can n n microseconds more expensive, where varies from 1 be scheduled without the help of user-level schedulers; and 10 diverse scheduling policies. CPU inheritance scheduling Microkernel:configure (13000 csw/s) Microkernel:gcc (3500 csw/s) extends naturally to multiprocessors, and supports proces- Microkernel:gzip (930 csw/s) 8 FreeBSD:configure (202 csw/s) sor management techniques such as processor af®nity and FreeBSD:gcc (32 csw/s) scheduler activations. We have shown that this ¯exibility FreeBSD:gzip (11 csw/s) can be provided with negligible overhead in environments 6 in which context switches are fast, and that the framework should be practical even in some kernel environments in 4 which context switches are more expensive. B

Overall slowdown (percent) 2 Acknowledgements A For their many thoughtfuland detailed comments on ear- 0 lier drafts we thank the anonymous reviewers and Kevin 1 10 100 1000 Additional overhead per context switch (microsec) Jeffay, our shepherd, as well as the members of the Flux project. We are especially grateful to Jay Lepreau for his Figure 10: Overall application slowdown as a function of additional support and advice, to Kevin Van Maren for considerable overhead per process-to-process context switch. In BSD, only switches help on the results and bibliography, and to Mike Hibler between Unix processes count; in Mach, traps and system calls count as RPCs (two context switches each); in L4, device interrupts also count as for spending an entire day reviving Bryan's machine after RPCs. a complete disk failure on Friday the 13th just before the ®nal paper deadline. third, our prototypedispatcher is extremely unoptimized in terms of both its algorithm (it still causes many avoidable References context switches) and its implementation (it takes much longer than necessary to compute the next thread to run). [1] M. Accetta, R. Baron, W. Bolosky, D. Golub, There should certainly be many points in the design space R. Rashid, A. Tevanian, and M. Young. Mach: A at which CPU inheritance scheduling is practical even for New Kernel Foundation for UNIX Development. microkernels; we are currently working on a second proto- In Proc. of the Summer 1986 USENIX Conf., pages type of the framework in our Fluke microkernel[10, 11] in 93±112, June 1986. order to evaluate the framework further in this light. [2] A. Adl-Tabatabai, G. Langdale, S. Lucco, and R. Wahbe. Ef®cient and Language-Independent 5.8 Code Complexity Mobile Programs. In Proc. ACM SIGPLAN Symp. on Programming Language Designand Implementation, As a ®nal useful metric of the practicality of our frame- pages 127±136, May 1996. work, we measured our prototype's code size and complex- ity in terms of both raw line count and lines containing [3] T. E. Anderson, B. N. Bershad, E. D. Lazowska, and semicolons. The entire dispatcher is contained in a sin- H. M. Levy. Scheduler Activations: Effective Ker- gle well-commented ®le of 550 (raw) or 158 (semicolons) nel Support for the User-Level Management of Paral- lines. Each of our example schedulers is around 200/100 lelism. ACM Trans. Comput. Syst., 10(1):53±79,Feb. lines long; of course, production-quality schedulers that 1992. handle processor af®nity, scheduler activations, dynamic priority adjustments, and so on, would probably be signif- [4] D. L. Black. Scheduling and Resource Manage- icantly bigger. ment Techniques for Multiprocessors. PhD thesis, Carnegie Mellon University, July 1990. 6 Conclusion [5] A. C. Bomberger and N. Hardy. The KeyKOS Nanok- ernel Architecture. In Proc. of the USENIX Work- In this paper we have presented a novel processor shop on Micro-kernels and Other Kernel Architec- scheduling framework in which threads are scheduled by tures, pages 95±112, Seattle, WA, Apr. 1992. other threads. Widely different scheduling policies can co- exist in a single system, providing modular, hierarchical [6] O.-J. Dahl. Hierarchical Program Structures. Struc- control over CPU usage. Applications as well as the OS tured Programming, pages 175±220, 1972. can implement customized local scheduling policies, and CPU resources consumed are accounted for and attributed [7] S. Davari and L. Sha. Sources of Unbounded Priority accurately. The framework also cleanly addresses priority Inversions in Real-time Systems and a Comparative inversionby providinga generalized form of priorityinher- Study of PossibleSolutions. ACM Operating Systems itance that automatically works within and among multiple Review, 23(2):110±120, Apr. 1992. [8] D. R. Engler, M. F. Kaashoek, and J. O'Toole Jr. [20] J. Liedtke. On Micro-Kernel Construction. In Proc. Exokernel: An Operating System Architecture for of the 15th ACM Symp. on Operating Systems Prin- Application-level Resource Management. In Proc. ciples, pages 237±250, Copper Mountain, CO, Dec. of the 15th ACM Symp. on Operating Systems Prin- 1995. ciples, pages 251±266, Copper Mountain, CO, Dec. [21] J. Liedtke. A Short Note on Cheap Fine-grained 1995. Time Measurement. ACM Operating Systems Re- [9] R. B. Essick. An Event-based Fair Share Scheduler. view, 30(2):92±94, Apr. 1996. In Proc. of the Winter 1990 USENIX Conf., pages [22] C. L. Liu and J. W. Layland. Scheduling Algorithms 147±161, Washington, D.C., Jan. 1990. for Multiprogrammingin a Hard-Real-Time Environ- ment. Journal of the ACM, 20(1):46±61, Jan. 1973. [10] B. Ford, M. Hibler, and Flux Project Members.

Fluke: Flexible -kernel Environment (draft docu- [23] J. M. McKinney. A survey of analytical time-sharing ments). University of Utah. Postscript and HTML models. ComputingSurveys, page 105º116,jun 1969. available under http://www.cs.utah.edu/projects/- [24] L. McVoy and C. Staelin. lmbench: Portable Tools ¯ux/¯uke/html/, 1996. for Performance Analysis. In Proc. of 1996 USENIX [11] B. Ford, M. Hibler, J. Lepreau, P. Tullmann, G. Back, Conf., Jan. 1996. and S. Clawson. Microkernels Meet Recursive Vir- [25] Microsoft Corporation. Win32 Programmer's Refer- tual Machines. In Proc. of the Second Symp. on Op- ence, 1993. 999 pp. erating Systems Design and Implementation, Seattle, WA, Oct. 1996. USENIX Assoc. [26] L. Sha, R. Rajkumar, and J. P. Lehoczky. Prior- ity Inheritance Protocols: An Approach to Real-time [12] D. B. Golub. Adding Real-Time Scheduling to the Synchronization. IEEE Transactions on Computers, Mach Kernel. Master's thesis, University of Pitts- 39(9):1175±1185, 1990. burg, 1993. [27] M. S. Squillante and E. D. Lazowska. Using [13] J. Gosling and H. McGilton. The Java Language En- Processor-Cache Af®nity Information in Shared- vironment: A White Paper. Technical report, Sun Mi- Memory Multiprocessor Scheduling. IEEE Trans- crosystems Computer Company, 1996. Available as actions on Parallel and Distributed Systems, http://java.sun.com/doc/language environment/. 4(2):131±143, 1993. [28] J. Torrellas, A. Tucker, and A. Gupta. Evaluat- [14] P. Goyal, X. Guo, and H. M. Vin. A Hierarchical CPU ing the Performance of Cache-Af®nity Scheduling in Scheduler For Multimedia Operations. In Proc. of the Shared-Memory Multiprocessors. Journal of Paral- Second Symp. on Operating Systems Design and Im- lel and Distributed Computing, 24:139±151, 1995. plementation, Seattle, WA, Oct. 1996. USENIX As- soc. [29] R. Vaswani and J. Zahorjan. The Implications of Cache Af®nityon Processor Schedulingfor Multipro- [15] N. Hardy. The KeyKos Architecture. Operating Sys- grammed, Shared Memory Multiprocessors. In Proc. tems Review, Sept. 1985. of the 13th ACM Symp. on Operating Systems Princi- ples, pages 26±40, Oct. 1991. [16] G. J. Henry. The Fair Share Scheduler. AT&T Bell Laboratories Technical Journal, 63(8), Oct. 1984. [30] C. A. Waldspurger. Lottery and : Flexible Proportional-Share Resource Management. [17] Institute of Electrical and Electronics, Inc. IEEE PhD thesis, Massachusetts Institute of Technology, Standardfor InformationTechnology Ð Portable Op- Sept. 1995. erating System Interface (POSIX) Ð Part 1: System [31] C. A. Waldspurger and W. E. Weihl. Lottery Schedul- Application Programming Interface (API) Ð Amend- ing: Flexible Proportional-Share Resource Manage- ment 1: Realtime Extension [C Language], 1994. Std ment. In Proc. of the First Symp. on Operating Sys- 1003.1b-1993. tems Design and Implementation, pages 1±11, Mon- terey, CA, Nov. 1994. USENIX Assoc. [18] E. D. Jensen. A Bene®t Accrual Model of Real-Time. In Proc. of the 10th IFAC Workshop on Distributed [32] C. A. Waldspurger and W. E. Weihl. Stride Schedul- Computer Control Systems, Sept. 1991. ing: Deterministic Proportional-Share Resource Management. Technical Report MIT/LCS/TM-528, [19] J. Kay and P. Lauder. A Fair Share Scheduler. Com- MIT Laboratory for Computer Science, June 1995. munications of the ACM, 31(1), Jan. 1988.