An Efficient Semaphore Implementation Scheme for Small-Memory Embedded Systems*

Khawar M. Zuberi and Kang G. Shin

Real-Time Computing Laboratory Department of Electrical Engineering and The University of Michigan Ann Arbor, MI 48109-2122 { zuberi, kgshin} @eecs.umich. edu

Abstract These embedded systems are mass-produced, mak- In object-oriented programming, updates to the ing low production costs one of the primary concerns state variables of objects (by the methods of the ob- in their design. Automotive applications alone ac- ject) have to be protected through semaphores to ensure count for millions of embedded systems produced an- . Semaphore operations are invoked nually. At these volumes, extra costs of even a few each time an object is accessed, and this represents dollars per unit translate into a loss of millions of significant run-time overhead. This is of special con- dollars overall, so the microcontrollers used in these cern in cost-conscious, small-site embedded systems - cost-conscious applications are those which have been such as those used in automotive applications - where in production for several years and their prices have costs must be kept to an absolute minimum. Object- dropped to a few dollars per unit. These microcon- oriented programming can be feasible in such applica- trollers have relatively slow processing cores (typically tions only if the OS provides eficient, low-overhead running at 10-30 MHz), small, on-chip RAMS (about semaphores. We present a new semaphore imple- 32-64 kBytes, hence the name “small-memory” em- mentation scheme which saves one context switch per bedded systems), and all applications are in-memory semaphore operation in most circumstances and (there are no disks/file systems in our target applica- gives performance improvements of 18-25% over tra- tions). This necessitates that any real-time operating ditional semaphore implementation schemes. system (RTOS) [a] used in these applications must be both time-efficient and memory-efficient . 1 Introduction In this paper, we focus on OS support for object- Real-time computing [l]today is no longer limited oriented (00) programming in embedded systems. to large and expensive systems such as planetary ex- 00 design gives benefits such as reduced software ploration robots or the space shuttle. The sharp drop design time and software re-use [3]. But with these in microprocessor prices over the recent years and the benefits comes the extra cost of ensuring mutual ex- introduction of the microcontroller incorporating a mi- clusion when an object’s internal state is updated. croprocessor with peripherals like timers, memory, and Semaphores’ I4,5] are typically used to provide this 1/0 in a single package has led to digital control now mutual exclusion. Because semaphore system calls are being used in much smaller and simpler embedded sys- invoked every time an execution enters or exits tems such as in automotive control, cellular phones, an object, it becomes essential that the RTOS provide and home electronics (camcorders, TVs, and VCRs). efficient, low-overhead semaphores; otherwise, 00 de-

~ sign will not be feasible for embedded applications be- *The work reported in this paper was supported in part by cause of high costs. the Advanced Research Projects Agency, monitored by the US Airforce Rome Laboratory under Grant F30602-95-1-0044, by the NSF under Grant MIP-9203895, and by the ONR under ‘The optimization scheme presented in this paper applies Grant N00014-94-1-0229. Any opinions, findings, and conclu- equally well to both semaphores and mutexes. However, for sions or recommendations are those of the authors and do not simplicity, we concern ourselves only with semaphores in this necessarily reflect the views of the funding agencies. paper.

1080-1812/97 $10.00 @ 1997 IEEE 25 Most research in the area of reducing syn- objects. Then, under the 00 paradigm, real-time soft- chronization overhead has focused on multiprocessors ware is just a collection of threads of execution, each [6,7]. But our target architectures are either unipro- invoking various methods of various objects [12]. cessor (as in home appliances) or very loosely-coupled Conceptually, this 00 paradigm is very appealing distributed systems (as in automotive applications). and gives benefits such as reduced Even with the latter, threads typically do not need to time and software re-use. But practically speaking, access remote objects, so our concern is only with im- these benefits come at a cost. The methods of an proving task synchronization performance for a single object must synchronize their access to the object’s processor. Previous work in this area has focused on data to ensure mutual exclusion. Because object in- either relaxing the semaphore semantics to get better vocations occur very frequently, it is essential that any performance [SI or coming up with new semantics and scheme used to achieve this synchronization must be new synchronization policies [9]. The problem with both memory-efficient as well as time-efficient; oth- this approach is that these newlmodified semantics erwise, 00 design will be infeasible for small-memory may be suitable for some particular applications but embedded systems due to high costs. usually they do not have wide applicability. We took the approach of providing full semaphore semantics (with priority inheritance [lo]), but opti- 2.1 Active and Passive Object Models mizing the implementation of these semaphores by ex- ploiting certain features of embedded applications. As There are two fundamentally different ways for ob- a result, our semaphore scheme has wide applicability jects and execution threads to interact with each other within the domain of embedded applications, while and this has some bearing on the type of synchroniza- significantly improving performance over standard im- tion scheme used to ensure mutual exclusion. plementation methods for semaphores. We have im- Under the active object model [13], one or more plemented this new semaphore scheme in the EMER- server threads are permanently bound to an object. ALDS (Extensible Microkernel for Embedded, ReAL- When a client thread invokes a method, a server time, Distributed Systems) RTOS [ll]which is being thread executes the method on behalf of the client. developed in the Real-Time Computing Laboratory With the passive object model [13], objects do not at the University of Michigan to satisfy the specific have threads of their own. To invoke a method, a memory and performance requirements of small-size thread will enter the object, execute the method, and embedded systems. then exit the object. In the next section, we give a brief overview of 00 From the point of view of synchronization, the ac- programming as it pertains to embedded real-time sys- tive object model has an advantage if only one thread tems, focusing on OS support needed for 00 program- is assigned per object. Since only one thread is in the ming. In Section 3, we describe our new implementa- object at any time, there is no need to worry about tion scheme. Section 4 discusses some limitations of mutual exclusion. But the active object model has the scheme and ways to overcome these limitations so several disadvantages. First of all, having a thread that our scheme can be used in almost all embedded per object means that there will be a large number applications. Section 5 evaluates the performance of of threads in the system (anywhere from several tens our new scheme, and we conclude with Section 6. to more than a hundred depending on the applica- tion). Each thread needs its own stack, thread control 2 Objects and Semaphores in Embed- block, etc., which makes the active object model very ded Real-Time Systems memory-inefficient. Moreover, each object invocation An object is a collection of private state informa- requires a context switch from the client thread to the tion (or data) and a of methods which manipulate server thread, so this model is time-inefficient as well. the data. Objects are ideal for representing real-world With the passive object model, multiple threads entities: the object’s internal data represents the phys- can be inside the same object at one time, so they ical state of the entity (such as temperature, pressure, must synchronize their activities. Semaphores [4,5] position, RPM, etc.) and the methods allow the state are commonly used for this purpose (e.g., to provide to be read or modified. These notions of encapsula- the monitor construct [14]). Even though locking tion and modularity greatly help the software design based on semaphores incurs time overhead, it is de- because various system components such as cidedly much more memory-efficient than the active sensors, actuators, and controllers can be modeled by object model.

26 2.2 00 Design Under EMERALDS E Tx For the above stated reasons, we advocate the pas- -\ sive object model for embedded software design. Be- cause a semaphore system call is made every time an object’s method is invoked, semaphore opera- \ tions (acquiresem(1 and releasesen0 calls under T2 L -time EMERALDS, used to lock and unlock semaphores, respectively) become some of the most heavily used -thread - - context L:Lock U: Unlock execution switch semaphore semaphore OS primitives when 00 design is used. This moti- vated us to investigate new and efficient schemes for Figure 1: A typical scenario showing thread T2 at- implementing semaphore locking in EMERALDS as tempting to lock a semaphore already held by thread described next. TI. Tz is an unrelated thread which was executing 3 An Efficient Semaphore Implemen- while T2 was blocked. Conceptually, T, can be TI. tation Scheme The first step in designing efficient semaphores is to semaphore, thus causing . With pri- look at the way semaphores are typically implemented ority inheritance, will keep on running until it un- in various systems, identify distinct steps involved in locks the semaphore. At that point, its priority will locking/unlocking semaphores, and try to eliminate go back to its original value, but now Th will be un- or optimize those steps which incur the greatest over- blocked and it can continue execution. head. To do these optimizations, we will use charac- First of all, notice that if the semaphore is free when teristics peculiar to small-memory embedded applica- acquiresemo is called, then the semaphore lock op- tions. eration has very little overhead2. In fact, for this case, only one counter has to be incremented and some other 3.1 Standard Semaphore Implementation variables updated, The standard procedure to lock a semaphore can The situation is very different when the semaphore be summarized as follows: is already locked by thread TI when some thread T2 invokes the acquiresemo call. Figure 1 shows a typ- if (sem locked) { ical scenario for this situation. Thread T2 wakes up do priority inheritance; (after completing some unrelated blocking system call) add caller thread to wait queue; and then calls acquiresemo. This results in prior- block; ity inheritance and a context switch to TI,the current 1 lock holder. After TI releases the semaphore, its pri- lock sen; ority returns to its original value and a context switch occurs to Tz. If the semaphore happens to be already locked by some We observe that it is these context switches which other thread, the thread making the semaphore lock are responsible for much of the overhead (as much system call is put on a wait queue and is blocked. It as 40-50%) associated with locking and unlocking is unblocked as part of the semaphore release opera- semaphores (see Section 5 for timing measurements). tion and it then proceeds to reserve the semaphore for itself. Schedulability Analysis: In all critical real-time systems, an off-line guarantee is needed that the task If the caller is to block, priority inheritance [9,10] workload is feasible and all execution deadlines will be also takes place under which the current lock holder met at run-time. Schedulability tests [15-171 are used thread’s priority is increased to that of the caller for this purpose. The worst-case execution time of thread (if the former is less than the latter). This each task is first calculated and then the appropriate is needed to avoid unbounded priority inversion [lo]. schedulability test is used to determine feasibility. If a high-priority thread Th calls acquiresen() on a semaphore already locked by a low-priority thread The worst-case execution time for acquiresem() 3, the latter’s priority is temporarily increased to occurs when the semaphore is already locked when Without priority inheritance, a that of the former. 2This is especially true in EMERALDS where system call medium priority thread T, can get control of the CPU overhead is comparable to subroutine call overhead even with by preempting Tl while Th remains blocked on the full memory protection between processes [ll].

27 E Code Parser: In EMERALDS, all blocking calls take an extra parameter which is the identifier of the semaphore to be locked by the upcoming acquiresemo call. This parameter is set to -1 if the next blocking call is not acquiresemo. Switch to TI \ Tz instead of T2 Semaphore identifiers are statically defined (at L time - compile time) in EMERALDS as is commonly the case -thread - - context L:Lock U: Unlock in 0% for small-memory applications, so it is possi- execution switch semaphore semaphore ble to write a parser which examines the application code and automatically inserts the correct semaphore Figure 2: The new semaphore implementation identifier into the argument list of blocking calls just scheme. Context switch Cz is eliminated. preceding acquiresem( ) calls. Parser design issues are discussed further in Section 4. the system call is made. This means that the con- Schedulability Analysis for the New Scheme: text switches C2 and C3 shown in Figure 1 must be From the viewpoint of schedulability analysis, there included when calculating worst-case task execution can be two concerns regarding the new semaphore times. scheme (refer back to Figure 2): Any scheme to make semaphores more effi- 1. What if thread T2 does not block on the call pre- cient must target this worst-case scenario. The ceding acquiresem( )? This can happen if event other scenario (semaphore happens to be free when E has already occurred when the call is made. acquiresemo is called) is quite efficient as is and is of no concern when calculating worst-case execu- 2. Is it safe to delay execution of T2 even though it tion times, so, from now on, we focus on optimizing may have higher priority than TI (by doing pri- the worst-case scenario when the semaphore is already ority inheritance earlier than would occur other- locked by some thread when acquiresemo is called. wise)?

3.2 Semaphore Implementation in Regarding the first concern, if Tz does not block EMERALDS on the call preceding acquiresemo, then a context Going back to Figure 1, we want to eliminate con- switch has already been saved. For such a situation, T2 text switch C2. Recall that the progression of events will continue to execute till it reaches acquiresem() was as follows: T2 blocks (say, waiting for an event and a context switch will occur here. What our scheme such as a message arrival; call this event E); some really provides is that a context switch will be saved other threads execute; then event E occurs and T2 is either on the acquiresem0 call or on the preceding unblocked. Now, the next blocking call T2 is to make is blocking call. Where the savings actually occur at run- to acquire semaphore S. Under our scheme - as part time do not really matter for calculation of worst-case of the blocking call just preceding acquire-sem() - execution times for schedulability analysis. we instrument the code (using a code parser described For the second concern, the answer is that yes, it is later) to indicate which semaphore Tz intends to lock safe to let TI execute earlier than it would otherwise. (semaphore S in this case). When event E occurs and The concern here is that Tz may miss its deadline. But T2 is to be unblocked (Figure 2), the OS checks if S this cannot happen because under all circumstances, is available or not. If S is unavailable, then priority T2 must wait for TIto release the semaphore before inheritance from T2 to the current lock holder TI oc- T2 can complete. So from the schedulability analysis curs right here. T2 is added to the waiting queue for point of view, all that really happens is that chunks of S and it remains blocked, waiting for S. As a result, execution time are swapped between TI and Tz with- the scheduler picks TI to execute - which eventu- out affecting the completion time of Tz. Another sim- ally releases S - and T2 is unblocked as part of this ilar concern is that after event E, Tz may have to releasesemo call by TI. Comparing Figure 2 to produce an output or send a message/ to an- Figure 1, we see that context switch C2 is eliminated. other thread (call it T3). Delaying Tz may cause T3 The semaphore lock/unlock pair of operations now in- to miss its deadline. The answer to all such scenarios cur only one context switch instead of two, resulting in is that as just discussed, T2 completes by its deadline considerable savings in execution time overhead (see (even though it may be delayed). As long as TZ com- Section 5 for performance results). pletes by its deadline, no other thread that depends

28 parser. At run-time, the method which gets invoked for (;;) { on an object may depend on the input data: read sensor I; read sensor 2; if (sensorReading > A) valve.open; ... else valve.close; read sensor I; update actuator I; but this does not change the order in which update actuator 2; semaphores are locked because all methods of an ob- ... ject are protected by the same semaphore. In other words, most embedded applications are structured update actuator y; as block till timer expiry in Figure 3, and for such a structure, the parser can or event occurrence; easily determine which semaphore is to be locked after a given blocking call. I In case a blocking call occurs inside a loop followed by acquiresemo outside the loop, the argument to Figure A typical sensor-controller-actuator loop 3: be passed for the semaphore identifier is calculated commonly found in embedded control applications conditionally as follows: on T2 will miss its deadline, so schedulability of the while (cond) < task workload is not adversely affected. ... if (cond) 4 Applicability of the New Scheme sen = -1; There can be three circumstances under which our else proposed semaphore scheme may not work: sem = S; some-blocking-call(. . . , sem) ; 1. The code parser is unable to identify which ... semaphore is to be locked next due to conditional 1 constructs such as loops with a variable number ... of iterations or if -then-else statements. acquire-sem(S); 2. The blocking call preceding an acquiresemo is This way, -1 is passed as the parameter for all but another acquiresem() so that only one context the last iteration of the loop. Again, this code can switch is saved between these two calls. be automatically inserted by the code parser without the application programmer having to make any man- 3. The lock holder TI (Figure 2) blocks after event ual modifications to the code. Note that this scheme E but before releasing the semaphore. Then with works as long as the condition cond does not depend standard semaphores, TZ will be able to execute, on the blocking call or code after the call. This is but under our scheme it cannot which may lead true for loops which execute for a fixed number of it- to T2 missing its deadline. erations which is the most common case in embedded control systems. One example is code which steps a In the rest of this section, we discuss how often (if stepper motor 2 number of times. Value of may de- at all) these scenarios can occur in embedded real-time x pend on sensor readings, but it stays fixed while the systems, which specific forms they can occur in, and loop executes. how these problems can be resolved. Regarding loops with a variable number of itera- 4.1 Code Parser Issues tions, our experience shows that such loops typically Most threads in embedded systems execute sensor- do not contain blocking calls in embedded real-time controller-actuator loops as shown in Figure 3. Each systems. A variable-iteration loop is used to wait for device (sensor or actuator) is represented by an object a condition to come true (such as a spin lock), but that protected by its own semaphore. Each device may be is what blocking calls do as well (wait for a condition). a real sensor/actuator or a logical one representing The two may be combined if the result of the block- several devices being controlled as one group. ing call is uncertain (such as for condition variables Note that the same devices are accessed each time with Mesa semantics used in general-purpose comput- the loop executes. The order in which semaphores are ing), but such a situation rarely occurs in embedded locked is fixed, so there is no ambiguity for the code real-time systems.

29 4.2 Consecutive acquiresem() Calls E T- T2 nreemnted Going back to Figure 3, the bodies of the methods invoked by the thread may contain blocking calls, es- pecially condition variable and message-passing calls. In these calls, the parser will insert the identifier of the upcoming acquiresemo. But if such calls are L not present, then two or more acquiresen0 calls -thread - - context L:Lock B: block can occur with no other blocking call in between them. execution switch sem. Then, only one context switch will be saved per pair of acquiresemo calls. This leads to an interesting Figure 4: If a higher priority thread TI preempts T2, avenue for future research. Our scheme can be gen- locks the semaphore, and blocks, then T2 incurs the eralized so that the blocking call at the end of the full overhead of acquireSam() and a context switch control loop will not unblock until all the semaphores is not saved. needed by the thread for execution become available. In other words: resumes, calls acquiresemo, and blocks because S for (:;I C is unavailable. The context switch is not saved and no obj-1.method // protected by sem SI benefit comes out of our semaphore scheme. obj-2.method // protected by sem S2 All these problems occur when a thread blocks ... while holding a semaphore. To resolve these problems, obj-n.method // protected by sen Sn we first make a small modification to our semaphore block(.. . , Si, S2, . .., Sn); scheme to change the problem in case B to be the > same as the problem in case A. This leaves us with only one problem to address. Then, by looking at the This is somewhat similar to the Spring kernel’s larger picture and considering threads other than just notion of reserving all resources a task needs before TI and T2, we can show that this problem is easily letting the task execute [18], but with an impor- circumvented and our semaphore scheme works for all tant difference: the Spring kernel executes tasks non- blocking situations that occur in practice as discussed preemptively while under our proposal, threads exe- next. cute preemptively. This allows higher priority threads to preempt a given thread (giving good schedula- Modification to the Semaphore Scheme: For ble utilization) while reducing the number of context the situation shown in Figure 4, we want to somehow switches seen by the thread to wait for resources (giv- block T2 when the higher-priority thread TI locks S, ing shorter execution times). However, advance reser- and unblock Tz when TI releases S. This will prevent vation of all semaphores will increase scheduler com- T2 from executing while S is locked, which makes this plexity and may also adversely affect task schedula- the same as the situation in case A. bility. Impact of these issues on performance must be Recall that when event E occurs (Figure 4), the OS studied to determine the viability of this extension. first checks if S is available or not before unblocking 4.3 Blocking by the Lock Holder Thread Tz. Now, let us extend the scheme so that the OS Going back to Figure 2, suppose the lock holder adds T2 to a special queue associated with S. This TI blocks after event E but before releasing the queue holds the threads which have completed their semaphore. With standard semaphores, T2 will blocking call just preceding acquiresem( ) but have then be able to execute (at least, till it reaches not called acquiresen0 yet. acquiresemo), but under our scheme, T2 stays Thread TI will also get added to this queue as blocked. This gives rise to the concern that with this part of its blocking call just preceding acquiresem() . new semaphore scheme, T2may miss its deadline. When TI calls acquiresemo, the OS first removes In Figure 2, TI had priority less than that of T2 (call TI from this queue, then puts all threads remaining this case A).A different problem arises if TI has higher in the queue in a blocked state. Then, when TI calls priority than T2 (call it case B). Suppose semaphore releasesen(), the OS unblocks all threads in the S is free when event E occurs. Then T2 will become queue. This way, T2 is prevented from executing while unblocked and it will start executing (Figure 4). But S is locked which results in the same behavior as in before T2 can call acquiresen(), TI wakes up, pre- case A. Also, if done properly, addition and removal empts T2, locks s, then blocks for some event. T2 of threads from this queue incurs very little overhead

30 sensors are read and actuator commands are updated periodically and no blocking calls are involved. One common exception is to block on a timer (usually, to wait for the current period to end), but this block- ing call occurs at the end of the main loop of execu- tion of the thread and is not inside any object and no \ T2 slays blocked: \ semaphores are held by the thread when this call is T2 switch to Ts made. -L time Blocking calls are used to wait for aperiodic events, - thread - - context L:Lock U;Unlock B: Block S; Signal but it does not make sense to have such calls inside an execution switch sem. sem. object. There is always a possibility that an aperiodic event may not occur for a long time. If a thread blocks Figure 5: Situation when the lock holder TI blocks waiting for such an event while inside an object, it for a signal from another thread T, . may keep that object locked forever, preventing other threads from making progress. So the usual practice (about 5-7 ,us on a 25 MHz MC 68040 without caches is to not have any semaphores locked when blocking and just 1-2 ,us with caches). for an aperiodic event. With this modification, the only remaining concern In short, dealing with external events (whether pe- (for both cases A and B) is: if execution of T2 is de- riodic or aperiodic) does not affect the applicabil- layed like this while other threads (of possibly lower ity of our semaphore scheme under the commonly- priority) execute, then T2 may miss its deadline. This established ways of handling external events. But in concern is addressed next. case some application does require blocking for ex- ternal events while inside an object, our semaphore Applicability under Various Blocking Situa- scheme can be turned off by specifying -1 as the tions: There can be two types of blocking: semaphore identifier in the blocking call just preced- ing acquiresem0. This will cause EMERALDS’ Wait for an internal event, i.e., wait for a sig- semaphores to behave just like standard implemen- nal from another thread after it reaches a certain tation semaphores, but we do not believe this will be point. needed very often, if at all. Wait for an external event from the environment. This event can be periodic or aperiodic. 5 Performance Evaluation To measure the improvement in performance result- The first type of blocking is used by threads to syn- ing from our new semaphore scheme, we implemented chronize with each other and the second type is used it under EMERALDS and measured performance on to interact with the environment. a 25 MHz Motorola 68040 processor [19]. Blocking for Internal Events: The typical scenario When a thread enters an object, it first acquires for this type of blocking is for thread TIto enter an the semaphore protecting the object, and when it ex- object (and lock semaphore S) then block waiting for its the object, it releases the semaphore. The cumula- a signal from another thread T,. Meanwhile, T2 stays tive time spent in these two operations represents the blocked (Figure 5). The question is: is it safe to delay overhead associated with synchronizing thread access T2 like this even if T, is lower in priority than T2? The to objects. To determine by how much this overhead answer is yes, because T2 cannot lock S till TIreleases is reduced when our scheme is used, we measured the it, and TIwill not release it till it receives the signal time for the acquire/release pair of operations for both from T,, so even though T, may be lower in priority standard semaphores and our new scheme and then than Tz, it is safe to let Ts execute earlier. This leads compared the two results. In the following, we first to TIreleasing S earlier than it would otherwise which describe our evaluation procedure, then present the leaves enough time for T2 to complete by its deadline. results. Blocking for External Events: External events can be either periodic or aperiodic. For periodic events, 5.1 The Test Procedure polling is usually used to interact with the environ- We want to measure the worst-case overhead for ment and blocking does not occur. A common exam- acquire/release because this is what is used in schedu- ple is a periodic sensor-controller-actuator loop where lability analysis. The worst case occurs if

31 -rl+ ‘-t3--C:

-thread - - context L:Lock Unlock B: Block S:Signal -thread - - context L:Lock U: Unlock B: Block S: Signal U: execution switch sem. sem. thread execution switch sem. sem. thread

Figure 6: Test procedure for standard semaphores. Figure 7: Test procedure for the new semaphore Interval tl is the overhead for acquire/release opera- scheme. tions. scheduler queue. Experience shows that typical em- e the semaphore is already locked when bedded applications have about 10-20 threads (any- acquiresem() is called, and thing more will consume too much memory for stack, thread control block, etc.). For evaluation purposes, e priority inheritance occurs. we chose a slightly wider range of thread counts, from 3 to 30. For each case, two of the threads are TI and T2 To get this behavior, we use two threads in our tests, mentioned in Section 5.1 while the remaining threads TI and T2, with T2 having higher priority. For the just execute infinite loops and serve only to fill the standard semaphore implementation, the test pro- scheduler queue. ceeds as shown in Figure 6. T2 executes first and First, we ran our tests on the MC 68040 with caches blocks waiting for a signal from Ti. Ti executes, locks disabled (to simulate processors which do not have semaphore S,and signals T2 which is unblocked, goes caches). Figure 8 shows the results for both the stan- on to execute acquiresem(), and priority inheritance dard and the new semaphore implementation schemes. occurs. Thread TI then releases S, its priority goes Since the context switch overhead is a linear function back to its original value, and a context switch oc- of the number of threads, the acquire/release times curs back to Tz. We measure interval tl which is the also increase linearly with the thread count. But time for an acquire plus a release and includes relevant the standard implementation’s overhead involves two context switches. context switches while our new scheme incurs only We repeated this test with the new semaphore one, which is why the measurements for the standard scheme. Figure 7 shows the new sequence of events. In scheme have a slope twice that of our new scheme. For this case, priority inheritance is done by the OS when a typical thread count of 15 threads, our new scheme TI signals Tz, so TIcontinues after the signal and un- gives savings of about 35 ps over the standard imple- locks S. TI’Spriority goes back to its original value, T2 mentation and these savings grow even larger as the is unblocked, and it goes on to lock S without needing thread count increases. any more context switches. Then the difference t2 - t3 We repeated our tests on the MC 68040 with both (Figures 6 and 7) represents the improvement due to instruction and data caches enabled. The results are the new scheme and tl - (t2 - t3) is the overhead for shown in Figure 9. Again, the results for our new acquire/release under the new scheme. Note that we scheme have a slope roughly half that of the standard cannot directly measure the acquire/release overhead scheme. But, notice that the percent improvement in for the new scheme because priority inheritance occurs performance is more with caches enabled than with well before the rest of the acquire operation. caches disabled as shown in Figure 10. The reason is that the context switch overhead is greater (rela- 5.2 Experimental Results tively speaking) when caches are used because of the EMERALDS uses dynamic thread sched~ling,~so cache misses incurred when a new thread begins to ex- the context switch overhead depends on the number of ecute. The old context is flushed out to main memory, threads in the system. Because our semaphore scheme the new context is fetched, and this increases the con- eliminates one context switch, the improvement in text switch overhead, which is why our scheme gives performance depends on the number of threads in the greater improvement over the standard implementa- tion with caches enabled than with caches disabled. With priority inheritance, thread priorities change so often that it makes no sense to have fixed priority scheduling. These results show that our new scheme improves

32 250.0

h e--. Standard implementation t - - - New implementation V9

.-; 200.0 * 60.0 Standard implementation I h - 2. ---a New implementation d V 150.0 8 50.0 .-.e 2 .E!* 2 40.0 I 100.0 4 0.0 10.0 20.0 30.0 \ Number of Threads U 30.0 -4f Figure 8: Performance measurements with caches 20.0 disabled. The overhead for the standard implementa- 0.0 10.0 20.0 30.0 tion increases twice as rapidly as for the new scheme. Number of Threads

performance by 10-40%, depending on the number of Figure 9: Performance measurements with caches threads in the application and whether caches are used disabled. or not. Since most embedded applications have about 10-20 threads, they can expect improvements of about 18-25% (without caches) or 25-30% (with caches). 6 Conclusion Embedded application programmers generally tend to avoid object-oriented programming, one reason be- ing the high overhead associated with synchronizing thread access to objects. Semaphores must be used to ensure mutual exclusion when updating the state vari- 40.0 I I ables of objects, and this usually means a large enough overhead to make object-oriented programming infea- 35.0 - sible for cost-conscious embedded applications. - * 30.0 In this paper, we presented a new semaphore im- .- - - m- - 3 25.0 - plementation scheme which saves one context switch U per semaphore acquire/release pair of operations (for 5 20.0 - most scenarios found in embedded applications) and a 15.0 improves performance by 18-25%. We used the fact ER that in small-size embedded applications, the identi- 10.0 - fiers of semaphores are fixed at compile time. Then, 5.0 - during run-time, we use these known identifiers to 0.0 ' I do ahead-of-time checks on the status of semaphores 0.0 10.0 20.0 30.0 (whether they are available or not). If a semaphore Number of Threads is unavailable, we delay the execution of threads until the semaphore is released. This way, the semaphores Figure 10: Percent improvement in performance due are always available when threads actually make the to our new semaphore scheme. acquiresemo system call and the call does not block, saving one context switch. Future work includes studying the advantages and disadvantages of extending our scheme so that instead of looking ahead only to the next acquiresem0 call,

33 the scheduler will consider all the semaphores a thread [ll] K. M. Zuberi and K. G. Shin, “EMERALDS: may need to execute so that all resource conflict- A microkernel for embedded real-time systems,” related context switches are eliminated. Also, in this in Proc. Real- Time Technology and Applications paper we focused only on improving the semaphore Symposium, pp. 241-249, June 1996. lock operation. In the future, we plan to investigate optimizations related to the release operation to get [12] Y. Ishikawa, H. Tokuda, and C. W. Mercer, further improvements in synchronization overheads. “An object-oriented real-time programming lan- guage,” IEEE Computer, vol. 25, no. 10, pp. 66- References 73, October 1992. [l] K. G. Shin and P. Ramanathan, “Real-time com- [13] R. S. Chin and S. T. Chanson, “Distributed puting: a new discipline of computer science and object-based programming systems,” ACM Com- engineering,” Proceedings of the IEEE, vol. 82, puting Surveys, vol. 23, no. 1, pp. 91-124, March no. 1, pp. 6-24, January 1994. 1991. [21 K. Ramamritham and J. A. Stankovic, “Schedul- ing and operating systems support for [14] C. A. R. Hoare, “Monitors: An operating sys- real-time systems,’’ Proceedings of the IEEE, vol. tem structuring concept,” Communications of the 82, no. 1, pp. 55-67, January 1994. ACM, vol. 17, no. 10, pp. 549-557, October 1974. [3] B. Meyer, Object-Oriented Software Construc- [15] C. L. Liu and J. W. Layland, “Scheduling algo- tion, Prentice-Hall, 1988. rithms for multiprogramming in a hard real-time environment,” Journal of the ACM, vol. 20, no. [4] E. W. Dijkstra, ‘Cooperating sequential pro- 1, pp. 46-61, January 1973. cesses,” Technical Report EWD-123, Technical University, Eindhoven, the Netherlands, 1965. [16] A. C. Audsley, A. Burns, and A. J. Wellings, “Deadline monotonic scheduling theory and ap- [5] A. N. Habermann, “Synchronization of communi- plication,” Control Engineering Practice, vol. 1, cating processes,” Communications of the ACM, no. 1, pp. 71-78, 1993. vol. 15, no. 3, pp. 171-176, March 1972. [17] Q. Zheng and K. G. Shin, (‘On the ability [6] J. Mellor-Crummey and M. Scott, “Algorithms of establishing real-time channels in point- for scalable synchronization on shared-memory to-point packet-switched networks,” IEEE multiprocessors,’’ ACM Transactaons on Com- Trans. Communications, pp. 1096-1105, Febru- puter Systems, vol. 9, no. 1, pp. 21-65, February ary/March/April 1994. 1991. [18] J. Stankovic and K. Ramamritham, “The Spring [7] C.-D. Wang, H. Takada, and K. Sakamura, “Pri- Kernel: a new paradigm for real-time operating ority inheritance spin locks for multiprocessor systems,” ACM Operating Systems Review, vol. real-time systems,” in 2nd International Sympo- 23, no. 3, pp. 54-71, July 1989. sium on Parallel Architectures, Algorathms, and Networks, pp. 70-76, 1996. [19] M68040 User’s Manual, Motorola Inc., 1992. [8] H. Takada and K. Sakamura, “Experimental im- plementations of priority inheritance semaphore on ITRON-specification kernel,” in ffth TRON Project Internatzonal Symposaum, pp. 106-113, 1994. [9] H. Tokuda and T. Nakajima, “Evaluation of real- time synchronization in Real-Time Mach,” in Second Mach Symposium, pp. 213-221. Usenix, 1991. [lo] L. Sha, R. Rajkumar, and J. Lehoczky, “Prior- ity inheritance protocols: an approach to real- time synchronization,” IEEE Trans. on Comput- ers, vol. 39, no. 3, pp. 1175-1198, 1990.

34