Multiprocessor Semaphores

Semaphores afford significant flexibility for providing interprocessor synchronization and . Semaphores Aid Multiprocessor Designs

By Ted Raineault

ultiprocessing designs that share memory A second architecture uses dual- among DSPs and other processors beg for port RAM between processors. The downside here is the relatively high some form of mutual exclusion or inter- cost and small storage capacity of processor synchronization. Such designs are these devices; large banks of expen- becoming pervasive: DSPs are widely used in sive dual-port RAM are seldom prac- tical. However, in applications that dense multiprocessor arrangements at the use segmented data transport or networkM edge, and systems-on-chip often include DSP cores to small data sets for which small accelerate math-intensive computation. Although the DSP/BIOS amounts of dual-port RAM are suffi- cient, this type of memory is very kernel provides a standard, efficient, robust API useful. Dual-port RAM is relatively for uniprocessor applications, designers sometimes encounter sit- fast; it’s easy to design into a system; uations in which interprocessor syn- processor DSP systems are designed and, unlike FIFOs, it can store chronization mechanisms would be in this manner. shared data structures used for very useful. One method imple- interprocessor communication. ments interprocessor semaphores A word of caution about using by using DSP/BIOS. SHARED-MEMORY . When processors Shared-memory semaphores are ARCHITECTURES have on-chip cache or systems use basic tools for interprocessor syn- One common architecture uses a write posting, you must pay atten- chronization. Although self-imposed large region of single-port RAM tion to shared-variable coherence. design constraints can often reduce shared by all devices, often includ- To prevent loss of coherence, you synchronization requirements, sem- ing a host. Although arbitration can, depending on the processor, aphores offer multiprocessor system issues complicate the hardware disable the cache, use cache bypass, designers significant flexibility. In design for this architecture, soft- or flush the cache to ensure that a addition, a multitasking OS, like ware engineers appreciate a large shared location is in a proper state. DSP/BIOS, is virtually invaluable for shared-memory systems-on-chip The cache control API in Texas shared-memory multiprocessing pool visible to multiple processors. Instruments’ comprehensive Chip systems. When you can reduce bus con- Support Library, for example, pro- Assume that two or more proces- tention and arbitration inefficien- vides an excellent tool for managing sors share a physical pool of memo- cies with and data trans- cache subsystems. Solutions to ry, in the sense that each processor port strategies, software-related write-post delay problems are sys- sees the memory as directly ad- conveniences make this architec- tem-specific. dressable. Indeed, many multi- ture an attractive option. Assume that two processors use a

14 October 2001 Embedded Edge Multiprocessor Semaphores

phore . The wait oper- ation checks the value of the integer and either decrements it if it’s posi- tive or blocks the calling . The operation, in turn, checks for tasks blocked on the semaphore and either unblocks a task waiting for the semaphore or increments the semaphore if no tasks are waiting. A binary semaphore, which has count- er values limited to 0 and 1, can be used effectively by an application to guard critical sections. You can implement a multi- processor semaphore by placing its data structure in shared memory and using RTOS services on each processor to handle blocking. Before outlining an implementation, let’s look at two aspects of semaphores that cause complications in a multi- processor environment. One is low- level mutual exclusion to protect shared data within a semaphore, and the other is wake-up notifica- tion when a semaphore is released. common shared-memory buffer to delay, and eventual entry (no star- pass data or to operate cooperative- vation). The focus here is mutual ly on a data . In either case, one exclusion; the remaining properties LOW-LEVEL or more tasks on the processors are detailed in any number of text- MUTUAL EXCLUSION might need to know the state of the books and will be satisfied by the At its core, a semaphore has a count buffer before accessing it, and possi- multiprocessor semaphore dis- variable and possibly other data ele- bly to block while the buffer is in cussed below. ments that must be manipulated use. As in the case of single-proces- Relative to a shared resource, atomically. System calls use simple sor multitasking, a mutual exclusion mutual exclusion requires that only mutual exclusion mechanisms to mechanism to prevent inappropri- one task at a time execute in a criti- guard very short critical sections ate concurrent operations on the cal section. entry where the semaphore structure is shared resource is needed. A quick and exit protocols use such mecha- accessed. This arrangement pre- review of mutual exclusion will help nisms as polled flags (often called vents incorrect results caused by clarify multiprocessor issues. simple locks or spin locks) or more concurrent modification of shared Shared-resource management is a abstract entities, like blocking sem- data within the semaphore. fundamental challenge of multitask- aphores. Simple locks can be used In a uniprocessor environment, ing. A task (or , or ) to build protection mechanisms of masking is a popular tech- needs the ability to execute greater complexity. nique used to ensure that sequential sequences of instructions without Introduced by Edgar Dijkstra in operations occur without interfer- interference so that it can manipu- the mid-1960s, the semaphore is a ence. With this technique, inter- late shared data atomically. These system-level abstraction used for rupts are disabled on entrance to a sequences, known as critical sec- interprocess synchronization. The critical section and re-enabled on tions, are bracketed by entry and semaphore provides two atomic exit. In a multiprocessor situation, exit protocols that satisfy four prop- operations, wait (P) and signal (V), however, interrupt masking isn’t an erties: mutual exclusion, absence of which are invoked to manipulate a option. Even if one processor could , absence of unnecessary nonnegative integer in the sema- disable the of another

Embedded Edge October 2001 15 Multiprocessor Semaphores

(rarely the case), the second proces- In shared-memory systems, hard- able prevents incorrect results sor would still execute an active ware-assisted mutual exclusion can caused by race conditions and also thread and might inadvertently vio- be implemented with special ensures that each waiting task will late mutual exclusion requirements. flags found in multiport RAMs. Dual- eventually enter the critical section. A second technique uses an atom- port RAM logic prevents overlap of We can easily imagine situations ic test-and-set (or similar) instruc- concurrent operations on the hard- in which more than two processes tion to manipulate a variable. This ware flags, forcing them to maintain try to enter their critical sections variable might be the semaphore the correct state during simultane- concurrently. Peterson’s algorithm count itself or a simple used to ous accesses. Also, because proces- can, as noted, be generalized to n guard critical sections where sema- sors use standard read and write processes and used to enforce mutu- phore data is accessed. Either way, a instructions to manipulate the flags, al exclusion for more than two specialized instruction guarantees specialized atomic test instructions tasks, and other n-process solutions, atomic read-modify-write in a multi- aren’t required. However, this solu- such as the bakery algorithm, are tasking environment. tion is still limited, as shared-memo- readily available in computer sci- Although this solution looks ry systems often lack dedicated ence textbooks. For reasons of clari- straightforward, test-and-set in- hardware flags. ty and brevity, the discussion here is structions have disadvantages in Let’s take one more step to arrive limited to the two-process case. both uniprocessor and multiproces- at a general-purpose hardware- Pseudocode for the n-process Pe- sor scenarios. One drawback is independent method. terson’s algorithm is available at dependence on machine instruc- www.electricsand.com. tions, which vary across processors, Now that we have a low-level provide only a small number of PETERSON’S ALGORITHM mutual exclusion tool with which to atomic operations, and are some- Peterson’s algorithm, published in safely manipulate shared data within times unavailable. 1981, provides an elegant software a semaphore, consider the other key A second problem is bus locking: solution to the n-process critical ingredient of semaphores: blocking. If multiple processors share a com- section problem and has two key Assuming that each processor runs mon bus that doesn’t support lock- advantages over test-and-set spin DSP/BIOS or another multitasking ing during test-and-set, these pro- locks. One is that atomic test-and- OS, we’ll develop our wait operation cessors might interleave accesses to set isn’t required: the algorithm using services that are already avail- a shared variable at the bus level eliminates the need for special able on each individual processor. while executing seemingly atomic instructions and bus locking. The DSP/BIOS provides a flexible sema- test-and-set instructions. other is eventual entry: A task wait- phore module (SEM) that we’ll use in A third problem concerns test- ing for entry to a critical section the implementation. Peterson’s algorithm provides an elegant software solution to the n-process critical section problem and has two key advantages over test-and-set spin locks. and-set behavior in multiport RAM won’t starve in a typical scheduling When the owner of a uniproces- systems: Even if all buses can be environment. Although Peterson’s sor semaphore releases it with a locked, simultaneous test-and-set algorithm looks deceptively simple, signal system call, the local sched- sequences at different ports might it’s a culmination of many attempts uler has immediate knowledge of produce overlapped accesses by researchers to solve the critical the signal event and can unblock a Now consider two approaches section problem. task waiting for the semaphore. In that are very useful in shared-mem- The pseudocode in Listing 1 contrast, a multiprocessor sema- ory scenarios. One relies on simple shows the entry and exit protocols phore implies that the owner and atomic hardware locks; the other is used to enforce two-process mutual the requestor can reside on differ- a general-purpose software solution exclusion. Note that Peterson adds a ent processors. Because a remote known as Peterson’s algorithm. secondary turn variable. This vari- kernel has no implicit knowledge of

16 October 2001 Embedded Edge Multiprocessor Semaphores

phore while a task on the same processor owns it. The latter restric- tion simplifies the example and can be easily removed with some addi- tional design work.

USING SEMAPHORES MBS_wait is invoked to acquire a shared-memory semaphore. If the semaphore is available, MBS_wait- decrements it and continues. If the semaphore is already owned, the requesting task blocks within MBS_wait until a release notification interrupt makes the task ready to run. Once the interrupt occurs and higher-priority tasks have relin- quished the CPU, the task waiting for the semaphore wakes up within MBS_wait and loops to retest it. Note that the task doesn’t assume own- ership immediately when un- blocked. Because a remote task might reacquire the semaphore by the time the requestor wakes up, MBS_wait loops to compete for the semaphore again. When MBS_wait determines that a semaphore is unavailable, it sets a notification request flag in the shared-semaphore data structure to indicate that the processor should be interrupted when the semaphore is released elsewhere in the system. To avoid a known as the lost wake-up problem, MBS_wait tests the semaphore atomically and sets the notification request flag if the semaphore is unavailable. Code for the wait operation is divided into two distinct parts: MBS_wait, which contains the block- ing code and is called by an applica- tion, and MBS_interrupt, which runs in response to the notification inter- signal calls to a local kernel, the ing a shared semaphore. rupt and posts a local signal to the remote kernel needs to be notified This implementation of a multi- task waiting on the semaphore. This of local signal events in a timely processor binary semaphore (MBS) arrangement is very similar to that manner. Our solution uses inter- assumes that the hardware supports for a device driver: The upper part processor interrupts to notify other interprocessor interrupts and that a of a driver suspends a task pending processors of local activity involv- task won’t try to acquire a sema- I/O service and the interrupt-driven

Embedded Edge October 2001 19 Multiprocessor Semaphores

lower part wakes up the task. shown in Listing 2. General solu- prevent task switching. A task MBS_signal releases a semaphore tions for multiple tasks per proces- switch during a critical section by incrementing its value and post- sor and for a greater number of could cause another processor try- ing an interrupt to the processor processors can be implemented with ing to enter the same critical section that requested release notification. modified MBS operations using the to spin indefinitely in Peterson entry if These actions cause MBS_interrupt to n-process Peterson’s algorithm. it tried to acquire the same sema- execute on the remote processor Note that the critical sections, phore. The critical sections should where a task is blocked waiting for enforced by Peterson’s algorithm be executed as quickly as possible. the semaphore. Note that this se- (Peterson entry and Peterson exit), are Also note that the example omits quence of events varies slightly from very short instruction sequences error checking, return values, and those of the uniprocessor signal used to manipulate the semaphore timeouts. The pseudocode is meant operation described earlier in which data structure. The details of to highlight discussion topics rather the semaphore is incremented only Peterson’s algorithm aren’t shown; than provide a detailed implementa- if no tasks are blocked. they’re implicit in the Peterson entry tion . and exit operations. The lock and turn variables used in Peterson’s algo- Ted Raineault is the cofounder and tech- PSEUDOCODE rithm are distinct from the sema- nical director of Electric Sand Inc. in Now that we have a notion of phore data elements accessed in the Poway, Calif. He previously worked as a shared-semaphore architecture, let’s critical sections. sales executive for a DSP board compa- look at the pseudocode describing The critical sections are preceded ny and has 12 years’ experience as an the wait and signal operations, with DSP/BIOS TSK_disable calls to embedded software engineer.

20 October 2001 Embedded Edge