USENIX Association

Proceedings of the FREENIX Track: 2003 USENIX Annual Technical Conference

San Antonio, Texas, USA June 9-14, 2003

THE ADVANCED COMPUTING SYSTEMS ASSOCIATION

© 2003 by The USENIX Association All Rights Reserved For more information about the USENIX Association: Phone: 1 510 528 8649 FAX: 1 510 548 5738 Email: [email protected] WWW: http://www.usenix.org Rights to individual papers remain with the author or the author's employer. Permission is granted for noncommercial reproduction of the work for educational or research purposes. This copyright notice must be included in the reproduced paper. USENIX acknowledges all trademarks herein. An Implementation of User-level Restartable Atomic Sequences on the NetBSD Gregory McGarry [email protected]

Abstract system[1]. While the kernel component of the model controls the switching of execution contexts, the user- This paper outlines an implementation of restartable level component is responsible for scheduling threads atomic sequences on the NetBSD operating system onto the available execution contexts. The user-level as a mechanism for implementing atomic operations component contains a complete scheduler implementa- in a mutual-exclusion facility on uniprocessor sys- tion with shared scheduling data which must be pro- tems. Kernel-level and user-level interfaces are dis- tected by the mutual-exclusion facility. Consequently, cussed along with implementation details. Issues as- the mutual-exclusion facility is a critical component of sociated with protecting restartable atomic sequences the scheduler activations threading model. from violation are considered. The performance of The basic building block for any mutual-exclusion fa- restartable atomic sequences is demonstrated to out- cility, whether it be a blocking or busy-wait construct, is perform syscall-based and emulation-based atomic op- the fetch-modify-write operation[5]. The fetch-modify- erations. Performance comparisons with memory- write operation reads a boolean flag that indicates own- interlocked instructions demonstrate that restartable ership of shared data. The operation modifies the flag atomic sequences continue to provide performance ad- from false to true, thereby acquiring ownership. The vantages on modern hardware. Restartable atomic se- fetch-modify-write operation must execute atomically quences are now the preferred mechanism for atomic (without interruption) to ensure the flag state is main- operations in the NetBSD threads library on all unipro- tained consistent across all contending threads. cessor systems. Modern systems generally provide sophisticated pro- 1 Introduction cessor primitives within the hardware in the form of memory-interlocked instructions and bus support to en- The NetBSD Project is currently adopting a new threads sure that a given memory location can be read, modi- system based on scheduler activations[6]. Part of this fied and written atomically. The specific primitive varies project is the implementation of a POSIX-compliant between processors, however common instructions in- threads library that utilises the scheduler activations in- clude test-and-set, fetch-and-store (swap), fetch-and- terface. The motivation for the threads project is to add, load-locked/store-conditional and compare-and- support the multi-threaded programming model which swap[5, 2]. Unfortunately, there are two problems asso- is becoming increasingly popular for application de- ciated with the use of memory-interlocked instructions velopment. Multi-threaded applications use multiple within a user-level threads library: threads to aid portability to multiprocessor systems, and as a way to manage server concurrency even when no • Memory-interlocked instructions incur overhead true system parallelism is available. To support multi- since the cycle time for an interlocked memory ac- threaded applications, the POSIX standard specifies a cess is several times greater than that for a non- mutual-exclusion facility to serialise access and guar- interlocked access. The overhead associated with antee consistency of shared data. Even on uniproces- memory-interlocked instructions is due to memory sor systems, mutual-exclusion facilities are necessary to and interconnection contention. protect shared data against an interleaved thread sched- • Not all processors support memory-interlocked in- ule. Interleaving can occur when a thread blocks on a structions and therefore cannot provide atomic op- resource or when a thread is preempted, causing another erations for a mutual-exclusion facility. Example thread to assume control of the processor. processors include the MIPS , VR4100 and The scheduler activations threading model also places ARM2 processors. Interestingly, processor manu- additional demand on the mutual exclusion facility. The facturers are choosing to introduce new processors primary advantage of scheduler activations is that it without memory-interlocked instructions. These combines the simplicity of a kernel-based threading processors generally provide a subset of the com- system and the performance of a user-level threading plete processor specification and primarily target

USENIX Association FREENIX Track: 2003 USENIX Annual Technical Conference 311 embedded or low-power applications. well on both uniprocessor and multiprocessor systems. Both of these problems are important to the NetBSD However, even for the case when no contention and Project due to the large number of hardware systems that no collisions occur, reservation-based algorithms require are supported. several memory accesses per atomic operation. Addi- On multiprocessor systems, the hardware is expected tionally, these algorithms have storage requirements that to provide the necessary atomic operations. Multi- increase linearly with the number of potential contend- processor atomic operations for synchronisation have ing threads. For a user-level threads library, it is not al- been extensively investigated by Mellor-Crummey and ways possible to know the maximum number of threads Scott[5]. However, the majority of systems supported likely to be used in an application. For these reasons, by the NetBSD operating system are uniprocessor sys- the software reservation mechanism was not considered tems. Therefore, it is desirable to find an alternate mech- further. anism for atomic operations which works efficiently on The mechanism of restartable atomic sequences has all uniprocessor systems. been proposed to address the problems outlined above The simplest mechanism for implementing an atomic for implementing atomic operations on uniprocessor operation is to request the kernel to perform the opera- systems[2]. The basic concept of restartable atomic se- tion of behalf of the user-level thread. A call to a spe- quences is that a user-level thread which is preempted cific system call can be used to enter the kernel. While within a restartable atomic sequence is resumed by the inside the kernel, the scheduler can be paused so that kernel at the beginning of the atomic sequence rather the thread is guaranteed not to be preempted while a than at the point of preemption. This guarantees that non-atomic fetch-modify-write operation is performed. the thread eventually executes the instruction sequence Alternatively, an invalid instruction can be executed by atomically. the thread to cause an exception which can be inter- This paper outlines the implementation of restartable cepted by the kernel to emulate an atomic fetch-modify- atomic sequences to provide the atomic operations re- write operation. An advantage of instruction emulation quired by the POSIX-compliant threads library on the is that the invalid instruction can be forward compati- NetBSD operating system. Section 2 discusses the con- ble with newer versions of the processor which do sup- cept of restartable atomic sequences. Section 3 presents port explicit atomic operations. However, both syscall- details of the implementation on the NetBSD operat- based and emulation-based solutions incur significant ing system. Section 4 compares the performance of overhead by entering the kernel. restartable atomic sequences with other mechanisms Atomic operations can also be implemented using for implementing atomic operations including memory- a software reservation mechanism. With the software interlocked instructions, instruction emulation and a reservation mechanism, a thread must register its intent syscall-based mechanism. Section 4 also examines the to perform an atomic operation and then wait until no cost associated with restartable atomic sequences. Sec- other thread has registered a similar intent before pro- tion 5 discusses the application of restartable atomic ceeding. The most widely recognised algorithm based sequences in the POSIX-compliant threads library and on software reservation is Lamport’s mutual-exclusion presents some benchmarks of the mutual-exclusion fa- algorithm[3]. cility in the threads library. Finally, Section 6 presents In Lamport’s algorithm, the atomic operation is pro- conclusions. tected by two variables; one to indicate ownership of the atomic operation and another to place reservations. Each 2 Restartable Atomic Sequences thread registers in a global array its intent to perform an Restartable atomic sequences is a mechanism to imple- atomic operation. The thread then places a reservation ment atomic operations on uniprocessor systems. Re- into the reservation variable before testing the ownership sponsibility for executing the instruction sequence atom- variable. If a thread determines that ownership is held by ically is shared between the user-level thread and the another thread, there is contention, and the thread must kernel. When a thread is preempted within a restartable wait until ownership is relinquished. When the thread atomic sequence, it is resumed by the kernel at the be- determines that ownership has been relinquished, it as- ginning of the atomic sequence rather than at the point signs its ownership to the atomic operation then checks of preemption. that it still holds the reservation. If another thread holds Most mechanisms used to implement atomic opera- the reservation then a collision has occurred. The thread tions on uniprocessor systems can be called pessimistic. then waits for all contending threads to acquire owner- Their design assumes that atomicity may be violated ship and remove their intent from the global array. The at any moment and guards against this potential viola- thread then restarts the procedure from the beginning. tion for every atomic operation. Restartable atomic se- The software reservation mechanism works equally quences use an optimistic mechanism that assumes that

312 FREENIX Track: 2003 USENIX Annual Technical Conference USENIX Association atomic sequences are rarely preempted and are inexpen- sive for this common case. Only when an atomic se- int test_and_set(int *addr) quence is not executed atomically is it necessary to per- { form a recovery action to ensure atomicity. int old; Restartable atomic sequences require kernel support to ensure that a preempted thread is resumed at the be- __asm __volatile("__ras_start:"); ginning of the sequence. An application registers the ad- old = *addr; dress range of the atomic sequence with the kernel. If a *addr = 1; user-level thread is preempted within a registered atomic __asm __volatile("__ras_end:"); sequence, then its execution is rolled-back to the start of the sequence by resetting the program counter saved in return (old); its process control block. } Consider the test-and-set atomic operation in Figure 1 which reads the memory address, writes to the memory address and returns the read value. The memory address Figure 1: Implementation of a test-and-set atomic oper- could be a flag in a mutual-exclusion operation. If the ation protected as a restartable atomic sequence. test-and-set operation executes atomically, two condi- tions can occur. If *addr is unset, then it is set and *addr is unset, old is unset. The thread is then pre- the function returns the unset value. If *addr is set, empted. At this point the new thread will read *addr then it remains unchanged and the function returns the as unset, and set *addr. When the original thread set value. resumes, its execution is rolled-back to ras start Now consider what will happen if the test-and-set which corresponds to the instruction to read *addr. operation is preempted between the memory read and The value of old now corresponds to the correct value write operations and a collision occurs when accessing of *addr set by the other thread. The value of *addr *addr. Again, two conditions occur. If *addr is un- remains unchanged and the correct value is returned by set, old is unset. The thread is then preempted. At this the test-and-set operation. The other collision condition point the new thread will read *addr as unset, and set occurs analogously. *addr. When the original thread resumes, its recorded Restartable atomic sequences should adhere to the fol- value of old no longer reflects the true value of *addr lowing requirements: and will also set *addr. Both threads will have the test- and-set operation return success to setting the memory 1. have a single entry point; address. If this condition occurred while the test-and-set 2. have a single exit point; operation was being used in a mutual-exclusion oper- 3. restrict modifications to global/shared data; ation, then both threads would assume ownership of a 4. not execute emulated instructions or invoke system shared resource. calls; and The other collision condition occurs if *addr is read 5. not invoke any functions. as set before the thread is preempted. The new thread The first requirement is to ensure that the roll-back of clears *addr. When the original thread resumes, its the program counter is valid. Nevertheless, the ker- recorded value of old no longer reflects the true value of nel cannot guarantee that the sequence is successfully *addr, it will set *addr but return the unset value. In restartable; it assumes that the application knows what it this condition, the original thread believes that the mem- is doing. The second and third requirements are linked. ory address was already set, while the second thread Access to global/shared data should be restricted to en- has cleared the memory address. If this condition oc- sure that the restartable atomic sequence is idempotent. curred while the test-and-set operation was being used in An operation is idempotent if it achieves the same result a mutual-exclusion operation, then neither thread would irrespective of the number of times it executes. Restrict- assume ownership of the shared resource. Neither thread ing the restartable atomic sequence to a single global would be able to reacquire ownership of the resource and write in the last instruction of the restartable atomic se- deadlock would occur. quence ensures that the operation is idempotent. Ac- Atomicity of the test-and-set operation is assured cordingly, a single exit point is desirable. However, if by making the adjacent memory accesses between the atomic operation chooses not to modify any global ras start and ras end a restartable atomic se- data, the restartable atomic sequences may be exited at quence. Now consider what will happen if the test-and- any point. set operation is preempted between the memory read The fourth requirement is to ensure that the kernel is and write operations. Again, two conditions occur. If not entered and thus providing an opportunity for the

USENIX Association FREENIX Track: 2003 USENIX Annual Technical Conference 313 thread to block on a resource and be preempted. In this be identified outside the context of the executing thread. case, the kernel will always roll-back execution to the Additionally, the handling of asynchronous events is a beginning of the restartable atomic sequence whenever machine-dependent operation and would require inva- the thread is unblocked, and the thread will never exit sive changes to the machine-dependent kernel. For ex- the restartable atomic sequence. The fifth requirement ample, on architectures such as and m68k, it is diffi- is to ensure that the program counter remains within the cult to provide a central location to check if the program range of the registered atomic sequence. counter is within a restartable atomic sequence since in- There are two run-time costs associated with terrupts are dispatched via an interrupt table. restartable atomic sequences. Because the kernel iden- Another important consideration is how registered tifies restartable atomic sequences by an address range, restartable atomic sequences are handled by the fork and restartable atomic sequences cannot be inlined. The in- exec system calls. Restartable atomic sequences are in- ability to inline atomic sequences slightly increases the herited from the parent by the child during the fork sys- overhead of atomic operations due to the cost of subrou- tem call. This allows restartable atomic sequences to tine calls. The second run-time cost comes from check- continue to work on children of the parent process that ing the program counter at each context switch. Al- registered the sequences. Restartable atomic sequences though this test can add several cycles to the kernel’s are removed during the exec system call. This property is context switch path, many applications use many more intuitive given that the program text has changed and the atomic operations than the number of context switches, instruction sequence for a registered atomic sequence is making the additional scheduling overhead negligible. different. The implementation supports the registration of mul- 3 Implementation tiple restartable atomic sequence for a process. A list The NetBSD operating system is well-known for its of address ranges is maintained for each process. A portability and support for many architectures. The per-process limit of the maximum number of registered portability of the operating systems stems from a restartable atomic sequences is imposed to limit resource clear separation of machine-dependent and machine- exhaustion. independent subsystems. An implementation of 3.1 Kernel interface restartable atomic sequences clearly requires access to machine-dependent information such as the thread pro- All atomic sequences for a process are manipulated by gram counter. However, much of the implementa- the rasctl system call. Its prototype can be found in tion can be shared between all architectures and is and is implemented within the standard machine-independent. Additionally, the user-level inter- C library. It has a prototype given by face should be uniform across all architectures. Con- int sequently, the primary objective of this implementation rasctl(void *addr, size_t len, int op) was to provide a generic interface to be utilised by all supported architectures. To that end, the implementation The prototype is intended to be similar to the mmap consists of a simple system-call interface and a largely system call, since both system calls affect the process machine-independent kernel implementation. address space. If a restartable atomic sequence is reg- Initially, the implementation was intended to support istered and the process is preempted within the range atomic operations only for use by the threads library. addr and addr+len, then the process is resumed at However, it was recognised that a generic user-level in- addr. The operations that can be applied to a restartable terface would allow applications to find new and innova- atomic sequence are specified by the op argument. Pos- tive uses of restartable atomic sequences. For example, sible operations are: benchmark utilities, performance counters, and profil- • ing tools may make use of restartable atomic sequences RAS INSTALL: register a new atomic sequence; • to ensure that an analysed instruction sequence is exe- RAS PURGE: remove a registered atomic sequence cuted without interruption. Supporting new and innova- for this process; and • tive uses seemed like a worthwhile goal. RAS PURGE ALL: remove all registered atomic se- The initial design decision was that only thread- quences for this process. synchronous events will cause an atomic sequence to The operation affects the restartable atomic sequence be restarted and asynchronous events, such as inter- immediately. rupts, do not cause a restart. Therefore, a restartable The purge operation should be considered to have un- atomic sequence will only be restarted if the thread ex- defined behaviour if there are any other runnable threads ecution context is switched from the processor. Asyn- in the process which might be executing within the reg- chronous events are difficult to protect, since they must istered atomic sequence at the time of the purge. The

314 FREENIX Track: 2003 USENIX Annual Technical Conference USENIX Association restartable atomic sequences. The ras lookup() #include function is machine-independent and has the function prototype extern void __ras_start(void); extern void __ras_end(void); caddr_t ras_lookup(struct proc *p, ... caddr_t addr) if (rasctl((void *)__ras_start, (size_t)(__ras_end - __ras_start), It searches the registered restartable atomic sequences RAS_INSTALL)) for process p which contains the user address addr.If errx(1, "rasctl failed"); the address addr is found within a restartable atomic sequence, then the restart address of the restartable atomic sequence is returned, otherwise -1 is returned. Figure 2: The registration process for the restartable In the case of a match, the machine-dependent code in atomic sequence presented in Figure 1. cpu switch() uses the start address to reset the pro- gram counter in the process control block. The RAS INSTALL and RAS PURGE operations of application must be responsible for ensuring that there the rasctl system call invoke the ras install() and is some form of coordination between threads. ras purge() functions. They have the prototypes The rasctl system call will fail with EINVAL if an invalid operation is specified, if addr or addr+len int are invalid user addresses, or if the maximum number ras_install(struct proc *p, of restartable atomic sequences per process is exceeded. caddr_t addr, size_t len) It will return ESRCH if the restartable atomic sequence int cannot be found during a purge operation. ras_purge(struct proc *p, Figure 2 shows an example of registering the caddr_t addr, size_t len) restartable atomic sequence for the test-and-set atomic operation presented in Figure 1. The ras install() function will return EINVAL 3.2 Kernel implementation if addr or addr+len are invalid user addresses, or if the maximum number of restartable atomic sequences All registered restartable atomic sequences for a pro- per process is exceeded. The ras purge() function cess are recorded in a linked list, p raslist, located will return ESRCH if the specified restartable atomic se- in struct proc of the process. Each element in the quence has not been registered. linked list records the start address and end address of The ras fork() function is used to copy all regis- a single registered restartable atomic sequence. Ad- tered restartable atomic sequences for a process to an- ditionally, a counter is available to record the number other. It is primarily during the fork system call when of restarts actioned for the restartable atomic sequence. the sequences are inherited from the parent by the child. This counter can provide some interesting information It has the prototype but is rarely useful for most applications. Currently there is not a user-level interface to access the counter. int A counter, p nras in struct proc records the ras_fork(struct proc *p1, number of registered atomic sequences and is used struct proc *p2) to simplify the program-counter check. The program counter is checked within the cpu switch() func- The ras purgeall() function is used to remove tion. The cpu switch() function is a machine- all registered restartable atomic sequences for a process. dependent function which is responsible for switching It is primarily used to remove all registered restartable the context of the active thread on the processor. The atomic sequences for a process during the exec system cpu switch() function has a pointer to the proc call and to perform the RAS PURGE ALL operation for structure passed as the first argument, and a check the rasctl system call. It has the prototype if p nras is non-zero is an inexpensive test. The machine-dependent implementation code on the i386 int adds merely three instructions to the main execution ras_purgeall(struct proc *p) path for the case of no registered atomic sequences. If p nras is non-zero, then ras lookup() is invoked The ras fork() and ras purgeall() functions to compare the program counter with all registered are guaranteed to complete successfully.

USENIX Association FREENIX Track: 2003 USENIX Annual Technical Conference 315 3.3 Additionalkernelissues 4Performance Evaluation Restartable atomic sequences are user-level instruction In this section, the performance of restartable atomic se- sequences that receive special consideration by the ker- quences is compared with the competing mechanisms. nel. In addition to checking if the program counter is The test-and-set atomic operation based on restartable within a restartable atomic sequence when a thread con- atomic sequences is compared with a syscall-based text is restored, it is also important for the kernel not to mechanism, instruction emulation on the MIPS R3000 violate the prerequisites for their correct operation. One processor and memory-interlocked instructions. The potential problem for the kernel is the ptrace facility. overhead associated with checking the program counter The ptrace facility provides tracing and debugging fa- during a context switch is also investigated. cilities. It allows a process (the tracing process) to con- All microbenchmarks presented in this section are trol another (the traced process). Most of the time, the based on mutual exclusion mechanisms with a test that traced process runs normally, however the ptrace facil- enters a critical section using a test-and-set lock and ity provides some kernel capabilities to modify the be- leaves the critical section by clearing the test-and-set haviour of the traced process. If the traced process con- lock. The benchmark uses a single thread, so that no tains registered restartable atomic sequences then the contention occurs. The benchmark is measuring the per- kernel must ensure that they are protected from modi- formance of the basic processor architecture, memory fication by the tracing process, otherwise one or both of system and mutual exclusion mechanism. The measures the processes will fail to perform as expected. are determined by executing the benchmark in a loop one There are two specific cases which must be consid- million times. The loop overhead and overhead for clear- ered: ing the lock is eliminated in the published measures. As already mentioned, restartable atomic sequences • The tracing process attempts to write to a use an optimistic mechanism that assumes that atomic restartable atomic sequence (PT WRITE I). An sequences are rarely preempted and are inexpensive for example is when the tracing process attempts to set this common case. By way of example, a system with a breakpoint. a 100MHz i486DX processor executes one million iter- • The traced process attempts to single-step into a ations of the benchmark outlined above in 0.38 seconds restartable atomic sequence (PT STEP). with only 4 restarts actioned. The first case is handled within the machine- independent ptrace facility. Each write to the code seg- 4.1 Syscall-based atomic operations ment is first checked to ensure that the write is not to The syscall-based mechanism uses the kernel to per- a restartable atomic sequence. Attempting to write to a form all the necessary actions to ensure atomicity. restartable atomic sequence fails. The mechanism only works on uniprocessor systems The second case is more difficult to handle, since dif- and is successful because the NetBSD kernel is not ferent architectures handle single-stepping mode differ- preemptable[4]. Support is provided by two system calls ently. For example, the MIPS R3000 processor does not to acquire and release the lock on behalf of the thread. have a single-stepping or tracing mode and the facility These system calls were added to a NetBSD kernel for is generally handled through software emulation. Soft- the explicit purpose of comparing its performance with ware emulation works by replacing the next instruction restartable atomic sequences. with an invalid opcode which generates an exception that The system calls take the address in the process ad- the kernel identifies and handles specifically. Protect- dress of the lock. The acquire system call will block ing the restartable atomic sequences during emulation is the current thread until the lock is released. The re- the same as for the first case discussed above. However, lease system call will release the lock to any blocked care must be taken, since the same emulation technique threads. The system calls are implemented using the is used to single-step the kernel inside the kernel debug- tsleep()/wakeup() kernel facility. ger, and restartable atomic sequences are not supported The elapsed times to execute the test-and-set within the kernel. atomic operation based on the syscall mechanism and Other architectures such as i386 and m68k provide restartable atomic sequences for various processors are tracing support in hardware. On these architectures the shown in Table 1. From this benchmark it can be seen trace trap must check if the program counter is within a that on the MIPS R3000 processor, restartable atomic restartable atomic sequence before dispatching the event sequences provide almost ninety-fold performance im- to the tracing process via the ptrace facility. The usual provement over syscall-based atomic operations. The procedure is to continue stepping through the restartable run-time cost for syscall-based atomic operations is atomic sequence and only dispatch the event on the first high. The kernel must be invoked on every atomic oper- instruction after the atomic sequence. ation, requiring that a trap be fielded, dispatched and ar-

316 FREENIX Track: 2003 USENIX Annual Technical Conference USENIX Association system RAS syscall 4.3 Memory-interlocked instructions The performance of atomic operations based on DECstation 5000/25 0.40 35.6 restartable atomic sequences is compared with memory- 100MHz i486DX 0.32 21.5 interlocked instructions on processors that provide such DECstation 5000/260 0.17 9.7 functionality. Two memory-interlocked implementa- Alchemy Pb1000 EVB 0.04 2.4 tions are considered. An inlined version uses compiler Broadcom BCM91250A EVB 0.04 1.4 and assembler optimisations to schedule the memory- interlocked instructions in the instruction stream at the Table 1: Elapsed times (in microseconds) to execute the point of invocation. A non-inlined version wraps the test-and-set atomic operation based on the syscall mech- memory-interlocked instructions in a function call. Due anism and restartable atomic sequences (RAS). to the overhead associated with dispatching a function call, the inlined version is almost always expected to provide a performance gain. As mentioned in Section 2, restartable atomic sequences cannot be inlined. guments checked. Modern processors in the MIPS fam- To compare the performance of the two implementa- ily appear to be less sensitive to system-call overhead, tions of memory-interlocked instructions and restartable however restartable atomic sequences continue to pro- atomic sequences, a performance index is introduced. vide a significant performance improvement. The metric is calculated by   4.2 Instruction emulation t − t M = 100 NI , The original MIPS instruction set architecture (ISA) im- tNI plemented in the R3000 processor did not provide any where t is the execution time of the atomic-operation memory-interlocked instructions. The MIPS II ISA im- mechanism and tNI is the execution time for a non- plemented in the and most subsequent proces- inlined memory-interlocked instruction. Therefore, the sors introduced a load-locked/store-conditional instruc- performance index provides an indication of improve- tion pair. The MIPS R3000 processor will generate ment over a non-inlined memory-interlocked instruc- a trap if it attempts to execute a load-locked or store- tion. A performance index of zero indicates no per- conditional instruction. This trap can be intercepted by formance improvement. A performance index of one- the kernel and the necessary actions performed to ade- hundred indicates zero cost associated with an operation, quately emulate the instruction pair. or effectively infinite performance improvement. Instruction emulation has the advantage that software The performance indices comparing atomic opera- can be written for any processor in the family and the tions based on restartable atomic sequences and inlined unsupported instructions are supported transparently. memory-interlocked instructions are shown in Table 2. The NetBSD kernel currently emulates the load- The table is ordered approximately corresponding to the locked and store-conditional instructions for the MIPS processing power (or age) of the processor. R3000 processor. The current implementation only op- Table 2 shows that for older processors, restartable erates on uniprocessor systems and is successful because atomic sequences exhibit an additional cost over the kernel is not preemptable[4]. memory-interlocked instructions. On these machines, Although instruction emulation requires no special restartable atomic sequences are paying the penalty of hardware, its run-time cost is high. Similar to the the function call. The modern processors tend to indi- syscall-based mechanism, the kernel must be invoked cate a potential performance improvement of restartable on every emulated instruction, requiring that a trap be atomic sequences over memory-interlocked instructions. fielded, dispatched and arguments checked. A sin- This improvement is mainly attributed to the introduc- gle test-and-set atomic operation on the DECstation tion of on-chip caches and memory controllers. 5000/25 based on restartable atomic sequences can ex- Table 2 clearly shows that the memory subsystem is ecute in 0.4 microseconds. The test-and-set atomic op- crucial for efficient performance of lock operations. The eration using instruction emulation executes in 56 mi- Chalice CATS, Digital DNARD and Netwinder systems croseconds. The cost of instruction emulation is higher which have an ARM processor show significant perfor- than the syscall-based test-and-set atomic operation, mance improvement with restartable atomic sequences. since each test-and-set operation uses a load-locked and This processor appears to be very sensitive to memory store-conditional instruction, which generates two traps performance. However, newer processors in the ARM to the kernel to emulate the instructions. Therefore, on family such as the Xscale show the memory subsystem the MIPS R3000 processor, the syscall-based mecha- has been significantly improved so that inlined memory- nism is faster than instruction emulation. interlocked instructions outperform restartable atomic

USENIX Association FREENIX Track: 2003 USENIX Annual Technical Conference 317 sequences. For the Xscale case, restartable atomic se- context-switch time quences are paying the penalty of being non-inlined. On 180 this particular Xscale system, the memory controller is DECstation 5000/25 160 integrated into the CPU, the memory the that memory- 100MHz i486DX 140 interlocked instruction is manipulating is cacheable, and so the memory-interlocked instruction is inexpensive. 120 The systems based on the Alpha processor also show 100 significant performance improvement with restartable 80 atomic sequences. These systems generally have small 60 performance indices for the inlined memory-interlocked 40 instructions; an indication that the cost associated with 20 function invocation is negligible. The AlphaServer 1200 0 50 100 150 200 is a multiprocessor system, so the hardware causes sig- number of registered atomic sequences nificant extra bus activity. However, the benchmark was Figure 3: Context-switch time (milliseconds) for DEC- run with a uniprocessor kernel. station 5000/25 and 100MHz i486DX with increasing The microbenchmark results presented in Table 2 number of registered atomic sequences. demonstrate that an architecture and memory system cannot necessarily provide efficient functionality com- pared with a combination of kernel and compiler opti- search when it finds a sequence with a start address misation. On modern processors, restartable atomic se- higher than the program counter. Similarly, the first quences can provide improved performance in atomic and last addresses of all restartable atomic sequences operations. could be recorded in a header which would allow the 4.4 Run-time overhead of restartable ras lookup() function to quickly check if it is nec- atomic sequences essary to traverse the list. Realistically, applications will not make use of so many restartable atomic sequences. As mentioned in Section 2, there is a run-time over- Indeed, the current implementation places an arbitrary head associated with checking if the program counter restriction of sixteen sequences per process. Since the is within a restartable atomic sequence on every context threads library is currently the only user of restartable switch. The context-switch time was measured to ob- atomic sequences, only one restartable atomic sequence tain an indication of the cost of checking the program is expected to be registered for an application. counter. The context-switch time was determined by measuring the time it takes a token to be passed through 5 Discussion a pipe between two processes. The token is passed be- tween the two processes twenty times for a single mea- Based on the performance results presented in Section 4, surement. Of 16000 measurements, the shortest time restartable atomic sequences is the default default mech- was chosen as the true context-switch measurement, anism for implementing atomic operations in the threads since it most-likely represents the uninterrupted time to library on the NetBSD operating system. The mech- perform the context switch. Context-switch times are anism provides the foundation for the implementation measured for increasing number of registered atomic se- of a POSIX-compliant mutual-exclusion facility and the quences and are shown for the DECstation 5000/25 and primitives for mutual exclusion in the library scheduler. 100MHz i486DX in Figure 3. The threads library uses a blocking construct Since the restartable atomic sequences are recorded in the mutual-exclusion facility. This facility in a linked list for each process the context-switch times provides the pthread mutex lock() and increase linearly with increasing number of registered pthread mutex unlock() functions. The im- atomic sequences. By one-hundred registered atomic plementation uses a test-and-set atomic operation to test sequences the context-switch time for the DECstation the ownership of a mutual-exclusion flag (mutex). If a 5000/25 has doubled. About 130 registered atomic se- mutex is owned by another thread, the scheduler blocks quences are required to double the context-switch time the current thread and switches another thread onto on the 100MHz i486DX. the available execution context. Eventually a thread For a small number of sequences in the list, the per- will release ownership of the mutex and any blocked formance is not likely to be a significant issue. Never- threads waiting on the mutex will be given the execution theless, one way to improve the performance might be context. to order the list according to the start address. Then The blocking construct within the threads library uses the ras lookup() function could quickly abort the a single test-and-set atomic operation and therefore uses

318 FREENIX Track: 2003 USENIX Annual Technical Conference USENIX Association system performance index 33 HP9000/340 61 41 100MHz i486DX 65 8 SPARCstation 20 62 41 DECstation 5000/260 44 0 Sun Ultra5 11 92 Chalice CATS 9 34 Sun Ultra60 11 90 Digital DNARD 13 52 IBM Walnut EVB 20 91 Netwinder 12 76 Digital Multia 500AU 18 86 100MHz Pentium 25 51 Broadcom BCM91250A EVB 0 66 Alchemy Pb1000 EVB 23 46 400MHz Xscale i80321 68 81 200MHz Intel Pentium Pro 47 71 Digital AlphaStation 200 11 80 266MHz Intel Pentium II 42 70 Digital PC164 0 79 Digital AlphaServer 1200 2 62 Apple Power Macintosh G4 12 72 Apple iBook G3 14 91 Samsung UP1500 0 68 1000MHz AMD Athlon 25 74 AMD Athlon XP2000+ 26 80 1000MHz Intel Pentium 4 43 71 AMD Athlon MP2000+ 25

Table 2: Performance index of a test-and-set atomic operation based on restartable atomic sequences (black) and inlined memory-interlocked instructions (white). Larger values of the performance index indicate improved perfor- mance over non-inlined memory-interlocked instructions.

USENIX Association FREENIX Track: 2003 USENIX Annual Technical Conference 319 a single restartable atomic sequence. When the threads statistical difference in either the request rate or data- library is loaded, an initialisation function invokes the transfer rate. Therefore, it may be concluded, from a per- rasctl system call to register the restartable atomic se- formance perspective, that most applications will not be quence of the test-and-set atomic operation. Since the adversely affected by the implementation of restartable library must support atomic operations transparently on atomic sequences. both uniprocessor and multiprocessor systems, the test- For processors which do not provide memory- and-set atomic operation invokes the underlying test- interlocked instructions, restartable atomic sequences do and-set mechanism through function pointers. The ini- provide a significant benefit. In the future this benefit tialisation function checks for a multiprocessor system may become more significant if restartable atomic se- with the hw.ncpu sysctl variable. On a multiprocessor quences are extended for use within the NetBSD ker- system the test-and-set atomic operation uses memory- nel. To realise improved performance on multipro- interlocked instructions. cessor systems and attain targets for real-time laten- The use of function pointers to choose the underly- cies, a preemptable NetBSD kernel would place in- ing atomic mechanism introduces additional execution creased demand on atomic operations within the ker- cost in the mutual-exclusion facility. It also means that nel. While much of the design decisions discussed in the atomic locks are no longer inlined, and the perfor- Section 3 are readily extended to a kernel-level imple- mance of memory-interlocked instructions is relegated mentation, actioning restarts only on context switches to the non-inlined case considered in Section 4. is likely to be insufficient. Interrupts must also action The thread library introduces additional overhead restarts. Consequently, a kernel-level implementation of to the mutual-exclusion facility and the full per- restartable atomic sequences will require many more in- formance demonstrated in Table 2 is not attain- vasive changes to machine-dependent subsystems. The able. For comparison, a microbenchmark sim- lessons learned from this implementation of user-level ilar to the one used in Section 4 was devel- restartable atomic sequences serves as a good starting oped which invoked pthread mutex lock() and point. pthread mutex unlock() in a loop one million times. The benchmark uses a single thread so that no 6 Conclusion contention occurs. On a 1000MHz AMD Athlon proces- Restartable atomic sequences have been implemented sor, the benchmark of a test-and-set restartable atomic on the NetBSD operating system to provide a generic sequence completes in 75 milliseconds. On the same framework for atomic operations for use by the POSIX processor, the mutex operation (using restartable atomic threads library. Restartable atomic sequences are ap- sequences) completes in 125 milliseconds. Therefore, propriate for uniprocessor systems that do not support the mutual-exclusion facility of the threads library intro- memory-interlocked instructions. Moreover, on mod- duces 66% overhead. Nevertheless, restartable atomic ern processors that do have hardware support for syn- sequences improve the performance of the mutex func- chronisation, better performance may be possible with tions, where the mutex benchmark using non-inlined restartable atomic sequences. memory-interlocked instructions takes 135 milliseconds This paper has presented an overview of an imple- to complete. From this result, it can be seen that the per- mentation of user-level restartable atomic sequences on formance improvement of restartable atomic sequences the NetBSD operating system and discussed design de- propagates directly into the mutual-exclusion facility of cisions encountered during its implementation. Per- the threads library. formance comparisons between restartable atomic se- However, these microbenchmarks do not represent quences, a syscall-based mechanism and instruction em- typical usage in multi-threaded applications. A mi- ulation for mutual exclusion demonstrated the advan- crobenchmark based on a row-parallel LU matrix de- tages of restartable atomic sequences. composition provides a complement of computation and mutex contention. An LU decomposition on a Availability 1000 × 1000 matrix uses around 14,000 test-and-set The kernel and user implementation discussed in this atomic operations per thread. Running the benchmark paper has been adopted by the NetBSD Project and on a 1000MHz AMD Athlon processor using from is currently available under a BSD license from the one to one-hundred threads, there was no statistical NetBSD Project’s source servers. A complete set of re- difference between restartable atomic sequences and gression tools for memory-interlocked instructions and memory-interlocked instructions (p-value 0.7). Simi- restartable atomic sequences is available within the larly, a benchmark of the multi-threaded apache web source tree. The next formal release which will use server (version 2.0.44) using the httpperf HTTP perfor- the implementation will be NetBSD 2.0. Further infor- mance measurement tool (version 0.8) also showed no mation about the NetBSD Project can be found on the

320 FREENIX Track: 2003 USENIX Annual Technical Conference USENIX Association Project’s web server at www..org. Acknowledgements Thanks to Artem Belevich, Allen Briggs, Simon Burge, Martin Husemann, Jason Thorpe, Valeriy Ushakov, Mar- tin Weber, Nathan Williams, and Berndt Wulf for the performance data presented in Tables 1 and 2. References [1] T. E. Anderson, B. N. Bershad, E. D. Lazowska, and H. M. Levy. Scheduler activations: Effective ker- nel support for the user-level management of par- allelism. ACM Transactions on Computer Systems, 10(1):53–79, February 1992. [2] B. N. Bershad, D. R. Redell, and J. R. Ellis. Fast mutual exclusion for uniprocessors. In Fifth Sym- posium on Architectural Support for Programming Languages and Operating Systems, 1992. [3] L. Lamport. A fast mutual exclusion algorithm. ACM Transactions on Computer Systems, 5(1):1– 11, February 1987. [4] M. K. McKusick, K. Bostic, M. J. Karels, and J. S. Quarterman. The design and implementation of the 4.4BSD operating system. Addison-Wesley, 1996. [5] J. M. Mellor-Crummey and M. L. Scott. Al- gorithms for scalable synchronization on shared- memory multiprocessors. ACM Transactions on Computer Systems, 9(1):21–65, February 1991. [6] N. J. Williams. An implementation of scheduler ac- tivations on the NetBSD operating system. In Pro- ceedings of the 2002 Usenix Annual Technical Con- ference, 2002.

USENIX Association FREENIX Track: 2003 USENIX Annual Technical Conference 321