RTLinux with Address Spaces

Frank Mehnert Michael Hohmuth Sebastian Sch¨onberg Hermann H¨artig Dresden University of Technology Department of Computer Science, D-01062 Dresden, Germany [email protected]

Abstract The combination of a real-time executive and an off-the-shelf time-sharing has the potential of providing both predictability and the comfort of a large application base. To isolate the real-time section from a significant class of faults in the (ever-growing) time-sharing operating system, address spaces can be used to encapsulate the time-sharing subsystem. However, in practice designers seldomly use address spaces for this purpose, fearing that extra cost induced thereby limits the system’s predictability. To analyze this cost, we compared in detail two systems with almost identical interfaces—both are a combination of the operating system and a small real-time executive. Our analysis revealed that for interrupt-response times, the delay and jitter caused by address spaces are similar to or even smaller than those caused by caches and blocked interrupts. As a side effect of our analysis, we observed that published figures on predictability must be carefully checked whether or not such hardware features are included in the analysis.

1 Introduction from the time-sharing subsystem (both kernel and applications) than the other way round: For exam- In this paper, we determine the cost of introducing ple, a real-time application may be safety-critical, address-space protection to RTLinux [8]. but time-sharing operating systems have become so RTLinux is a real-time extension for Linux. It in- big and bloated that it is difficult to trust in their troduces a small real-time executive layer into the stability. However, the undoubtable benefits of sep- time-sharing . On top of that layer, the arate address spaces do not come for free. time-sharing kernel and the real-time applications all We observed RTLinux’ worst-case response time to run in kernel mode. User mode is reserved for time- be much higher than “about 15 µs on a generic x86 sharing applications. Real-time applications are pro- PC” claimed by the system’s authors [7]. In our tected from errors in time-sharing applications. experiments, RTLinux’ worst-case interrupt latency 1 For our evaluation, we used L4RTL, a reimplemen- was 68 µs. The worst case for L4RTL was 85 µs. tation of the RTLinux API based on a real-time mi- We found that the cost induced by address-space crokernel and a user-level Linux server. That design switches to real-time applications does not signifi- has the property that the real-time subsystem is pro- cantly distort the predictability of the system. In tected from a large class of errors: Crashes in the general, most of the worst-case overhead we observed time-sharing subsystem (either in the operating sys- must be attributed to implementation artifacts of the tem or in user code) will not bring down the real-time we used, not to the use of address spaces. subsystem, except if a device driven by the time- sharing subsystem locks up the machine or corrupts main memory. 2L4RTL This increased level of fault tolerance is desirable for many real-time applications. In many ways, it is For an accurate comparison of the cost of address more important to protect the real-time subsystem spaces in real-time systems, we have reimplemented 1We carried out our measurements on a 800-MHz Pentium-III PC—more details in Section 3. We believe that this system qualifies as a “generic x86 PC” as presumed by the RTLinux statement. the RTLinux API in the context of the Drops sys- separate from ’s address space. We call tem. The resulting system, called L4RTL, can run these threads and address spaces L4RTL threads and unmodified RTLinux programs (source-level compat- L4RTL tasks, respectively. Each L4RTL taskcan ibility). contain several L4RTL threads, and these threads can cooperate using the RTLinux API. 2.1 The DROPS system There can be more than one L4RTL task. All of these tasks can use the same single L4Linux server. Drops is an operating system that supports appli- However, L4RTL currently does not allow real-time cations with real-time and quality-of-service require- threads in different tasks to communicate with each ments as well as non-real-time (time-sharing) appli- other using the RTLinux API. cations [2]. It uses the Fiasco microkernel as its base. The L4RTL dynamic load module2 for L4Linux im- The Fiasco microkernel is a fully preemptible real- plements the API for use by Linux programs to com- time kernel supporting hard priorities. It uses non- municate with L4RTL threads. It also creates a ser- blocking synchronization for its kernel objects. This vice thread in the L4Linux task. L4RTL threads use ensures that runnable high-priority threads never this thread as their only connection to L4Linux, and blockwaiting for lower-priority threads or the ker- only for communication with time-sharing Linux pro- nel [5]. cesses. Otherwise, L4RTL tasks can work completely For time-sharing applications, Drops comes with independent from L4Linux. 4 L Linux, a Linux server that runs as an applica- FIFOs and Mbuffs are implemented using shared- tion program on top of the Fiasco microkernel [3]. memory regions between L4RTL threads and the 4 L Linux supports standard, unmodified Linux pro- L4RTL load module. L4RTL threads allocate these grams (binary compatibility). The Linux server memory regions and then transfer a mapping of these never disables interrupts for synchronization pur- to the load module’s service thread inside L4Linux poses. Instead, we modified the implementations of using microkernel IPC. The shared-memory regions cli() and sti() to workwithout disabling inter- contain all necessary control data. Once a L4RTL rupts while still ensuring their semantics (to protect thread and the load module have established the a critical section) [4]. mapping, they only communicate using their shared- memory region. L4RTL has been designed so that 2.2 L4RTL implementation bugs in L4Linux that corrupt the shared-memory data cannot crash L4RTL threads. We implemented L4RTL as a library for RTLinux More Fiasco-microkernel IPC is necessary only for application programs and a dynamic load module 4 FIFO signalling. For this purpose, the L4RTL li- for L Linux. Figure 1 gives an overview of L4RTL’s brary creates one thread in each L4RTL task(Fig- structure. ure 1). This thread handles signal messages from the L4RTL task L4RTL load module and forwards them to L4RTL threads. Linux Linux In contrast to RTLinux, L4RTL does not contain a app app scheduler. Instead, it relies on the Fiasco microker- L4RTL library nel for scheduling. L4RTL is the subject of ongoing workand research. L4RTL task Currently, it does not implement all of the very rich RTLinux API. However, everything relevant to the L4RTL dynamic load module discussion in this paper (and more) has been imple- mented and works well, and we believe that the miss- L4 Linux server L4RTL library ing features have no influence on our measurements.

User Kernel Fiasco microkernel 3 Cost of address spaces Using original RTLinux and L4RTL, we ran a min- FIGURE 1: L4RTL structure imal interrupt-latency benchmarkunder worst-case The L4RTL library implements the RTLinux API conditions. This experiment was meant to establish for real-time RTLinux applications. RTLinux appli- an upper bound for the overhead that can be expected cations run as user-mode threads in address spaces from providing address spaces. 2From L4Linux’s point of view, this is a “Linux kernel module,” but of course L4Linux runs as a user program and not in kernel mode. 3.1 Experimental setup try”). This measurement was intended to quantify the effect of critical code sections that disable inter- To induce worst-case system behavior, we have used rupts within the kernels4. By measuring both kernel- two strategies. level and real-time–thread latencies, we were able to First, prior to triggering the hardware event, we con- filter out overhead not induced by introducing ad- figure the system such that the kernel’s and the real- dress spaces but by artifacts of the kernels’ imple- time thread’s cache and TLB working sets they need mentations. to react to the event are completely swapped out to The diagrams in Figures 2 and 3 show the densities main memory (and the corresponding 1st-level and of the interrupt response times under the two load 2nd-level cache lines are dirty). For this purpose we conditions. Table 1 gives an summarization of the have written a Linux program that invalidates the worst-case measurement results. caches and TLB entries (cache flooder) Second, we exercise various code paths in RTLinux, kernel entry kernel path user entry L4Linux, and the Fiasco microkernel. These cover- L4RTL 53 µs 39 µs 85 µs age tests are a probabilistic way to reveal code paths RTLinux 53 µs 56 µs 68 µs with maximal execution time while interrupts are disabled. Additionally, they increase confidence that TABLE 1: Worst-case interrupt execution Drops the system is indeed completely preemptible times on L4RTL and RTLinux as we claimed in Section 2.1. To avoid missing criti- cal code paths because of pathologic timer synchro- The maximal time to invoke the Fiasco microker- nization, we varied the time between triggering two nel’s handler was 53 µs, and the maximal time to interrupts. start the corresponding L4RTL thread was 39 µs. For code coverage we use a benchmarking suite for As the sum of these separate maximums is close to Unix, hbench [1]. This benchmarkprovides excel- the total maximum (86 µs), it seems that the code lent coverage for Linux and, in our experience, also path that disables interrupts and the code path that for L4Linux and the Fiasco microkernel. starts the L4RTL thread use different code and data As a reliable and measurable interrupt source, we and are accounted for separately. We believe that the have used the x86 CPU’s built-in interrupt controller kernel-entry cost can be attributed to deficiencies in (Local APIC3). This unit offers a timer interrupt our microkernel’s preemptability. that can be used to obtain the time between the Perhaps the most interesting result is that the worst- hardware event and the reaction in kernel or user case latency for RTLinux is 68 µs. RTLinux achieves code. When the Local APIC is used in periodic a higher level of kernel interruptibility than the Fi- mode, its overhead is close to zero because first, it asco microkernel. This fact leads to better worst-case does not require repeated reinitialization, and sec- response times in our code-coverage tests—39 µs ver- ond, the elapsed time since the hardware trigger can sus 56 µs. However, RTLinux seems to have long be read directly from the chip. periods of disabled interrupts, too. It takes up to We used RTLinux version 3.0 with Linux 2.2.18 in 53 µs to invoke a kernel handler. The fact that this non-SMP mode. In this mode, the Local APIC is not value is close to the total maximum of 68 µs sug- used for system purposes. Linux has the bigphysarea gests that the code that keeps interrupts disabled is patch applied to allow allocation of contiguous ph- close or identical to the code that is executed when ysiscal memory pages we need for our cache flooder. an interrupt occurs (Figure 2, bottom row). In other The version of L4Linux we used is 2.2.18. words, it is probably RTLinux itself that impairs the system’s interruptibility. 3.2Measurements All in all, our measurements show that unrelated lim- itations in our microkernel, namely its disabling of For both RTLinux and L4RTL, we measured the interrupts for up to 53 µs, was more problematic for exact time between the occurrence of the hard- L4RTL’s real-time performance than the introduc- ware event and the first instruction in the real-time tion of address spaces for real-time tasks into the thread. We measured this time under two conditions: system. Average case (no additional system load) and under a We deduce that providing address spaces for real- combination of the hbench and cache-flooding loads. time tasks does not lead to unacceptable worst-case Additionally, we measured the time between the oc- scheduling delays. Moreover, the introduction of ad- currence of the hardware event and the first instruc- dress spaces introduces less uncertainty than blocked tion of the kernel-level interrupt handler (“kernel en- interrupts and caches. 3APIC = advanced programmable interrupt controller 4L4Linux is not an issue here, because it never disables interrupts for synchronization. idle: kernel entry idle: kernel path flooder: rt task entry 100 100 100

10−1 10−1 10−1

10−2 10−2 10−2

−3 −3 −3

Frequency 10 Frequency 10 Frequency 10

10−4 10−4 10−4

10−5 10−5 10−5 0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80 IRQ occurence rate (µs) IRQ occurence rate (µs) IRQ occurence rate (µs) worst case = 7 µs worst case = 6 µs worst case = 13 µs

lat_syscall_sbrk: kernel entry lat_syscall_sbrk: kernel path lat_syscall_sbrk: rt task entry 100 100 100

10−1 10−1 10−1

10−2 10−2 10−2

−3 −3 −3

Frequency 10 Frequency 10 Frequency 10

10−4 10−4 10−4

10−5 10−5 10−5 0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80 IRQ occurence rate (µs) IRQ occurence rate (µs) IRQ occurence rate (µs) worst case = 53 µs worst case = 56 µs worst case = 68 µs

Legend: X axis shows interrupt latency. Y axis shows density of occurrence of particular latencies. Note that the Y axis has a logarithmic scale. FIGURE 2: RTLinux performance under no load (first row) and hbench + cache flooder combined (second row). Left: Time to enter kernel mode. Center: Time to activate handler in RTLinux thread. Right: Accumulated, total time

idle: kernel entry idle: kernel path idle: rt task entry 100 100 100

10−1 10−1 10−1

10−2 10−2 10−2

−3 −3 −3

Frequency 10 Frequency 10 Frequency 10

10−4 10−4 10−4

10−5 10−5 10−5 0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80 IRQ occurence rate (µs) IRQ occurence rate (µs) IRQ occurence rate (µs) worst case = 37 µs worst case = 17 µs worst case = 43 µs

lat_syscall_sigaction: kernel entry lat_syscall_sigaction: kernel path lat_syscall_sigaction: rt task entry 100 100 100

10−1 10−1 10−1

10−2 10−2 10−2

−3 −3 −3

Frequency 10 Frequency 10 Frequency 10

10−4 10−4 10−4

10−5 10−5 10−5 0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80 IRQ occurence rate (µs) IRQ occurence rate (µs) IRQ occurence rate (µs) worst case = 53 µs worst case = 39 µs worst case = 85 µs

Legend: X and Y axes have the same meaning as in Figure 2 FIGURE 3: L4RTL performance under no load (first row) and hbench + cache flooder combined (second row). Left: Time to enter kernel mode. Center: Time to activate handler in RTLinux thread. Right: Accumulated, total time 4 Conclusion and future work [2] H. H¨artig, R. Baumgartl, M. Borriss, Cl.-J. Hamann, M. Hohmuth, F. Mehnert, L. Reuther, In this paper, we have compared two Linux-based S. Sch¨onberg, and J. Wolter. DROPS: OS sup- real-time operating systems: RTLinux, a shared- port for distributed multimedia applications. In space system, and L4RTL, a separate-space system. Proceedings of the Eighth ACM SIGOPS Eu- We learned that address spaces, when provided by a ropean Workshop, Sintra, Portugal, September small real-time executive and used to protect criti- 1998. cal real-time tasks from a shaky time-sharing subsys- [3] H. H¨artig, M. Hohmuth, J. Liedtke, tem, do not come for free. They increase worst-case S. Sch¨onberg, and J. Wolter. The perfor- response times, adding delays and jitter. mance of µ-kernel-based systems. In 16th ACM However, the good news is that the additional over- Symposium on Operating System Principles head in worst-case situations is comparable to costs (SOSP), pages 66–77, Saint-Malo, France, introduced by blocked interrupts and by common October 1997. hardware features of modern CPUs, such as caches. These costs seem to be well-accepted by designers. [4] Hermann H¨artig, Michael Hohmuth, and Jean So far, we have only claimed that separate address Wolter. Taming Linux. In Proceedings of the spaces are effective means to isolate the real-time sec- 5th Annual Australasian Conference on Parallel tion from faulty time-sharing subsystems. As future And Real-Time Systems (PART ’98), Adelaide, work, we plan to test this claim by injecting faults, Australia, September 1998. for example arbitrary memory accesses, to the Linux section in kernel and user space and to add a reboot- [5] M. Hohmuth and H. H¨artig. Pragmatic nonblock- ing facility for the time-sharing subsystem. ing synchronization for real-time systems. In USENIX Annual Technical Conference, Boston, To improve worst-case execution times, we plan to MA, June 2001. extend the Fiasco microkernel with a facility for em- ulating tagged TLBs (“small address spaces” [6]) and [6] J. Liedtke. Improved address-space switching to apply a user-level memory-management server on Pentium processors by transparently multi- that provides cache partitioning. plexing user address spaces. Arbeitspapiere der GMD No. 933, GMD — German National Re- search Center for Information Technology, Sankt References Augustin, September 1995. [7] RTLinux FAQ. URL: http://www.rtlinux. [1] A. B. Brown and M. I. Seltzer. Operating system org/documents/faq.html. benchmarking in the wake of lmbench: A case study of the performance of NetBSD on the Intel [8] Victor Yodaiken and Michael Barabanov. A Real- x86 architecture. In ACM SIGMETRICS Con- Time Linux. In Proceedings of the Linux Appli- ference on Measurement and Modeling of Com- cations Development and Deployment Conference puter Systems, pages 214–224, Seattle, WA, June (USELINUX), Anaheim, CA, January 1997. The 1997. USENIX Association.