<<

Hrtimers and Beyond: Transforming the Linux Subsystems

Thomas Gleixner Douglas Niehaus linutronix University of Kansas [email protected] [email protected]

Abstract will complement and complete the hrtimers and timeofday components to create a solid foun- dation for architecture independent support of Several projects have tried to rework Linux high-resolution timers and dynamic ticks. time and timers code to add functions such as high-precision timers and dynamic ticks. Pre- vious efforts have not been generally accepted, in part, because they considered only a sub- 1 Introduction set of the related problems requiring an inte- grated solution. These efforts also suffered sig- nificant architecture dependence creating com- Time keeping and use of is a fundamen- plexity and maintenance costs. This paper tal aspect of operating system implementation, presents a design which we believe provides a and thus of Linux. related services in op- generally acceptable solution to the complete erating systems fall into a number of different set of problems with minimum architecture de- categories: pendence. • time keeping The three major components of the design are hrtimers, John Stulz’s new timeofday approach, • clock synchronization and a new component called clock events. Clock events manages and distributes clock • time-of- representation events and coordinates the use of clock event • next event interrupt scheduling handling functions. The hrtimers subsystem has been merged into Linux 2.6.16. Although • process and in-kernel timers the name implies “high resolution” there is no change to the tick based timer resolution at • process accounting this stage. John Stultz’s timeofday rework ad- • process profiling dresses jiffy and architecture independent time keeping and has been identified as a fundamen- tal preliminary for high resolution timers and These service categories exhibit strong interac- tickless/dynamic tick solutions. This paper pro- tions among their semantics at the design level vides details on the hrtimers implementation and tight coupling among their components at and describes how the clock events component the implementation level. 334 • Hrtimers and Beyond: Transforming the Linux Time Subsystems

Hardware devices capable of providing clock 2 Previous Efforts sources vary widely in their capabilities, accu- racy, and suitability for use in providing the de- sired clock services. The ability to use a given A number of efforts over many have ad- hardware device to provide a particular clock dressed various clock related services and func- service also varies with its context in a unipro- tions in Linux including various approaches to cessor or multi-processor system. high resolution time keeping and event schedul- ing. However, all of these efforts have encoun- tered significant difficulty in gaining broad ac- Arch 1 ceptance because of the breadth of their impact Timekeeping TOD Clock source HW on the rest of the kernel, and because they gen-

Tick ISR Clock event source HW erally addressed only a subset of the clock re- lated services in Linux. Arch 2 Process acc. TOD Clock source HW Profiling Interestingly, all those efforts have a common ISR Clock event source HW design pattern, namely the attempt to inte- Jiffies grate new features and services into the existing Arch 3 clock and timer infrastructure without changing Timer wheel TOD Clock source HW the overall design.

ISR Clock event source HW There are no projects to our knowledge which attempt to solve the complete problem as we understand and have described it. All existing Figure 1: Linux Time System efforts, in our view, address only a part of the whole problem as we see it, which is why, in Figure 1 shows the current software architec- our opinion, the solutions to their target prob- ture of the clock related services in a vanilla 2.6 lems are more complex than under our pro- Linux system. The current implementation of posed architecure, and are thus less likely to be clock related services in Linux is strongly asso- accepted into the main line kernel. ciated with individual hardware devices, which results in parallel implementations for each ar- chitecture containing considerable amounts of essentially similar code. This code duplication 3 Required Abstractions across a large number of architectures makes it difficult to change the semantics of the clock related services or to add new features such The attempt to integrate high resolution timers as high resolution timers or dynamic ticks be- into Ingo Molnar’s real-time preemption patch cause even a simple change must be made in led to a thorough analysis of the Linux timer so many places and adjusted for so many im- and clock services infrastructure. While plementations. Two major factors make im- the comprehensive solution for addressing the plementing changes to Linux clock related ser- overall problem is a large-scale task it can be vices difficult: (1) the lack of a generic ab- separated into different problem areas. straction layer for clock services and (2) the assumption that time is tracked using periodic timer ticks (jiffies) that is strongly integrated • clock sources management for time keep- into much of the clock and timer related code. ing 2006 Linux Symposium, Volume One • 335

• clock synchronization 3.2 Clock Synchronization

• time-of-day representation Crystal driven clock sources tend to be impre- • clock event management for scheduling cise due to variation in component tolerances next event interrupts and environmental factors, such as tempera- ture, resulting in slightly different clock tick • eliminating the assumption that timers are rates and thus, over time, different clock val- supported by periodic interrupts and ex- ues in different computers. The Network Time pressed in units of jiffies Protocol (NTP) and more recently GPS/GSM based synchronization mechanisms allow the correction of values and of clock These areas of concern are largely independent source drift with respect to a selected standard. and can thus be addressed more or less inde- Value correction is applied to the monotoni- pendently during implementation. However, cally increasing value of the hardware clock the important points of interaction among them source. This is an optional functionality as must be considered and supported in the overall it can only be applied when a suitable refer- design. ence time source is available. Support for clock synchronization is a separate component from those discussed here. There is work in progress 3.1 Clock Source Management to rework the current mechanism by John Stultz and Roman Zippel. An abstraction layer and associated API are re- quired to establish a common code framework 3.3 Time-of-day Representation for managing various clock sources. Within this framework, each clock source is required to maintain a representation of time as a monoton- The monotonically increasing time value pro- ically increasing value. At this time, nanosec- vided by many hardware clock sources cannot onds are the favorite choice for the time value be set. The generic interface for time-of-day units of a clock source. This abstraction layer representation must thus compensate for drift allows the user to select among a range of avail- as an offset to the clock source value, and rep- able hardware devices supporting clock func- resent the time-of-day ( or wall clock tions when configuring the system and pro- time) as a function of the clock source value. vides necessary infrastructure. This infras- The drift offset and parameters to the func- tructure includes, for example, mathematical tion converting the clock source value to a wall helper functions to convert time values specific clock value can set by manual interaction or to each clock source, which depend on prop- under control of software for synchronization erties of each hardware device, into the com- with external time sources (e.g. NTP). mon human-oriented time units used by the framework, i.e. . The centraliza- It is important to note that the current Linux im- tion of this functionality allows the system to plementation of the time keeping component is share significantly more code across architec- the reverse of the proposed solution. The inter- tures. This abstraction is already addressed by nal time representation tracks the time-of-day John Stultz’s work on a Generic Time-of-day time fairly directly and derives the monotoni- subsystem [5]. cally increasing time value from it. 336 • Hrtimers and Beyond: Transforming the Linux Time Subsystems

This is a relic of software development history and representation was one of the main prob- and the GTOD/NTP work is already addressing lems faced when implementing high resolution this issue. timers and variable interval event scheduling. All attempts to reuse the cascading timer wheel turned out to be incomplete and inefficient for 3.4 Clock Event Management various reasons. This led to the implementation of the hrtimers (former ktimers) subsystem. It While clock sources provide read access to provides the base for precise timer scheduling the monotonically increasing time value, clock and is designed to be easily extensible for high event sources are used to schedule the next resolution timers. event interrupt(s). The next event is currently defined to be periodic, with its period defined at compile time. The setup and selection of the event sources for various event driven func- 4 hrtimers tionalities is hardwired into the architecture de- pendent code. This results in duplicated code The current approach to timer management in across all architectures and makes it extremely Linux does a good job of satisfying an ex- difficult to change the configuration of the sys- tremely wide range of requirements, but it can- tem to use event interrupt sources other than not provide the quality of service required in those already built into the architecture. An- some cases precisely because it must satisfy other implication of the current design is that such a wide range of requirements. This is why it is necessary to touch all the architecture- the essential first step in the approach described specific implementations in order to provide here is to implement a new timer subsystem new functionality like high resolution timers or to complement the existing one by assuming a dynamic ticks. subset of its existing responsibilities. The clock events subsystem tries to address this problem by providing a generic solution to 4.1 Why a New Timer Subsystem? manage clock event sources and their usage for the various clock event driven kernel function- alities. The goal of the clock event subsystem The Cascading Timer Wheel (CTW), which is to minimize the clock event related architec- was implemented in 1997, replaced the original ture dependent code to the pure hardware re- time ordered double linked list to resolve the lated handling and to allow easy addition and scalability problem of the linked list’s O(N) in- utilization of new clock event sources. It also sertion time. It is based on the assumption that minimizes the duplicated code across the ar- the timers are supported by a periodic interrupt chitectures as it provides generic functionality (jiffies) and that the expiration time is also rep- down to the interrupt service handler, which is resented in jiffies. The difference in time value almost inherently hardware dependent. (delta) between now (the current system time) and the timer’s expiration value is used as an in- dex into the CTW’s logarithmic array of arrays. 3.5 Removing Tick Dependencies Each array within the CTW represents the set of timers placed within a region of the system The strong dependency of Linux timers on us- time line, where the size of the array’s regions ing the the periodic tick as the time source grow exponentially. Thus, the further into the 2006 Linux Symposium, Volume One • 337

array start end granularity be removed and reinserted into the lower cat- 1 1 256 1 egory’s finer-grained entries. Note that in the 2 257 16384 256 first CTW category the timers are time-sorted 3 16385 1048576 16384 with jiffy resolution. 4 1048577 67108864 1048576 5 67108865 4294967295 67108864 While the CTW’s O(1) insertion and removal is very efficient, timers with an expiration time Table 1: Cascading Timer Wheel Array Ranges larger than the capacity of the first category have to be cascaded into a lower category at least once. A single step of cascading moves future a timer’s expiration value lies, the larger many timers and it has to be done with inter- the region of the time line represented by the rupts disabled. The cascading operation can array in which it is stored. The CTW groups thus significantly increase maximum latencies timers into 5 categories. Note that each CTW since it occasionally moves very large sets of array represents a range of jiffy values and that timers. The CTW thus has excellent average more than one timer can be associated with a performance but unacceptable worst case per- given jiffy value. formance. Unfortunately the worst case perfor- mance determines its suitability for supporting Table 1 shows the properties of the different high resolution timers. timer categories. The first CTW category con- sists of n1 entries, where each entry represents However, it is important to note that the CTW a single jiffy. The category consists of is an excellent solution (1) for timers hav- n2 entries, where each entry represents n1*n2 ing an expiration time lower than the capac- jiffies. The third category consists of n3 en- ity of the primary category and (2) for timers tries, where each entry represents n1*n2*n3 which are removed before they expire or have jiffies. And so forth. The current kernel uses to be cascaded. This is a common scenario n1=256 and n2..n5 = 64. This keeps the num- for many long- protocol-timeout related ber of hash table entries in a reasonable range timers which are created by the networking and and covers the future time line range from 1 to I/O subsystems. 4294967295 jiffies. The KURT-Linux project at the University of The capacity of each category depends on the Kansas was the first to address implementing size of a jiffy, and thus on the periodic in- high resolution timers in Linux [4]. Its con- terrupt interval. While the 10 ms tick period centration was on investigating various issues in 2.4 kernels implied 2560ms for the CTW related to using Linux for real-time computing. first category, this was reduced to 256ms in the The UTIME component of KURT-Linux exper- early 2.6 kernels (1 ms tick) and readjusted to imented with a number of data structures to 1024ms when the HZ value was set to 250. support high resolution timers, including both Each CTW category maintains an time index separate implementations and those integrated counter which is incremented by the “wrap- with general purpose timers. The HRT project ping” of the lower category index which occurs began as a fork of UTIME code [1]. both when its counter increases to the point where projects added a sub-jiffy time tracking compo- its range overlaps that of the higher category. nent to increase resolution, and when integrat- This triggers a “cascade” where timers from the ing support with the CTW, sorted timers within matching entry in the higher category have to a given jiffy on the basis of the subjiffy value. 338 • Hrtimers and Beyond: Transforming the Linux Time Subsystems

This increased overhead involved with cascad- sorted, per-CPU list, implemented as a red- ing due to the O(N) sorting time. The experi- black tree. Red-black trees provide O(log(N)) ence of both projects demonstrated that timer insertion and removal and are considered to management overhead was a significant factor, be efficient enough as they are already used and that the necessary changes in the timer code in other performance critical parts of the ker- were quite scattered and intrusive. In sum- nel e.g. memory management. The timers are mary, the experience of both projects demon- kept in relation to time bases, currently CLOCK_ strated that separating support for high resolu- MONOTONIC and CLOCK_REALTIME, ordered tion and longer-term generic (CTW) timers was by the absolute expiration time. This separa- necessary and that a comprehensive restructur- tion allowed to remove large chunks of code ing of the timer-related code would be required from the POSIX timer implementation, which to make future improvements and additions to was necessary to recalculate the expiration time timer-related functions possible. The hrtimers when the clock was set either by settimeofday design and other aspects of the architecture de- or NTP adjustments. scribed in this paper was strongly motivated by the lessons derived from both previous projects. hrtimers went through a couple of revision cy- cles and were finally merged into Linux 2.6.16. The timer queues run from the normal timer 4.2 Solution softirq so the resulting resolution is not better than the previous timer API. All of the struc- ture is there to do better once the other parts of As a first step we categorized the timers into the overall timer code rework are in place. two categories: After adding hrtimers the Linux time(r) system looks like this: Timeouts: Timeouts are used primarily by Arch 1 networking and device drivers to detect when Timekeeping TOD Clock source HW an event (I/O completion, for example) does hrtimers not occur as expected. They have low resolu- Tick ISR Clock event source HW Arch 2 tion requirements, and they are almost always Process acc. TOD Clock source HW removed before they actually expire. Profiling ISR Clock event source HW Jiffies

Arch 3 Timers: Timers are used to schedule ongoing Timer wheel TOD Clock source HW events. They can have high resolution require- ISR Clock event source HW ments, and usually expire. Timers are mostly related to applications (user space interfaces) Figure 2: Linux time system + htimers The timeout related timers are kept in the ex- isting timer wheel and a new subsystem opti- mized for (high resolution) timer requirements 4.3 Further Work hrtimers was implemented. hrtimers are entirely based on human time The primary purpose of the separate imple- units: nanoseconds. They are kept in a time mentation for the high resolution timers, dis- 2006 Linux Symposium, Volume One • 339

cussed in Section 7, is to improve their sup- Shared HW Clock source port by eliminating the overhead and variable latency associated with the CTW. However, it is Clock synchr. TOD also important to note that this separation also Timekeeping Arch 1 HW hrtimers creates an opportunity to improve the CTW Tick ISR Clock event source HW behavior in supporting the remaining timers. Process acc. For example, using a coarser CTW granular- Arch 2 HW Profiling ity may lower overhead by reducing the num- ISR Clock event source HW ber of timers which are cascaded, given that an Jiffies even larger percentage of CTW timers would be Timer wheel Arch 3 HW canceled under an architecture supporting high ISR Clock event source HW resolution timers separately. However, while this is an interesting possibility, it is currently a speculation that must be tested. Figure 3: Linux time system + htimers + GTOD

6 Clock Event Source Abstraction 5 Generic Time-of-day

Just as it was necessary to provide a general The Generic Time-of-day subsystem (GTOD) abstraction for clock sources in order to move is a project led by John Stultz and was pre- a significant amount of code into the architec- sented at OLS 2005. Detailed information is ture independent area, a general framework for available from the OLS 2005 proceedings [5]. managing clock event sources is also required It contains the following components: in the architecture independent section of the source under the architecture described here. Clock event sources provide either periodic or • Clock source management individual programmable events. The manage- • Clock synchronization ment layer provides the infrastructure for regis- tering event sources and manages the distribu- • Time-of-day representation tion of the events for the various clock related services. Again, this reduces the amount of es- sentially duplicate code across the architectures GTOD moves a large portion of code out of and allows cross-architecture sharing of hard- the architecture-specific areas into a generic ware support code and the easy addition of non- management framework, as illustrated in Fig- standard clock sources. ure 3. The remaining architecture-dependent code is mostly limited to the direct hardware in- The management layer provides interfaces for terface and setup procedures. It allows simple hrtimers to implement high resolution timers sharing of clock sources across architectures and also builds the base for a generic dy- and allows the utilization of non-standard clock namic tick implementation. The management source hardware. GTOD is work in progress layer supports these more advanced functions and intends to produce set of changes which only when appropriate clock event sources have can be adopted step by step into the main-line been registered, otherwise the traditional peri- kernel. odic tick based behavior is retained. 340 • Hrtimers and Beyond: Transforming the Linux Time Subsystems

6.1 Clock Event Source Registration • resume: function which has to be called be- fore resume

Clock event sources are registered either by the • event_handler: function pointer which is architecture dependent boot code or at mod- filled in by the clock event management code. ule insertion time. Each clock event source This function is called from the event source fills a data structure with clock-specific prop- interrupt erty parameters and callback functions. The • start_event: function called before the clock event management decides, by using the event_handler function in case that the specified property parameters, the set of system clock event layer provides the interrupt han- functions a clock event source will be used to dler support. This includes the distinction of per- CPU and per-system global event sources. • end_event: function called after the event_handler function in case that the System-level global event sources are used for clock event layer provides the interrupt han- the Linux periodic tick. Per-CPU event source dler are used to provide local CPU functionality • irq: interrupt number in case the clock event such as process accounting, profiling, and high layer requests the interrupt and provides the in- resolution timers. The clock_event data terrupt handler structure contains the following elements: • priv: pointer to clock source private data structures • name: clear text identifier

• capabilities: a bit-field which describes The clock event source can delegate the inter- the capabilities of the clock event source and rupt setup completely to the management layer. hints about the preferred usage It depends on the type of interrupt which is as- sociated with the event source. This is possible • max_delta_ns: the maximum event delta for the PIT on the i386 architecture, for exam- (offset into future) which can be scheduled ple, because the interrupt in question is handled • min_delta_ns: the minimum event delta by the generic interrupt code and can be ini- which can be scheduled tialized via setup_irq. This allows us to com- pletely remove the timer interrupt handler from • mult: multiplier for scaled math conversion the i386 architecture-specific area and move the from nanoseconds to clock event source units modest amount of hardware-specific code into • shift: shift factor for scaled math conver- appropriate source files. The hardware-specific sion from nanoseconds to clock event source routines are called before and after the event units handling code has been executed.

• set_next_event: function to schedule the In case of the Local APIC on i386 and the next event Decrementer on PPC architectures, the inter- rupt handler must remain in the architecture- • set_mode: function to toggle the clock event specific code as it can not be setup through source operating mode (periodic / one shot) the standard interrupt handling functions. The • suspend: function which has to be called be- clock management layer provides the function fore suspend which has to be called from the hardware level 2006 Linux Symposium, Volume One • 341 handler in a function pointer in the clock source 6.4 Existing Implementations description structure. Even in this case the shared code of the timer interrupt is removed from the architecture-specific implementation At the time of this writing the base framework and the event distribution is managed by the code and the conversion of i386 to the clock generic clock event code. The Clock Events event layer is available and functional. subsystem also has support code for clock event The clock event layer has been successfully sources which do not provide a periodic mode; ported to ARM and PPC, but support has not e.g. the Decrementer on PPC or match regis- been continued due to lack of human resources. ter based event sources found in various ARM SoCs. 6.5 Code Size Impact

6.2 Clock Event Distribution The framework adds about 700 lines of code which results in a 2KB increase of the kernel binary size. The clock event layer provides a set of prede- fined functions, which allow the association of The conversion of i386 removes about 100 lines various clock event related services to a clock of code. The binary size decrease is in the range event source. of 400 bytes.

The current implementation distributes events We believe that the increase of flexibility and for the following services: the avoidance of duplicated code across archi- tectures justifies the slight increase of the bi- nary size. • periodic tick The first goal of the clock event implementation was to prove the feasibility of the approach. • process accounting There is certainly room for optimizing the size impact of the framework code, but this is an is- • profiling sue for further development. • next event interrupt (e.g. high resolution timers, dynamic tick) 6.6 Further Development

The following work items are planned: 6.3 Interfaces

• Streamlining of the code The clock event layer API is rather small. • Revalidation of the clock distribution de- Aside from the clock event source registration cisions interface it provides functions to schedule the next event interrupt, clock event source notifi- • Support for more architectures cation service, and support for suspend and re- sume. • Dynamic tick support 342 • Hrtimers and Beyond: Transforming the Linux Time Subsystems

6.7 State of Transformation clock event framework is the inclusion of a sin- gle line in the architecture specific Kconfig file.

The clock event layer adds another level of The next event modifications remove the im- abstraction to the Linux subsystem related to plicit but strong binding of hrtimers to jiffy time keeping and time-related activities, as il- tick boundaries. When the high resolution ex- lustrated in Figure 4. The benefit of adding the tension is disabled the clock event distribu- abstraction layer is the substantial reduction in tion code works in the original periodic mode architecture-specific code, which can be seen and hrtimers are bound to jiffy tick boundaries most clearly by comparing Figures 3 and 4. again.

Shared HW Clock source

Clock synchr. TOD 8 Implementation

Timekeeping Arch 1 HW hrtimers Shared HW Clock events HW While the base functionality of hrtimers re- ISR mains unchanged, additional functionality had Arch 2 HW Tick Event distribution to be added. HW Process acc.

Profiling • Management function to switch to high Arch 3 HW Jiffies resolution mode late in the boot process. ISR HW • Next event scheduling Timer wheel • Next event interrupt handler Figure 4: Linux time system + htimers + GTOD + clock events • Separation of the hrtimers queue from the timer wheel softirq

During system boot it is not possible to use the 7 High Resolution Timers high resolution timer functionality, while mak- ing it possible would be difficult and would serve no useful function. The initialization of The inclusion of the clock source and clock the clock event framework, the clock source event source management and abstraction lay- framework and hrtimers itself has to be done ers provides now the base for high resolution and appropriate clock sources and clock event support for hrtimers. sources have to be registered before the high resolution functionality can work. Up to the While previous attempts of high resolution point where hrtimers are initialized, the sys- timer implementations needed modification all tem works in the usual low resolution peri- over the kernel source tree, the hrtimers based odic mode. The clock source and the clock implementation only changes the hrtimers code event source layers provide notification func- itself. The required change to enable high tions which inform hrtimers about availability resolution timers for an architecture which is of new hardware. hrtimers validates the usabil- supported by the Generic Time-of-day and the ity of the registered clock sources and clock 2006 Linux Symposium, Volume One • 343 event sources before switching to high reso- The softirq for running the hrtimer queues lution mode. This ensures also that a kernel and executing the callbacks has been separated which is configured for high resolution timers from the tick bound timer softirq to allow ac- can run on a system which lacks the necessary curate delivery of high resolution timer signals hardware support. which are used by itimer and POSIX interval timers. The execution of this softirq can still be The time ordered insertion of hrtimers pro- delayed by other softirqs, but the overall laten- vides all the infrastructure to decide whether cies have been significantly improved by this the event source has to be reprogrammed when separation. a timer is added. The decision is made per timer base and synchronized across timer bases in a support function. The design allows the system Shared HW Clock source to utilize separate per-CPU clock event sources Clock synchr. TOD for the per-CPU timer bases, but mostly only one reprogrammable clock event source per- Timekeeping Arch 1 HW hrtimers CPU is available. The high resolution timer Shared HW Clock events HW does not support SMP machines which have Next event ISR only global clock event sources. Arch 2 HW Event distribution The next event interrupt handler is called from HW Process acc. the clock event distribution code and moves Profiling expired timers from the red-black tree to a Arch 3 HW Jiffies separate double linked list and invokes the ISR HW softirq handler. An additional mode field in the Timer wheel hrtimer structure allows the system to execute callback functions directly from the next event Figure 5: Linux time system + htimers + interrupt handler. This is restricted to code GTOD + clock events + high resolution timers which can safely be executed in the hard inter- rupt context and does not add the timer back to the red-black tree. This applies, for exam- 8.1 Accuracy ple, to the common case of a wakeup function as used by nanosleep. The advantage of exe- cuting the handler in the interrupt context is the avoidance of up to two context switches—from All tests have been run on a Pentium III the interrupted context to the softirq and to the 400MHz based PC. The tables show compar- task which is woken up by the expired timer. isons of vanilla Linux 2.6.16, Linux-2.6.16- The next event interrupt handler also provides hrt5 and Linux-2.6.16-rt12. The tests for inter- functionality which notifies the clock event dis- vals less than the jiffy resolution have not been tribution code that a requested periodic interval run on vanilla Linux 2.6.16. The test thread has elapsed. This allows to use a single clock runs in all cases with SCHED_FIFO and pri- event source to schedule high resolution timer ority 80. and periodic events e.g. jiffies tick, profiling, process accounting. This has been proved to Test case: clock_nanosleep(TIME_ work with the PIT on i386 and the Incrementer ABSTIME), Interval 10000 , on PPC. 10000 loops, no load. 344 • Hrtimers and Beyond: Transforming the Linux Time Subsystems

Kernel min max avg Kernel min max avg 2.6.16 24 4043 1989 2.6.16-hrt5 8 119 22 2.6.16-hrt5 12 94 20 2.6.16-rt12 12 78 16 2.6.16-rt12 6 40 10 Test case: POSIX interval timer, Interval 500 Test case: clock_nanosleep(TIME_ micro , 100000 loops, 100% load. ABSTIME), Interval 10000 micro seconds, 10000 loops, 100% load. Kernel min max avg 2.6.16-hrt5 16 489 58 Kernel min max avg 2.6.16-rt12 12 95 29 2.6.16 55 4280 2198 The real-time preemption kernel results are sig- 2.6.16-hrt5 11 458 55 nificantly better under high load due to the gen- 2.6.16-rt12 16 eral low latencies for high priority real-time Test case: POSIX interval timer, Interval 10000 tasks. Aside from the general latency opti- micro seconds, 10000 loops, no load. mizations, further improvements were imple- mented specifically to optimize the high reso- Kernel min max avg lution timer behavior. 2.6.16 21 4073 2098 2.6.16-hrt5 22 120 35 2.6.16-rt12 20 60 31 Separate threads for each softirq. Long lasting softirq callback functions e.g. in the Test case: POSIX interval timer, Interval 10000 networking code do not delay the delivery of micro seconds, 10000 loops, 100% load. hrtimer softirqs. Kernel min max avg 2.6.16 82 4271 2089 Dynamic priority adjustment for high reso- 2.6.16-hrt5 31 458 53 lution timer softirqs. Timers store the prior- 2.6.16-rt12 21 70 35 ity of the task which inserts the timer and the next event interrupt code raises the priority of Test case: clock_nanosleep(TIME_ the hrtimer softirq when a callback function ABSTIME), Interval 500 micro seconds, for a high priority thread has to be executed. 100000 loops, no load. The softirq lowers its priority automatically af- Kernel min max avg ter the execution of the callback function. 2.6.16-hrt5 5 108 24 2.6.16-rt12 5 48 7 9 Dynamic Ticks Test case: clock_nanosleep(TIME_ ABSTIME), Interval 500 micro seconds, 100000 loops, 100% load. We have not yet done a dynamic tick imple- mentation on top of the existing framework, but Kernel min max avg we considered the requirements for such an im- 2.6.16-hrt5 9 684 56 plementation in every design step. 2.6.16-rt12 10 60 22 The framework does not solve the general prob- Test case: POSIX interval timer, Interval 500 lem of dynamic ticks: how to find the next ex- micro seconds, 100000 loops, no load. piring timer in the timer wheel. In the worst 2006 Linux Symposium, Volume One • 345 case the code has to walk through a large num- 10 Conclusion ber of hash buckets. This can not be changed without changing the basic semantics and im- The existing parts and pieces of the overall so- plementation details of the timer wheel code. lution have proved that a generic solution for high resolution timers and dynamic tick is fea- The next expiring hrtimer is simply retrieved sible and provides a valuable benefit for the by checking the first timer in the time ordered Linux kernel. red-black tree. Although most of the components have been tested extensively in the high resolution timer On the other hand, the framework will deliver patch and the real-time preemption patch there all the necessary clock event source mecha- is still a way to go until a final inclusion into nisms to reprogram the next event interrupt and the mainline kernel can be considered. enable a clean, non-intrusive, out of the box, so- lution once an architecture has been converted In general this can only be achieved by a step by to use the framework components. step conversion of functional units and archi- tectures. The framework code itself is almost self contained so a not converted architecture The clock event functionalities necessary for should not have any impacts. dynamic tick implementations are available whether the high resolution timer functionality We believe that we provided a clear vision of is enabled or not. The framework code takes the overall solution and we hope that more de- care of those use cases already. velopers get interested and help to bring this further in the near future.

With the integration of dynamic ticks the trans- formation of the Linux time related subsystems 10.1 Acknowledgments will become complete, as illustrated in Fig- ure 6. We sincerely thank all those people who helped us to develop this solution. Help has been pro- vided in form of code contributions, code re- Shared HW Clock source views, testing, discussion, and suggestions. Es- pecially we want to thank Ingo Molnar, John Clock synchr. TOD Stultz, George Anzinger, Roman Zippel, An- Timekeeping Arch 1 HW hrtimers drew Morton, Steven Rostedt and Benedikt Shared HW Clock events HW Spranger. A special thank you goes to Jonathan Next event ISR Corbet who wrote some excellent articles about Dynamic tick Arch 2 HW Event distribution hrtimers (and the previous ktimers) implemen- HW tation [2, 3]. Process acc.

Profiling Arch 3 HW Jiffies ISR HW References Timer wheel [1] George Anzinger and Monta Vista. High Figure 6: Transformed Linux Time Subsystem resolution timers home page. 346 • Hrtimers and Beyond: Transforming the Linux Time Subsystems

http://high-res-timers. sourceforge.net.

[2] J. Corbet. Lwn article: A new approach to kernel timers. http: //lwn.net/Articles/152436.

[3] J. Corbet. Lwn article: The high resolution timer api. http: //lwn.net/Articles/167897.

[4] B. Srinivasan, S. Pather, R. Hill, F. Ansari, and D. Niehaus. A firm real-time system implementation using commercial off-the shelf hardware and free software. In 4th Real-Time Technology and Applications Symposium, Denver, June 1998.

[5] J. Stulz. We are not getting any younger: A new approach to timekeeping and timers. In Ottawa Lnux Symposium, Ottawa, Ontario, Canada, July 2005. Proceedings of the Linux Symposium

Volume One

July 19th–22nd, 2006 Ottawa, Ontario Canada Conference Organizers

Andrew J. Hutton, Steamballoon, Inc. C. Craig Ross, Linux Symposium

Review Committee

Jeff Garzik, Red Hat Software Gerrit Huizenga, IBM Dave Jones, Red Hat Software Ben LaHaise, Intel Corporation Matt Mackall, Selenic Consulting Patrick Mochel, Intel Corporation C. Craig Ross, Linux Symposium Andrew Hutton, Steamballoon, Inc.

Proceedings Formatting Team

John W. Lockhart, Red Hat, Inc. David M. Fellows, Fellows and Carr, Inc. Kyle McMartin

Authors retain copyright to all submitted papers, but have granted unlimited redistribution rights to all as a condition of submission.