Challenges of Hard Real-Time OS This slide will be uploaded to http://www.ertl.jp/̃hiro/tmp/100905.pdf

Hiroaki Takada 1 Challenges of Hard Real-Time OS

Self Introduction Hard Real-Time Systems ‒ Main Focus of this Lecture Necessity and Problems of Multiprocessor Systems ▶ Shared resource contentions Classification of Multiprocessors ▶ Classifications from SW and HW points of view ▶ SMP, FDMP, LCMP RTOS for Multiprocessors RTOS for FDMP ▶ Necessity, Functionality, Implementation Issues RTOS for SMP ▶ Functionality, Our Approach Energy Consumption Optimization Concluding Remarks

Hiroaki Takada 2 Challenges of Hard Real-Time OS

Hiroaki Takada 3 Challenges of Hard Real-Time OS

Current Positions ▶ Professor, Nagoya University ▶ Executive Director, Center for Embedded Computing Systems (NCES), Nagoya University ▶ Chairman, TOPPERS Project and several others Major Research Topics ▶ Real-time operating systems for embedded systems ▶ Real-time scheduling and analysis ▶ Electronic system-level design of embedded systems ▶ Automotive embedded systems ! Several joint projects with Toyota Motor Corp. and other Japanese automotive industries

Hiroaki Takada 4 Challenges of Hard Real-Time OS

Nagoya ▶ Center city of third largest metropolitan area in Japan ▶ Tokyo (incl. Yokohama), Osaka, Nagoya, … ▶ Located around the center of Japanese Main Island (between Tokyo and Osaka) ▶ Manufacturing industry center of Japan ▶ Automotive industries are concentrated, especially ▶ The headquarters of Toyota Motor Corp. (located in Toyota City) is near to Nagoya. Nagoya University ▶ National University located in Nagoya City ▶ Within top 10 (I hope top 5!) universities of Japan

Hiroaki Takada 5 Challenges of Hard Real-Time OS

NCES = Nagoya Univ. Center for Embedded Computing Systems Objectives ▶ To establish a research and educational hub for embedded systems for satisfying strong industrial demands on technologies and human resources. Scope of NCES Activities ▶ Under the collaborations of academia, industry and/or government, NCES is involved in the following activities related to embedded systems: ▶ applied research aimed at practical use based on the fundamental research at University ▶ development of prototype software ▶ education/training of embedded system engineers

Hiroaki Takada 6 Challenges of Hard Real-Time OS

Projects funded by Industries ▶ OS for in-vehicle multimedia systems (TMC) ▶ Next-generation automotive network (AutoNetworks Technologies, Ltd.) ▶ Analysis and design of real-time task scheduling for automotive integrated control systems (TMC) ▶ Fault-tolerant design support through architecture description of automotive systems (TCRL) Projects funded by Government ▶ Energy Consumption optimization of embedded systems (JST CREST) ▶ Automotive software platform conforming to the functional safety standard (METI)

Hiroaki Takada 7 Challenges of Hard Real-Time OS

TOPPERS = Toyohashi Open Platform for Embedded and Real-Time Systems Objectives of the Project ▶ To develop various open-source software for embedded systems including RTOS and to promote their use. Building a widely used open-source OS as in the area of embedded systems! Main Activities of the Project ▶ Building a definitive µITRON-conformant RTOS ▶ Developing a next generation RTOS technology ▶ Developing software development technology and tools for embedded systems ▶ Fostering Embedded System Engineers

Hiroaki Takada 8 Challenges of Hard Real-Time OS

! All SW listed below can be downloaded from the TOPPERS website at http://www.toppers.jp/. TOPPERS/JSP Kernel (JSP = Just Standard Profile) ▶ RTOS conformant to the standard profile of µITRON4.0 specification TOPPERS/ATK1 (ATK = Automotive Kernel) ▶ RTOS conformant to OSEK/VDX OS specification TOPPER/ASP Kernel (ASP = Advanced Standard Profile) ▶ Improvement of JSP kernel ▶ Basis of TOPPERS new generation kernels TOPPERS/FMP Kernel (FMP = Flexible Multiprocessing) ▶ Extension of ASP kernel to various types of multiprocessor systems

Hiroaki Takada 9 Challenges of Hard Real-Time OS

TECS (TOPPERS Embedded Component System) ▶ Specification and tools for component-based development of embedded software. TINET ▶ Compact TCP/IP protocol stack conformant to ITRON TCP/IP API specification. ▶ Both IPv4 and IPv6 are supported. TLV (TraceLogVisualizer) ▶ Customizable tool to visualize various trace logs, including the trace log of RTOS Several Open Educational Materials ▶ Educational materials including presentation slides, software, and so on.

Hiroaki Takada 10 Consumer Applications

IPSiO GX e3300 (Ricoh) PM-A970 (EPSON)

DO!KARAOKE (PANASONIC)

UA-101 (Roland) GT-541 (Brother)

Hiroaki Takada 11 Industrial and Other Applications

Kizashi (SUZUKI) AP-X (Kyowa MEDIX)

OSP-P200 (Okuma)

DP-350 (Daihen) ASTRO-H (JAXA) ... under development

Hiroaki Takada 12 Challenges of Hard Real-Time OS

Hiroaki Takada 13 Challenges of Hard Real-Time OS

Definition of Hard Real-Time Systems ▶ If a timing constraint of the system is violated, some catastrophic event (such as loss of life) occurs. Examples of Hard Real-Time Systems ▶ Automotive control systems ▶ Engine management, Brake control, Steering control, Airbag control, … ▶ Railway control systems ▶ Aerospace applications ▶ Process conrtol systems ▶ Robotics (in Future?) ▶ …

Hiroaki Takada 14 Challenges of Hard Real-Time OS

System Components ▶ control computer (ECU) ▶ many sensors ▶ crank position sensor ▶ air flow meter ▶ intake temperature sensor ▶ throttle sensor ▶ some actuators Basic Functions of the Control System ▶ Calculates fuel injection volume and ignition timing and controls the actuators in every rotation cycle.

Hiroaki Takada Courtesy: Toyota Motor Corp. 15 Challenges of Hard Real-Time OS

Timing Behavior of Engine Management System ▶ When rotation speed is 6000rpm, one cycle is 20msec. ▶ Timing precision of the ignition is 10μsec. order.

Hiroaki Takada Courtesy: Toyota Motor Corp. 16 Challenges of Hard Real-Time OS

Required Real-Time Property (Example) ▶ The calculation of the fuel injection volume must be finished before the injection timing. ▶ The calculation of the ignition timing must be finished before the ignition timing. ▶ There is no additional value, even if these calculations finish earlier. Safety Requirement (Example) ▶ Missing an ignition must not happen, because inflammable gas is emitted outside of the engine and can lead to a fire (because catalyst burns). ▶ If the ignition plug of a cylinder is broken, fuel must not be injected to the cylinder.

Hiroaki Takada 17 Challenges of Hard Real-Time OS

Hiroaki Takada 18 Challenges of Hard Real-Time OS

! Use of more than one processors in an embedded system is NOT new. Conventional use of multiprocessors in embedded systems ▶ Large-scale & high-performance embedded systems ▶ Subsystems with different requirements (e.g. mechanical control & GUI) ▶ Integration of independently-developed subsystems ▶ Processors embedded in components (e.g. sound chip) ! Use of multiple processors in a system usually increases the product cost, however. On-chip multiprocessors changed this situation ▶ On-chip multiprocessors can often reduce cost.

Hiroaki Takada 19 Challenges of Hard Real-Time OS

to achieve both high performance and low power (or energy consumption) simultaneously Limitation of performance improvement of (single) processor Higher performance processor is less energy efficient. ▶ Clock frequency improvement relied on transistor scaling so far, but we cannot expect much in future. ▶ signal integrity, process variation, leakage power, etc. ▶ Exploiting more ILP (instruction-level parallelism) results in increasing power consumption. ▶ deep pipelines, wide issues, out-of-order execution, branch prediction, register renaming, speculation, etc.

▶ Larger number of slower processors can achieve higher energy efficiency.

Hiroaki Takada 20 Challenges of Hard Real-Time OS

Energy Efficiency is Getting Worse

Source: Chris Rowen, “The Reinvention of the Microprocessor for MPSOC,” MPSoC 2006.

Hiroaki Takada 21 Challenges of Hard Real-Time OS

Software development becomes difficult ▶ Programming model is different. ▶ Parallel programming is much more difficult than sequential programming. ▶ Reuse of existing software is not easy. ▶ Realization of mutual exclusion is different. ! Software engineers do not like to use multiprocessors. Inherent Unpredictability serious difficulty for hard real-time systems ▶ Shared resource contentions ▶ Cache misses ▶ cache memory is MUST for SMP (Symmetric Multiprocessor)

Hiroaki Takada 22 Challenges of Hard Real-Time OS

Example 1: Contention for shared bus (or memory)

processor 1 processor 2 … processor N

Assumptions shared bus ▶ 10ns for each memory access shared memory ▶ round robin arbitration ▶ The task accesses the shared memory 10000 times during an execution and finishes execution in 1ms without any shared bus contention. ▶ WCET (worst case execution time) considering shared bus contentions is 1ms+10ns×(N-1)×10000. ▶ 1.1ms if N=2 ▶ 1.3ms if N=4 … acceptable or too pessimistic?

Hiroaki Takada 23 Challenges of Hard Real-Time OS

Example 2: Contention for spinlock (or shared variable)

processor 1 processor 2 … processor N

spinlock shared variable Assumptions ▶ The task locks the spinlock 10 times during an execution (for accessing the shared variable) and finishes execution in 1ms without any spin lock contention. ▶ The spinlock is locked for 10µs. ▶ FCFS-ordered spinlock ▶ WCET considering spinlock contentions is 1ms+10µs×(N -1)×10. ▶ 1.1ms if N=2 ▶ 1.3ms if N=4 … same with Example 1

Hiroaki Takada 24 Challenges of Hard Real-Time OS

Distribution of Execution Times

P probability that the exec. time of the program is t If this difference is large, the analysis is said to be pessimistic. exec. time t probabilitydensity

analyzed max exec. time true max exec. time cannot be known average exec. time min exec. time can be unbounded

Hiroaki Takada 25 Challenges of Hard Real-Time OS

p-reliable WCET (Probabilistic Real-Time Guarantee)

1 probability that the exec. time 0.1 of the program is less than t (in logarithmic axis) 10-2 10-3 required reliability p

… exec. time t P) in logarithmic axis ∫ analyzed max exec. time

(1 - true max exec. time p-reliable max. exec time average exec. time min exec. time

Hiroaki Takada 26 Challenges of Hard Real-Time OS

Exec. Time Distributions of the Example 1 & 2

P Spinlock contentions (Example 2) Shared bus contentions (Example 1)

exec. time t probabilitydensity

analyzed max exec. time true max exec. time

average exec. time min exec. time

Hiroaki Takada 27 Challenges of Hard Real-Time OS

Performance Metrics for Hard Real-Time Systems ▶ Appropriate performance metric depends on the application. When Analyzed WCET is Used ▶ Accept the pessimism or decrease shared resource accesses. When p-reliable WCET is Used ▶ Shared bus contentions can be handled with probabilistic real-time guarantee. ▶ Spinlock contentions again depend on the application. → discussed in detail, later

Hiroaki Takada 28 Challenges of Hard Real-Time OS

Hiroaki Takada 29 Challenges of Hard Real-Time OS

Kind of Processors ▶ Homogeneous ▶ Heterogeneous ▶ Mild … common basic instruction set + extensions ▶ Strong … different instruction set ▶ Multi-threading Interconnections among Processors ▶ Shared memory (tightly-coupled) ▶ Symmetric or UMA (uniform memory access time) ▶ Distributed or NUMA (non-uniform memory access time) ▶ Message passing (loosely-coupled, distributed system) SMP and AMP ▶ SMP (symmetric multiprocessor): homogeneous and symmetric shared memory ▶ AMP (asymmetric multiprocessor) or ASMP: others

Hiroaki Takada 30 Challenges of Hard Real-Time OS

Shared memory (tightly-coupled) ▶ Symmetric (SMP) ▶ Each processor has the same role. ▶ Each task can be executed on any processor. ▶ Tasks are (statically or dynamically) distributed to processors to obtain maximum performance (load balancing). ▶ Function-distributed (FDMP) ▶ Processors have different roles. ▶ Each task is fixed to a processor and cannot migrate to others. Message passing (loosely-coupled, distributed system) ▶ Symmetric ▶ Function-distributed

Hiroaki Takada 31 Challenges of Hard Real-Time OS

Definition (again) and Characteristics ▶ In HW, each processor can symmetrically access (almost) all resources. ▶ In SW, each task can be executed on any processor. ▶ HW SMP is necessary to realize SW SMP. ▶ SW FDMP (function-distribution) can be implemented on HW SMP. ▶ Multi-threading processor is basically same with SMP from SW point of view. ▶ Except that simultaneous multi-threading (SMT) is different from SMP when considering the affinity among tasks.

Hiroaki Takada 32 Challenges of Hard Real-Time OS

SMP Example: ARM MPCORE

interrupt controller for distributing interrupt among processors peripherals (timer, etc.) for each processor

up to 4 ARM11 processor + L1 cache

cache coherence controller

Source: Web site of ARM

Hiroaki Takada 33 Challenges of Hard Real-Time OS

Application Area and Advantages ▶ Widely used for PC and servers. ▶ Effective, when workload is changed dynamically. ▶ Performance can be raised to some extent, without tuning of HW and SW. As the result, reusability of HW and SW is high. ▶ Chip can be general-purpose (advantage of HW SMP). Disadvantages ▶ Pessimistic analyzed WCET, because most resources are shared among processors. ▶ High cost and energy consumption, because coherent cache and high speed interconnect are necessary to obtain high performance.

Hiroaki Takada 34 Challenges of Hard Real-Time OS

Definition (again) ▶ Processors have different roles. ▶ Each task is fixed to a processor. Characteristics ▶ HW architecture can be optimized considering the role of each processor. Optimization result is often an AMP architecture. ▶ Memory and peripherals are connected to the local bus of a processor. ▶ Acceleration HW unit or coprocessor is added to a processor. ▶ Suitable type of processor to its role is used.

Hiroaki Takada 35 Challenges of Hard Real-Time OS

FDMP Example 1: Toshiba MPEG2 Codec LSI ▶ 6 MeP (Media embedded processor) cores with different acceleration HW units are used.

LSI System Control Video CODEC MeP Module Audio CODEC MeP Module MeP Module DCT/Q/MC HWE

SIMD VLC Audio VLIW UCI DSP Coprocessor MeP Core MeP Core MeP Core

Global Bus

MeP Core MeP Core MeP Core SDRAMC Video Filter Block Match Bit-stream

HWE HWE HWE Host IF

Video Filter Motion Estimation Bit-stream Process JTAG MeP Module MeP Module MeP Module Debug IF

Source: Web site of Toshiba

Hiroaki Takada 36 Challenges of Hard Real-Time OS

FDMP Example 2: TI OMAP 1710 ▶ 1 general-purpose processor (ARM926) and 1 DSP (TMS320C55x) are used.

Source: Web site of TI

Hiroaki Takada 37 Challenges of Hard Real-Time OS

Advantages ▶ Analyzed WCET is tight (less pessimistic), because shared resources are limited. ▶ Low energy consumption, because coherent cache and high speed interconnect can be omitted. Disadvantages ▶ Tuning of both HW and SW is necessary to obtain the above advantages. This results in lower reusability. ▶ Software designer must assign tasks to processors. ▶ Dynamic load balancing is not easy (but possible). ▶ Chip becomes special-purpose (disadvantage of HW AMP).

Hiroaki Takada 38 Challenges of Hard Real-Time OS

Definition (again) and Characteristics ▶ In HW, no shared memory among processors. ▶ In SW, only message communication is used among processors. ▶ When a task accesses a resource attached to another processor, an access request is sent to the processor. ▶ Even if HW has a shared memory, when the shared memory is used only for implementing message passing, the system is LCMP from SW point of view. ▶ Latency of inter-processor communication is longer than tightly-coupled multiprocessor. Bandwidth is not necessarily narrow.

Hiroaki Takada 39 Challenges of Hard Real-Time OS

OS Support for LCMP ▶ OS supporting LCMP can be classified as a distributed OS. ▶ Distributed OS was intensively studied, but was not successful. Why? ▶ Under long communication latency, a larger granularity request can achieve higher performance. ▶ OS request is too small. ▶ Implementing remote requests in upper layer is successful. ▶ RPC (remote procedure call) ▶ Distributed object framework (CORBA, SOAP) ! Exclude from the scope of the following discussions.

Hiroaki Takada 40 Challenges of Hard Real-Time OS

Hiroaki Takada 41 Challenges of Hard Real-Time OS

Ideal RTOS for Multiprocessor ▶ From application SW point of view, it is ideal that existing application software running on uniprocessor RTOS can run on multiprocessor RTOS without modification. ▶ In other words, multiprocessor RTOS should be compatible with uniprocessor RTOS, except that more than one tasks are executed in parallel. Difficulty ▶ Parallel execution of tasks is a great difference! ▶ Many existing application SW does not work under parallel execution.

Hiroaki Takada 42 Challenges of Hard Real-Time OS

General-Purpose OS ▶ Most general-purpose OS, such as Windows and Linux, support SMP. ▶ These OS take care of load balancing. ▶ Though there are real-time extensions of these OS, they are not suitable for hard real-time systems. RTEMS (Real-Time Executive for Multiprocessor Systems) ▶ RTEMS supports multiprocessor systems as its recent name shows. ▶ Only FDMP is supported (no task migration). ▶ Implemented with remote invocation method (explained in the next section).

Hiroaki Takada 43 Challenges of Hard Real-Time OS

TOPPERS/FMP Kernel ▶ TOPPERS/FMP kernel supports both FDMP and SMP. ▶ Task migration is supported (explained later). ▶ Implemented with direct access method. AUTOSAR OS Specification ▶ Multiprocessor extension is included in the latest release (Release 4.0). ▶ Only FDMP is supported (no task migration) ▶ Can be Implemented both with remote invocation method and with direct access method. Others ▶ Many commercial OS, including VxWorks, OSE, and -X, now support multiprocessor systems.

Hiroaki Takada 44 Challenges of Hard Real-Time OS

Hiroaki Takada 45 Challenges of Hard Real-Time OS

Using uniprocessor RTOS for FDMP design ▶ An FDMP system can be designed using a uniprocessor RTOS on each processor. ▶ In this case, communication with another processor is implemented in application level. ▶ A problem of this approach: ▶ When some task is moved to another processor in design time, inter-processor communication is necessary to be re-designed and re-implemented. ▶ SW reusability is low. With FDMP RTOS … ▶ Tasks on different processors can communicate with normal RTOS API. ▶ Existing SW for uniprocessor is easier to port to multiprocessor.

Hiroaki Takada 46 Challenges of Hard Real-Time OS

Object Management Concept ▶ Each OS object (task, semaphore, mailbox, …) belongs to one of the processors. ▶ Task is executed only by the processor to which it belongs. ▶ All tasks can access all OS objects with same API. tasks tasks I/O processor 1 I/O processor 2

local local memory memory

synchronization/communication objects

Hiroaki Takada 47 Challenges of Hard Real-Time OS

Task Scheduling ▶ Tasks are scheduled independently on each processor with the same scheduling algorithm with uniprocessor RTOS. ▶ No dynamic task migration. → The technique to realize mutual exclusion among tasks by raising priority (typically, priority ceiling protocol) cannot be used across processor boundaries. Interrupt Processing ▶ Each interrupt is assigned to one of the processors. ▶ The interrupt handler is executed by the processor to which the interrupt is assigned. ▶ Function to disable interrupts is valid only within the processor. → Mutual exclusion among tasks on different processors cannot be realized by disabling interrupts.

Hiroaki Takada 48 Challenges of Hard Real-Time OS

Mutual Exclusion by Disabling Interrupts ▶ Realizing mutual exclusion (among tasks and interrupt handlers) by disabling interrupts is widely used in existing application SW for uniprocessors. ▶ This technique cannot be used across processor boundaries. ▶ Function to disable interrupts on all processors is not a solution, because … ▶ When a processor disables interrupts on all processors, another processor is possibly executing an interrupt handler! ▶ In order to realize mutual exclusion among processors, a processor must wait until the other processors complete the execution of interrupt handlers, in addition to disable interrupts on all processors.

Hiroaki Takada 49 Challenges of Hard Real-Time OS

How Application SW Realizes Mutual Exclusion? ▶ Mutual exclusion among tasks ▶ With semaphore or mutex function of RTOS, mutual exclusion among any tasks can be realized. ▶ With the functions to disable interrupt and to temporarily raise priority, only mutual exclusion among tasks on the same processor can be realized. ▶ Mutual exclusion between a task and an interrupt handler ▶ With the function to disable interrupt, mutual exclusion between a task and an interrupt handler on the same processor can be realized. ▶ In order to support mutual exclusion between a task and an interrupt handler on different processors, new function (such as application-level spinlock) is necessary.

Hiroaki Takada 50 Challenges of Hard Real-Time OS

Implementation Policy ▶ To exploit the advantage of FDMP, as many task as possible should be executed within a processor (without accessing remote resources). → OS-internal data structures (such as object management blocks and ready queues) should be prepared for each processor. tasks tasks I/O processor 1 I/O processor 2

local local memory memory

synchronization/communication objects

Hiroaki Takada 51 Challenges of Hard Real-Time OS

Two Implementation Methods ▶ Operations on a remote object can be implemented with two methods. ▶ Direct access method ▶ Remote invocation method Direct Access Method ▶ Operations on a remote object are realized by directly accessing the object management block located on a local shared memory. ▶ Mutual exclusion among processors is necessary to avoid parallel access to the object management block. ▶ When task switching is necessary on the target processor, an interrupt request is sent to the processor (called IPI: inter-processor interrupt).

Hiroaki Takada 52 Challenges of Hard Real-Time OS

Remote Invocation Method ▶ Operations on a remote object are realized by requesting the operations to the processor to which the object belongs. ▶ Whenever an operation is requested to a processor, an interrupt request is sent to the processor. ▶ The requesting processor must wait for the completion of the operation to obtain the result. Direct Access vs. Remote Invocation ▶ When shared memory access latency is short, direct access method is advantageous. ▶ Remote invocation can be realized as a middleware. ! We focus on direct access method, later.

Hiroaki Takada 53 Challenges of Hard Real-Time OS

Necessity of Spinlocks ▶ When a processor accesses OS-internal data structures within OS, spinlocks are usually used for realizing mutual exclusion. ▶ Spinlock: busy waiting lock for mutual exclusion (a task repeatedly checks the lock in a loop and waits until the lock becomes available) ▶ Because blocking locks (semaphore, mutex, …) are realized with OS, they cannot be used for implementing the OS itself!

Hiroaki Takada 54 Challenges of Hard Real-Time OS

Test&Set Lock: A Typical Spinlock Algorithm ▶ Simplest version using atomic test&set instruction: while (test&set(L) == Locked) { /* spin */ } /* crical secon */ L = Released; ▶ Simple and widely used. ▶ Not appropriate for hard real-time system, because the time until a processor can acquire a lock is unbounded. Controlling the Lock Acquisition Order ▶ In hard real-time systems, the order in which a processor acquires the lock must be controlled. ▶ Ticket-based spinlock algorithms ▶ Queue-based spinlock algorithms

Hiroaki Takada 55 Challenges of Hard Real-Time OS

Queue-based Spinlock Algorithms ▶ Processors waiting for a lock make a queue. ▶ Many variations. ▶ MCS lock, … ▶ The queue can be FCFS order or priority order, depending on the algorithm. ▶ Some algorithms support dequeueing (called timeout or preemption) Spinlock Waiting Time ▶ A processor is just spinning while waiting for a spinlock. ▶ The spinlock waiting time is completely wasted, thus should be shorten. More harmful than priority inversion!

Hiroaki Takada 56 Challenges of Hard Real-Time OS

Giant Lock ▶ All data structures are guarded with a single lock. ▶ Only one processor can execute OS at the same time. ▶ Chance of parallel execution is limited. ▶ Can be effective, when the number of processors is small. Fine Grained Lock ▶ Data structures are divided into some classes and each class is guarded with one lock. ▶ The average spinlock waiting time is decreased. ▶ However, the analyzed maximum spinlock waiting time cannot be decreased (and is often increased). ▶ Depending on OS functionality and lock unit design, two or more spinlocks need to be obtained in a nested manner (nested spinlock). ▶ This increases the worst-case execution time. ▶ Deadlock avoidance may be necessary.

Hiroaki Takada 57 Challenges of Hard Real-Time OS

Inter- and Intra Processor Synchronizations ▶ OS-internal data structures must be guarded both from other tasks on the same processor (intra-processor sync.) and from tasks on other processors (inter-processor sync.) ▶ Inter-processor sync. is realized by acquiring a spinlock. ▶ Intra-processor sync. is realized by disabling interrupts. Spinlock Acquisition vs. Disabling Interrupts ▶ Spinlock acquisition and disabling interrupt should be atomic, because … ▶ If a spinlock is acquired first, other processors wastefully wait for the spinlock during an interrupt service. ▶ If interrupts are disabled first, the worst-case interrupt latency becomes long.

Hiroaki Takada 58 Challenges of Hard Real-Time OS

Spinlock with Preemption ▶ Test&set lock can be easily modified to handle this situation. disable_interrupts; while (test&set(L) == Locked) { if (interrupt_requested) { enable_interrupt; /* interrupts are serviced here. */ disable_interrupt; } } /* crical secon */ L = Released; enable_interrupts;

▶ Queue-based spinlocks can also be modified (queue- based spin lock with preemption/timeout).

Hiroaki Takada 59 Challenges of Hard Real-Time OS

Nested Spinlock with Preemption ▶ When two spinlocks are necessary to be acquired in a nested manner, the solution becomes more complex. retry: disable_interrupts; while (test&set(L1) == Locked) /* same with the previous slide */ while (test&set(L2) == Locked) { if (interrupt_requested) { L1 = RELEASED; /* L1 should be released */ enable_interrupt; /* interrupts are serviced here. */ disable_interrupt; goto retry; } } /* crical secon */ L2 = Released; L1 = RELEASED; enable_interrupts;

Hiroaki Takada 60 Challenges of Hard Real-Time OS

Is this Retry OK for Hard Real-Time Systems? ▶ Unbounded number of retry is usually harmful for hard real-time systems. ▶ In this case, however, by adding the overhead of the retry to the WCET of the interrupt handler, the schedulability test is possible.

Hiroaki Takada 61 Challenges of Hard Real-Time OS

Queue-Based Nested Spinlock with Preemption ▶ Queue-based nested spinlock with preemption raises several new problems. ▶ With simple application of FCFS-order spinlocks, the worst-case waiting time becomes O(nm), where n is the number of processors and m is the maximum nest level. Totally FCFS approach and priority-inheritance spinlock are necessary to solve this. ▶ When an interrupt is serviced during spinning, the order within the queue should be canceled or preserved? ▶ When a spinlock is necessary to be acquired within the interrupt handler, how to handle it? ▶ A solution exists, but is very complex. The overhead is also large. ▶ We are trying to implement this with HW.

Hiroaki Takada 62 Challenges of Hard Real-Time OS

▶ When FDMP RTOS is implemented with remote invocation method, spinlock is not necessary because there are no shared data structure. ▶ The intra- vs. inter-processor problem also exists in a different form as follows. ▶ With remote invocation method, requests from other processors are notified as interrupts (IPI). ▶ Which should be assigned higher priority: IPI or local interrupts from I/O devices? ▶ If IPI is given a higher priority, the worst-case interrupt latency to local interrupts becomes long. ▶ If local interrupts are given higher priorities, requesting processors wastefully wait for local interrupt handlings. ▶ This problem is more difficult to solve. ▶ Application-level solution will be necessary.

Hiroaki Takada 63 Challenges of Hard Real-Time OS

Hiroaki Takada 64 Challenges of Hard Real-Time OS

Object Management Concept ▶ All OS objects (task, semaphore, mailbox, …) are global (do not belong to any processor). ▶ OS dynamically determines on which processor a task is executed.

processor 1 processor 2 … processor n

tasks global I/O memory synchronization/ communication objects

Hiroaki Takada 65 Challenges of Hard Real-Time OS

Task Scheduling ▶ A straight-forward extension of priority-based scheduling is to always execute top n high priority tasks. n : the number of processors ▶ The function to bind a task to a processor (processor affinity) is very useful. ▶ This extension is not good, because … ▶ Poor schedulability. ▶ Too many task migrations can happen. Interrupt Processing ‒ Two Approaches ▶ Each interrupt is statically assigned to a processor. ▶ The processor serving an interrupt request is dynamically determined. ▶ The processor that accepts the interrupt first. ▶ The processor that executes lowest priority task.

Hiroaki Takada 66 Challenges of Hard Real-Time OS

An Example of Too Many Task Migrations ▶ Suppose the following task set consisting of 4 tasks.

high priority high priority task A bound processor 1 processor 2 task B bound to processor 1 to processor 2 activated activated frequently and frequently and finishes in short finishes in short

middle priority low priority unbound task C unbound task D ▶ When task A is activated, task C is migrated to processor 2. When task B is activated task C goes back to processor 1. ▶ Static task allocation can be better.

Hiroaki Takada 67 Challenges of Hard Real-Time OS

Basic Concept ▶ Task migration is introduced to FDMP RTOS. ▶ Similar approach with Linux 2.6. Object Management Concept (same with FDMP RTOS) ▶ Each OS object belongs to one of the processors. ▶ Task is executed by the processor to which it belongs. Task Migration Support ▶ New service calls to move a task to another processor are introduced. ▶ Policy-mechanism separation ▶ Migration mechanism: supported by OS ▶ Migration policy: up to application

Hiroaki Takada 68 Challenges of Hard Real-Time OS

Introduced Service Calls ▶ mig_tsk(ID tskid, ID prcid) ▶ Migrate the task specified with tskid to the processor specified with prcid. ▶ Our implementation has a restriction that only the tasks that are assigned to the same processor with the calling task can be migrated. ▶ mact_tsk(ID tskid, ID prcid) ▶ Activate the task specified with tskid on the processor specified with prcid. ▶ Useful service call, when a task is periodically activated and the execution time of each activation is short.

Hiroaki Takada 69 Challenges of Hard Real-Time OS

Dynamic Migration ▶ The workload of each processor is measured in runtime. ▶ A task is selected from a high workload processor and is migrated to a low workload processor. ▶ Because guarantee of real-time constraints is difficult, this is suitable for soft real-time tasks (hard real-time system often includes soft real-time tasks). ▶ How to determine the workload of a processor is the issue. Planned Migration ▶ The best allocation of tasks to processors for each operation mode of the system is determined in design time. ▶ In runtime, tasks are migrated by the application SW when the operation mode is changed. ▶ Easier to guarantee real-time constraints.

Hiroaki Takada 70 Challenges of Hard Real-Time OS

Approach in Linux 2.6 ▶ Load average (average number of active tasks) of a processor is thought to be the workload of the processor. Our Approach under Investigation ▶ Each target task reports its progress to the OS. ▶ In a multimedia application, the number of generated frames but not consumed yet can be defined to be its progress. ▶ Average progress of the target tasks assigned to a processor is thought to be the workload. ▶ Better than the Linux approach when the progress of a task can be defined appropriately.

Hiroaki Takada 71 Challenges of Hard Real-Time OS

Hiroaki Takada 72 Challenges of Hard Real-Time OS

Project Title ▶ HW/SW co-optimization for low energy embedded systems Project Members ▶ Nagoya Univ. (H. Takada) ▶ Kyushu Univ. (T. Ishihara) ▶ Toshiba (T. Fukaya) ▶ Ritsumeikan Univ. (H. Tomiyama) Funding and Period ▶ The project is funded by the CREST program of JST (Japan Science and Technology Agency) ▶ From Oct. 2005 to Mar. 2011

Hiroaki Takada 73 Challenges of Hard Real-Time OS

Basic Directions and Approaches ▶ Target is embedded systems. ▶ We assume that application is known. ▶ The characteristics of application are fully exploited. ▶ Co-optimization across design layer boundaries. ▶ Rooms for optimization become large through the co-optimization from SW design layer to HW circuit layer. ▶ Minimize energy consumption while guaranteeing required QoS (performance, reliability, …). ▶ In more precise, minimize energy consumption while guaranteeing real-time constraints. Most Important Concept ▶ Dynamic Energy/Performance Scaling (DEPS)

Hiroaki Takada 74 Challenges of Hard Real-Time OS

What is DEPS? ▶ Control the system to work under the optimal tradeoff point of energy and performance. ▶ In our project, use the slowest processor or slowest processor configuration with which real-time constraints can be met. ▶ A generalization of DVFS (Dynamic Voltage/ Frequency Scaling) Motivation ▶ The operational voltage of recent processors is too low to take the advantage of DVFS. ▶ Higher performance processor becomes more complicated resulting in worse energy efficiency.

Hiroaki Takada 75 Challenges of Hard Real-Time OS

Limitation of DVFS (Dynamic Voltage/Frequency Scaling) ▶ DVFS dynamically changes the operation voltage and frequency of a processor depending on how busy the processor is. ▶ DVFS was an effective approach to reduce energy consumption. ▶ The operational voltage of recent processors is very low, there is little room to lower the voltage dynamically. Concept of DEPS ▶ A generalization of DVFS. ▶ DEPS uses other control parameters in addition to voltage/frequency.

Hiroaki Takada 76 Challenges of Hard Real-Time OS

Rationale behind DEPS ▶ Higher performance processor is less energy efficient, because of ▶ large cache, deep pipelines, wide issues, out-of- order execution, branch prediction, … ▶ By turning off these features when the processor is not busy, energy consumption can be reduced. How DEPS Works? ▶ DEPS controls the system to work under the optimal tradeoff point of energy and performance. ▶ In our project, DEPS switches to the slowest processor configuration (or slowest processor) with which real-time constraints can be met.

Hiroaki Takada 77 Challenges of Hard Real-Time OS

Theoretical Difference between DVFS and DEPS ▶ In DVFS, the execution time of a task can be estimated from the frequency. ▶ In DEPS, the relationship between execution time and processor configuration (cache size, pipeline depth, … in addition to frequency) depends on the characteristics of the task.

_ C (T11,E11) Energy 11 Cij (Tij, Eij) •Cij DEPS configuration j of task i •Tij worst case exe. time under Cij _ C12(T12,E12) _ C21(T21,E21) •Eij energy consumption during Tij

_ C13(T13,E13) _ C (T22,E22) 22 _ C14(T14,E14) Task 1 ( period = P 1 ) _ C23(T23,E23) Task 2 ( period = P 2 ) Execution time (1/Performance)

Hiroaki Takada 78 Challenges of Hard Real-Time OS

What is DEPS-Ready IP Core? ▶ An IP core (incl. processor) which has several configurations with different energy/performance tradeoff and can dynamically switches among them. ▶ Or a set of IP cores which have the same instruction set but have different energy/performance tradeoff. e.g. a pair of ARM7 and ARM11 processors Multi-Performance Processor (MPP) ▶ Our DEPS-ready processor core. ▶ MPP has two processing element with different voltage and frequency and dynamically switches between them quickly. ▶ Cache size can be changed dynamically.

Hiroaki Takada 79 Challenges of Hard Real-Time OS

Multi MPP Core Chip under Development ▶ An SMP consisting of 3 MPP Cores ▶ Frequency is low because an old 0.18µm process is used.

Processing element PE-H 8KB PE-H 8KB PE-H 8KB 8KB 8KB 8KB with high voltage 60MHz 60MHz 60MHz

and high frequency @1.8V 4KB @1.8V 4KB @1.8V 4KB 4KB 4KB 4KB

Processing element PE-M Tag PE-M Tag PE-M Tag 2KB 2KB 2KB with low voltage and 30MHz 2KB 30MHz 2KB 30MHz 2KB 2KB 2KB 2KB low frequency @1.0V 2KB @1.0V 2KB @1.0V 2KB

DMAC BUSIF DMAC BUSIF DMAC BUSIF The number of cache ways can be changed dynamically. AMBA AHB 30MHz

Hiroaki Takada 80 Challenges of Hard Real-Time OS

Assumption ▶ Existence of DEPS-ready IP core Target System ▶ Hard real-time embedded systems ▶ A set of independent periodic tasks or sporadic tasks (aperiodic tasks with minimal inter-arrival separation) ▶ Priority-base scheduling with static priority assignment Optimization Policy ▶ Constraint: guarantee that all tasks complete within their deadlines ▶ Goal: minimize energy consumption

Hiroaki Takada 81 Challenges of Hard Real-Time OS

Three Stage Optimizations ▶ Intra-task optimization ▶ Characteristics of each task is extracted by analyzing the execution trace obtained by instruction-set level simulation. ▶ Inter-task optimization ▶ Each task is allocated to a processor and assigned an execution time budget. ▶ Runtime optimization ▶ RTOS dynamic calculates the slack time and determines the optimal (not necessarily best) processor configuration.

Hiroaki Takada 82 Challenges of Hard Real-Time OS

Input ▶ Source program of a task ▶ Input data for the task with weight (frequently executed input data has to have a large weight) ▶ Information on processor configurations What is done in this Stage ▶ Check points are inserted to the task code for intra-task DEPS. ▶ Processor configuration can be changed at each check point. ▶ Check points are determined by analyzing the execution trace of the task (called execution trace mining) ▶ DEPS profile of the task is generated.

Hiroaki Takada 83 Challenges of Hard Real-Time OS

Where to Insert Check Points? ▶ Points at which characteristics of the task is greatly changed. ▶ For example, boundaries of processor-bound section and memory-bound section of the task. ▶ Points at which remaining worst-case execution time (RWCET) can be predicted more precisely. ▶ Typically, after a conditional branch such that RWCET when branch is taken and that when not- taken have large difference. ▶ If a task can execute for long without passing a check point, a check point should be inserted to avoid this case.

Hiroaki Takada 84 Challenges of Hard Real-Time OS

DEPS Profile ▶ list of all effective combi- nations of configurations START 90% CP2 ▶ remaining worst-case 100% CP0 CP1 EXIT execution time (RWCET) under each combination 10% CP3 ▶ remaining average energy consumption (RAEC) under Control Flow and Check Points of an Example Task each combination

CP0 CP1 CP2 CP3 RWCET RAEC config2 config2 config1 config4 23.7 433.2 config3 config3 config1 config4 28.3 345.1 config3 config4 config2 config6 32.1 301.5 config4 config4 config2 config6 35.2 273.8 config6 config6 config4 config6 45.1 205.2 DEPS Profile of a Task (Image)

Hiroaki Takada 85 Challenges of Hard Real-Time OS

Input ▶ Task set information (cycle, deadline, priority, …) ▶ DEPS profile of each task ▶ Information on processor configurations What is done in this Stage ▶ Tasks are allocated to processors. ▶ Basically, workload should be balanced to reduce energy consumption. ▶ Execution time budgets are assigned to tasks. ▶ In other words, default combination of configurations are determined for each task. ▶ Budgets are assigned to minimize energy consumption. ▶ DEPS management table for each check point is generated.

Hiroaki Takada 86 Challenges of Hard Real-Time OS

DEPS Management Table ▶ all effective configurations for each check point ▶ remaining worst-case execution time (RWCET) under each configuration CP2 RWCET config1 14.6 config2 17.4 CP0 RWCET START CP2 90% config4 20.8 config2 23.7 100% CP0 CP1 config3 32.1 EXIT config4 35.2 10% CP3

CP1 RWCET config6 45.1 CP3 RWCET config2 18.7 config4 13.6 config3 22.7 config6 19.4 config4 27.8 config6 36.5 DEPS Management Table for each Check Point

Hiroaki Takada 87 Challenges of Hard Real-Time OS

RTOS Supporting DEPS ▶ Multiprocessor RTOS with following extensions ▶ Slack time calculation ▶ Slack time (the maximum time that can be wasted without missing any deadline) is calculated in run time (approximation is used). ▶ Configuration determination and switch ▶ For each check point, RTOS determines the configuration from the calculated slack time and DEPS management table. ▶ Slack time is consumed greedily. ▶ RTOS switches the configuration at each check point and at each task switching.

Hiroaki Takada 88 Challenges of Hard Real-Time OS

Hiroaki Takada 89 Challenges of Hard Real-Time OS

FDMP has an advantage, currently ▶ With the current technologies, real-time constraints are hard to guarantee with SMP. ▶ With FDMP, it is relatively easy. Then, why SMP? ▶ Considering the mask cost of future large-scale LSI, the advantage of HW SMP that chip can be general- purpose has significant meaning. Future Challenge ▶ How to realize a predictable real-time system with general-purpose chip. ▶ Cooperation of SW and HW is indispensable.

Hiroaki Takada 90 Challenges of Hard Real-Time OS

▶ How to realize a high-performance embedded system with limited energy consumption and low cost? My Future View ▶ General-purpose chip including heterogeneous manycore and reconfigurable logic. ▶ Interconnection network and memories can be imple- mented on another chip and are integrated as a 3D LSI.

processor reconf. core memory logic

▶ Challenge: HW and SW design environment and tools for this chip.

Hiroaki Takada 91