Corporate Technology

The Many Approaches to Real-Time and Safety-Critical Open Source Summit Japan 2017

Prof. Dr. Wolfgang Mauerer Siemens AG, Corporate Research and Technologies Smart Embedded Systems Corporate Competence Centre Embedded Linux

Copyright 2017, Siemens AG. All rights reserved.

Page 1 31. Mai 2017 W. Mauerer Siemens Corporate Technology Corporate Technology

The Many Approaches to Real-Time and Safety-Critical Linux Open Source Summit Japan 2017

Prof. Dr. Wolfgang Mauerer, Ralf Ramsauer, Andreas Kolbl¨ Siemens AG, Corporate Research and Technologies Smart Embedded Systems Corporate Competence Centre Embedded Linux

Copyright c 2017, Siemens AG. All rights reserved.

Page 1 31. Mai 2017 W. Mauerer Siemens Corporate Technology Overview

1 Real-Time and Safety

2 Approaches to Real-Time Architectural Possibilities Practical Approaches

3 Approaches to Linux-Safety

4 Guidelines and Outlook

Page 2 31. Mai 2017 W. Mauerer Siemens Corporate Technology Introduction & Overview

About

Siemens Corporate Technology: Corporate Competence Centre Embedded Linux Technical University of Applied Science Regensburg Theoretical Computer Science Head of Digitalisation Laboratory

Target Audience Assumptions

System Builders & Architects, Software Architects Linux Experience available Not necessarily RT-Linux and Safety-Critical Linux experts

Page 3 31. Mai 2017 W. Mauerer Siemens Corporate Technology A journey through the worlds of real-time and safety

Page 4 31. Mai 2017 W. Mauerer Siemens Corporate Technology Outline

1 Real-Time and Safety

2 Approaches to Real-Time Architectural Possibilities Practical Approaches

3 Approaches to Linux-Safety

4 Guidelines and Outlook

Page 5 31. Mai 2017 W. Mauerer Siemens Corporate Technology Real-Time: What and Why? I

Real Time Real Fast

Deterministic responses to stimuli Caches, TLB, Lookahead Bounded latencies (not too late, not too Pipelines early) Optimise average case Repeatable results Optimise/quantify worst case

Page 6 31. Mai 2017 W. Mauerer Siemens Corporate Technology Real-Time: What and Why? II

Type Characteristics Use Cases Soft Real-Time Subjective Deadlines Media rendering, I/O 95% Real-Time Deadlines met most of the time, Data acquisition, finance, navi- misses can be compensated gation, . . . 100% Real-Time Miss deadline: Defects occur Industrial Automation & control, Robotics, Airplanes, . . .

Ensuring Real-Time

Statistical testing WCET calculation + schedulability testing Formal verification

Page 7 31. Mai 2017 W. Mauerer Siemens Corporate Technology Real-Time: What and Why? II

Type Characteristics Use Cases Soft Real-Time Subjective Deadlines Media rendering, I/O 95% Real-Time Deadlines met most of the time, Data acquisition, finance, navi- misses can be compensated gation, . . . 100% Real-Time Miss deadline: Defects occur Industrial Automation & control, Robotics, Airplanes, . . .

Ensuring Real-Time

Statistical testing WCET calculation + schedulability testing Formal verification

Page 7 31. Mai 2017 W. Mauerer Siemens Corporate Technology Safety: What and Why?

Some undesirables Safety-Critical Systems

Brake: Segfault! Malfunctions of the system (may) result in Engines full speed ahead: Segfault! death/injury to people and so on. . . damage to equipment/property environmental harm

Safety = Real-Time, but often coupled! 6 100% RT + fatal consequences Safety )

Page 8 31. Mai 2017 W. Mauerer Siemens Corporate Technology Safety: Standards

Robotic Devices Routes to Safety Electrical Power Drive ISO10218 Industrial Process IEC61800 IEC61511 Standard compliant “umbrella” standard Railways Machinery development IEC62278 IEC61508 IEC62061 Proven in use Nuclear Power Plants Medical Device Software Compliant IEC61513 Automotive IEC62304 non-compliant ISO26262 development Challenge:

Page 9 31. Mai 2017 W. Mauerer Siemens Corporate Technology Outline

1 Real-Time and Safety

2 Approaches to Real-Time Architectural Possibilities Practical Approaches

3 Approaches to Linux-Safety

4 Guidelines and Outlook

Page 10 31. Mai 2017 W. Mauerer Siemens Corporate Technology Approaches to Real-Time Linux

€ Δt € App App +/- Engineering Δt € Control Application RT Latency € RT Latency Specialised Languages - Standard Languages Control Framework +RT-Bridge +RT-Net Specialised OS + Middleware Proprietary Hardware Dynamic -Overhead Static Linux +RT

COTS Hardware +FPGA

Why Real-Time Linux?

Commodity features Subtractive vs. additive Engineering Multi-Core utilisation ...

Page 11 31. Mai 2017 W. Mauerer Siemens Corporate Technology Architectural possibilities I

1 Traditional RTOS in side-device Pros and Cons 2 RT-Enhanced Kernel 3 Countless variants available 3 Separation Kernel 3 Pre-Certified Versions 4 Co-Kernel 3 Extreme simplicity 5 Asymmetric Multiprocessing 7 Hard to extend with state-of-the art IT 7 Vendor lock-in 7 Unusual etc.

Page 12 31. Mai 2017 W. Mauerer Siemens Corporate Technology Architectural possibilities I

1 Traditional RTOS in side-device Pros and Cons 2 RT-Enhanced Kernel 3 Leverage existing Linux Know-How 3 Separation Kernel 3 Integration of high-level technologies 4 Co-Kernel with little effort 5 Asymmetric Multiprocessing 7 Certification complicated 7 Complex system 7 Only statistical RT assurance

Page 12 31. Mai 2017 W. Mauerer Siemens Corporate Technology Architectural possibilities I

1 Traditional RTOS in side-device Pros and Cons 2 RT-Enhanced Kernel 3 Clean split between RT and non-RT 3 Separation Kernel 3 Substantial certification experience 4 Co-Kernel 7 Typically strong HW coupling 5 Asymmetric Multiprocessing 7 Vendor Lock-In

Page 12 31. Mai 2017 W. Mauerer Siemens Corporate Technology Architectural possibilities I

1 Traditional RTOS in side-device Pros and Cons 2 RT-Enhanced Kernel 3 Clean split between RT and non-RT 3 Separation Kernel 3 Ressource efficient 4 Co-Kernel 7 Non-standard maintenance efforts 5 Asymmetric Multiprocessing 7 Implicit couplings

Page 12 31. Mai 2017 W. Mauerer Siemens Corporate Technology Architectural possibilities I

1 Traditional RTOS in side-device Pros and Cons 2 RT-Enhanced Kernel 3 Combine advantages of split systems 3 Separation Kernel with single HW basis 4 Co-Kernel 3 Near bare metal performance 5 Asymmetric Multiprocessing 7 Implicit couplings 7 Relatively new development 7 Maintenance overhead

Page 12 31. Mai 2017 W. Mauerer Siemens Corporate Technology Architectural possibilities II

Commonality

System partitioning! Logical instead of physical Workloads of different criticality handled by different system portions Mixed ) Criticality

Page 13 31. Mai 2017 W. Mauerer Siemens Corporate Technology Practical Approaches

Practical Approaches

Preempt-RT /ipipe ARM/PRU GPUs/FPGA assisted RT Traditional RTOSes

Page 14 31. Mai 2017 W. Mauerer Siemens Corporate Technology Preempt-RT I

Enhance Linux with RT capabilities RT Howto

Preemption (incl. preemption at kernel Don’t anything stupid level) Lock memory (no paging) No inappropriate syscalls (networking Deterministic (and fine-grained) timing etc.) behaviour No block device access Avoidance of priority inversion (prio ... inheritance/ceiling)

Linux Foundation: Official project (goal: upstreaming code) Typical Jitter: 50µs (), 150 µs (rpi)

Page 15 31. Mai 2017 W. Mauerer Siemens Corporate Technology Preempt-RT II

Types of patches backport forwardport invariant

400

300

200 Number of commits 100

0

4.0.8-rt6 4.4.9-rt17 3.6.11-rt31 3.8.13-rt16 4.1.20-rt23 4.6-rc7-rt1 3.2.78-rt113 3.12.57-rt77 3.14.65-rt68 3.18.29-rt30 3.0.101-rt130 3.4.111-rt141 3.10.101-rt111

Page 16 31. Mai 2017 W. Mauerer Siemens Corporate Technology Stack Version Preempt-RT III: Pros and Cons

Advantages Disadvantages

3 Patch availability and community 7 Functional certifiability limited support 7 Achieving smallest latencies requires 3 Re-use of engineering knowledge substantial system knowledge 3 Excellent multi-core scalability 7 Mixing RT and non-RT easy 3 RT in userspace easily possible 7 Fixing problems requires substantial system knowledge

Page 17 31. Mai 2017 W. Mauerer Siemens Corporate Technology Xenomai 3.0 I

Userspace Task Task Task Task Task Xenomai: RTOS-to-Linux Process

Provides skins for traditional RTOSes Scheduler A Preemption Scheduler B Two modes of operation Kernel Run on top of Linux (w. or w/o RT Dispatching and Collaboration Services

capabilities) IRQ IRQ IRQ IRQ Run over co-kernel extension Hardware (patched Linux required)

ipipe patch: 450-600 KiB (depending on arch), (mostly) stable over time Typical Jitter: 10µs (x86), 50 µs (rpi)

Image source: Siemens AG, CC BY-SA 3.0

Page 18 31. Mai 2017 W. Mauerer Siemens Corporate Technology Xenomai 3.0 II: Architecture sketch

Image source: Xenomai.org, CC BY-SA 3.0

Page 19 31. Mai 2017 W. Mauerer Siemens Corporate Technology Xenomai 3.0 II: Architecture sketch

Image source: Xenomai.org, CC BY-SA 3.0

Page 19 31. Mai 2017 W. Mauerer Siemens Corporate Technology Xenomai 3.0 III: Pros and Cons

Cobalt (Co-Kernel) Mercury (Preempt-RT)

3 Clean split between RT/non-RT 3 Architectural basis maintained by (transition is signalled) substantial community 3 Light-weight in low-end platforms (lock 3 Very solid skin framework w/o invasive contention, cache usage etc.) core changes 7 Very limited number of 7 Legacy not always 100% developers/small community reproducible 7 Porting effort required; availability lag 7 Inadvertently mixing RT and non-RT 7 Regressions on upstream changes easier

Page 20 31. Mai 2017 W. Mauerer Siemens Corporate Technology ARM + PRU I

ARM Subsystem Programmable Real-Time Unit (PRU) Subsystem PRU0 PRU0 PRU1 I/O (200MHz) (200MHz) Cortex-A PRU1 Shared Inst. Data Inst. Data I/O L1 L1 RAM RAM RAM RAM RAM Instruction Data Cache Cache Interconnect

L2 Data Cache INTC Peripherals

L3 InIntterconnectt

Shared Peripherals Memory

L4 Interconnect

Peripherals GP I/O

Page 21 31. Mai 2017 W. Mauerer Siemens Corporate Technology ARM + PRU II

Programmable Real-Time Unit Pros and Cons

Dedicated (two) execution units based 3 High determinism/small jitter on 32-Bit RISC architecture 3 Simpler than adding µC components to No pipelines, no caches, separate system instruction/data memory (shared RAM) 3 Clean split between RT and non-RT Linux Support via remoteproc and 7 Tied to (very) specific hardware rpmsg framework 7 Increased maintenance efforts 200 MHz 5ns cycle time (additional compilers etc.) )

Page 22 31. Mai 2017 W. Mauerer Siemens Corporate Technology Real-Time GPU/FPGA assisted Computing

Automotive/Image Processing

Speculative Evasion Pre-Planning Eye Tracking Autonomous Control Movie Player Backup Camera Warning System Driver’s Display Automatic Lane Following Data Encryption (Disk, Network, etc.) Intelligent Cruise Control Emergency Collision Avoidance Route Planning Voice Control Traffic Sign Recognition Autonomous Local Navigation

General Purpose Real Fast Soft Real-Time Hard Real-Time

Figure 2: Spectrum of possible temporal requirements for a number of automotive applications that may utilize a GPU. Each feature may cross domains, as indicated by the line beneath each feature name.

note that the use of GPUs appears to be the only eco- with other GPU-efficient algorithms, can be used in med- nomically feasible solution able to meet the processing ical imaging and video processing,G. Elliot where and J. real-time H. Anderson, con- IEEE RTCSA, 2011, 48–54 requirements of advanced driver-assist and autonomous straints are common. Additionally, a particularly com- features in future automotive applications. Unfortunately, pelling application for real-time GPUs is that of automo- there are obstacles created by current GPU technology biles. Page 23that must be overcome 31. Mai before 2017 GPUs can be incorporated GPUs W. Mauerer can be used to implement a number of system Siemens Corporate Technology into real-time systems. In this paper we discuss several features in the automotive domain. For user interface fea- of these obstacles and present a summary of solutions we tures, a GPU may be used to realize rich displays for the have found through our research to date. We hope to en- vehicle operator and to implement responsive voice-based gage the real-time and cyber-physical systems communi- controls [16], all while possibly driving video entertain- ties to identify additional applications where the use of ment displays for other passengers simultaneously. Fur- GPUs may be beneficial or even necessary. Through fur- ther, a GPU can also be used to track the eyes of the ther research and the development of a breadth of appli- vehicle operator [24]. Such tracking could be used to cations, we hope to inspire GPU manufactures to incor- implement a number of safety features. Real-time ap- porate features into their products to improve real-time plications for GPUs in automobiles become even more behaviors. apparent when we consider driver-assist and autonomous This paper is organized as follows. In the next sec- vehicle features. In these platforms, multiple streams tion, we present several applications where GPUs may of data from video feeds, laser range sensors, and radar be beneficial in real-time systems. In Sec. 3, we present can be processed and correlated to provide environmen- the unique constraints imposed by current GPU technol- tal data for a number of vehicle functions. This data can ogy that pose challenges to the use of GPUs in real-time be used for automatic sign recognition [27], local naviga- systems. In Sec. 4, we present a summary of solutions tion (such as lane following), and obstacle avoidance [29]. that we have developed that address several of these con- GPUs are well suited to handle this type of workload since straints and allow GPUs to be used in real-time systems. these sensors generate enormous amounts of data. Indeed, In Sec. 5, we present future directions for our research and GPUs are likely the only efficient and cost-effective solu- discuss what changes may be necessary in current GPU tion. Moreover, these are clearly safety-critical applica- technology to better support real-time systems. Finally, tions where real-time constraints are important. in Sec. 6, we conclude with remarks on the field of real- Fig. 2 depicts a number of automobile features that time GPUs. could make use of a GPU. These features are plotted along a spectrum of temporal requirements showing our view of the relative need for real-time performance. The 2 Real-Time GPU Applications spectrum is broken up into four regions: general-purpose, “real-fast,”2 soft real-time, and hard real-time. Features in There are a number of real-time domains where GPUs the general-purpose region are those that could possibly may be applied. For example, a GPU can efficiently be supported by general-purpose scheduling algorithms, carry out many digital signal processing operations such though may still be a part of a real-time system. The “real- as multidimensional FFTs and convolution as well as ma- fast” region captures applications that may have general trix operations such as factorization on data sets of up to several gigabytes in size. These operations, coupled 2The term “real-fast” is borrowed from Paul McKenny [26]. GPU Architecture

CPU/GPU Architecture Program Flow

Abb. 1.1: Unterschied zwischen einer CPU-undGPU-Architektur (Quelle: [CGM14], Seite 9)

Abb. 1.4: Beispielhafter Ablauf eine Cuda-Programms (Quelle: [CGM14], Seite 25) FürGPU die Entwicklung Computation der Bibliothek und Cycle den Test der Algorithmen wurden die Tesla K20c und das Jetson TK1 Entwicklungsboard verwendet. Beide basieren aus der in 1.2 Cuda Abbildung 1.2 dargestellten Kepler1 Architektur. 1.) Copy CPU GPU; 2.) Execute Kernels; 3.) CopyCUDA GPUist ein von NVIDIACPU eigens für ihre GPUS bereitgestelltes Framework zur Pro- Der L2-Cache entspricht dem! globalen Speicher, der von allen laufenden Threads auf grammierung von! Grafikprozessoren. Das bedeutet, wo OPENCL prinzipiell für alle Arten von GPUS verschiedener Hersteller verwendet werden kann, zielt CUDA aus- der GPU geschrieben und gelesen werden kann. Er ist der größte auf der GPU vor- J.schließlich Cheng et auf al.: dieProfessional Programmierung Cuda von C NVIDA-GPU ProgrammingS ab., John Wiley & Sons, 2014

handene Speicher, besitzt allerdings auch die längsten Zugriffszeiten. Um der GPU Nichts desto trotz besteht die Möglichkeit, GPUS der Firma NVIDIA mit Hilfe von DatenPage 24 für die Ausführung bereit zu stellen, 31. müssen Mai 2017 diese initial vom RAM Speicher der W. MauererOPENCL zu programmieren. Dies wird aber von O SiemensPENCL nur Corporate stiefmütterlich Technology unter- CPU über den entsprechenden Speicherbus in den globalen Speicher der GPU kopiert stützt und erreicht zudem auch nicht die Performance, die mit dem eigens für NVIDA- GPUS entwickelten CUDA-Framework erreicht werden kann. werden. Nach einer Berechnung durch die GPU werden alle Ergebnisse wieder in den Aus diesem Grund wurde CUDA für die Implementierung verwendet, was aber nicht globalen Speicher abgelegt, wodurch diese von hier wieder in den RAM Speicher der ausschließt, das die vorgestellten Algorithmen für andere GPUS mit ein wenig Aufwand CPU transferiert werden können. Es ist anzumerken, dass je nach Anbindung der GPU auf OPENSSL portiert werden könnten. Dies ist jedoch nicht Teil dieser Arbeit. die Transferraten des Speicherbusses sehr lang sein können, wodurch Daten nicht öfter CUDA stellt eine Erweiterung des C/C++-Syntax dar, indem dieser durch spezielle Schlüs- selwörter ergänzt wurde. Zudem gibt es bereits CUDA-Wrapper für die gängigsten Pro- als nötig zwischen GPU und CPU kopiert werden sollten. grammiersprachen, wie zum Beispiel Java, Python, Perl und .NET, wodurch der Einstieg in die Programmierung mit CUDA sehr einfach gestaltet ist. Für die entwickelte Biblio- Eine GPU auf Basis der Kepler Architektur besteht wiederum aus mehreren Multipro- thek erfolgten die CUDA-Umsetzung jedoch ausschließlich mit Hilfe des erweiterten zessoren (SMX), deren Aufbau in Abbildung 1.3 dargestellt ist. C++-Syntax. Jeder dieser Multiprozessoren besitzt einen kleineren, schnelleren L1- und Read-Ony- Wie in Abbildung 1.4 dargestellt, enthält ein CUDA-Programm eine Kombination aus CPU- und GPU-Code und stellt somit ein heterogenes Programmiermodell dar. Cache, eigene Register, Shedduler, Dispatcher, mehrere Load/Store Units (LD/ST), Dou- Wird dieser Code mit dem NVCC Compiler aus dem CUDA-Framework kompiliert, so ble-Precision Units (DP Unit), Special-Function Units (SFU) und eine hohe Anzahl an wird einerseits Maschinencode für die CPU mittels des verwendeten C++ Compilers Single-Percision Units (Cores). Da hier nur ein kleiner Einblick in die Architektur von

GPUS gegeben werden kann, sei für eine genauere Beschreibung der einzelnen Kompo- 5 nenten zum Beispiel auf [CGM14] verwiesen. Interessant ist jedoch der L1-Cache eines solchen Multiprozessors. Dieser ist zwar klei- ner als der globale L2-Cache, besitzt aber deutlich schnellere Zugriffszeiten als die- ser, wodurch er ausgezeichnet als Zwischenspeicher von temporären Ergebnissen eines Threads dienen kann, wie später noch gezeigt wird. Da dieser Speicher jedoch von al- len Kernen eines Multiprozessors verwendet wird, muss darauf geachtet werden, dass pro Kern nicht zu viel Speicher benötigt wird, da ansonsten die Anzahl an ausführbaren

1http://www.nvidia.de/object/nvidia-kepler-de.html

2 GPU/FPGA assisted RT: Problems

Problems GPU execution model fundamentally non-preemptive Execution and memory copy Hyper-Q/MPS: Optimise utilisation, not determinism issues (binary-only) I/O-device scheduling not straightforward )

Page 25 31. Mai 2017 W. Mauerer Siemens Corporate Technology Ressource Arbitration

Ressource Sharing

Global Parameter Description CPU Scheduling m number of system CPUs Partitioned Clustered Global Clustered h number of system GPUs c CPU cluster size Partitioned g GPU cluster size T G set of all GPU-using tasks T C set of all CPU-only tasks

cpu Partitioned ei Ti’s provisioned CPU execution time egpu T Heterogenous Computingi i’s provisioned GPU execution time cpu total CPU execution time within qi Ti’s GPU critical section I zi size of Ti’s GPU input data (bytes) 2 compute ressourcesO (assumption: zi size of Ti’s GPU output data (bytes) Clustered S multiple GPUszi on embeddedsize of Ti’s inter-job systems) GPU state data (bytes) bi upperbound on blocking for Ti (at least) 9 possible combinations! GPU Organization Table I: Important notation.

Schedulability POV: Clustered CPUsgpu + I O one arbitrary GPU in its GPU cluster. ei , zi , zi , and Global partitionedzS are GPUs zero for T T C . The term b denotes an upper- i i ∈ i bound on the time Ti,j may be blocked due to lock requests Page 26 31. Mai 2017 W. MauererFigure 4: Concrete Siemens configurations. Corporate Technology (for presentation simplicity, we assume tasks share no other G. Elliot and J. H. Anderson, IEEE RTCSA, 2011, 48–54 resources, but this is not a GPUSync requirement). We derive configuration. values for bi in Appendix B. Finally, Ti’s utilization is given cpu gpu I O S GPUSync uses a two-level nested locking structure: an by ui ! (ei + ei + xmit(zi ,zi ,zi ))/pi, and the task set n outermost token lock to allocate GPUs to jobs and innermost utilization is U u . ! i=1 i engine locks to arbitrate access to GPU engines. This is We refer back to the parameters summarized in Table I. ! depicted in Fig. 5. In Step A (or time t1 in Fig. 3), the Example. If we assume that the GPU usage pattern illus- job requests a token from the GPU allocator responsible for trated in Fig. 3 represents the entire execution sequence of a managing the GPUs in the job’s GPU cluster. The GPU cpu job Ti,j, then ei =(t2 t0)+(t4 t3)+(t6 t5)+(t9 t7), allocator determines which token—and by extension, which gpu cpu − − − − ei = t5 t4, qi =(t2 t1)+(t4 t3)+(t6 t5)+(t8 t7), GPU—should be allocated to the request. The requesting −I O S − − − − and xmit(zi ,zi ,zi )=(t3 t2)+(t7 t6) (assuming job may access the assigned GPU once it receives a token S − − zi =0, i.e. the job has no state to migrate between GPUs). in Step B. In Step C, the job competes with other token- holding jobs for GPU engines; access is arbitrated by the B. GPUSync Structure engine locks. A job may only issue GPU operations on its It helps to refer to concrete system configurations in describ- assigned GPU after acquiring its needed engine locks in Step ing GPUSync, so let us define several such configurations. D. For example, an engine lock must be acquired at times t2, Fig. 4 depicts a matrix of several high-level CPU/GPU t4, and t6 in Fig. 3. With the exception of P2P migrations, configurations for a 12-CPU, 8-GPU system, which we also a job cannot hold more than one engine lock at a time. use in Secs. IV and V. We refer to each cell in Fig. 4 using a GPUSync can be configured to use different locking column-major tuple, with the indices P , C, and G denoting protocols to manage tokens and engines. In this paper, we partition, clustered, and global choices, respectively. The configure GPUSync to use protocols known to offer asymp- tuple (P, P) refers to the top-left corner—a configuration totically optimal blocking bounds under FL scheduling. We with partitioned CPUs and GPUs. Likewise, (G, C) indicates now describe the two locking levels in more detail. We the right-most middle cell—globally scheduled CPUs with provide blocking analysis in Appendix B. clustered GPUs. We use the wildcard to refer to an g entire row or column: e.g., (P, ) refers∗ to the left-most Token lock. Each cluster of GPUs is managed by ρ column—all configurations with∗ partitioned CPUs. Within one GPU allocator. We associate tokens (a configurable each cell, individual CPUs and GPUs are shown on the left and right, respectively. Dashed boxes delineate CPU and Engine Locks GPU clusters (no boxes are used in partitioned cases). The GPU solid lines depict the association between CPUs and GPUs. Allocator GPU0 GPUg–1 For example, the solid lines in (C, C) indicate that two GPU clusters are wholly assigned to each CPU cluster. Finally,

the horizontal dashed line across each cell denotes the IN CE0 EE0 request OUT NUMA boundary of the system. Offline, tasks are assigned CE0 to CPU and GPU clusters in accordance with the desired Figure 5: High-level design of GPUSync.

263 GPU: Scheduling Practicalities

Central Server Clustering Preempt. Kernels

Dispatch Less wasteful than Implement context time-bounded kernels partitioning save/restore Longest running Worst case: Locally Highly experimental; kernel determines maximal execution advanced features latency time (streams, shared Additional HW Queue support memory, . . . ) not synchronisation supported Further silicon support complexity required

Page 27 31. Mai 2017 W. Mauerer Siemens Corporate Technology Small RTOS I

Traditional Real-Time Operating System

Optimisation: Size, Determinism Strong focus: scheduling and schedulability Frugal feature set

System POSIX Maturity VM Archs Drivers Ressources Docs

FreeRTOS 7 high 3 high high low good RTEMS 3 very high 7 high very high avg very good µclinux 3 avg 7 avg high avg poor mbed 7 high 7 low low avg very good Zephyr 7 high 7 low avg low avg

Page 28 31. Mai 2017 W. Mauerer Siemens Corporate Technology Small RTOS II

Prerequisite: Execution Env Pros and Cons

Static partitioning 3 Base systems certifiable Real-Time capable virtualisation (e.g., 3 Rich scheduling/schedulability options KVM over Preempt-RT) 3 Clear split between RT and non-RT Defeats the point, somewhat. . . 7 Often non-POSIX programming model 7 Maintenance effort doubles 7 Implicit coupling via shared ressources (busses etc.)

Page 29 31. Mai 2017 W. Mauerer Siemens Corporate Technology Outline

1 Real-Time and Safety

2 Approaches to Real-Time Architectural Possibilities Practical Approaches

3 Approaches to Linux-Safety

4 Guidelines and Outlook

Page 30 31. Mai 2017 W. Mauerer Siemens Corporate Technology Safety Strategies with Linux

SIL2LinuxMP System Partitioning

“Distributed System” on a chip Jailhouse Containers + minimal tools (compliant Requires HW virtualisation development) + partitioning allocator Up to N OSes on N-Core Minimise interference SafeG Requires ARM TrustZone Two OSes (trusted and untrusted) Temporal isolation: FIQ vs. IRQ Quest and Quest-V Research systems, interesting niche features (e.g., RT-USB)

Page 31 31. Mai 2017 W. Mauerer Siemens Corporate Technology Image Source: N. McGuire, GNU/Linux for safety-related systems – SIL2LinuxMP, FOSDEM 2016 Safety Strategies for Linux

Safety and Linux

Partition system in various ways Mixed Criticality: Combine critical & uncritical workloads

Page 32 31. Mai 2017 W. Mauerer Siemens Corporate Technology Jailhouse I

Jailhouse: Motivation SMP is everywhere Enables consolidation of formerly separate devices Linux is almost everywhere, but Legacy software stacks require bare-metal Safety-critical software stacks DSP-like real-time workloads

Page 33 31. Mai 2017 W. Mauerer Siemens Corporate Technology Jailhouse II

Jailhouse Architecture

Build static partitions on SMP systems Use hardware-assisted virtualisation Do not schedule No CPU core sharing, 1:1 device assignment Split up running Linux system Simplicity over Features

Page 34 31. Mai 2017 W. Mauerer Siemens Corporate Technology Jailhouse: Challenges

Issues

Memory-Mapped I/O Indivisible hardware ressources Erroneous hardware behaviour

Page 35 31. Mai 2017 W. Mauerer Siemens Corporate Technology Jailhouse: Challenges

Issues

Memory-Mapped I/O Indivisible hardware ressources Erroneous hardware behaviour

Page 35 31. Mai 2017 W. Mauerer Siemens Corporate Technology Jailhouse: Impact?

Measurement

RTEMS (Jailhouse/bare metal) task switching overhead Linux partition with and w/o load

RTEMS on Bare Metal RTEMS on Jailhouse/High Linux load RTEMS on Jailhouse/No Linux load 500 400 300 200 100 # Occurrences [k]

5 10 15 20 25 5 10 15 20 25 5 10 15 20 25 Task Switch duration [us]

Page 36 31. Mai 2017 W. Mauerer Siemens Corporate Technology Outline

1 Real-Time and Safety

2 Approaches to Real-Time Architectural Possibilities Practical Approaches

3 Approaches to Linux-Safety

4 Guidelines and Outlook

Page 37 31. Mai 2017 W. Mauerer Siemens Corporate Technology Guidelines and Outlook

Guidelines Outlook

Combinatorial explosion of Appliances with certified and alternatives. . . non-certified mode Data capture/signal processing Increased HW support for partitioning Jailhouse, PRU and multi-OS Audio, media and non-fatal control Preempt-RT Real-Time combined with throughout requirements Xenomai with Cobalt kernel Involved temporal interrelations RTOS on system partition

Page 38 31. Mai 2017 W. Mauerer Siemens Corporate Technology Thanks for your interest!

Page 39 31. Mai 2017 W. Mauerer Siemens Corporate Technology