Selection and evaluation of an embedded : application to an automotive platform

Etienne Hamelin∗, Moha Ait Hmid∗, Amine Naji†, Yves Mouafo-Tchinda‡ ∗CEA, LIST (email: [email protected]) †Jaguar Land Rover (email: [email protected]) ‡Continental (email: [email protected])

Abstract—This paper presents a methodology for selecting an criteria and then evaluating the candidates, especially when for embedded applications consolidation, non-technical and subjective criteria are in balance with based on qualitative and quantitative criteria, then describes technical ones. how this methodology is applied to the construction of an This paper proposes a methodological selection , automotive hardware and software platform. applicable to many domains, and presents its successful application to the selection of a hypervisor for 1. Introduction a new automotive .

The advent of multi/many-core SoCs in embedded sys- 2. Related work tems enables the execution of multiple software applica- tions on the same integrated circuit, possibly with very Several works address hypervisor evaluation and com- different requirements in terms of performance, security, parison. We can cite Patel et al [2] who present the open safety criticality, and lifecycle management. The realization source embedded hypervisor Xvisor and compare it against of the full potential of such platforms involves the use of two of the commonly used KVM and ac- multiple software computing domains, often managed by cording to factors that affect the whole system performance. different operating systems, on the same physical device. Experimental results on ARM architecture prove Xvisor’s To host a diversity of software computation payload, either lower CPU overhead, higher memory bandwidth, lower lock a hardware or a software solution can be considered. The synchronization latency and lower virtual timer interrupt hardware solution consists in dedicating to each software overhead and thus overall enhanced virtualized embedded computing domain a distinct core or cluster on the platform. system performance. In [3], Shimada et al. have presented While providing a good isolation of domains, it does not an evaluation of their designed hypervisor according to four enable efficient sharing of hardware resources. The software main criteria: boot time, communication overhead, CPU solution is [1]. It provides a software envi- time overhead for intensive tasks and latency of interrupts. ronment on which programs or complete operating systems We did not find microbenchmarks suitable for embedded can run simultaneously, as if they were running natively on hypervisor evaluation. However, some OS benchmarks such hardware. The software layer providing this environment is as LMBENCH [4], UnixBench [5] and HBench-OS [6], [7] called (VM). Multiple VMs can thus run provide metrics that can be used for hypervisor evaluation. almost transparently on a single hardware device, managed A common limitation of several of these previous works by a hypervisor which provides an abstraction layer between is that they advocate a specific hypervisor solution, and the physical hardware and the VMs. However, due to shared therefore focus on their specific performance advantage. resources, some interference may occur between virtual Moreover they usually use a very technical evaluation ap- machines. proach with many details (microbenchmarks), which can be Several forms of virtualization may be applicable to far from the realm of a real-world business-driven decision. consolidation. Virtualization solutions Our methodology proposes a balance technical criteria (al- are usually classified as type-I (bare-metal hypervisor) or though less detailed than several related works) with non- type-II (hypervisor hosted inside a bare-metal, full-featured technical criteria, in a methodological approach. ), -based hypervisors being a hybrid between both, where the VMM (virtual machine 3. Approach manager) runs over a small microkernel or . Deciding among the available hypervisors which ones We propose a rational and practical approach to selecting are more suitable is not an easy task as it requires eliciting an embedded hypervisor for a specific domain needs. First, This work was done while all authors were research engineers at CEA, the user needs have to be turned into selection criteria. LIST. Based on user requirements, on knowledge of embedded hypervisors and of applications of the domain, we propose possibility of a long-term relationship between the a set of criteria, both qualitative and quantitative, which integrator and hypervisor editor (or open-source detail specific characteristics, features, or metrics, that can community) needs to be assessed, as well as le- be evaluated for each particular solution. gal/contractual constraints. It however would be prohibitive to evaluate all metrics • Required vs. nice-to-have features: some features on all hypervisors, and approach selection as a maximization are strictly necessary to certain users (e.g. ”support process. Instead, we propose a multi-step filtering process the ARMv8A instruction set”), while some others where at each step the most blocking or most easily- may be added later (e.g. support a specific driver), evaluated criteria are used to reject a number of solutions, so and only represent a limited additional cost. that the subsequent steps can devote more effort to evaluate • Objective vs. subjective evaluation: several crite- in depth a reduced number of solutions. ria are uneasy to assess objectively e.g. ease of We propose a list of evaluation criteria, grouped into the use. Moreover, many technical criteria (e.g.context- following classes: switch performance, inter-VM interference), while objectively defined, can only be evaluated on a spe- • Main features cific hardware, and in a specific configuration. The – hypervisor type (type I, type II, microkernel- definition of test cases and hypervisor configurations based) may not accurately model the actual field usage – supported hardware architectures (e.g. configurations. For these reasons, all quantitative ARMv8A, /64, etc.) criteria are regarded as an estimation only and are – supported OSs (e.g. full-virtualized , subject to interpretation. paravirtualized RTOSs) – communication services (e.g. inter-partition 4. Application in the automotive domain comms, virtual network) – exposed (e.g. Posix PSE-53, OSEK, In collaboration with the Renault-Nissan Alliance within ARINC653, etc.) the project FACE “Future architecture for Automotive Com- – real-time support (time-driven or priority- puter Environment“1, CEA has designed a centralized com- driven, preemption model, resource monitor- puting hardware and software platform. This platform shall ing model, etc.) serve as a generic computing unit supporting various ve- – safety & security services (memory partition- hicle product lines, designed to host several heterogeneous ing, health monitoring, fault-tolerance, etc.) software appliances ranging from e.g. hard real-time control Industrial maturity & business model to soft real-time CPU-intensive ADAS processing and best- effort multimedia applications. These appliances may be – application domains and success stories (how developed, deployed and updated independently by vari- many prototypes or industrial products that ous software providers. In this context, we applied the rely on a version of that hypervisor) methodological selection process presented above to select a – available safety & security qualification pack- suitable candidate. Choosing a software platform for a wide ages (e.g. generic qualification package like range of product is a long-term strategic decision, there- automotive SEooC or Common-Criteria EAL fore business and strategic stakes are as high as technical, certification, or successful use in a safety- or performance-related ones. security-qualified product; applicable restric- Our selection process was performed in 3 steps. At each tion for safety- or security-qualified usage) step, several criteria are used to select a limited number of – toolset (availability, and ease-of-use of devel- candidates for further analysis, as illustrated in figure 1. opment environment, configuration, debug, analysis tools) 4.1. Selection process overview – licensing model (e.g. open-source, flat rate or per-product license fee) At first, list of 23 embedded hypervisor solutions was – business model (available support, applicable identified, both open-source and commercial ones. These regulation, partnership model) were then scrutinized together against the main discriminat- These criteria are characterized along the following as- ing requirements of: pects: • real-time support, • Qualitative vs. quantitative: many qualitative crite- • available safety/security qualification package, ria are easily evaluated from public documentation, • estimated industrial maturity (assessed based on and may therefore be used in early selection steps; known industrial uses of the hypervisor) and avail- whereas most quantitative criteria should be mea- able support. sured on target, which is a more costly process. 1. http://www-list.cea.fr/en/media/news/2019/409-february-19-2019- • Technical vs. non-technical criteria: besides tech- face-powers-automotive-innovations-by-revolutionizing-electrical-and- nical compatibility and performance aspects, the electronics-architectures 4.2. Overview of the quantitative analysis

The impacts and benefits of virtualization were evaluated by comparing several metrics evaluated in various configu- rations. Our main focus was on:

• assessing orders of magnitude of the performance impact (overhead) of virtualization itself over bare- metal applications ; • assessing the level of inter-partition interference, i.e. how virtualization may impact predictability of an application’s performance. The quantitative analysis was performed on a Renesas RCar- H3 platform 7. This SoC is based on a cluster of 4 ARM- Cortex A57 cores, a cluster of 4 ARM-Cortex A53 cores, and a dual-core ARM-Cortex R7 that was not exploited here. Since we make only limited use of Renesas-specific hardware IPs, we assume that our results can be extrapolated to other multicore ARMv8A SoCs.

4.3. Boot time overhead Figure 1. Overview of the selection process System boot time is crucial to several automotive appli- From this first filter, the following solutions stood out: cations, like camera-based rear-view. The additional delay caused by the hypervisor delay (measured from the first hypervisor instruction, to the first VM instruction) shall • PikeOS by Sysgo2, therefore remain small with respect to other boot delays • Integrity Multivisor by Green Hills Software3, e.g. bootloader, OS and application start time. This metric • QNX by Blackberry4, depends mainly on the module level configuration (e.g. • Mentor embedded hypervisor by Mentor5, enabled hypervisor features/drivers, number of VMs, config- • RedBend10 by Harman6. ured communications and scheduling schemes, module level Several open-source solutions (e.g. Xen, L4Re/Fiasco), Health Monitoring. . . ) and the VM configuration, especially though of high technical interest, were discounted due to the VM . Our measurements (shown in lack of safety/security qualification or available support figure 2) show that the hypervisor boot time depends most companies. linearly on VM size, and remains within the tens of mil- liseconds order of magnitude for considered VM footprint. Then each hypervisor editor was interviewed, and it became clear that:

• PikeOS, Integrity Multivisor and QNX provided bet- 50

ter user tooling, and a longer trail of successful 45

usage in safety-critical or security-relevant industrial 40

environments than Mentor embedded hypervisor and 35

Harman RedBend10; 30

• embedded hypervisors from north-american editors 25 might be subject to restrictive regulation when de-

Boot Boot (ms) time 20

ployed in some countries; which might become 15 problematic for a world-scale OEM. 10

5 For these reasons, Sysgo’s PikeOS was chosen as the 0 primary platform for quantitative evaluation. 0 100 200 300 400 500 600 700 800 900 1000 VM footprint (MB)

2. https://www.sysgo.com/ 3. https://www.ghs.com/products/rtos/integrity virtualization.html 4. http://blackberry.qnx.com/en/software-solutions/embedded- Figure 2. Hypervisor boot time software/industrial/-hypervisor 5. https://www.mentor.com/embedded-software/hypervisor/ 6. https://www.harman.com/ 7. https://www.renesas.com/us/en/solutions/automotive/soc/r-car-h3.html 4.4. Memory footprint with FPU unused, both with enabled or disabled flushing of the TLB and L1 caches, and with variable request sizes. Our RAM memory is a scarce resource in embedded sys- measurements show a maximum round-trip delay of 17µs tems, and especially so in the automotive domain where an during first ping-pong (cold caches), subsequent execution ECU’s bill-of-material is reduced to the minimum before delays nearly halved (8µs), probably due to warm caches ; deploying on millions of vehicles. Besides the RAM mem- therefore context switch time is estimated in the order of a ory assigned to each VM, the hypervisor requires RAM few µs. Our use-cases do not involve software tasks with memory for its own execution and for managing each timing constraints less than 500µs, hence a context-switch VMs. We thus define two metrics. First the hypervisor overhead of a few µs is deemed acceptable. memory footprint which is the minimum amount of mem- ory required by the hypervisor for its own execution. The 4.6. Partitioning schemes hypervisor memory footprint may depend on the size of the hypervisor and VM binaries, module-level configuration In order to estimate the impact of application partition- (e.g. enabled hypervisor features/drivers, number and types ing, various guest VM configurations were evaluated (see (native, para/full-virtualization. . . ) of VMs, communication figure 3): channels, scheduling schemes, module level Health Moni- toring. . . ). (a) bare metal : without hypervisor, a Yocto Linux Second, for a VM, the VM memory footprint is the distro running various tests from the HBench-OS minimum amount of memory required by the hypervisor to and MiBench suites on the first 2 ARM-Cortex-A57 manage the VM. The VM memory footprint may depend on cores C0 and C1 of the RCar-H3 SoC ; e.g. the VM RAM size, number of tasks (e.g. MMU tables) (b) core-partitioned: with hypervisor enabled, several and the number of threads (e.g. execution stacks). partitions based on core separation; Two configurations have been considered, both with – a Yocto Linux virtual machine running code minimalist hypervisor services. The first configuration con- from the HBench-OS and MiBench suite, sists in a single native partition (or VM), running a single mapped on cores C0 and C1, “Hello World” thread. The second configuration uses a full – an ELinOS partition, mapped on cores C2 and virtualized Linux kernel, using network services. C3, in which a memory and network distur- Our measurements show that the VM memory footprint bance may be activated – this perturbation is increases linearly with the size of the VM RAM, with an generated by some benchmarks, overhead of one 4KB page per 2MB mapping (page table), – a service partition mapped on C3, runs a and one 4KB page per 1GB mapping (page directory). set of inter-VM services, especially a virtual For the simple “Hello Word” configuration the VM Ethernet bridge. memory footprint was measured a 8MB (for maximum VM RAM size of 3910MB), and hypervisor memory footprint (c) time-partitioned, with the same user partitions, but less than 2MB. For the heavier “Linux” configuration, the now separated by a time-driven schedule (following VM memory footprint was up to 9 MB (for the maximum the ARINC-653 avionics separation principle). VM RAM size of 3883 MB), and about hypervisor memory – On cores C0, C1 and C2, repeat the following footprint 28 MB (including more than 16 MB for the Linux schedule every 30ms: VM kernel binaries). ∗ Yocto runs during 20ms 4.5. Context switch overhead ∗ ELinOS runs during 10ms, – the service partition mapped on C3. The context switch time is a key performance indi- cator for hypervisors, especially in real-time applications. In all configurations, the Linux partition running the The context switch consists in saving the CPU state of a benchmark has access to 2 cores on average, so that multi- running thread so that its execution can be latter resumed threaded compute-bound tasks are expected to yield similar and restoring the CPU state of a new ready thread. This average performance. metric depends on many parameters: a thread’s privilege The impact of virtualization itself is estimated by com- mode, FPU usage, cache and TLB content, etc. In our paring performances of (b) against (a); the effect of inter- evaluation, we estimate the context switch time by setting ference due to scheduling (mostly intra-core) and resources up an interrupt-driven client/server communication between sharing (mostly inter-core) is estimated by comparing (b) two partitions running on the same core. The server VM and (c), with perturbations enabled or disabled. has high priority, is blocked waiting a request from the We measured that compute-bound tasks (the Basicmath client VM, and then replies immediately; the client mea- and Bitcount tests from the automotive benchmark [7]) sures the round-trip delay between request and reply. The showed very similar performance in bare-metal and core- round-trip delay is therefore an overestimation of 2 times partitioned configuration. In the time-partitioned configura- the context switch time. Measurements have been made tion, a 1/3 slowdown is due to single-thread benchmark on PikeOS using Queuing port communication mechanism, execution not exploiting the additional core. The TCP client/server communication bandwidth drops from 60 MB/s (undisturbed), down to 3 MB/s in presence of network disturbance, i.e. -90% bandwidth loss. This is due to the virtual Ethernet bridge being saturated by perturbing traffic, and also preempting one of the ELinOS VMs because the Service Partition runs at higher priority, and can be mapped onto any of the cores.

4.8. Conclusions for the automotive domain

The quantitative evaluation on PikeOS did not reveal a significant technical drawback, therefore in our limited time and effort budget, similar in-depth analysis was not reproduced on other hypervisors. As a conclusion, to the automotive application program- mer, the presence of PikeOS hypervisor by itself causes only slightly visible performance overheads (on the order of tens of milliseconds additional boot time, 10µs at partition context switch, a few % memory overhead). Note that these figures are given as orders of magnitudes, and shall not be used as worst-case execution time in a timing analysis: a full WCET characterization remains to be performed for proper timing analysis of safety-critical applications in their Figure 3. Hypervisor configurations specific configuration. Moreover interference between partitions or VMs com- peting on e.g. memory or shared services like virtual net- For memory-bound tasks however, the impact of virtual- work, can become very problematic in real-time or safety- ization is much more visible. Figure 4 shows the result of a critical applications, as our tests revealed up to 60% memory memory copy bandwidth test in the selected configurations, bandwidth reduction and 90% network bandwidth reduction with various block sizes being copied by the Benchmark in certain configurations. VM. As a mitigation for memory-induced interference, we By itself, virtualization does not significantly affect the suggest to rely on L2 cache separation when available, e.g. memory performance of the core-partitioned application on the RCar-H3 SoC between the A57 cluster and the A53 when undisturbed (orange vs. black line in figure 4). Inter- cluster, to protect hard real-time applications or tasks (run- partition interference however can reduce memory access ning in a paravirtualized RTOS or over the hypervisor native performance by more than half in the core-partitioning con- API), from full-virtualized GPOSs like the Linux-based figuration (solid-orange vs. dashed orange line). These in- Android Automotive or Genivi. We expect full-virtualized terference are caused by conflicting use of shared resources GPOSs to heavily use memory and I/Os (especially for like L2 cache lines and MMU TLB. The time partitioning image-based ADAS or graphic interfaces), filling up the scheme protects much better the victim partition against 2MB L2 cache of the A57 cluster and generating a heavy perturbations (solid green vs. dashed green line), at the cost traffic on the SoC system bus to DRAM and peripherals. of an available bandwidth reduced by 1/3 (as is expected by On the other hand, we expect hard-real-time tasks to be the time-sharing ratio). Perturbations still reduces by -15% less memory-intensive, therefore keep the better part of the bandwidth available to the victim partition. In our test their code and data within the 512kB L2 cache of the A53 campaign, flushing the L1 cache and TLB at time-partition cluster, mostly free from interference from the memory- boundaries did not provide significantly better resistance to hungry OSs running on the A57s. Recall that many real-time inter-core perturbations, probably due to L2 cache-induced control tasks today run on micro-controllers with only tens interference. of kB of ROM and RAM. An application-specific timing and interference analysis remains necessary for validating 4.7. Shared services any safety-related real-time application. We furthermore recommend that a software monitor be An additional series of tests on the network bandwidth instantiated within each partition, to enforce priorities and/or showed a drastic reduction due to perturbations through quota on usage of services shared with time-critical VMs, shared services. In this test, two pairs of ELinOS-type like the virtual Ethernet bridge. As the ETH-AVB/TSN pro- virtual machines run the same network bandwidth code, each tocol suite is becoming progressively used in the automotive pair consists of a server and a client VM. All client/server domain for real-time and mixed-critical traffic scheduling, a communication run through the virtual Ethernet bridge im- virtual bridge following the TSN scheduling schemes would plemented in the Service partition. be particularly useful. Bare-metal 2 core

Core partitioning

Core partitioning + perturb 12 000 Time-partitioning

Time-partitioning + perturb

10 000 Time-partitioning, cache flushing

Time-partitioning, cache flushing + perturb

8 000

6 000 memcpy memcpy bandwidth(MB/s)

4 000

2 000

0 2k 4k 8k 16k 32k 64k 128k 256k 512k 1m 2m 4m 8m

block size

Figure 4. Memory copy bandwidth comparison

heterogeneous software applications on a single multi-core SoC. Our methodology takes into account various aspects, technical ones, e.g. performances, as well as non-technical aspects e.g. business-model related. This methodology was applied on an automotive use-case. Moreover, while analyzing the performance and interfer- ence metrics on the target hypervisor in various configura- tions, we get a useful indication of the configurations most suitable for specific applications. Throughout the application of this methodological pro- cess to an automotive industrial use-case, we discovered how heavy non-technical criteria can weigh with regard to Figure 5. Network bandwidth benchmark configuration technical ones. It was henceforth clear that criteria related to the strategy or business-model (e.g. applicable regulation and licensing model) can be a real show-stopper, whereas most performance-related issues could be dealt with when designing or adapting specific guest applications. References

[1] M. Mounika, C. N. Chinnaswamy, ”A Comprehensive Review on Embedded Hypervisors”. International Journal of Advanced Research in Computer Engineering & Technology (IJARCET) Volume 5, Issue 5, May 2016. Figure 6. Recommended hypervisor configuration [2] A. Patel, M. Daftedar, M. Shalan, M. W. El-Kharashi, ”Embedded Hypervisor Xvisor: A Comparative Analysis”, 23rd Euromicro In- ternational Conference on Parallel, Distributed, and Network-Based 5. Conclusions Processing, 2015 [3] T. Shimada, T. Yashiro, N. Koshizuka, K. Sakamura, ”A Real-Time We have defined a methodological approach for selecting Hypervisor for Embedded Systems with Sup- an embedded hypervisor as a generic platform for hosting port”, TRON Symposium (TRONSHOW) 2014 [4] W. McVoy, Larry & Staelin, Carl. (1996), ” lmbench: Portable Tools for Performance Analysis”. Proceedings of the USENIX 1996 Annual Technical Conference San Diego, California, January 1996, pages 279- 294. [5] N. Hatt, A. Sivitz, B. a. Kuperman, ”Benchmarking Operating Sys- tems”, Conference for Undergraduate Research in Computer Science and Mathematics., pp. 63-68, 2007. [6] P. Koopman, J. Sung, C. Dingman, D. Siewiorek and T. Marz, ”Com- paring operating systems using robustness benchmarks,” Proceedings of SRDS’97: 16th IEEE Symposium on Reliable Distributed Systems, Durham, NC, USA, 1997, pp. 72-79. [7] https://www.eecs.harvard.edu/margo/papers/sigmetrics97-os/hbench/ visited on 2019-06-07