<<

Department of Computer Science

Real-time Code Generation in Virtualizing Runtime Environments

Dissertation

submitted in fulfillment of the requirements for the academic degree

doctor of engineering (Dr.-Ing.)

to the Department of Computer Science of the Chemnitz University of Technology

from: Dipl.-Inf. (Univ.) Martin Däumler born on: 03.01.1985 born in: Plauen

Assessors: Prof. Dr.-Ing. habil. Matthias Werner Prof. Dr. rer. nat. habil. Wolfram Hardt

Chemnitz, March 3, 2015

Acknowledgements

The work on this research topic was a professional and personal development, rather than just writing this thesis. I would like to thank my doctoral adviser Prof. Dr. Matthias Werner for the excellent supervision and the time spent on numerous professional discus- sions. He taught me the art of academia and research. I would like to thank Alexej Schepeljanski for the time spent on discussions, his advices on technical problems and the very good working climate. I also thank Prof. Dr. Wolfram Hardt for the feedback and the cooperation. I am deeply grateful to my family and my girlfriend for the trust, patience and support they gave to me.

Abstract

Modern general purpose programming languages like or # provide a rich feature set and a higher degree of abstraction than conventional real-time programming languages like C/C++ or Ada. Applications developed with these modern languages are typically deployed via platform independent intermediate code. The intermediate code is typi- cally executed by a virtualizing runtime environment. This allows for a high portability. Prominent examples are the of the Android , the as well as .NET’s . The virtualizing runtime environment executes the instructions of the intermediate code. This introduces additional challenges to real-time software development. One issue is the transformation of the intermediate code instructions to native code instructions. If this transformation interferes with the of the real-time application, this might introduce jitter to its execution times. This can degrade the quality of soft real-time systems like augmented reality applications on mobile devices, but can lead to severe problems in hard real-time applications that have strict timing requirements. This thesis examines the possibility to overcome timing issues with intermediate code execution in virtualizing runtime environments. It addresses real-time suitable generation of native code from intermediate code in particular. In order to preserve the advantages of modern programming languages over conventional ones, the solution has to adhere to the following main requirements: • Intermediate code transformation does not interfere with application execution • Portability is not reduced and code transformation is still transparent to a program- mer • Comparable performance Existing approaches are evaluated. A concept for real-time suitable code generation is developed. The concept bases on a pre-allocation of the native code and the elimination of indirect references, while considering and optimizing startup time of an application. This concept is implemented by the extension of an existing virtualizing runtime envi- ronment, which does not target real-time systems per se. It is evaluated qualitatively and quantitatively. A comparison of the new concept to existing approaches reveals high execution time determinism and good performance and while preserving the portability deployment of applications via intermediate code.

Contents

1 Introduction 1 1.1 Motivation ...... 1 1.2 Problem Statement ...... 2 1.3 Objectives of this Work ...... 4 1.4 Structure ...... 5

2 Fundamentals 7 2.1 Introduction ...... 7 2.2 Real-Time ...... 7 2.3 Conventional Applications ...... 8 2.3.1 Introduction ...... 8 2.3.2 Compilation ...... 10 2.3.3 Linking ...... 13 2.3.4 Loading ...... 20 2.4 Virtualizing Runtime Environments ...... 22 2.4.1 Introduction ...... 22 2.4.2 Architecture and Services of a VM ...... 24 2.4.3 Workflow ...... 26 2.4.4 Examples of High-Level Language Virtual Machines ...... 30 2.5 Summary ...... 31

3 State of the Art 33 3.1 Introduction ...... 33 3.2 Real-Time Specifications ...... 33 3.2.1 Java Platform ...... 33 3.2.2 CLI Platform ...... 35 3.3 Hardware-supported Execution ...... 35 3.3.1 Java Platform ...... 35 3.3.2 CLI Platform ...... 36 3.4 Software-supported Execution ...... 37 3.4.1 Introduction ...... 37 3.4.2 Java Platform ...... 38 3.4.3 CLI Platform ...... 41 3.4.4 Low-Level Virtual Machine ...... 43 3.5 Summary ...... 43

4 Allocation of Native Code 45 4.1 Evaluation of Existing Approaches ...... 45 4.1.1 Hardware-supported Execution ...... 45 4.1.2 Software-supported Execution ...... 45

vii Contents

4.2 Concept ...... 48 4.3 JIT -based Pre-Compilation ...... 50 4.3.1 Principle ...... 50 4.3.2 Implementation ...... 52 4.4 Testing Environment ...... 53 4.5 Experimental Results and Evaluation ...... 61

5 Patching of Native Code 69 5.1 Lazy Compilation and Reference Handling ...... 69 5.2 Concept ...... 71 5.3 Implementation ...... 71 5.4 Experimental Results and Evaluation ...... 75

6 Optimization of Startup Time 79 6.1 Interim Analysis ...... 79 6.2 Reduction of Allocated Code ...... 79 6.2.1 Concept ...... 79 6.2.2 Implementation ...... 81 6.2.3 Experimental Results and Evaluation ...... 82 6.3 Checkpoint and Restore ...... 86 6.3.1 Concept ...... 86 6.3.2 Implementation ...... 86 6.3.3 Experimental Results and Evaluation ...... 87

7 Evaluation 89 7.1 Internal Experiments ...... 89 7.1.1 Introduction ...... 89 7.1.2 Standard Execution Mode ...... 89 7.1.3 Real-Time Code Generation Mode ...... 90 7.1.4 Summary ...... 94 7.2 Comparative Experiments ...... 95 7.2.1 Introduction ...... 95 7.2.2 Instance methods benchmark ...... 96 7.2.3 Interface Methods benchmark ...... 101 7.2.4 Class Methods benchmark ...... 104 7.2.5 Type Initializer benchmark ...... 108 7.2.6 Static Methods benchmark ...... 112 7.2.7 Static Class Methods benchmark ...... 115 7.2.8 Summary ...... 118

8 Summary and Outlook 121 8.1 Discussion ...... 121 8.2 Conclusion ...... 122 8.3 Outlook ...... 124

viii Contents

List of Abbreviations 125

Bibliography 136

ix

List of Figures

1.1 Normalized execution times of 1000 interface methods that are written in C# and run by the virtualizing runtime environment “” version 2.6.1 in JIT mode ...... 3

2.1 Process’ point of view on a machine...... 8 2.2 Workflow of a native program from to a running . . 10 2.3 Calling a dynamically linked function in ELF executable...... 21 2.4 Process Virtual Machine ...... 23 2.5 Compiler and Loader in conventional and HLL VM environment (derived from Figure 5.1 in [109, Ch. 5]) ...... 24 2.6 Abstract Architecture of a High-level Language Virtual Machine ...... 25 2.7 Workflow of a program from source code to execution by a HLL VM . . . . 27

4.1 Pre-compilation based on Mono’s JIT compiler...... 50 4.2 Triggering Mono’s JIT compiler via managed code...... 53 4.3 Standard Deviation of the simple and complex benchmark variant written in C++, IA-32 ...... 60 4.4 Frequency Distribution of the observed execution times of the simple and complex benchmark variant written in C++, IA-32 ...... 60 4.5 Frequency Distribution of the observed execution times of the third and fourth measurement of Mono in JIT mode, IA-32 ...... 62 4.6 Standard Deviation of execution times of 1000 methods (note the logarith- mic scale), IA-32 ...... 63 4.7 Startup time of 1000 methods ...... 67

5.1 Call of a lazily compiled method in Mono ...... 70 5.2 Interplay of Pre-Compilation and Pre-Patch ...... 72 5.3 Pre-Patch of a method call through the Virtual Table ...... 73 5.4 Pre-Patch of a method call through the Interface Method Table ...... 74 5.5 Frequency Distribution of the observed execution times of the all four mea- surements of MonoRT in Pre-Patch mode, IA-32 ...... 76 5.6 Standard deviation of execution times of 1000 methods, IA-32 ...... 76 5.7 Startup time of 1000 methods, IA-32 ...... 77

6.1 Interplay of AOT compiler based Pre-Compilation and Pre-Patch . . . . . 80 6.2 Frequency Distribution of the observed execution times of the all four mea- surements of MonoRT in AOT-based Pre-Patch mode, IA-32 ...... 83 6.3 Standard deviation of execution times of 1000 methods, IA-32 ...... 84 6.4 Startup time of 1000 methods, IA-32 ...... 85 6.5 Startup time of 1000 methods, ARM ...... 86

xi List of Figures

6.6 Execution times of 1000 methods using the JIT-based pre-compilation mode of MonoRT, ARM ...... 87 6.7 Startup time of 1000 methods, ARM ...... 88

7.1 Frequency Distribution of the observed execution times of the all four mea- surements of IBM WebSphere Real-Time for RT in AOT mode, IA-32 98 7.2 Frequency Distribution of the observed execution times of the all four mea- surements of .NET in PRC mode cut at 276 µs, IA-32 ...... 99 7.3 Standard Deviation of execution times of benchmark with 1000 instance methods ...... 99 7.4 Startup time of benchmark with 1000 instance methods ...... 100 7.5 Average observed execution times of benchmark with 1000 interface meth- ods run by MonoRT, IA-32 ...... 102 7.6 Observed execution times of benchmark with 1000 interface methods . . . 103 7.7 Standard Deviation of execution times of benchmark with 1000 interface methods ...... 103 7.8 Startup time of benchmark with 1000 interface methods ...... 104 7.9 Average observed execution times of benchmark with 1000 class methods run by MonoRT, IA-32 ...... 106 7.10 Observed execution times of benchmark with 1000 class methods ...... 107 7.11 Standard Deviation of execution times of benchmark with 1000 class methods107 7.12 Startup time of benchmark with 1000 class methods ...... 108 7.13 Average observed execution times of benchmark with 1000 class methods and type initializer run by MonoRT, IA-32 ...... 110 7.14 Observed execution times of benchmark with 1000 class methods and type initializer ...... 110 7.15 Standard Deviation of execution times of benchmark with 1000 class meth- ods and type initializer ...... 111 7.16 Startup time of benchmark with 1000 class methods and type initializer . . 112 7.17 Average observed execution times of benchmark with 1000 static methods run by MonoRT, IA-32 ...... 113 7.18 Observed execution times of benchmark with 1000 static methods . . . . . 114 7.19 Standard Deviation of execution times of benchmark with 1000 static methods114 7.20 Startup time of benchmark with 1000 static methods ...... 115 7.21 Average observed execution times of benchmark with 1000 static class methods run by MonoRT, IA-32 ...... 116 7.22 Observed execution times of benchmark with 1000 static class methods . . 117 7.23 Startup time of benchmark with 1000 static class methods ...... 118

xii List of Tables

4.1 Evaluation of code allocation and execution approaches, respectively (X/◦/− =b achieved/restricted/missed) ...... 48 4.2 Results of the simple and complex benchmark variant written in C++, IA-32 59 4.3 Observed execution times of 1000 methods of the C++ variant and the C# variant using different execution modes of Mono and MonoRT, IA-32 . . . 62 4.4 Observed execution times of 1000 methods of the C++ variant and the C# variant using different execution modes of Mono and MonoRT, ARM . . . 64

4.5 RET and RSD of 1000 methods of the C++ variant and the C# variant using different execution modes of Mono 2.6.1 and MonoRT, IA-32 . . . . . 66

4.6 RET and RSD of 1000 methods of the C++ variant and the C# variant using different execution modes of Mono 2.6.1 and MonoRT, ARM . . . . . 67

5.1 Observed execution times of 1000 methods of the C++ variant and of the C# variant using the pre-compilation and the Pre-Patch execution mode of MonoRT, IA-32 ...... 75

5.2 RET and RSD of 1000 methods of the C++ variant and the C# variant using pre-compilation and Pre-Patch mode of MonoRT, IA-32 ...... 77

6.1 Observed execution times of 1000 methods of the C++ variant and the C# variant executed in different pre-compilation and Pre-Patch modes of MonoRT, IA-32 ...... 83

6.2 RET and RSD of 1000 methods of the C++ variant and the C# variant run by MonoRT, IA-32 ...... 84 6.3 Observed execution times of the minimal set of 1000 methods (C#) bench- mark executed by MonoRT in AOT-based Pre-Patch mode, IA-32 . . . . . 84

7.1 Observed execution times of benchmark with 1000 instance methods . . . . 97

7.2 RET and RSD of benchmark with 1000 instance methods ...... 100 7.3 RET and RSD of benchmark with 1000 interface methods ...... 104 7.4 RET and RSD of benchmark with 1000 class methods ...... 106 7.5 RET and RSD of benchmark with 1000 class methods and type initializer . 111 7.6 RET and RSD of benchmark with 1000 static methods ...... 113 7.7 RET and RSD of benchmark with 1000 static class methods ...... 117

xiii

Listings

2.1 Source file main.c ...... 10 2.2 Header file func.h ...... 10 2.3 Source file func.c ...... 10 2.4 Assembly code of C source file func.c (Listing 2.3) ...... 11 2.5 Section headers of ELF object file main.o ...... 12 2.6 entries of section .text in ELF object file main.o ...... 13 2.7 Disassembly of ELF object file main.o ...... 13 2.8 Disassembly of ELF executable file main-static.out (excerpt) ...... 15 2.9 Building the ELF executable file main-dyn.out and listing its dependencies 18 2.10 Relocation entries of ELF executable main-dyn.out ...... 18 2.11 Disassembly of ELF executable main-dyn.out (excerpt) ...... 19 2.12 Sections of ELF executable main-dyn.out (excerpt) ...... 19 2.13 Source file main.cs ...... 27 2.14 Source file func.cs ...... 27 2.15 Metadata and Intermediate Code of file main.exe (Listing 2.13) ...... 28 2.16 Assembly code before execution (excerpt) ...... 29 2.17 Assembly code after execution (excerpt) ...... 29 2.18 Disassembly of AOT compiled code in ELF shared object main.exe.so (ex- cerpt) ...... 30

4.1 Pseudo code of micro-benchmark ...... 54 4.2 Structure of simple benchmark methods ...... 55 4.3 Structure of complex benchmark methods ...... 56 4.4 Example assignment of variables to methods in complex benchmark . . . . 57 4.5 Pseudo code pattern of instance method benchmark ...... 58

7.1 Pseudo code pattern of interface method benchmark ...... 101 7.2 Pseudo code pattern of class methods benchmark ...... 105 7.3 Pseudo code pattern of type initializer benchmark ...... 109 7.4 Pseudo code pattern of static methods benchmark ...... 112 7.5 Pseudo code pattern of static class methods benchmark ...... 116

xv

1 Introduction

1.1 Motivation

Today’s prevailing programming languages for embedded systems are languages like C, C++ and Ada [18]. These high-level languages provide a higher degree of abstraction than the language, but they also provide “bare-” access, which gives the developer full control over the application. In the conventional building process, their source code is translated into that can be executed directly by the processor. These properties predestinate those languages for the use in real-time systems. In real- time systems, which are often also embedded systems, not only the result of a computation has to be correct. The time, at which the result is available, has to be predictable too. An example for real-time systems is the domain of automation engineering. Typical applications are control circuits, where an input from a sensor has to be processed within a given time budget to write an output that controls a certain actuator. The system might “get out of step”, if the processing of the input value takes too long. The software often runs on specialized computation systems, called Programmable Logic Controllers (PLCs). It is typically written in the procedural programming languages that are defined in the standard IEC 61131-3 [54]. Increasing performance of hardware for embedded systems and complex tasks that have to be modeled, even in real-time systems, make it desirable to use programming languages with a higher level of abstraction than C, C++, Ada or IEC 61131-3 languages. That could also facilitate the development of applications with individual requirements in automation [22]. This is a motivating example for the use of modern general-purpose languages like Java [46] and C# [3] for the programming of real-time systems. They provide features like object- orientation, inheritance, interfaces, interrupt and callback functions as well as events. So, application development and code reusability can be enhanced. A strong typing system, exception handling and programming without direct memory references (pointers) can make programs very stable. In contrast to the conventional building process of software for embedded and real-time systems, the source code of the referred languages is typically translated into an intermediate code. The instructions of the intermediate code cannot be executed directly by general-purpose processors1. When it comes to execution, the intermediate code is input for another , a so-called virtualizing runtime environment2, which is often realized as a virtual machine (VM). There are several types of virtual machines in the area of computer science, but this thesis is about virtual machines in the area of programming languages and user programs. Section 2.4 introduces details. The VM represents a computation system that is able to execute the instruction of the intermediate code. One of the advantages of the principle is an increased portability

1Special solutions provide hardware-supported execution of intermediate, e.g., for Java’s intermediate code. 2This term is often used synonymously with “managed runtime environment”.

1 1 Introduction of applications. Only the VM has to be adapted to a certain hardware architecture and operating system (OS). A prominent example is the “Android” system. It is an operating system for mobile devices like smart phones and tablet computers, which bases on the Linux-kernel. Applications for Android can be written in the Java language. The source code is translated to a special intermediate format that can be executed by the so-called Dalvik-VM [129, Ch. 2] [87]. The Dalvik-VM has to be adapted to the specific (mobile) device. Once written, an application can run on a plenty of different devices that run Android and the Dalvik-VM, respectively. Java is suitable to build large-scale applications. For example, the distribution framework for ambient networks – SimANet [126] – is written in Java. There are also use cases [39] that apply a JVM to reconfigurable embedded systems [77]. While this approach is advantageous for the portability of an application, it might rep- resent a problem for real-time systems. This is the case when the transformation of the intermediate code into an executable form is done on demand at application run time, like it is typical by means of a Just-in-Time (JIT) compiler. Then, the transformation can interfere with the execution of the application and the timing behavior of an application might become unpredictable. There are more aspects of a VM execution that are critical for real-time systems. This work addresses the real-time suitable generation of native code from intermediate code in particular.

1.2 Problem Statement

From an abstract point of view, three basic resources are necessary to execute a computer program: code, memory and the (CPU). Finally, a program has to be available in a form of code that can be processed directly by the real CPU. This code is called machine code or native code. The program’s native code has to be directly accessible by the CPU, e.g., by loading it into main memory3. While code and memory are necessary conditions for program execution, the CPU is a sufficient one. When the CPU’s program counter points to the code, the program runs. When a VM executes intermediate code, it has to allocate the native code, allocate the memory for it and set the control flow to it. There are several possibilities to provide the resource code. If that happens during the execution of an application, dynamic compila- tion is applied [112, p. 739]. An executes the intermediate code immediately instead of generating native code [17]. It “reads instructions [...] one at a time, per- forming each operation in turn on a software-maintained version [...]” [108, p. 1]. For performance reasons, today’s prevailing technique is to provide native code, so that it can be executed directly by the CPU. To be memory and time efficient, JIT – which are a transparent dynamic compilers [112, p. 741] – are used. A JIT compiler generates the native code from a piece of intermediate code when the intermediate code, e.g., a method, is executed the very first time [112, p. 741]. This approach is called lazy

3In some embedded systems, the application’s native code can be executed in place, so that no explicit allocation of memory and loading of code is necessary.

2 1.2 Problem Statement compilation. During the generation of the native code, memory has to be allocated to store it. That is, the dynamic generation of native code can also affect the allocation of the memory. The complex translation process delays the first execution of a piece of intermediate code. That can lead to significant jitter in the execution time between the first and the following executions. Figure 1.1 shows an excerpt from a micro benchmark that examines execution time determinism.

120

100 100

80

60 execun me in percent 40

20

0,075034103 0 first execuon second execuon

Figure 1.1: Normalized execution times of 1000 interface methods that are written in C# and run by the virtualizing runtime environment “Mono” version 2.6.1 in JIT mode

Details about the benchmark and the testing environment are not necessary to demon- strate how lazy compilation affects execution time determinism. They are explained in detail in Section 4.4. In the benchmark, 1000 simple methods are called through an inter- face two times (through different method calls) and their execution time is measured. The application is written in the C#, compiled to intermediate code and run by the Common Language Infrastructure (CLI) [4] implementation Mono [128] in version 2.6.1 and JIT compilation mode. Figure 1.1 illustrates that the first-execution overhead would introduce high jitter. Note, that – for demonstration purpose – the ex- ecution time of the methods, which are called, is intentionally short compared to the execution time of the JIT compiler, so that the result looks “dramatic”. The lazy compi- lation approach is not suitable for real-time systems. A variant of the lazy compilation principle is to generate native code in advance, e.g., by an Ahead-of-Time (AOT) com- piler [43, 115]. A VM might load the AOT-compiled code lazily into main memory on demand, without the need for invoking a JIT compiler. While the lazy loading of code might be less time consuming than code generation, it causes also one-time overhead, i.e., jitter. [17, p. 1] even claims that “JIT compilation systems [...] are completely unneces- sary. They are only a means to improve the time and space efficiency of programs. After

3 1 Introduction all, the central problem JIT systems address is a solved one: translating programming languages into a form that is executable on a target platform”. Dynamic compilers face the problem that the allocated code might refer to not yet al- located application code. These references typically point to VM-internal functions that load the referenced code. Further, a VM consists of several parts that provide services to ensure correct program execution [44, p. 10]. These services are executed transparently (e.g. automatic memory management) or they are used in an implicit manner (e.g. the threading mechanism). Hence, the allocated native code can contain references not only to other application code but to functions and data structures that are internal to the VM. If they are resolved at execution time, this process can cause one-time overhead when a certain piece of code is executed the very first time. Thus, there might be more indirect references in AOT compiled code than in JIT compiled code, because the addresses of run time structures might be known to a JIT compiler at compilation time [16].

1.3 Objectives of this Work

In this thesis, a new concept for realizing deterministic execution times of applications, which are deployed in intermediate code and which are executed by a virtualizing runtime environment, is developed. The development is driven by the following objectives:

O.1 Reach execution times that are not affected by non-deterministic allocation of the resource code or related issues O.2 Modifications of applications are not necessary O.3 Minor impact of real-time suitable code generation on the total execution times compared to conventional execution O.4 Low startup time and scaleability of startup time.

The ultimate objective O.1 avoids that the VM interferes an application’s execution non- deterministically by generating/loading, more general “allocating”, the application’s native code. O.2 is expressive. It means that the deployment of applications via intermediate code – as far as possible without extra effort – should be retained. It implies a side-by-side use of real-time and non-real-time code. That is, dynamic language features can be used when real-time behavior does not have to be guaranteed. Further, legacy code or external libraries can be augmented with real-time capabilities, at least for the non-dynamic part of the code. The development objectives O.3 and O.4 target the practical relevance of the solution. The startup time denotes the time, which elapses whenever the application, which is deployed successfully, is started. That is, it means the time span from the point in time, when the application is invoked, until the point in time, when the first instruction of the application code is executed. The startup time does not include any time that is related to application deployment or any compilation, which is not directly related to application execution. The solution to be developed considers all – partially conflicting – goals together, which are ordered in descending priority. Optimizations that concern a

4 1.4 Structure goal at higher level are preferred. The temporal determinism of code-execution has the highest priority. In order to achieve all goals, the work discusses three main issues. Allocation of native code As a basic step, raw native code has to be allocated from the intermediate code of an application. It has to be evaluated, which code generation technique suits best for the given requirements. Optimization of native code The concept includes the elimination of indirect references from the raw native code, so that it reaches real-time suitable execution time de- terminism. It also allows a mixed execution mode of real-time and non-real-time code. Optimization of startup time In order to make the solution applicable in real-world scenarios, an application’s startup time should be low and it should scale in size of the application, respectively. Several possibilities are evaluated. The concept is evaluated by augmenting a virtualizing runtime environment with real-time suitable code allocation. The effectiveness of the concept is examined by experiments. The development of these tests that reveal temporal non-determinism is an essential part of this work. The new concept has to compete with existing solutions, which are used to conduct the same experiments.

This thesis addresses the real-time suitable, i.e., deterministic, allocation of the resource native code from intermediate code. Real-time suitable is used in the sense of that the code is available when it is needed, without interfering the application’s execution. The generation of the native code itself (e.g., by means of compilation), its quality, e.g., perfor- mance or size, or its worst case execution time (WCET) analysis are not addressed here. The allocation of the other two resources – memory and CPU – is not covered directly, except if there are things in common. According to the classification of components of a VM given in [44, p. 10], this work only considers the “execution engine”. It does not address other components that are also issues of real-time system programming. These include automatic management of memory that is allocated by the application itself by means of a “memory manager” or Garbage Collector (GC), threading including synchro- nization, scheduling and prioritization (“threading support”), language and programming features as well as compiler optimization in general. If there are things in common, they are pointed out.

1.4 Structure

Chapter 2 introduces fundamental terms and information regarding real-time systems and the workflow from source code to a running application, which are intended to facilitate the understanding of this thesis. Chapter 3 represents the state of the art. It gives an overview of ongoing and completed work in the field of real-time virtual machines. It extracts major approaches that face the problem of native code generation.

5 1 Introduction

Chapter 4 evaluates existing approaches and solutions of native code generation in vir- tualizing runtime environments. A new approach is developed based on those results. It incorporates pre-allocation of native code in the startup phase of an application. A newly developed test framework is introduced and applied to the new approach in order to examine its real-time capability. Chapter 5 handles indirect references in the pre-allocated native code. This is a problem, which is not solved by the approach introduced in chapter 4. The test framework is applied to the solution, which provides real-time suitable deterministic execution times. However, the startup time of an application is increased. Two approaches, which reduce the startup time, are discussed in chapter 6. Chapter 7 examines the correctness of the solution and provides comparative experiments, which show how the solution competes with up-to-date virtualizing runtime environments, including dedicated real-time solutions. The corner cases of the solution are described and solutions are suggested. This thesis is finished by chapter 8. It summarizes the achievements and concludes an outlook.

6 2 Fundamentals

2.1 Introduction

This chapter introduces fundamental terms and concepts from the field of real-time sys- tems, build systems, virtualizing runtime environments and operating systems. As men- tioned in section 1.2, a program needs the resources code, memory and the CPU to be executed. Especially the management of code and memory are considered here from dif- ferent points of view. The typical workflow from a program’s high-level source code right up to the start of a program is considered, both for a conventional application and for an application that is run by a VM. So, the conceptual characteristics of VM-driven exe- cution are pointed out and ease the understanding for the design decisions made in this work.

2.2 Real-Time

In [66, p. 2], a real-time system is defined as follows:

“A real-time computer system is a computer system where the correctness of the system behavior depends not only on the logical results of the computations, but also on the physical time when these results are produced.”

[114] gives a similar definition. That is, a real-time system guarantees or tries to guarantee that a correct result is available at a predictable point in time. Real-time systems can be classified in soft real-time systems and hard real-time systems. It is called hard, when missing the timing guarantees would cause harm that is at least one magnitude higher than the system’s benefit [127, p. 15 – 16]. In contrast, when a soft real-time system misses its timing guarantees, it causes harm that is only about the magnitude of its benefit. [127] also discusses that this is a quite arbitrary classification. There are a lot more terms that characterize real-time properties and requirements, which are not needed here. The quintessence of the given definitions is that real-time does not mean “fast” or “as fast as possible”. It means that a result is available at a predictable point in time. Temporal determinism or timing guarantees, respectively, are more impor- tant than performance. For a program, which has to fulfill real-time requirements, the availability of the resources code, memory and CPU has to be necessarily deterministic. In this work, a concept for the real-time suitable availability of the resource code during execution of an application in a virtualizing runtime environment is developed.

7 2 Fundamentals

2.3 Conventional Applications

2.3.1 Introduction

This section describes the workflow starting with the high-level source code of an application, which is compiled conventionally and which is to be executed on a real machine, up to its imminent start. First, the meaning of the term “machine” in this work is defined. In that context, a “process” is simply considered as a program in execution [113, Ch. 2]. The following definitions are inspired by [109, Ch. 1]. From a process’ point of view, a machine is “a combination of the operating system and the underlying user-level hardware”. The hardware, especially the CPU, provides the Instruction Set Architecture (ISA) as interface to the overall software running on the machine. The ISA includes the CPU’s set of instructions, its operands, e.g., registers, and the CPU’s behavior when it executes an instruction. That is, the ISA describes “what” is provided by the hardware and “how” it is used. Typically, it is divided into two parts. The user-level part is available for user applications while the system-level part includes instructions to manage the hardware1. The system-level part is available only to managing or monitoring software like an operating system. The operating system provides an interface by means of “system calls”, which makes these instructions (indirectly) available to a process. An interrupt or special instruction transfer control to the operating system and a certain operation is performed, which bases on the arguments given on the stack or registers. So, the interface between a process and the machine is the Application Binary Interface (ABI). It is comprised of the user-level part of the ISA and the operating system’s system call interface [109, Ch. 1], see Figure 2.1. The

Native Applications

ABI System Call interface Operating System

User & System ISA User ISA ISA

Hardware

components of the machine from process' point of view Figure 2.1: Process’ point of view on a machine.

ABI is a convention how software interacts with the underlying hardware and other software components – even if they are programmed in another high-level language –

1Also called “privileged instructions”.

8 2.3 Conventional Applications at machine code level. For example, the ABI describes how arguments are passed to a called function. This convention does not need to be used for the internal interaction of components of a program. A program that is built for a certain ABI can be deployed an run as expected on another machine that provides the same ABI, without extra effort. This typically means that the machine has the same (or a compatible) ISA and runs the same (or a compatible) operating system. The program has to be re-built in order to run it on a machine that provides another ABI. That is, building applications to a specific ABI harms . As a side note, a program that is compiled for a certain ABI runs as expected on a machine, as long as the elements of the machine’s ABI, that are actually used, match the program’s ones. That is, a program can be deployed and run on a machine that provides a (slightly) different ABI than that the program is built for, as long as the varying parts of the ABI are not used. As a further side note, the interface for software components at source code level is called the Application Programming Interface (API). It provides standard functionality like console output or file handling to a for a specific programming language.

The remainder of this section describes the workflow of building and starting a program that is built for the ABI of a real machine. The final executable program contains directly executable machine code, which is sometimes referred to as “native code”, because it runs on the bare machine2, and on its ABI, respectively. Such a program is denoted as “native program”. Extra tools might be necessary to launch the program like operating system services, but it does not run in the context of another program like a VM3. The process of building native programs is sometimes referred to as “static compilation”. In this context, the term “static” is somewhat misleading as a statically compiled program might have a quite dynamic behavior. So, the term “static compilation” is not used here. Emphasis is put on the actions when the machine code of the program is going to be run. This is done especially with regard to the allocation of the resources code and memory. Details concerning the generation of native code from source code are not focused by this work, see section 1.3. An exemplary workflow of building and running a native program is given by Figure 2.2. An example program, which is written in the language C [65], is used for demonstrating purposes. The target and build machine is a personal computer with IA-324 ISA, which runs the operating system Linux. The GNU Compiler Collection (GCC) [41] is used for the build process. GCC is a compilation driver that can perform several actions to generate an executable from source code. Therefore, it uses tools like a pre-processor, a source code compiler, an assembler and a . Details of the build process might vary from platform to platform, but the basic concepts apply to most modern systems. In the example, the Executable and Linkable Format (ELF) [29] is used for the files and executable files. There are other file formats like the Common Object File Format (COFF) and a.out [70, Ch. 3].

2The underlying hardware itself might be a virtualized computing system or a binary [108]. 3A discussion about commonalities and differences of VMs and operating systems is given later. 4Also known as .

9 2 Fundamentals

before execution at execution

file main memory

(Re- Source Compiler Assembler Assembler locatable) File 1 File 1 Object File Process 1 image of Executable File 1

(Re- Source Assembler locatable) Executable reference Compiler Assembler Linker OS Loader Shared File 2 File 2 Object File File 1 Library File 1 2

(Re- (Position locatable) Independent) OS Dynamic Library Shared Linker/Loader File 1 Library File 1

(a) source code (b) assembler code (c) object code () executable code (e) executing code Figure 2.2: Workflow of a native program from source code to a running executable

2.3.2 Compilation

This section considers the generation of object files via assembler code from source code, i.e., the way from Figure 2.2a to Figure 2.2c. Listing 2.1 shows the C source file (suffix .c) main.c that contains the entry point of the example application. It calls the function func that is implemented in C source file func.c, see Listing 2.3. The C header file (suffix .h) func.h (Listing 2.2) contains the prototype of the function func. The included header stdio.h (see Listing 2.1, line 1) contains the prototype of the standard library function printf, see line 8.

1 #include 1 extern int func ( int , int ); 2 #include " func . h" Listing 2.2: Header file func.h 3

4 int main ( ) { 1 #include " func . h" 5 int var0 = 1 ; 2 6 int var1 = 3 ; 3 int func ( int var0 , int var1 ) 7 var0 = func (var0, var1); 4 { 8 printf("var0:␣%i\n",var0); 5 return ( var0 − var1 ) ; 9 return var0 ; 6 } 10 } Listing 2.1: Source file main.c Listing 2.3: Source file func.c

Details about the relation between C source files and header files are omitted. Simply put, the headers are used to share information between source files. In this case, it is the prototype of function func. Also the very first step in the build process, i.e., the processing of include directives and of macros in the source code by means of a pre-processor, is omitted, because it actually does not perform compilation operations. [110, Ch. 4] gives an overview. As a side note, the standard library function printf is defined in Linux’ C API. That is, if the C API is provided for another ABI, the example application

10 2.3 Conventional Applications can be build easily on that other platform. The following commands, which utilize the compilation driver gcc of the GNU tools, generate assembly code from the source code of both C source code files: $ gcc −S −o main.s main.c $ gcc −S −o func.s func.c

The option “-S” instructs the gcc to run the pre-processor and the C language compiler. The compiler takes the pre-processed source files (suffix .i) as input and generates as- sembly language code files (suffix .s, see Figure 2.2b) as output. Assembly code is the representation of native code in human-readable mnemonics, i.e., the sequence of CPU instruction that will be executed when the program runs. For example, the assembly code of function func:

1 $ cat f u n c . s 2 .file "func.c" 3 . t e x t 4 . g l o b l func 5 .type func , @function 6 func : 7 pushl %ebp 8 movl %esp ,%ebp 9 movl 12(%ebp ),%eax 10 movl 8(%ebp ),%edx 11 movl %edx ,%ecx 12 s u b l %eax ,%ecx 13 movl %ecx ,%eax 14 popl 15 ret Listing 2.4: Assembly code of C source file func.c (Listing 2.3)

The code in Listing 2.4 loads the function’s arguments from stack (lines 9 and 10) and performs the subtraction operation (line 12). This example shows assembly code for IA-32 ISA. The code looks different, depending on the ISA and ABI, respectively, it is generated for. As a side note, a compiler does not need to generate code that is suitable for the hardware it runs on. If it produces native code for a different architecture, it is called a cross-compiler. Cross-compilers are commonly used in embedded systems where an application is rather built on a host or development system than on the target itself, which might be quite resource constrained [18, Ch. 4]. The assembly code is input for another tool that translates the to object code – the assembler. The following commands generate object files (files with suffix *.o) from the assembly code: $ gcc −c −o main.o main.s $ gcc −c −o func.o func.s

The option “-c” instructs the gcc to stop after the assembler has run. In the example, it takes assembly language files as input and generates object files. The step illustrated by Figure 2.2c is reached. [70, Ch. 3] describes the contents of an object file and discusses several formats. Details about object files in general, and ELF object files in particular, are not given here. For ease of understanding of this work, the following summary of object file content is taken from [70, Ch. 3]:

• Header information: overall information about the file, such as the size of the code, name of the source file it was translated from, and creation date.

11 2 Fundamentals

• Object code: Binary instructions and data generated by a compiler or assembler. • Relocation: A list of the places in the object code that have to be fixed up when the linker changes the addresses of the object code. • Symbols: Global symbols defined in this module, symbols to be imported from other modules or defined by the linker. • Debugging information: Other information about the object code not needed for link- ing but of use to a . This includes source file and line number information, local symbols, descriptions of data structures used by the object code such as C struc- ture definitions. The command file shows specific information about the object file main.o of the example: $ file main.o main.o: ELF 32− bit LSB relocatable , Intel 80386, version 1 (SYSV), not stripped

It is a “relocatable” ELF file for the IA-32 architecture. Relocatable means that the file contains information to eliminate undefined symbols, e.g., to functions or variables that are defined in another object file, and references5, e.g., to data within another segment in the same object file. There are two further important variants for ELF files: executable and shareable. They are described later. In simplified terms, an ELF file contains several sections, which are defined by section headers. In case of the file main.o, it looks like this (excerpt):

1 $ r e a d e l f −S main . o 2 There are 12 section headers, starting at offset 0x46c: 3 4 Section Headers: 5 [Nr]Name Type Addr Off Size ESFlgLkInfAl 6 [0] NULL 0000000000000000000000 0 0 0 7 [ 1] .text PROGBITS 00000000 000034 00004c 00 AX 0 0 4 8 [ 2] .rel.text REL 00000000 0009b0 000018 08 23 1 4 9 [ 3] .data PROGBITS 00000000 000080 000000 00 WA 0 0 4 10 [ 4] .bss NOBITS 00000000 000080 000000 00 WA 0 0 4 11 [...] Listing 2.5: Section headers of ELF object file main.o

Section .text (line 7 in Listing 2.5) stores the native (binary) code. It has an offset of 52 (hex 0x34) bytes from the beginning of the object file and a size of 76 bytes. Its address is not determined. Section .data (line 9 in Listing 2.5) stores initialized global data and .bss (line 10 in Listing 2.5) uninitialized global data. main.o has undefined symbols (to functions) func and printf, because they are used but not defined in the object file: $ nm −u main . o U func U p r i n t f

Section .rel.text is of special interest, because it contains information on positions that need relocation, see Listing 2.6. An entry in this section marks an offset in the .text section, where a symbol or reference – to a code or data – has to be assigned to a real address at link or load time. Details about ELF relocation entries can be found in [29].

5A reference could be threaded as a pseudo-symbol with the address of the segment that contains the reference’s target as base address [70, Ch. 7], so that there is only one type to be relocated.

12 2.3 Conventional Applications

1 $ r e a d e l f −−relocs main.o 2 3 Relocation section ’.rel.text’ at offset 0x9b0 contains 3 entries: 4 Offset Info Type Sym.Value Sym. Name 5 00000029 00001202 R_386_PC32 00000000 func 6 00000032 00000801 R_386_32 00000000 .rodata 7 00000042 00001302 R_386_PC32 00000000 printf Listing 2.6: Relocation entries of section .text in ELF object file main.o

The following excerpt from the disassembly of object file main.o shows the offsets the relocation entries refer to:

1 $ objdump −d −M intel main.o 2 [...] 3 00000000

: 4 0 : push ebp 5 [...] 6 2 8 : c a l l 29 7 2d : mov DWORDPTR [ esp+0x18 ] , eax 8 3 1 : mov eax , 0 x0 9 [...] 10 4 1 : c a l l 42 11 [...] Listing 2.7: Disassembly of ELF object file main.o

The relocation entry for offset 29 (line 5 of Listing 2.6) refers to line 6 of Listing 2.7. It handles the symbol func , which corresponds to the function func (see Listing 2.1, line 7). The entry for offset 32 (line 6 of Listing 2.6) with symbol name .rodata refers to line 8 of Listing 2.7. This rather means the same-named ELF section than a distinct symbol. The section contains the string which is is argument for the call to printf (see Listing 2.1, line 8), whose relocation entry is at offset 42 (line 7 of Listing 2.6 and line 10 of Listing 2.7). The placeholders for addresses of the not yet relocated symbols have special values, depending on the type of the relocation. In the example, the call targets refer to the position after the call instruction while the placeholder for the “symbol” .rodata actually means the offset within the corresponding ELF section. In the next build steps, these undefined references are resolved.

2.3.3 Linking

Definition

This section describes the operations that are performed to generate an executable file (Figure 2.2d) from object code (Figure 2.2c). Due to sometimes confusing use of the terms “statically linked” and “dynamically linked” in the literature, their use in this work is explained first. In this work, a “statically linked” executable means an executable, which physically contains, i.e., not just references, all the machine code that it requires for execution, except for code of operating system services6. At imminent start of execution, the executable is comprised of one or more components. The decision, which components

6In a quite static system like an even these might be “statically linked” as they never change

13 2 Fundamentals are used, is made at link time. Any modification of a component after build does not change either the executable itself or its behavior. The executable is “locked”. In this work, a “dynamically linked” executable means an executable that does not physi- cally contain all the code it requires for execution. The executable is comprised of one or more components. The decision, which components are used, is delayed until load time or run time. The executable rather references the components it is built from. Its behavior might change after build, if these components are modified, while the executable itself is not necessarily modified. The definitions given there refer to the decisions made at link time of an executable, i.e., from the build process point of view. They do not refer to the programmed behavior, i.e., its semantics. That is, a statically linked executable can have a highly dynamic behavior. For example, it might bring additional code to execution or it might modify its own code to change its semantics.

A second term to be discussed is “library” and its classification in “static”, “dynamic” and “shared”, which can be found in literature. In this work, a software library is an aggregation of object code or object code files. A detailed definition of the notation of libraries is omitted. For the understanding of this work, the kind of object code and how it is used is more important than a classification of libraries that store it. So, a classification of libraries is not given here. Instead, the kind of object code and how it is used are described. If there is a common term for a certain combination, it is mentioned.

Static Linking

A so-called linker takes one or more object files as input and “merges” them to an ex- ecutable file. In the example, an executable file named main-static.out is built by the following command, including the object files main.o and func.o: $ gcc −s t a t i c −o main−static.out main.o func.o

The following outputs show that the ELF executable does not have dependencies on other components and that it has no relocation information: $ f i l e main−s t a t i c . out main−static.out: ELF 32− bit LSB executable, Intel 80386, version 1 (SYSV), statically linked , for GNU/Linux 2.6.4, not stripped

$ ldd main−s t a t i c . out not a dynamic executable

$ r e a d e l f −r main−s t a t i c . out

There are no relocations in this file.

The linker performs numerous actions. In order to reduce it to the essential, only the actions, which are crucial for the understanding are described. The linker combines similar sections of relocatable object files to generate a corresponding section that includes all. For example, main-static.out’s section .text contains (amongst others) the machine code of the functions main, func and printf. The code is taken from the .text sections of

14 2.3 Conventional Applications the appropriate object files, whereby only main.o and func.o are given explicitly to the linker. The same applies to other sections. The linker has to look up unresolved symbols (see Listing 2.6) and insert the (object) code of the symbols, e.g., that of function printf. The look up process is not discussed here. The machine code of function printf is stored in a container of object code files, which typically contain relocatable code. Such a container is often referred to as “static library” in literature and in common usage. The object code of that type of library is copied into a statically linked executable, like in the example. The excerpt from the disassembly of section .text shows that main-static.out includes the machine code of the three functions:

1 $ objdump −d −M i n t e l main−s t a t i c . o u t 2 [...] 3 Disassembly of section .text: 4 5 08048180 <_start>: 6 8048180: xor ebp , ebp 7 8048182: pop e s i 8 [...] 9 08048278

: 10 [...] 11 80482 a0 : c a l l 80482c4 12 80482 a5 : mov DWORDPTR [ esp+0x18 ] , eax 13 80482 a9 : mov eax , 0 x80a8bc8 14 [...] 15 80482 b9 : c a l l 8048d10 <_IO_printf> 16 [...] 17 080482c4 : 18 [...] 19 08048d10 <_IO_printf>: 20 [...] Listing 2.8: Disassembly of ELF executable file main-static.out (excerpt)

In the example, the linker also adds startup code to the executable (line 5 of Listing 2.8). The startup code prepares the program for execution, e.g., by initialization of the stack [18, Ch. 4], and it starts the program’s execution. After merging the sections, their sizes and relative offsets are known to the linker. This is essential for the next step: relocation [70, Ch. 7]. Relocation means that the machine code is modified by assigning absolute or relative addresses to undefined symbols and sections. In this example, the function main in the .text section of object file main.o is relocated from address 0 (line 3 of Listing 2.7) to address 0x8048278 (Listing 2.8, line 9). The code of function func is relocated to address 0x80482c4 (Listing 2.8, line 17), the string to address 0x80a8bc8 (Listing 2.8, line 13) and printf to 0x8048d10 (Listing 2.8, line 19). That is, the relocations of object file main.o (Listing 2.6) are processed. The object file func.o could also be regarded as a static library, which is used to look up the symbol func. If the GNU tool ar would be used to create an archive (file suffix .a), the result would be denoted as “static library”. After linking, the entry point is not the address of function main, it is the startup code (compare to Listing 2.8, line 5):

$ r e a d e l f −h main−s t a t i c . out [...] Entry point address: 0x8048180 [...]

The statically linked executable main-static.out has the same view on the machine like the “user process” in Figure 2.1 – the ABI. The executable in the example is built for

15 2 Fundamentals a specific ABI, which is given by the Linux operating system, the ELF for object and executable files and the IA-32 ISA. In the example, the linker relocates the executable to an absolute memory address at which the program is loaded into main memory when it is going to be executed. This is not a problem, because the executable is linked for a system that provides virtual addresses and paging. So, each process has its own “clean” address space and it does not need to share its address space with other processes. Even in systems with virtual memory addresses, relocation at load time can be beneficial. While a predefined load address (like in the example) results in a short startup time, the load time relocation increases startup time but is more flexible. Virtual addresses and paging might not be available in an embedded system. The relo- catable code is often statically assigned to physical addresses by means of a locater, while considering the address space of other applications on the system. This is possible due to the static nature of an embedded system where a defined set of applications is applied once. In case of the GNU compilation driver gcc, which includes the linker program ld, it is possible to control the location process by special linker scripts that contain appropriate instructions for the linker [18, Ch. 4]. In a more dynamic system without virtual memory, the linker might link and relocate the executable to a nominal load address like zero and store relocation information. At load time, the executable is relocated referring to an actual load address, which depends on the memory layout and available addresses. The operating system MS-DOS works this way [70, Ch. 7]. Relocation does not violate the definition of “statically linked” given here. It changes the position of the code but not its components. Also the concept of overlays [70, Ch. 8] is compatible with the definition of static linking used here. The overlay concept defers the decision which parts of code are loaded into memory at run time, but the code itself is not modified since build.

In the example here, the executable main-static.out is statically linked with relocatable object code. It is also possible to statically link relocated code, as long as there are no conflicts with other relocated code in the program or the system. This saves link time because the relocation of that portion of code has to be done only once. However, if the object code is intended to be used in a number of statically linked , it is likely to come into conflict with relocated addresses, especially when several relocated object code files are involved, even in a system with virtual memory addresses.

Dynamic Linking

In the example from Listing 2.8, the executable is statically linked. That is, all machine code is included in the executable. This approach has the disadvantage that each exe- cutable, which uses, for example, the function printf, includes a copy of printf’s code. In order to save (persistent) memory, it is possible to store only a reference to the object code of printf. That is what dynamic linking actually does. Dynamic linking defers “merging” of the program until load time or even run time. Before showing an example,

16 2.3 Conventional Applications several combinations of dynamic linking and kinds of linked object code are considered that are commonly not used. Theoretically, it is possible that an executable dynamically links to relocated object code. References in the executable could be adapted to the object code’s load address. At load time, the object code has to be loaded at the specified address. The executable has to store information that the object code has to be loaded too. The benefit over a statically linked executable is that the object code might be updated, if this does not break the references between them. It is also possible to use relocated object code with a number of executables, if there are no address conflicts. Otherwise, different relocated versions had to be available. Another possibility is to dynamically link an executable with relocatable object code. This provides more flexibility compared to the use of relocated object code, because the object code can be loaded at an arbitrary address. Only one copy had to be stored persistently. This approach increases startup time because relocations have to be performed at load time. If a number of executables use this object code, it is likely that it is relocated to different load addresses. During relocation, it is likely that absolute addresses are assigned to references in the machine code. So, the executables cannot use just one copy of the object code’s executable code section in main memory – even if there are virtual memory addresses – if the object code is relocated to different addresses.

Due to optimizations of flexibility and memory efficiency, it is common to use libraries, which can be “shared”, and link them dynamically. In this context, “shared” means that there is only one copy of the object code’s code section in main memory, even if a number of processes use it. In an example, the following commands build a sharable variant of the object code file that contains the code of function func:

$ gcc −shared −fPIC −o func−pic.so func.c $ f i l e func−p i c . so func−pic.so: ELF 32− bit LSB shared object, Intel 80386, version 1 (SYSV), dynamically linked , not stripped

The options -fPIC and -shared instruct the compilation driver gcc to build the object code file func-pic.so with “position-independent code” (PIC) [70, Ch. 8]. Now, it is commonly called a “shared library”, the third flavor of ELF files that are considered here. The code is separated from references that need relocation. The separation is realized by an additional indirection by means of tables. ELF files that contain PIC have a so-called Global Offset Table (GOT) [70, Ch. 8] for global data references. There exists one GOT entry per global data reference. The GOT resides in the .data section, which is writable and not shared by processes. At link time, the sizes of all sections and so the relative address of the .data section – with that the offset of the GOT – is known. References do not depend on the actual load address. That is, the .text section does not have to be modified regardless of the load address. Listing 2.9 shows how an executable, which dynamically links a shared library, is built and its dependencies on shared libraries. Line 10 shows the dependency on the shared library func-pic.so. libc.so.6 (line 11) is the standard C library that contains the code of function printf, amongst other.

17 2 Fundamentals

1 $ gcc −o main−dyn.out main.o func−p i c . so 2 3 $ f i l e main−dyn . out 4 main−dyn.out: ELF 32− bit LSB executable, Intel 80386, version 1 5 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.4, 6 not stripped 7 8 $ ldd main−dyn . out 9 linux −gate.so.1 => (0xffffe000) 10 func−pic.so => ./func−pic.so (0xb77aa000) 11 libc.so.6 => /lib/libc.so.6 (0xb7630000) 12 / l i b / ld−linux.so.2 (0xb774c000) Listing 2.9: Building the ELF executable file main-dyn.out and listing its dependencies main-dyn.out’s other dependencies refer to the system call gateway of Linux (linux- gate.so.1, line 9) and to the Linux dynamic loader (/lib/ld-linux.so.2, line 12). The stan- dard C library is loaded into the address space of the program at the address 0xb7630000. The predefinition of the shared library’s address is an optimization to save startup time. The command file confirms that it is a shared library in ELF: $ l s −l /lib/libc.so.6 lrwxrwxrwx 1 root root 14 29. Apr 2010 /lib/libc.so.6 −> l i b c − 2 . 1 0 . 1 . so

$ file /lib/libc − 2 . 1 0 . 1 . so / l i b / l i b c − 2.10.1.so: ELF 32− bit LSB shared object, Intel 80386, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.4, stripped

Function calls are also handled via an indirection: “To support dynamic linking, each ELF shared library and each executable that uses shared libraries has a Procedure Linkage Table (PLT). The PLT adds a level of indirection for function calls analogous to that provided by the GOT for data” [70, Ch. 10]. The PLT resides in the .text section and each entry refers to a GOT entry. So, only the GOT entry is modified when the symbol is resolved. Therefore, the relocation information is stored:

1 $ r e a d e l f −r main−dyn . out 2 [...] 3 Relocation section ’.rel.plt’ at offset 0x3dc contains 4 entries: 4 Offset Info Type Sym.Value Sym.Name 5 [...] 6 0804 a008 00000407 R_386_JUMP_SLOT 00000000 func 7 0804a00c 00000507 R_386_JUMP_SLOT 00000000 printf Listing 2.10: Relocation entries of ELF executable main-dyn.out

Listing 2.11 shows an excerpt of the disassembly of the ELF executable main-dyn.out. The call to function func (Listing 2.11, line 17) goes through a PLT entry at address 0x804845c (Listing 2.11, line 6). This PLT entry contains a jump to the address stored at address 0x804a008. This is inside main-dyn.out’s special section .got.plt of the GOT, which ranges from 0x8049ff4 to 0x804a00f (Listing 2.12, line 8) and which handles PLT calls. The code at address 0x804a008 is subject to relocation, see line 6 of Listing 2.10. Similarly, the call to function printf (Listing 2.11, line 21) goes through the PLT entry at address 0x804846c (Listing 2.11, line 9) which jumps to a GOT entry that has relocation information (Listing 2.10, line 7). In the example, it is also possible to statically link the object files main.o and func.o and dynamically link only the standard C library. When referring that scenario to the

18 2.3 Conventional Applications example in Figure 2.2c, main.o and func.o are input to the linker while “Library File 1” is not used in the example. File main-dyn.out and the standard C library are input of the program loader, see Figure 2.2d.

1 $ objdump −D −M i n t e l main−dyn.out 2 [...] 3 Disassembly of section .plt: 4 [...] 5 0804845c : 6 804845 c : jmp DWORDPTR ds : 0 x804a008 7 [...] 8 0804846c : 9 804846 c : jmp DWORDPTR ds : 0 x804a00c 10 [...] 11 Disassembly of section .text: 12 13 08048480 <_start>: 14 [...] 15 08048534

: 16 [...] 17 804855 c : c a l l 804845c 18 8048561: mov DWORDPTR [ esp+0x18 ] , eax 19 8048565: mov eax , 0 x8048650 20 [...] 21 8048575: c a l l 804846c 22 [...] Listing 2.11: Disassembly of ELF executable main-dyn.out (excerpt)

1 $ r e a d e l f −S main−dyn . out 2 There are 41 section headers, starting at offset 0x1b94: 3 4 Section Headers: 5 [Nr]Name Type Addr Off Size ESFlgLkInfAl 6 [...] 7 [24] .got PROGBITS 08049ff0 000ff0 000004 04 WA 0 0 4 8 [25] .got.plt PROGBITS 08049ff4 000ff4 00001c 04 WA 0 0 4 9 [26] .data PROGBITS 0804a010 001010 000008 00 WA 0 0 4 10 [...] Listing 2.12: Sections of ELF executable main-dyn.out (excerpt)

Shared libraries allow re-using and easy updating of code. That is, only one copy of the code, e.g., of the standard function printf, has to be stored persistently on the system. Code of shared libraries is loaded at program start or even later [70, Ch. 10]. Applications, which use a library, do not have to be re-built after the library is updated7. If the shared library contains PIC, its code can even be shared during execution. This avoids machine code redundancy. As a side note, it is also possible to generate ELF executables with PIC. So, the .text section of an ELF executable is shareable. This allows loading an executable at an arbitrary address. It enables “Address Space Layout Randomization” (ASLR) [20]. This is a technique that loads programs and libraries at randomized base addresses. It can make buffer overflow attacks more difficult. The PIC approach has advantages regarding security, flexibility and memory efficiency. However, PIC has a performance drawback that also applies to PIC shared libraries. An external reference always goes through an indirection, the GOT or PLT. On IA-32, this is a severe problem. [70, Ch. 8] states: “Since the x86 does have direct addressing, a reference to external data that would be a simple MOV or ADD instruction in non-PIC code turns into a load of the address followed by the MOV or ADD, which both adds an extra memory reference and uses yet another precious register for the temporary pointer”.

7This might also turn into disadvantage if a program does not behave as expected after the update.

19 2 Fundamentals

2.3.4 Loading

This section describes the operations that are performed to bring executable file (Figure 2.2d) to execution (Figure 2.2e). [70, Ch. 8] describes the basic steps when an executable without relocation information, e.g., a “statically linked” executable, is loaded:

• Read enough header information from the object file to find out how much address space is needed. • Allocate that address space, in separate segments if the object format has separate segments. • Read the program into the segments in the address space. • Zero out any bss space at the end of the program if the virtual memory system doesn’t do so automatically. • Create a stack segment if the architecture needs one. • Set up any runtime information such as program arguments or environment vari- ables. • Start the program. The allocation of memory and the relocation of symbols and references are crucial from real-time point of view. When loading a statically linked executable, it has to be ensured that the whole process image, i.e., the code, data, stack and the pro- cess control block [113, p. 166], is loaded into physical main memory. This might be a problem if there is an operating system that maps memory pages to physical main memory lazily. The native code itself was already allocated during compilation and linking. As a statically linked executable does not contain unresolved symbols and references, it is preferred from real-time point of view, although it is not possible to use benefits of dynamic linking like code sharing, memory efficiency and easy updating.

This section also describes details of lazy binding during execution of a dynamically linked ELF executable. In general, a “lazy” approach might be critical for real-time systems, see section 1.2. However, the understanding of the lazy binding mechanism in ELF executables is essential for this work. The lengthy loading process is described in [70, Ch. 10] explicitly. Figure 2.3 illustrates the sequence when calling the dynamically linked function func from within the executable main-dyn.out from section 2.3.3. Recall that calls to functions, which reside in dynamically linked libraries, go through the PLT and GOT, see Listings 2.10, 2.11 and 2.12. Figure 2.3 shows excerpts from the execution of the dynamically linked executable under observation by a debugger. The call instruction to func does not point to the callee’s code because it is not loaded so far, according to the lazy binding principle. So, the code of a PLT entry at address 0x804845c is called, see step 1 in Figure 2.3 and compare to line 17 of Listing 2.11. Its first instruction is a jump to an address that is stored in the

20 2.3 Conventional Applications

main-dyn.out Disassembly of section .plt: 4. 0x804842c <__gmon_start__@plt-0x10>: 804842c: pushl 0x8049ff8 5. 8048432: jmp *0x8049ffc [...] 1. 0x804845c : dynamic linker 3. 804845c: jmp *0x804a008 8048462: push $0x10 8048467: jmp 804842c

0x804846c : 804846c: jmp *0x804a00c [...]

Disassembly of section .text: func-pic.so 0x8048480 <_start>: 7. [...] 0xb7fdc42c : 0x08048534

: b7fdc42c: push %ebp [...] b7fdc42d: mov %esp,%ebp 8. 804855c: call 804845c [...] [...] b7fdc43c: ret 8048575: call 804846c [...]

Disassembly of section .got.plt: Disassembly of section .got.plt:

0x8049ff4 <_GLOBAL_OFFSET_TABLE_>: 0x8049ff4 <_GLOBAL_OFFSET_TABLE_>: [...] 2. [...] 6. 804a008: 0x8048462 804a008: 0xb7fdc42c 804a00c: 0x8048472 804a00c: 0x8048472

Figure 2.3: Calling a dynamically linked function in ELF executable.

GOT entry, which is related to the PLT entry (step 2 in Figure 2.3). That is, the PLT entry of the function printf (at address 0x804846c) uses another GOT entry. Before the very first call to func, its GOT entry points to the second instruction of the PLT at address 0x8048462, see step 3 in Figure 2.3. This instruction pushes a special value on the stack which depends on the PLT entry. Afterwards, the program flow reaches the very first entry of the PLT (step 4 in Figure 2.3). It branches to the dynamic linker (step 5 in Figure 2.3) – an operating system service – which looks up the address of the callee. It is address 0xb7fdc42c in the example. After the lookup, the dynamic linker branches to the address that is determined and the callee is executed (step 7). The lookup process is not discussed here, because only the result is important. However, before branching to func, the dynamic linker modifies the entry in the GOT, see step 6 in Figure 2.3. The value at address 0x804a008, which is marked as to be relocated (see Listing 2.10), stores the actual address of func. After execution, the callee returns to the caller (step 8). The crucial point is the modification of the GOT. Subsequent calls to func can branch directly to the callee’s code in step 3 of Figure 2.3. That is, the dynamic look up introduces one-time overhead which is not suitable for real-time systems. The same mechanism applies to the call of function printf. The first call has the overhead of

21 2 Fundamentals symbol lookup.

Linux – and other operating systems – provides mechanisms to control the dynamic linker’s behavior. In case of Linux, if the environment variable “LD_BIND_NOW” is set, the symbol resolution is done at startup. Similarly, there are linker options that let the dynamic linker initialize the PLT and GOT at startup. While this increases startup time, it benefits from dynamically linked object code. Another possibility is to use the operating system’s programming interface to load and link libraries dynamically, if it provides one. It even allows loading libraries that did not exists at program build time. It is obvious that loading and relocation operations at runtime can cause temporal non-determinism and have to be avoided if real-time suitable execution times are needed. However, a programmer that uses these features should be aware that s/he performs time critical operations.

2.4 Virtualizing Runtime Environments

2.4.1 Introduction

This section describes fundamentals about the concept of virtualizing runtime environ- ments, i.e., VMs. Thereby, the services that are provided by a VM and the critical points for real-time systems are figured out. First, the term virtualization is defined. [109, Ch. 1] states:

“Formally, virtualization involves the construction of an isomorphism that maps a virtual guest system to a real host. This isomorphism [...] maps the guest state to the host state [...], and for a sequence of operations, e, that modifies the state in the guest, there is a corresponding sequence of operations e0 in the host that performs a equivalent modification to the host’s state.”

That is, virtualization means the mapping of the state of a virtual system (guest in the following) to the state of a real system (host in the following) and vice versa. Second, any (sequence of) operation(s) on the guest that would change the guest’s state has to be converted to a (sequence of) operation(s) on host that change the host’s state so that the host’s new state corresponds to the guest’s new state. The host performs operations on behalf of the guest. This might be realized on a different level of abstraction. Section 2.3.1 and Figure 2.1 describe a machine from a single process’s point of view. From the view of an operating system, the machine is represented by the ISA. A machine is virtualized by “adding a layer of software to a real machine to support the desired virtual machine’s architecture” [109, Ch. 1]. From a process’s point of view, the ABI of the guest has to be virtualized. Such virtual machines are called “process virtual machines” in [109]. From the (operating) system’s point of view, the ISA has to be virtualized and appropriate

22 2.4 Virtualizing Runtime Environments virtual machines are referred to as “system virtual machines”. The components of a process virtual machine are colored grey in Figure 2.4.

Guest-Applications

Guest-ISA Native Applications Virtualizing Runtime Environment

System Calls

Operating System ABI

User ISA User & System ISA User ISA ISA

Hardware

Components of process virtual machine

Figure 2.4: Process Virtual Machine

[109, Ch. 3] describes process virtual machines in general. A fictive example of a process virtual machine is one that allows running a guest program on a Linux/ARM host, which was built for Windows and IA-32. The VM has to map not only the IA-32 ISA to the ARM ISA, also the guest operating system’s state and behavior has to be emulated. This thesis considers a more special type of process virtual machines, so-called “high-level language virtual machines” [109, Ch. 5] (HLL VM) which are specialized to run user programs. They provide rather a synthetic guest-ISA than emulating a real ISA/OS combination. It is an advantage over general process VMs that the guest-ISA can be designed to be virtualized easily. A possibility is to keep the guest-ISA on an abstract level without relying on hardware-specific features of the host, like address width, memory size, number of registers, etc. Popular implementations of HLL VMs are VMs according to the Java Virtual Machine Specification [73] and to the ECMA-335 standard [4]. There exists further HLL VMs for programming languages like Pascal, Python, Smalltalk, , Lua, Ruby, etc. In the conventional application workflow, an executable program is tied to a specific ABI. It is deployed as native code and runs only on systems that provide the same ABI, see Figure 2.5a. Thereby, the compiler might translate the source code to an intermediate representation before generating native code. For example, gcc’s front-end produces a “Register Transfer Language” (RTL). The assembly code is generated from this RTL. This modular structure enables easy integration of new source languages and targets. An application, which targets a HLL VM, is deployed in code of the HLL VM’s guest- ISA, which is referred to as intermediate code in the following, see Figure 2.5b. The intermediate code is loaded by the VM and transformed into an executable form that

23 2 Fundamentals

High-Level High-Level Language Language Code Code

Compiler Front-end Compiler

Intermediate Intermediate Code (guest- Representation ISA)

Compiler Back-end

VM Code Loader Native Code Virtual Memory (host-ISA) Image

VM code generator OS Program Loader Native Code Memory Image (host-ISA)

(a) Conventional Compiler (b) Compiler and Loader and Loader in a HLL VM

Deployment Figure 2.5: Compiler and Loader in conventional and HLL VM environment (derived from Figure 5.1 in [109, Ch. 5]) matches the host-ISA or host-ABI, e.g., by means of JIT compilation. That is, the intermediate code runs (in theory) on each system where the HLL VM runs. Only the HLL VM has to be adapted to a specific ISA/OS combination. As the intermediate code is host-independent, it typically contains additional information, so-called metadata, about data used in the program. The intermediate code is “aware” of the virtualization and the metadata can facilitate that. In a general process VM, the application is deployed in the same format and at the same level like shown in Figure 2.5a. The application code might contain less detailed metadata because it is not aware of virtualization and all information is encoded in the native code. In the remainder of this thesis, the term VM replaces HLL VM. If another kind of virtual machine is meant, it is noted explicitly.

2.4.2 Architecture and Services of a VM

If drawing a parallel from a VM to a conventional execution system from an application’s point of view, like in Figure 2.1, a VM’s guest-ISA combines OS and machine-specific ISA, i.e., the ABI. The guest-ISA represents the virtual “machine” and its hardware- specific features like instruction set, registers (if there are any), the memory model, etc. A more detailed description is given later with examples. The guest-ISA’s instructions

24 2.4 Virtualizing Runtime Environments can be more abstract than that of a conventional ISA. They can provide functionality on a higher level of abstraction, e.g., by supporting direct handling of objects like in object-oriented high-level programming languages. Although a VM and its guest-ISA, respectively, underly a specific model that describes the “machine”, the memory and the execution mechanism, the VM’s implementation does not need to reflect this model. This can be the case when the guest-ISA is directly mapped to the host-ISA/OS without emulating the virtual “machine”. Although there is no general VM architecture because it highly depends on its implementation, Figure 2.6 shows abstract components a VM is typically comprised of. [44, Ch. 2] describes a similar basic architecture.

Standard Application Library

HLL VM Code/Class Loader

native native Memory Threading code Execution exception code call Manager Manager gen. Engine manager cache interface

ABI Host

Figure 2.6: Abstract Architecture of a High-level Language Virtual Machine

Code Loader At application startup and when a method’s or class’s code is needed, the code loader loads, and might even verify, intermediate code. Its tasks are similar to a dynamic linker in a conventional execution system that is described in Section 2.3. Execution Engine The execution engine is the heart of the VM. It is responsible for the execution of the intermediate code. It has to map the guest-ISA to the host-ABI. Interpretation, JIT compilation and the use of AOT compiled code are the prevailing techniques. In this thesis, the native code generation component is a code generator. The execution engine might also include a native code cache that buffers native code generated for the host, e.g., by means of JIT compilation. The execution engine interacts with the code loader to load intermediate code on demand, especially when lazy translation is applied. It usually provides a native call interface to call native code in general or host OS services in particular. Traps and interrupts can occur during execution of the guest-application. An exception manager manages these events and redirects the program flow to the appropriate exception handler.

25 2 Fundamentals

Memory Manager The memory manager provides automatic management of the mem- ory that is used from within and by the application during execution, e.g., by alloca- tion of memory for new objects, which is called application memory in the following. The memory manager handles allocation of new application memory and reclaiming of application memory, which is out of use. The reclaiming part is usually referred to as Garbage Collector (GC). This can increase the robustness of an application, because it can avoid memory leaks. The GC does not manage the memory that is used by the VM itself (VM memory), e.g., memory of virtual tables for dispatching method calls in an inheritance hierarchy or memory of VM internal data structures describing loaded intermediate code. Threading Manager Many HLLs provide the concept of threads. A VM has to handle independent program flows by, e.g., mapping them to OS threads or by imple- menting an own model. This also includes the primitives for synchronization and communication as well as scheduling functionality. Standard Library VMs are often shipped with a standard library8. It provides the ma- jority of the features of a language that abstract from a certain OS/ISA like, for example, opening and writing a file. The standard library is not a component of the VM as such, because if it is available as intermediate code, it is just input of the VM. If the intermediate code and the class library are standardized, it is (theoret- ically) possible to use a library of a third-party vendor. However, library routines of some implementations might include native code that is called through a native call interface or some calls are implemented to go into the VM. That is, the library might be tightly coupled to a specific VM implementation.

All these components might be weaved together in a specific implementation. A strict assignment of the services provided to components cannot be given. The security mecha- nism of a VM is neglected in this section because it is not important for the understanding of this work. It is essential that the code loader, code generator and native code cache are important for the allocation of the resource code and memory to store the code in the guest application’s (virtual) memory image.

2.4.3 Workflow

Introduction

This section describes the typical workflow from the HLL source code of an application to its execution by means of a VM, see Figure 2.7. The VM that is used in the example is the CLI implementation Mono and the example application is a re-implementation of the program in section 2.3.2, now written in the language C#. Listing 2.13 shows the C# source file (suffix .cs) main.cs that contains the entry point of the example application. It calls function func that is implemented in the C# source file func.cs, see Listing 2.14.

8Sometimes referred to as core library.

26 2.4 Virtualizing Runtime Environments

before execution at execution file main memory

Standard Native Code Library VM of Standard (Intermediate Execution Library Code) Engine Source File 1 Native Code of Executable Executable Compiler (Intermediate VM Code / Code) Class ... Loader

Source reference Native Code File 2 of User Library

User Library Source ... Compiler (Intermediate File 3 Code)

(a) source code (b) intermediate code (c) HLL VM (d) executing code

Figure 2.7: Workflow of a program from source code to execution by a HLL VM

1 using System;

2 class Hello {

3 static int Main ( ) {

4 int var0 = 1 ;

5 int var1 = 3 ;

6 var0 = function.func (var0, var1);

7 System.Console.WriteLine("var0:␣" + var0.ToString());

8 return var0 ;

9 }

10 } Listing 2.13: Source file main.cs

1 public class function {

2 p u b l i c static int func ( int var0 , int var1 ) {

3 return ( var0 − var1 ) ;

4 }

5 } Listing 2.14: Source file func.cs

Compilation

The HLL’s source code is compiled into intermediate code by a source language compiler at the beginning of the workflow, see Figure 2.7a. Source code that is compiled into intermediate code is denoted as managed code in this work. Contrary, source code that is compiled into native code, like the example in section 2.3, is denoted as unmanaged code, because its execution is not controlled by a VM. The following commands utilize the C# compiler mcs, which is shipped with Mono in intermediate code, to generate intermediate code from the source code of both C# source code files:

27 2 Fundamentals

$ mcs −target:library func.cs −out:func. dll $ mcs main.cs −r : func . d l l −out:main.exe

The resulting files are called assemblies in CLI terminology. An assembly is created from one or more source files. The intermediate code of function func is stored in the library func.dll, which is comparable to the “User Library” in Figure 2.7b. The purpose of such a library assembly is comparable to that of a dynamically linked library that is described in section 2.3.3. Listing 2.15 shows an excerpt of the metadata and intermediate code of the executable assembly main.exe (compare “Executable” in Figure 2.7b) that was generated from main.cs. As noted, the intermediate code targets a CLI implementation. CLI specifies a “” (VES) that “is responsible for loading and running programs written for the CLI ” [4]. It represents a stack-machine. A detailed description of all listed opcodes and the (virtual) machine model is beyond the scope of this work. [4, Part. III] and [72] give an overview.

1 $ monodis main.exe 2 .assembly extern mscorlib 3 [...] 4 .assembly extern func 5 [...] 6 .class private auto ansi beforefieldinit Hello 7 extends [mscorlib]System.Object { 8 [...] 9 .method private static hidebysig default int32 Main () cil managed { 10 [...] 11 .locals init ( 12 i n t 3 2 V_0, 13 i n t 3 2 V_1) 14 [...] 15 IL_0004: ldloc.0 16 IL_0005: ldloc.1 17 IL_0006: call int32 class [func]function::func(int32 , int32) 18 [...] 19 IL_001d: call void class [mscorlib]System.Console::WriteLine(string) 20 IL_0022: ldloc.0 21 IL_0023: ret 22 } 23 } Listing 2.15: Metadata and Intermediate Code of file main.exe (Listing 2.13)

The listing shows just information, which is necessary for understanding. Lines 2 and 4 of Listing 2.15 are references to external library assemblies. These are the standard library mscorlib.dll, which is the core of the standard library shipped with Mono (compare Figure 2.7b), and the user library func.dll. Like in the C# code of Listing 2.13, the intermediate code contains the definition of class Hello (line 6). The class’s (or type’s) standard constructor is omitted in the listing, because it is the standard case. The local variables var0 and var1 are initialized in lines 11 to 13. Before calling func (line 17), they are pushed on the VES’s stack. Also the call to WriteLine (line 19), which writes to the standard output, references the assembly where the call target is defined.

Execution

The application’s intermediate code is input for a VM like shown by Figure 2.6 and Figure 2.7c. The VM itself is a program – dynamically or statically linked – like another application, which runs on top of the host’s ISA/OS, see Figure 2.4. For this thesis, the

28 2.4 Virtualizing Runtime Environments crucial point is the translation of the intermediate code into native code by means of a code generator, which is part of the execution engine. The concept of interpretation, which is described in section 1.2, is not considered further due to poor performance compared to solutions that generate native code. The code generation by means of JIT and AOT compilation are regarded in the following with the help of the example application and the CLI VM Mono. The application’s execution is started by the following command: $ mono main.exe

Mono’s class and code loader loads the intermediate code of the entry-point function (Main). The execution engine with code generator translates the intermediate code into native code and stores it in main memory (Figure 2.7d). Main memory has to be allo- cated for data too. This includes constants or strings that are used by the application (application memory) as well as VM internal data structures like Virtual Tables or data structures representing the intermediate code loaded (VM memory). In contrast to the memory that is allocated by the managed code, the VM memory has to be managed like in any other unmanaged application, thereby using OS services. Mono utilizes a JIT compiler, which generates native code for the host lazily. The host used in this example has an IA-32 ISA and runs the OS Linux. Listing 2.16 shows an excerpt of the native code that is generated from function Main’s intermediate code, which is listed in excerpts by Listing 2.15, right before the execution of Main.

1 0x010591e8: push %ebp 0x010591e8: push %ebp 2 [...] [...] 3 0x01059202: push $0x3 0x01059202: push $0x3 4 0x01059204: push %eax 0x01059204: push %eax 5 0x01059205: call 0x1059268 0x01059205: call 0x1059328 6 [...] [...] 7 0x01059233: push %eax 0x01059233: push %eax 8 0x01059234: call 0x1059244 0x01059234: call 0x105bdc0 9 [...] [...] 10 0x01059240: ret 0x01059240: ret Listing 2.16: Assembly code before Listing 2.17: Assembly code after exe- execution (excerpt) cution (excerpt)

The translation process as such, and in particular the mapping from the stack-machine instructions of CLI’s VES to the register machine instructions of IA-32, is not revealed. It is important that the code generator allocates two resources: the executable native code and the memory where it can be executed. This is not application memory (see Section 2.4.2). It is rather VM memory, which has to be allocated like a OS program loader allocates memory for a program that is going to be executed (see Section 2.3.4). The native call instructions in lines 5 and 8 correspondent to the intermediate code’s calls in lines 17 and 19 in Listing 2.15. Listing 2.17 shows the same native code right after the execution of Main. The call instructions’ targets changed. Before the execution of Main started, the lazy JIT compiler did not generate code for func and WriteLine. So, the calls point to stubs before execution and they point to the functions’ native code after execution. The stubs refer to callback functions of the VM’s code loader and code generator. Figure 2.7d shows a situation where all references are resolved. The one-time overhead of reference resolution and code generation upon the first call is not suitable for real-time systems.

29 2 Fundamentals

In order to reduce the impact of JIT compilation overhead, some VMs generate the native code in advance and load it lazily, like described in Section 1.2. Mono has a built- in feature, which uses the code generator as AOT compiler. The intermediate code is pre-compiled (“ahead of time”) and stored persistently. Mono uses a file format, which is very similar to the ELF on the example’s target machine. This facilitates examination of the pre-compiled code. Other VM implementations, which have an AOT compiler, might use another format. Listing 2.18 shows the native code of function Main that is generated by Mono’s AOT compiler.

1 $ objdump −D −M intel main.exe.so 2 [...] 3 Disassembly of section .text: 4 [...] 5 00001020 : 6 [...] 7 1047: push 0x3 8 1049: push eax 9 104a: call 2010 10 [...] 11 1082: push eax 12 1083: call 202b 13 [...] 14 1093: r e t Listing 2.18: Disassembly of AOT compiled code in ELF shared object main.exe.so (excerpt)

The calls to func (line 9) and to WriteLine (line 12) point to entries in the PLT. This pre- serves the possibility to load code lazily. The one-time overhead of resolving the references – the technique is described in Section 2.3.4 – and loading AOT compiled code in the PLT might introduce jitter, i.e., non-determinism, to execution times. This is not suitable for real-time systems. Furthermore, the loading of pre-compiled code might rely on the host OS’s services that can involve more “lazy” mechanisms, which can introduce further jitter.

In summary, a HLL VM runs user programs, which are deployed in a format that is highly specialized to be run by a VM. The focus of the VM principle is on portability and security. A VM provides an ISA to the guest-programs and it shares functionalities with an operating system. In particular, the allocation of resources that are necessary for execution rests on the VM; thereby it relies on the underlying host’s ISA/OS.

2.4.4 Examples of High-Level Language Virtual Machines

This section gives a short introduction to two prominent examples of wide spread general-purpose HLL VM platforms: Java and Common Language Infrastructure (CLI). In this context, Java denotes a programming language and a runtime environment. A Java Runtime Environment (JRE) essentially consists of the Java Virtual Machine (JVM) and the standard library. The Java platform was first introduced in the 1990s. Its primary focus is platform independence according to the claim “Write once, run anywhere”. A Java program is compiled to platform independent intermediate code, which is called “”. This bytecode is executed by the JVM. Bytecode is stored in so-called “class” files, which might also be combined to “JAR” files, which are similar

30 2.5 Summary to archives, for deployment. Early versions of JVMs typically use intermediate code interpretation for execution, while newer versions rely on a JIT compiler for native code generation. Originally, the Java programming language was the only programming language supported by the platform. Now, there are a couple of programming languages, which can be compiled to bytecode, e.g., Groovy, Scala, or JRuby [121, Ch. 1]. The main focus of the Java platform is on portability. There exist implementations for several operating systems like Windows, Linux and Mac OS. There are JVMs that also run on bare metal. Java’s was first released by . Today, there exits JVM implementations from Oracle, IBM, Excelsior JET [80], JamaicaVM [107] and PERC [103], as well as open-source and academic projects like VMKit J3 [44].

The CLI is standardized by ISO and ECMA 335 [4]. It is developed with support for various programming languages in mind, including C#, C++/CLI, J# and .NET. Languages, which target CLI, have to adhere to certain rules called “Common Language Specification”, in order to ensure interoperability of the languages. The source code of the languages is compiled to intermediate code, which is called “Common Intermediate Language” (CIL). The CIL is stored in files called “assemblies”. The CIL is executed by a “Virtual Execution System”, which is the CLI-equivalent of the JVM. A standard library is provided by means of a “Base Class Library”. The most important CLI implementation is the rich-featured and mature Microsoft .NET framework [79], which is available for Microsoft’s Windows operating system. The CLI reference implementation Shared Source CLI [117], also known as “Rotor”, is also publicized by Microsoft. The open-source project Mono [128] provides CLI support on Linux, Windows and Mac OS X. It also supports various hardware architectures. VMKit N3 [44] is an academic project.

The virtual machines of Java and CLI have in common that they are stack-based, pro- vide automatic memory management and that the intermediate code reflects the object- oriented programming mode. The intermediate code instructions differ in some cases. A technical comparison is omitted at this point, because native code generation in virtu- alizing runtime environments is a general problem, which is independent of the specific intermediate code. If there are dependencies on the intermediate code instructions, this is stated. [47,105] compare the JVM to CLI VES.

2.5 Summary

The chapter Fundamentals outlined the crucial points of real-time execution. It described the workflows beginning from source code to a running program for conventional built programs and for programs that are executed by a virtualizing runtime environment. Crucial points for real-time systems were pointed out, especially for the VM approach. The differences and commonalities of both variants regarding code generation, loading and execution were described as well.

31

3 State of the Art

3.1 Introduction

Based on the fundamental techniques introduced in chapter 2, this chapter presents re- search and development in the field of real-time suitable execution of intermediate code and programming languages, which are typically compiled to intermediate code. The focus is rather on appropriate code-generation techniques than on special application de- velopment techniques or language features, because legacy code should be supported by the solution presented here, see development goal G.2 in section 1.3. Development of real- time VMs has been made especially for, but is not limited to, Java environments. Hence, this chapter contains solutions from the Java area to a great extend. The evaluation of the approaches is done in section 4.1, because it is closely coupled to the development of a new concept in section 4.2.

3.2 Real-Time Specifications

3.2.1 Java Platform

The Java platform comes in several editions according to a specific application field [121, Ch. 1.2 & 1.3]. The main editions are the Java Enterprise Edition for server environ- ments, the Java Standard Edition for Desktop environments and the Java Micro Edition (JavaME) for resource-constrained devices. Each edition includes the JVM and a stan- dard library including a collection of , which is suitable for the needs of the target application field. Other variants are JavaFX for internet applications and for low-end embedded systems like Subscriber Identifier Module (SIM) cards in mobile phones or chips on credit cards. Although JavaME rather targets embedded and resource-constrained devices than general-purpose application machines, it lacks real-time capabilities. JavaME defines so-called “configurations” that specify the minimum functionality of the Java platform which targets a class of devices with comparable hardware features [100]. Currently, there are two configurations: Connected Limited Device Configuration (CLDC) [1] and Connected Device Configuration (CDC) [2]. While the CDC requires a target with a 32-Bit CPU and at least 2 MB memory, the CLDC requires a 16- or 32-Bit CPU and 192 KB memory.

In 1999, the “National Institute for Standards and Technology” worked out the require- ments for a real-time Java platform [27]. Several groups worked on it and the main

33 3 State of the Art results are the Real-Time Core Extensions (RTCE) [111] and the Real-Time Specification for Java (RTSJ) [45]. [24] compares both efforts with Ada.

RTCE introduces new models for memory, tasks and asynchrony, exceptions and syn- chronization [101, p. 41 ff.]. RTCE did not left draft status and there is no reference implementation. The is split. An additional special real-time capa- ble “Core“ runs side-by-side with a “Baseline” standard JVM [24, p. 3]. A new object hierarchy introduces Core objects. Core objects have a special API that introduces real- time programming features. Core objects are not handled by a GC. The new real-time -classes have a preemptive priority-based scheduling. There is a new exception class hierarchy. RTCE defines new synchronization objects like, e.g., semaphores and mutexes. A priority ceiling protocol avoids priority inversion. A new timer class is introduced too.

The RTSJ defines extensions of the Java standard. The JVM has to be augmented to support a new API for real-time programming. The RTSJ introduces a new memory model. The “Heap Memory” corresponds to the traditional GC-managed memory of Java. The “Immortal Memory” and “Immortal Physical Memory” have the lifetime of the application. They are not subject to the GC and are visible to all threads. Immortal Physical Memory “allows objects to be allocated from a range of physical memory with particular attributes, determined by their memory type” [45]. “Scoped Memory” has a similar semantic to stack-allocation. It is valid only within a certain scope, e.g., a thread. When the scope is entered, the memory is allocated. When the scope is left, the memory is freed. Scopes can be nested. The scoped memory comes in several variants with bounded or unbounded allocation time and special variants for accessing physical memory. The RTSJ introduces new real-time thread classes. A “NoHeapReal- TimeThread” has a higher priority than other Java threads and so cannot be interrupted by the GC and must not use GC-managed memory. That restricts because Java is highly object-oriented and objects are placed in the GC-controlled heap. The “RealTimeThread” can be interrupted by the GC. RTSJ allows providing allocation rates and memory budgets when allocating an object to be scheduled, e.g., some variants of a thread. A priority-based preemptive base scheduler is standard in each implementation, but other schedulers may be implemented. Implementations of the RTSJ must provide a variant of the synchronized keyword that avoids priority inversion. In order to synchronize real-time and normal threads, wait free queues are specified. RTSJ also specifies asynchronous event handling. The functions for handling such events can be scheduled similar to a thread. In order to realize interruptible functions and threads, primitives for “Asynchronous Transfer of Control” are specified. RTSJ introduces the “HighResolutionTime” class for precise time representation. Classes for absolute and relative time are derived from that. Also a clock and one-shot or periodic timers are specified. There exist several implementations of a RTSJ-compliant JVM. The most common implementations of the RTSJ are TimeSys’ reference implementation [119], IBM WebSphere Real Time [16], Sun RTS [118] and the JamaicaVM [107]. The RTSJ is complex and not suitable for low-end devices, e.g., for such that are targeted by JavaME/CLDC. [101, p. 43 ff.] further discusses the RTSJ.

34 3.3 Hardware-supported Execution

Briefly, the RTSJ is closer to Java than to real-time. For RTCE, the reverse is true. Both specifications have in common that they do not specify code generation that is suitable for real-time systems. Real-time code generation and real-time programming features are orthogonal. Details of both specifications are omitted because this work is about real-time suitable code generation and not real-time programming features.

3.2.2 CLI Platform

Like the Java platform, the CLI comes in several flavors. The standard .NET framework does not target real-time systems. Microsoft’s .NET Compact Framework is comparable to JavaME, because it targets resource-constrained devices like personal digital assistants. The intermediate code is JIT compiled, so that real-time applications should use an interface to conventional build applications [116]. Microsoft’s .NET Micro Framework [37] is for embedded devices and can be used without operating system. Its standard library is reduced. Applications are deployed via CIL, which is interpreted. However, no variant targets real-time systems in particular. A specification like the RTSJ does not exist for CLI environments. [131] discusses the RTSJ with regard to .NET. Standardization and implementation are missing.

3.3 Hardware-supported Execution

The natural way to avoid the translation of intermediate code into native code is direct execution by hardware. That is, “building” the virtual machine. There exist several solutions – primarily in the Java environment – that execute the intermediate code in hardware. This is not a trivial task because the JVM and CLI’s CLR are quite abstract machines.

3.3.1 Java Platform

[50, Ch. 7] and [101] survey Java hardware solutions, from which the following expla- nations are derived. There are more solutions that are realized in a field-programmable gate array (FPGA) than solutions that are committed to silicon. Most solutions cover only JavaME, configured with CLDC, or only a subset of that. The quite complex RTSJ is not fully supported by Java hardware. In general, complex instructions are not executed directly, they are emulated in microcode instead.

The Java hardware can be classified into co-processors and processors. A co-processor supports a main CPU by decoding (simple) Java bytecode instructions to the main CPU’s ISA or even executing them. Complex instructions are emulated by the main CPU. JIFFY [11] is a Java JIT compiler realized in a FPGA to support a target CPU, e.g., in an embedded system where JIT compilation is quite costly. It was a research project at Technische Universität München (TUM). [14] is an extension of ARM

35 3 State of the Art

CPUs, which executes simple bytecode instructions directly. The ARM CPU is set to a special state by a certain instruction, so that it executes Java bytecode. [50, Ch. 7] also lists Nazomi’s JA 108 [82], which fetches Java bytecode instructions from memory and translates them into native code. Native instructions are passed by.

A Java processor executes Java bytecode only. That is, all parts of a system have to be compiled to bytecode if the Java processor is standalone. While simple instructions are executed directly, complex ones are emulated in most solutions. picoJava [86] is a comparatively complex Java processor introduced in 1997 in association with Sun Mi- croelectronics. It implements instruction folding. That is, certain instruction sequences are executed within one cycle. Thereby, picoJava supports more than 300 instructions. Performance critical instructions are implemented in microcode, whereas it traps on the remaining instructions for emulation, e.g. on object creation. It also emulates exceptions and interrupts. The overhead reaches from 16 to over 900 cycles. This is a crucial point for real-time systems. aJile’s JEMCore [48] and its successor, respectively, is on the market today. It executes most of the bytecode instructions directly. The build tools take standard bytecode and “link” it to shipped implementations, which implement real-time threads for example. Komodo and jamuth [120] are results of an academic project at the University of Augs- burg. They target real-time scheduling on multi-threaded systems. Simple instructions are executed directly, more complex instructions are emulated in microcode and very complex instructions like object creation lead to a trap. Dynamic class loading is allowed as long as the classes are available in the memory images, which are loaded down on the system. Extensive academic research in the field of Java hardware is done at the Vienna University of Technology. The [101] is a Java processor that is developed with WCET analysis in mind. There are no dependencies between the instructions so that the execution takes always the same time for a specific instruction.

[50, Ch. 7], [101] and [130] list more projects, some which are discontinued. In summary, while Java processors execute almost all simple directly, complex ones are emulated. As the JVM is a stack machine, Java processors usually have caches for stack elements. The solutions vary in caches for instructions and heap, the number of natively implemented bytecodes and the realization in FPGA or silicon. Some solutions have their own low-level I/O and real-time thread model. There are Java processors with an object cache which are accessed with an object reference. Complex variants like the picoJava implement folding in order to optimize execution of common instruction sequences.

3.3.2 CLI Platform

There are fewer projects implementing CLI’s VES in hardware than in the Java environ- ment. Research was done at the University of New Brunswick. [71] is an attempt to realize parts of a CLR in FPGA. Similar to the Java solutions, only a subset of the intermedi- ate code’s instructions is supported. Especially the object model is not implemented in

36 3.4 Software-supported Execution hardware. Implementing all instructions would make the hardware very complex, which reduces the maximum clock frequency. The CLR processor also fetches the CIL instruc- tions and decodes them to processor internal microcode instructions. The CLR is a stack architecture. If the hardware implementation applies a pipelined execution, this is a prob- lem when the result of an instruction is needed by the following instruction. This might stall the pipeline, which is a performance loss. Benchmarks in [71] also shows that a VM can reach performance that is comparable to the hardware solution, if the overhead by the JIT compiler is not considered. However, the FGPA implementation is far less power consuming. [132] is an academic project to develop an embedded CLI processor FPGA. The CIL to be executed is simplified in a previous step. The resulting language is called “Simplified Common Intermediate Language” (SCIL). SCIL does not contain relative references; it uses absolute memory addresses instead. As the CIL metadata is part of the SCIL instruc- tions, it makes the processor design less complex in contrast to [71]. The SCIL processor supports only a subset of CIL. For example, instructions concerning the object model and dynamic loading are not included. SCIL instructions are decoded to micro-instructions, like it is common in other solutions. [28] discusses an implementation of a CIL processor with a Digital Signal Processor (DSP). Again, the complexity of implementing the object model is depicted.

3.4 Software-supported Execution

3.4.1 Introduction

The following sections describe software solutions of code generation techniques for lan- guages that typically target a virtualizing runtime environment. Existing code generation approaches do not target real-time systems only but also try to optimize, e.g., code size, performance and startup time. Due to the generality of the problem, real-time and non- real-time solutions are considered. Because of the prevalence and the real-time develop- ment of the Java platform, various solutions are located in this field, but are not limited to it. Section 1.2 describes that lazy compilation, e.g., by means of a JIT compiler, is not suitable for native code generation in real-time systems. That is, solutions presented here aim to generate or allocate the native code some time before start of execution. The term “virtual machine” has to be refined in this context. So far, a VM is an inde- pendent “vital” computer program that is placed on top of a host’s ISA/OS and takes the guest-application’s intermediate code as input, see Figure 2.4 and Figure 2.6. From a more abstract point of view, a VM can be seen as a runtime environment that provides services – comparable to an operating system – which are described in section 2.4.2. This runtime environment can be realized in different ways besides the “independent program”- approach. As an example, the framework can be comprised of a code generator and a set of libraries which provides the framework services. The application might be available in native code which is statically or dynamically linked to the set of libraries (see section

37 3 State of the Art

2.3.3) to create a standalone executable. That is, the code generator and its invocation, respectively, can be spatially and temporally divided from the other services; breaking the traditional workflow. It is (just) a technical difference whether the code, which pro- vides services like memory management, is loaded into memory implicitly at startup of the VM by means of the operating system’s program loader, or whether it is loaded by the dynamic linker when it resolves a reference to a shared library. For a clearer classifi- cation of the approaches, two terms a used in the following. The term “virtual machine” (VM) is used for architectures like shown in Figure 2.4. The term “runtime environment” is used for other approaches that are described case-by-case. There are also in-between approaches. Like section 3.2 and section 3.3, this section discusses Java approaches first and other platforms later.

3.4.2 Java Platform

VM-based Execution

A convenient way to eliminate the impact of JIT compilation at execution time is to compile the intermediate code to native code in an offline step, i.e., do AOT compilation, and store it persistently. The VM loads the native code instead of invoking a dynamic compiler. TimeSys reference implementation (RI) [119] and IBM WebSphere Real Time [16], which is a commercial product, are RTSJ compliant implementations of a JVM. Both apply the VM approach. IBM WebSphere provides an AOT compiler. It takes Java bytecode as input and generates native code. The JVM loads the pre-compiled native code at execution time. IBM’s WebSphere stores the AOT compiled code in a so-called “Shared Class Cache” which is read at execution time. The initialization of this cache is an extra deployment step. TimeSys RI relies on an interpreter approach that does not generate native code [76]. Sun Real-Time System (Sun RTS) [21,25] is a RTSJ compliant JVM implementation. It ships with “Initialization Time Compilation” (ITC) [76]. ITC has three phases: initialization-time compilation, pre-initialization of classes and pre-loading of classes. Each phase requires a list of methods and classes which are compiled, initialized or loaded, respectively, at VM startup.

The Aonix PERC product family provides industrial Java real-time solutions. The Aonix Perc Ultra JVM is a commercial product that supports Java SE. It provides deterministic garbage collection as well as an AOT and a JIT compiler. The latter can be used at execution time, if not all intermediate code was compiled to native code ahead of time [84, p. 6]. It does not implement the RTSJ and it targets soft real-time applications [83]. Aonix PERC pico [103] is a RTSJ compliant hard real-time solution that follows the runtime environment approach. Java source code is compiled to bytecode, which is used to generate C source code. It is compiled and linked to an

38 3.4 Software-supported Execution executable.

[52] from the Soul National University proposes a client Ahead-of-Time Compiler. A JVM executes bytecodes by means of an interpreter and only frequently used methods are JIT compiled. Instead of discarding the JIT compiled native code at termination, it is stored persistently with additional relocation information. The stored code is re-used in later runs. The native code has to be relocated in order to be suitable for the process running. However, the relocation would take some time. The authors propose to defer the relocation in some cases until the JVM would JIT compile the bytecode, so that the pre-compiled code can be used without relocation. They assume that a relocation is not necessary in the most cases. This is not suitable for real-time systems but this solution aims at improvement in performance and not real-time systems.

Quicksilver [104] from IBM Research is called a “quasi-static compiler for Java” which relies on a JVM. Quicksilver utilizes the JVM’s code generator to generate “quasi static images” (QSIs). Each QSI is associated with a class file and contains native code. The JVM loads the QSIs at execution time to avoid JIT compilation. The normal code generator is used in case of dynamic class loading and if native code cannot be loaded from a QSI, e.g., because the associated class file changed. Before execution, a QSI has to be relocated, i.e., “stitched”. Indirect references are resolved via an indirection table [57] at execution time.

Apogee provides a commercial real-time JVM that has a so-called “Just-Ahead-of-Time Compiler” [13]. It applies an IBM JIT compiler at application startup to generate to native code of the application. There are two modes to determine the methods to pre-compile. First, all code that is reachable up to a certain call depth from the main method is precompiled. Second, an analyzer determines “hot” code that is going to be pre-compiled.

TurboJ [58] from the Open Group Research Institute is a bytecode-to-C compiler. It takes a classfile as input, generates C code and a new classfile which is called “interlude”. The new classfile calls the generated C code via JNI (Java Native Interface), which is used to call native code from within a Java program. For execution of a Java application, a standard JVM can be used. The JVM is responsible for memory management like GC, threading, class loading and execution of the interludes. The generated C code handles class initialization and exceptions.

Runtime Environment Approaches

An alternative code generation approach is to take the application’s source or intermediate code as input and transfer it into another form of source code, e.g., C code, and transfer it to native code. The result is an executable binary. The services of a VM, like garbage

39 3 State of the Art collection or even a class/code loader and code generator, are often shipped as libraries and are linked to the binary. An implementation of this approach is JamaicaVM [107] from Aicas. It is a commercial RTSJ compliant JVM for hard real-time systems. JamaicaVM can translate classfiles, i.e., Java bytecode, into C language code and compiles and links them to a standalone executable. The target domain is embedded systems. Building the executables is an extra deployment step. An interpreter can handle dynamically loaded bytecode. The AERO-JVM [35] is a JVM developed to be used in aerospace. It takes JamaicaVM as basis.

The OVM is a framework for the research and development of virtual machines with different features from the Purdue University. For example, it is used to realize real-time applications in avionics [15]. Therefore, an RTSJ compliant version was implemented. It uses an AOT compiler that generates C++ code from the Java application and the OVM, which is implemented in Java too. The C++ code is intended to be processed further by GCC. Other implementations on top of OVM create C code as intermediate language [90].

The Fiji VM [90, 91] is a JVM that (partially) supports the RTSJ and the Safety Critical Java (SCJ [92]) specification. It has an AOT compiler that takes Java bytecode as input and generates C code. The C code is compiled to native code, which is linked to a set of libraries that provides services like GC, threading and synchronization. The li- braries are written in C and Java. Fiji VM is on the market but has its origin in academia.

Excelsior JET is a commercial product. It provides a “Just-Enough-Time” compiler that translates Java bytecode to native code [80]. It creates a standalone executable that references the JET runtime, which is shipped as library. It contains code to preserve the dynamic loading facilities of Java by using a JIT compiler.

GCJ [40] is part of the free GCC and a compiler for Java. It can compile Java source code to bytecode, Java source code to native code or Java bytecode to native code. The native code is linked to a which provides VM services, so that a standalone executable can be created. jRate [31] is an extension for GCJ which provides partial support for RTSJ.

Other Projects

In the following, smaller projects and research vehicles are listed. Marmot [38] is a static Java compiler that translates Java classfiles into a register-based intermediate represen- tation. This representation is optimized and linked to a set of libraries that represents the runtime. The result is a standalone executable. The implementation lacks features like Reflection, dynamic class loading and the JNI. Toba [93] has a “Way Ahead-of-Time” compiler that generates C code from bytecode. It provides libraries for GC, threading and

40 3.4 Software-supported Execution further Java API. The C code is compiled to object files which are linked to the libraries. The project does not support dynamic loading features of Java. Harissa [81] translates bytecode to C and generates a standalone executable. A runtime library contains a byte- code interpreter to support dynamic loading. Java2C [85] takes Java source code as input and generates C code. The solution targets very resource constrained embedded devices. The same group worked on the Infinitesimal Virtual Machine [55]. It is an interpreter-based research real-time JVM. Manta [124] compiles Java source code to native code and creates an executable. The executable contains the bytecode in order to support remote method invocation by sending the bytecode. [93] references other projects that use C language code as intermediate language, e.g., for Forth, Smalltalk, Pascal and .

There are several other projects in the Java field, which are no longer updated or do not contribute new information for the related work presented here, and for which only few information is available. These include towerJ, BulletTrain, Jove, JTime, simpleRTJ and fastJ.

3.4.3 CLI Platform

VM-based Execution

The open source CLI VM Mono [128] has an AOT compiler, which takes a CLI assembly as input and generates relocatable native code for the methods in the assembly. This requires an additional deployment step. The native code is stored in an object file format, see section 2.4.3. Mono loads the AOT code lazily at execution time. References might go via an indirection and might cause the invocation of the class/code loader at execution time. Code that could not be pre-compiled is generated by the JIT compiler at execution time.

Microsoft’s .NET framework [79] has an AOT compiler called “Native Image Generator” (NGen) [78]. Ngen generates native code images from CLI assemblies in an additional deployment step. Such an image is assigned to a certain address. So, internal references are absolute. If the image cannot be loaded at its pre-defined address, the internal references have to be adjusted to the new load address at load time. If another image binds the image hard, i.e., it uses absolute addresses, also these references have to be updated. Ngen can also emit stubs that allow indirect referencing. Both CLI VMs do not target real-time applications. They lack suitable code generation and programing features.

The ProConOS embedded CLR [68] is a pruned CLI-VM for embedded automation systems for industrial use. It is integrated in a PLC in order to execute CLI code. An AOT compiler generates native code in advance.

41 3 State of the Art

Runtime Environment Approaches

The academic Real-Time.Net project of the Hasso-Plattner-Institut provides a GCC front-end, which translates CLI-assemblies into the intermediate representation of the GCC (see section 2.4.1, so that native executables can be generated [122]. The project fo- cuses on real-time systems. Some CIL opcodes – especially that for memory management and exception handling – are target specific and typically transfer control to the VES, i.e., they are implemented in the VES. So, a special runtime library is provided, which is linked to the application code in order to get a standalone executable. Components that are not suitable for real-time systems are replaced by manually implemented ones. For example, Real-Time.Net introduces a real-time variant of the threading library. Some features, e.g., Reflection, are not supported. [96] suggest real-time extensions for ECMA 335 in scope of the project.

Microsoft’s .NET Native [9] is a high-level language compiler for C# that generates native code rather than intermediate code. Support for other languages is expected to follow. The runtime environment is also available in native code and it is linked to the application. Microsoft aims at improvement in performance compared to the execution of intermediate code. Some features concerning Reflection, marshalling and serialization are not supported [10].

The GCC4NET project [30] evaluates using CIL as deployment format for applications of embedded systems. An application’s C code is taken as input and a platform independent CIL assembly is generated. This assembly is used for deployment. A CIL-to-native compiler on the target generates native code. The authors suggest AOT compilation at deployment time or a loading time initialization, which is comparable to Sun RTS’s ITC. However, as the source language to be supported is C, only a subset of CIL and – due to the absence of a runtime library – no VM services are supported. Support for type parameters, automatic memory management, class loading, Reflection and exceptions are missing.

Remotesoft’s Salamander .NET [95] creates standalone executables for IA-32 from CLI assemblies. The tool rather targets deployment and code obfuscation than real-time systems. CrossNet [5] generates C++ code from CIL. The C++ code can be compiled to native code by a standard compiler. However, the project does not provide a standard library and it is in a pre-mature state.

The operating system toolkit C# Open-Source Managed Operating System (Cos- mos) [7] is written entirely in C# or any other CIL compatible language. After compiling the code to CIL, the AOT compiler “IL2CPU ” generates native code from the CIL. Sin- gularity [53] is a research operating system from Microsoft Research written in managed code. Programs are compiled from CIL to native code at installation time.

42 3.5 Summary

3.4.4 Low-Level Virtual Machine

The Low-Level Virtual Machine (LLVM) is a free compiler framework used in many projects. It “aims to make lifelong program analysis and transformation available for arbitrary software, and in a manner that is transparent to programmers” [69]. Similar to GCC’s RTL, LLVM has a source language independent intermediate representation (LLVM IR) and a native code generator that takes the LLVM IR as input. LLVM IR was developed to reflect features of modern languages. It includes an appropriate , instructions for type conversions and exception-handling. However, it is independent from object models used in Java bytecode or CIL, because it is not aware of constructs like classes. The term “virtual machine” in LLVM is somewhat misleading, because LLVM does not provide the services that are described by section 2.4.2. The LLVM workflow is as follows. Compiler front-ends generate LLVM IR, which can be optimized by the LLVM linker. If referenced libraries are not available in LLVM IR, they are not considered for optimization. Then, the LLVM IR is translated into native code, which is stored together with the LLVM IR, or it can be JIT compiled by LLVM’s back-end. The LLVM back-end can add profiling code to the native code, so that it is possible to re-optimize the native code during or after execution time. [6] is an LLVM front-end for the C language family. Mono can use LLVM as compiler back-end since version 2.8. Mono’s JIT intermediate representation is transformed to LLVM IR, which is handled by the LLVM back-end. VMKit [44] from the Université Pierre et Marie Curie is a framework for the development of HLL VMs. It utilizes LLVM as code generator. A JVM and CLI VM are implemented for demonstration purposes. There are various projects [8] in the context of LLVM, e.g., a standard C library and a debugger.

3.5 Summary

The chapter State of the Art presents a broad overview of ongoing and completed work in the field of real-time virtual machines. Prevailing efforts are established in the Java area but are not limited to it. The approaches developed can be transferred to other platforms. Formal specifications rather target memory management, threading and other real-time programming features like timers and synchronization. The generation of native code is not covered in detail. There are three main approaches for native code generation and intermediate code execution, respectively: interpretation (software based intermedi- ate code execution instead of native code generation), hardware-based intermediate code execution and software-based native code generation. This thesis focuses on real-time suitable code allocation in particular. The following sections introduce a new concept for real-time suitable code generation that fulfills the development requirements described in section 1.3.

43

4 Allocation of Native Code

4.1 Evaluation of Existing Approaches

Section 3.3 and section 3.4 introduced various solutions for hardware-supported execu- tion and software-supported execution and code generation, respectively. The hardware- support is considered as one approach, whereas software-supported solutions can be clas- sified in five approaches: interpretation, JIT compilation, initialization time compilation (ITC), AOT compilation with VM-support and AOT compilation in a runtime environ- ment. All approaches are evaluated according to the development objectives in section 1.3. Table 4.1 summarizes the evaluation results.

4.1.1 Hardware-supported Execution

Section 3.3 introduced hardware-supported execution of intermediate code. That is, the intermediate code is “native”, because it can be executed directly by the underlying hard- ware. This execution model is not widespread today. Intermediate code, e.g., Java byte- code and CIL, are designed to be run by a VM. They rely on a quite high-level hardware model. This makes it difficult to implement a “metal” variant of the VM, which is tar- geted. Most solutions do not provide the full instruction set and – except for dedicated projects like the “Java Optimized Processor” [101] – might suffer from hard-to-predict execution times due to the complexity. I.e., only particular solutions achieve objective O.1 (execution time determinism). The object-oriented nature requires associative caches for objects. Various Java processors handle complex bytecode instructions by trapping into software routines. These solutions can meet objectives O.3 (good performance) [71] and O.4 (low startup time). They hardly fulfill O.2 (portability) of section 1.3. First, due to the limited set of supported intermediate code instructions, the use of legacy code and dynamic loading is limited. Second, the use of dedicated hardware contradicts portability per se. For that reasons, solutions that rely on hardware-supported execution are not considered further.

4.1.2 Software-supported Execution

Section 1.2 describes interpretation as software-based execution. The intermediate code is “native” as the VM executes it directly. That allows for a low startup time of an application, because the native code does not have to be allocated or generated. Each intermediate code instruction is just available as a set of instructions of the executing system. The portability is preserved as interpreters typically support all intermediate code instructions in full-featured VMs. Besides O.4 and O.2, interpreters are also supposed to fulfill O.1, according to the definition of “real-time suitable” code generation

45 4 Allocation of Native Code given in section 1.3. Real-time suitable code generation does not interfere the execution of an application, i.e., the native representation of the application is available before start. Lazy loading of intermediate code, which is referenced, at execution time could be critical. However, interpreting solutions have a performance penalty compared to solutions that use a native representation of the intermediate code, i.e., they do not fulfill O.3. So, interpreter solutions they are not considered in the following.

Section 3.4 introduced software-supported execution and allocation of native code, re- spectively. Solutions that use allocation rates – like some real-time garbage collection approaches do – and that allow native code generation at execution time are not con- sidered here, because they are not in accordance with the definition of real-time suitable code generation in section 1.3. This would require knowledge about the application and contradicts O.2. Solutions that discard the native code after generation and first exe- cution are not considered. Even if the code generation is deterministic, such a solution would not fulfill O.3 compared to approaches that generate native code and safe it (at least during the life time of the application) for later execution. When using a JIT compiler as code generator, like Mono, an application is deployed in intermediate code. Updated application code or legacy code are handled without extra effort. Thus, objective O.2 is fulfilled. A JIT compiler generates native code that runs directly on the hardware. So, the code is assumed to perform well1, i.e., O.3 is achieved. JIT compilers typically apply the lazy compilation principle, so that only the entry point method is compiled at application startup. If this method consists of a large amount of intermediate code or if the majority of the application’s code is executed at startup, e.g., for graphical user interface applications, the startup time can be quite long. Hence, JIT compilation has a medium startup time compared to the other approaches. The lazy compilation principle can cause high execution time jitter, see section 1.2, so that O.1 is not achieved. ITC is a variant of JIT compilation where the application’s intermediate code is pre- compiled at startup. As it is implemented by SunRTS and Apogee, ITC has the dis- advantage that profiles of the intermediate code to be pre-compiled have to be created. The lists can be created manually or based on an execution trace. This is critical be- cause it requires very deep knowledge of the application and especially the rarely exe- cuted time critical paths. So, execution traces might not cover all cases, e.g., for legacy code. This analysis has to be repeated when parts of the application change. So, ITC achieves medium portability (O.2) compared to the other approaches. ITC can reach high execution time determinism and high performance, if all potentially executed code is pre-compiled. O.1 and O.3 can be achieved with this approach. Pre-compilation of methods and pre-initialization of classes increases startup time of an application, so that other solutions outperform ITC, so that O.4 is not fulfilled. A VM-based execution of AOT compiled code is provided by Microsoft .NET framework in interaction with Ngen and by IBM WebSphere Real-Time. Mono also ships with an AOT

1at least compared to interpretation

46 4.1 Evaluation of Existing Approaches compiler. Loading AOT compiled code typically reduces startup time, because the initial JIT compilation overhead does not occur. Objective O.4 can be fulfilled this way. If the pre-compiled code has to be loaded lazily or if references have to be resolved at execution time, jitter is introduced. So, execution time determinism (O.1) is only medium compared to other solutions. An additional deployment step is necessary: the AOT compilation of assemblies and classfiles, respectively. The execution of legacy code have to fall back to the JIT compiler or interpreter, if the AOT compiler does not support all language features. For instance, some versions of Mono provide a “Full-AOT” mode, which pre- compiles even Mono-internal helper functions. So, JIT compilation can be avoided in most cases. However, it does not generate all helper functions that are necessary. For example, helper functions marshalling of user-defined data structures when invoking native code – also known as “Platform Invoke” – are not generated. Falling back to JIT compilation in these cases introduces execution time jitter. However, it is an asset that no knowledge on application internals is necessary to apply the AOT compiler. So, the portability (O.2) is rated medium because of the extra deployment step. The execution performance is assumed to be high, because native code is executed. That is, O.3 is achieved. However, it is case specific if there is a performance gain compared to JIT based execution. On the one hand, AOT compiled code might contain more indirect references than JIT compiled code [16]. For example, addresses of application code or VM-internal services are known at JIT compilation time but not at AOT compilation time. On the other hand, an AOT compiler can do more time consuming aggressive optimizations.

JamaicaVM and .NET Native provide AOT compilation in a runtime environment. Ja- maicaVM translates intermediate code, i.e., Java bytecode, into C code, which is compiled to a standalone executable. In contrast, .NET Native translates the high-level language to native code. The native application code is typically linked to a set of libraries, which provides VM service functionality. These approaches might achieve high execution time determinism, because native code does not have to be generated or allocated at run time. If the program loader can be instructed to resolve references at startup time, objective O.1 can be achieved. O.3 (performance) can be achieved too, because the application is available as native code and offline code generation can use optimizations that are more aggressive but more time consuming than that of, e.g., JIT compilers. If intermediate code is compiled to native code, an extra deployment step is required. If the transforma- tion is done at high-level language level, the source language compiler has to be adapted, each time a new target architectures and a new source language, respectively, has to be supported. Also the libraries, which provide the VM services, have to be ported to a new system. These issues affect the portability of an application negatively. Various solutions do not provide all services of a VM or only support a subset of the source language or intermediate code. O.2 is worse than that of the VM-based execution of AOT compiled code. The standalone executable has to be re-built, even if only a fraction of the ap- plication is updated. On the other hand, the startup time is presumably the best of all approaches, i.e., O.4 is achieved. There is no extra processing except the program loader work, which has to be done to load a VM anyway.

47 4 Allocation of Native Code

Method O.1 O.2 O.3 O.4 Example Hardware X − XX JOP Interpretation XX − X TimeSys RI JIT − XX ◦ Mono, .NET CLR ITC X ◦ X − SunRTS, Apogee AOT (VM) ◦ ◦ XX WebSphere AOT (runtime env.) X − XX JamaicaVM

Table 4.1: Evaluation of code allocation and execution approaches, respectively (X/◦/− =b achieved/restricted/missed) 4.2 Concept

A concept for real-time suitable code generation in virtualizing runtime environments is proposed, which bases on the evaluation results of section 4.1 and the development objectives O.1 to O.4 of section 1.3. The approach is generic in the sense, that it is intended to be applied to virtualizing runtime environments in general and does not require a certain platform, e.g., Java or ECMA 335/CLI. The new principle can be applied to other virtualizing runtime environment platforms. The following design decision are made: D.1 Using a full-featured VM as basis D.2 Pre-compilation of intermediate code D.3 Stitching of native code D.1 is chosen to reach development objective O.2 (portability). Importance is attached to easy deployment and the use of legacy code. Making a VM, which was not designed with determinism in mind, deterministic – at least for a particular aspect – seems like doing things back to front. However, it is a valid approach considering the requirements. Solutions that generate standalone native executables have restrictions concerning lan- guage feature support and manageability of application and/or standard library updates. For example, dynamic loading and application updates are more the exception than the rule for AOT compilation approaches that rely on runtime environments like JamaicaVM and RealTime.NET, which claim to provide high execution time determinism. The design decision to use a full-featured VM has the advantage over other approaches that an error- prone and complex re-implementation of (parts of) a VM can be omitted. VM services, even if they are not real-time suitable, are preserved and can be used by legacy code. Ap- proved functionalities like the native code generator and class loader can be re-used. The VM keeps its support of dynamic features like dynamic class loading and code generation at runtime, if these features are needed. The standard programming library is intended to be used as is. This design decision allows using legacy application code and the source language is not restricted. The real-time extension is not supported for some features but the language features can be used nevertheless. The benefits of general-purpose program- ming languages over conventional real-time system programming languages, e.g., a higher degree of abstraction, a rich feature set and maybe a broader community of programmers,

48 4.2 Concept are preserved due to this design decision. As a side note, the infrastructure of the VM to support automatic garbage collection can be used to apply a real-time suitable GC. D.2 is chosen to achieve objective O.3 (performance). Therefore, a native representation of the application’s intermediate code is generated in a first step. The direct impact of the lazy compilation principle on the execution times of intermediate code can be eliminated. This should provide better performance than interpretation of intermediate code. O.2 also contributes to the design decision. The allocation of native code starts at intermediate code level in order to preserve the portability and the full feature set. Existing high-level language compilers can be used as is and the interoperability of various high-level languages is preserved due to the application deployment via intermediate code. An alternative is to rely on the high-level language, like it is realized by .NET Native. This implies that a source-to-native compiler has to be developed for each combination of source language and target architecture. A further alternative approach is to use C code as intermediate code. If the C code is generated from intermediate code, as with JamaicaVM, there is an additional deployment step, which builds the executable from the C code. For that reasons, the native code generation is performed at intermediate code level, and not, for example, at high-level language level. Design decision D.3 is a second step that allows for achieving development objective O.1 (determinism). It is an optimization of the native code, which is generated in the first step. Neither further native code is generated nor already generated native code is modified at run time after this optimization. That is, the execution of an application can be free of any interactions of VM services that deal with native code generation and modification. The impact of the lazy compilation principle is eliminated. This is realized by resolving all indirect references in the native code. The code “stitching” is typically closely related to native code generation, so that both steps are considered together when necessary. In summary, the design, which is given here, has conceptual advantages over existing so- lutions especially in use cases where the portability, easy deployment and performance as well as support for interoperability of high-level programming languages is important. This is realized by relying on application deployment via intermediate code and its trans- lation into native code. The concept is designed for an easy to realize side-by-side use of real-time and non-real-time application code and the use of legacy code, which comes through the use of a full-featured HLL VM. It is a holistic solution, which does not require providing two runtime environments, e.g., one for the real-time application and a second for a graphic user interface. Other solutions might lack this flexibility but perform better in quite resource-constrained environments, e.g., when it is not possible to deploy a full set of the standard library. Development objective O.4 (startup time) is also considered. The overhead due to ref- erence resolving depends on the number of unresolved references which partly depends on whether the native code is dynamically generated or ahead-of-time, see section 4.1.2. Taking this into account, section 4.3 describes pre-compilation that is inspired by the ITC approach. Section 4.4 describes the testing environment that is used for the evaluation of the pre-compilation, which is done in section 4.5. Chapter 5 describes the native code optimization and chapter 6 examines an approach that bases on AOT compilation to

49 4 Allocation of Native Code examine the startup time behavior. In summary, a design is chosen that combines the advantages of JIT compilation (O.2 and O.3, see Table 4.1) with that of ITC (O.1) while optimizing objective O.4.

4.3 JIT Compiler-based Pre-Compilation

4.3.1 Principle

The concept of pre-compilation that bases on JIT compilation is first introduced in [99]. The CLI implementation Mono in version 2.6.1, which was released in December 2009, is used for the implementation and the evaluation of the concept. Mono was chosen, because it is a free and open source project. Mono provides two execution modes: JIT compila- tion and AOT compilation. This reduces the implementation effort of the evaluation of startup time behavior depending on the execution mode. Figure 4.1 illustrates how the pre-compilation works. Mono’s code loader loads the application assembly, see Figure 4.1,

Pre-Compilation Extension

List to pre-compile: mscorlib. main.exe Class / Code Metadata generates dll JIT Compiler Loader interaction Analysis 3. 2. func.dll

1. inserts

main.exe 5. 4. lookup main() Native Code Cache: hit main() 6. - main() Persistent Execution Engine lookup func() Memory - func() - WriteLine() hit func() - ......

native application code intermediate application code

Figure 4.1: Pre-compilation based on Mono’s JIT compiler. step 1. The metadata is checked for referenced assemblies, e.g., assemblies of the standard library. All assemblies found are loaded and analyzed until no additional assemblies are identified. In the example of section 2.4.3, the application assembly main.exe has refer- ences to mscorlib.dll (the core library) and func.dll, see lines 2 and 4 of Listing 2.15. The referenced assemblies do not refer to others, so that the pre-compilation list includes the three named assemblies (Fig. 4.1, step 2). Mono’s JIT compiler processes each assembly (Fig. 4.1, step 3) in order to pre-compile their contents. All methods, which are found in the assemblies via metadata analysis, are pre-compiled. For assembly main.exe in the example, that are function Main and the constructor of class Hello. The memory for

50 4.3 JIT Compiler-based Pre-Compilation the methods’ native code is allocated during the JIT compilation process and the native code is stored there. An existing Mono-internal cache manages the native code (Fig. 4.1, step 4), so that it can be retrieved later. Besides the methods in the assemblies, all potentially needed Mono-internal helper method – so-called “wrappers” – are generated and pre-compiled too. Wrappers can also be considered as Mono-internal structures be- cause they are transparent to a user. In general, a wrapper is responsible for marshalling method arguments and/or invoking the method. In this context, marshalling means the conversion of data elements and structures in order to pass them from unmanaged code to managed code or vice versa. The wrapper generation also considers custom classes and structures. Mono uses following types of wrappers: • Invocation of managed code from the runtime, e.g., the entry point method • Invocation of special methods of the core library, e.g., for finalization and exceptions • Invocation of Delegates • Invocation of native code, e.g., Platform Invokes • Remoting • Marshalling, e.g., for the StructureToPtr-method • Synchronization • Boxing and Unboxing The pre-compilation fills the native code cache with the addresses of all potentially executed code. In the example (which was executed with conventional JIT compilation), function Main’s native code starts at address 0x010591e8 and func at 0x1059328, see line 1 in Listing 2.16 and line 5 in Listing 2.17, respectively. The execution of the application assembly starts when the pre-compilation finishes (Fig. 4.1, step 5). The execution engine looks up the native code cache when a call to a method is executed the first time. In case of a hit, the native code is loaded (Fig. 4.1, step 6). In case of a miss, the JIT compiler would generate native code. In the example shown in Figure 4.1, the native code cache is already filled so that no JIT compilation is performed at run time. The pre-compilation is transparent to Mono’s execution engine. The class loader and JIT compiler are not modified. If the pre-compilation is deactivated, the execution of the assembly is as usual.

The type initializers of classes, i.e., static constructors, are run during pre-compilation. So, more VM-internal data structures are set up this way, e.g., virtual tables (VTables). This allows for resolving a lot of references at code generation time because the addresses are known. However, this is usually done when the type in question is referenced the very first time. This is obviously not the case during pre-compiling of intermediate code, because the code is actually not executed. However, [4, §I.8.9.5] states that “If marked BeforeFieldInit then the type‘s initializer method is executed at, or sometime before, first access to any static field defined for that type”. This flag is set for class Hello in the example, see line 6 in Listing 2.15. Running the type initializers of types that have the flag not set would violate the standard. On the other site, running the type initializer at

51 4 Allocation of Native Code run time can introduce jitter. The solution taken here makes a compromise in favor of real-time behavior.

4.3.2 Implementation

There were two possibilities to implement the pre-compilation in Mono: an invasive and a non-invasive variant. The non-invasive variant is implemented in a high-level language like C#. It uses CLI’s Reflection capabilities to load an assembly and to examine its metadata, in order to search for referenced assemblies and the methods in the assembly. The C#- namespace “System.Runtime.CompilerServices” provides the class “RuntimeHelpers”. Its methods “PrepareMethod” and “RunClassConstructor” can control the JIT compiler and type initializers. That allows for a pre-compilation like described. It allows even control over the execution time of the type initializers. The workflow would be as follows. The pre-compilation code forms a separate pre-compilation assembly, e.g., PreComp.exe. The name of the application assembly, e.g., App.exe, and its parameters are input pa- rameters. The following listing shows the invocation of the application the conventional way:

$ mono App.exe arg1 arg2 ... argn

The next listing shows the invocation of the application with the pre-compilation assem- bly:

$ mono PreComp.exe App.exe arg1 arg2 ... argn

This does not require changes to the application assembly, so that portability (objective O.2) is not harmed. It keeps the implementation effort low and is portable, because the pre-compilation code is deployed as intermediate code. However, the non-invasive approach could not be used. The methods to control the JIT compiler and the type initializers are introduced by the .NET-2.0-profile in 2005. This version also introduced the VM-internal support for type parameters (“Generics)”, i.e., parametric polymorphism. The solution presented here does not support intermediate code that relies on Generics. Details about this restriction are explained in section 7.1.3. It was not possible to load the version of standard library that is referenced by the application assembly and that does not support Generics, if the pre-compilation assembly references another version. This was tested with Mono and the .NET framework. The main reason for an invasive implementation is to get control over the generation of the Mono-specific wrappers. Hence, this decision is driven by Mono implementation specifics. The invasive variant requires modification and extension, respectively, of the VM’s source code. The extensions to the Mono VM are marked by dashed lines in Figure 4.1. They are implemented in the C language, like Mono, and use existing functionality to keep the implementation effort low. This required an extensive analysis of Mono’s (often undocumented) source code. For example, the metadata analysis that determines referenced assemblies uses existing functions of Mono. The pre-compilation routines for normal methods and wrappers base on Mono’s AOT-compilation code.

52 4.4 Testing Environment

It is possible that exceptions are thrown during pre-compilation, e.g., due to malicious type initializers. In order to be able to continue pre-compilation and to keep the imple- mentation effort low, exceptions are caught in managed code and not in C code. Figure 4.2 shows step 3 of Figure 4.1 in more detail. The implementation uses the Mono-specific

intermediate code PreComp.exe

native code 1.

Mono embedded API Internal Call Interface Mono VM

List to 2. pre-compile:

main.exe 3. glue code JIT Compiler mscorlib. dll

func.dll Pre-Compilation Extension

Figure 4.2: Triggering Mono’s JIT compiler via managed code.

“Mono embedded API”. It allows for executing CIL from C code by calling routines that are provided by Mono libraries. This interface is applied to Mono itself in order to call a pre-compilation assembly (“PreComp.exe” in Figure 4.2), which is used as intermedi- ary. One of its managed functions is called for each assembly on the list of pre-compiled assemblies, see step 1 of Figure 4.2. The managed function calls C code of Mono’s pre- compilation extension (step 2 of Figure 4.2) via an “Internal Call”. An internal call is a CIL method with the attribute internalcall set. [4, §II.15.4.3.3] puts that this at- tribute “[...] specifies that the method body is provided by this CLI (and is typically used by low-level methods in a system library)”. The internal call triggers some glue code, which invokes Mono’s JIT compiler (Figure 4.2, step 3). If an exception occurs, it is caught in the pre-compilation assembly and the pre-compilation can be continued.

4.4 Testing Environment

Section 4.3 describes the pre-compilation concept and its implementation. This section describes a testing environment to evaluate the execution modes of the modified Mono VM. It is called “Mono Real-Time” (MonoRT) in the following. The evaluation of the interim development state, which is done in section 4.5, provides a better understanding of the initial problem, for the gain reached through pre-compilation and for the following

53 4 Allocation of Native Code steps.

The number of available standard industry benchmarks for CLI implementations is limited [44, p. 81]. Existing C#/CLI benchmarks ( [133], [67] and [106]) are often ports of Java benchmarks and are quite computation intensive. They examine code quality rather than execution time determinism. Real-time benchmarks that are dedicated for Java [32,42,61] use RTSJ-specific functionality. The Mono VM used here lacks such features like high precision timers or a real-time suitable threading model. So, a naive port of existing Java benchmarks to C# is not possible. However, RTSJ-features are not necessary to examine the properties that are mentioned above. For that reasons, the test series are realized as custom micro-benchmarks. The properties to examine are: • Temporal determinism of method execution • Execution time • Application startup time These properties reflect the development objectives O.1, O.3 and O.4, respectively. The benchmarks are designed to examine the native code allocation. They do not cover GC performance, WCET analysis or other real-time programming aspects. The initial version was developed by Alexej Schepeljanski [99]. He provided an adaptable code generator to create the benchmarks and he made significant contributions to the development of the structure of the benchmarks. Listing 4.1 shows the general structure of the micro- benchmarks. main ( ) { // initialization  method0();  ... measurement 1 method999(); 

for (i = 0; i < 3; i++){  method0();  ... measurement 2, 3, 4 method999();  } return ; } Listing 4.1: Pseudo code of micro-benchmark

There are several similar methods, which are explained in detail later. The methods are designed to have an execution time that is low compared to the overhead of VM-internal code handling mechanisms. For that reason, there are 1000 methods and not only one method that is 1000 times more extensive than a single one. This allows for emphasizing onetime overhead that is related to code allocation. The value 1000 for the number of methods proved to be best practicable with regard to test duration, expressiveness and

54 4.4 Testing Environment manageability. It also provides a sufficient “randomness” that is necessary for evalua- tion. The benchmarks reveal different kinds of overhead that are introduced by the VM. The total invocation and execution time of all 1000 methods is captured four times per benchmark run, i.e., measurement, see Listing 4.1. First Measurement The first measurement considers the very first invocation of the methods. It is intended to consider the code generation/loading and reference han- dling overhead and the methods’ execution. Second Measurement The same methods are called like the first measurement does, but through different call sites. The methods’ native code should already be generated or loaded at this point. The first and the second measurement use different call sites in order to reveal whether there is overhead that is not related directly to JIT compilation. A simple repetition of the first run would possibly hide additional overhead that is related to the method calls. Third & Fourth Measurement They are repetitions of the second one, see Listing 4.1. That is, they use the second run’s call sites. The third and fourth run determine whether there is an additional kind of overhead and they determine the “final” execution time level. At the final level, the methods’ execution times are not affected by overhead introduced by the VM. These results are used to evaluate the results of the previous measurements. If the third (or fourth, or ...) run would use not yet used call sites, the overhead revealed by the second run might be observed again. The benchmarks are reduced to the essential. That is, an additional run on the final execution time level is omitted as well as one that might reveal overhead, which is already figured out. The benchmarks look quite synthetic as they make intensive use of method calls. Object-oriented languages like C# and Java foster a programming style that enforces this, because semantics (methods) and data are linked by objects. Programming patterns like getter and setter, which enable data encapsulation, are an example. Thus, the benchmark presented here reflects a realistic scenario.

The methods, which are called in the benchmark, need a special design. On the one hand, the benchmarks itself has to be deterministic. It should also minimize hardware effects due to caching, pipelining and branch prediction. On the other hand, simple compiler optimizations like method inlining, which cannot be disabled in each case, have to be avoided. The benchmark has to be easily portable to at least C# and Java. So, language specifics like C#’s method implementation options are not used. Two variants of the benchmark are developed and compared in order to determine a convenient method design. There are a simple and a complex variant. Listing 4.2 shows the structure of the simple methods and Listing 4.3 shows the complex methods. int method0 ( int var1 , int var2 ) { return (var1 + var2); } Listing 4.2: Structure of simple benchmark methods

55 4 Allocation of Native Code

int method0 ( int var1 , int var2 ) { i f (var1 > var2) return (var1 + var2); else return (var2 + var1 + 1); } Listing 4.3: Structure of complex benchmark methods

Two integers are arguments of each method, which in turn returns an integer value (line 1 of Listing 4.2 and Listing 4.3). The simple variant has only one path and one exit point (line 2 of Listing 4.2). Each method performs an addition. The complex variant comes in three flavors that differ in the mathematical operation: addition, subtraction or multiplication. The value 1 is always added in the else-clause (line 5 of Listing 4.3). The else-clause is the “long” path, whereas the if-clause is the “short” path. The two possible paths and exit points are intended to avoid inlining.

There is a series of benchmarks. The elements of this series differ in the test case examined and the target language. Several elements might examine the same test case but differ in the programming language. The test cases basically differ in the kind of how the methods are called. For example, one test case examines the case where the methods are called through an instance of a class. Another test case considers calling static methods. The adaptable code generator creates each element of the series, i.e., the source code of the benchmarks. It operates on a given set of n integer variables, var0 to varn−1, and a set of n methods, method0 to methodn−1. n is set to 1000 in the series used here. Each variable vari is initialized with the value of i at benchmark startup. The method calls of the first measurement (see Listing 4.1) are generated sequentially as follows. Three integer variables are chosen randomly out of the set of integer variables. One variable takes the method’s return value and the other two variables are the arguments. The method arguments must be different. The simple variant of the benchmark calls the simple structured methods, which are shown in Listing 4.2. The simple variant’s method calls for the second, third and fourth measurement are identical to that of the first measurement, including assigned variables. That means, each measurement of a simple benchmark takes always the identical path, because there is only one. The complex benchmark uses the complex methods (see Listing 4.3). The mathematical operation is chosen randomly for each complex method when the benchmark is created. Due to the number of 1000 methods, it is assumed that the number of additions, subtrac- tions and multiplications is balanced. Additionally, the random assignment of variables to method calls is repeated for the second set of method calls, which is used for the sec- ond, third and fourth measurement (see Listing 4.1). Listing 4.4 gives an example of the generated source code.

56 4.4 Testing Environment

main ( ) { // initialization  ...  var144 = method301(var894, var208);  measurement 1 var722 = method302(var931, var434);  ... 

for (i = 0; i < 3; i++){  ...  var909 = method301(var606, var467);  measurement 2, 3, 4 var509 = method302(var613, var720);  ...  } return ; } ... int method301 ( int var894 , int var208 ) { i f (var894 > var208) return var894 ∗ var208 ; else return var208 ∗ var894 + 1 ; }

int method302 ( int var931 , int var434 ) { i f (var931 > var434) return var931 − var434 ; else return var434 − var931 + 1 ; } ... Listing 4.4: Example assignment of variables to methods in complex benchmark

It is likely that a complex method operates on different variables and the benchmark takes different paths during each measurement. It is assumed that the execution number of short and long paths is balanced. It is assumed that the “randomness” is sufficient so that there is no partial impact on a measurement’s or a benchmark element’s results. The randomness bases on the random number generator of the .NET framework, which is assumed to provide a uniform distribution. The .NET framework runs the adaptable code generator. In order to reduce caching effects on the benchmark results, all CPU caches and the file system caches are invalidated or polluted before each measurement. Nevertheless, an “ambient noise” of execution times is assumed, which is caused by, e.g., non-maskable system management interrupts. The benchmarks results represent a sample of the population of the execution times.

57 4 Allocation of Native Code

The test hardware, which is used for the tests in this thesis, has the following specification: • CPU: Intel Atom [email protected] GHz (single core, no Hyper-Threading) • 1.0 GiBytes main memory This test system is referred to as “IA-32” in the following. It runs a Linux operating system with “Real-Time Preempt”-patches [74] applied (Kernel version 2.6.33.5-rt23-v1). The tests are run on high operating system priority, i.e., real-time priority 792 on Linux. That avoids interferences by operating system services and other applications. Standard interrupt execution is threaded. These threads have priority 50 in the patched Linux. The corresponding interrupt-service routines enqueue these threads. The IA-32 platform is used to perform the timing benchmark tests of MonoRT and it is used to perform comparative tests with other virtualizing runtime environments. A second test system, which is used for tests in this thesis, has the following specification: • CPU: Broadcom BCM 2835 System-on-Chip with ARM1176JZF-S core@700 MHz (single core) • 0.5 GiBytes main memory This test system is referred to as “ARM” in the following. The hardware with this speci- fication is better known as the “Raspberry Pi” (model B) [36]. It runs a Linux operating system without “Real-Time Preempt”-patches applied. Its Kernel version is 3.11.10. The ARM platform is used for timing benchmark tests of Mono’s Full-AOT mode (see section 4.1.2) and it is used to evaluate the concept of “checkpoint / restore” for startup time optimization, which is described in chapter 6. The simple and complex benchmark are compared to each other in order to specify the variant, which is used in the following. They are implemented in C++ because it does not suffer from overhead introduced by a VM as a matter of principle. This reference im- plementation represents the achievable determinism and execution time level. Listing 4.5 shows the access pattern of the methods. There is one class that defines the 1000 meth- ods. The methods are called through an instance of their defining class. The benchmark only considers the time related to method execution and not the instance creation.

class Test { static int Main ( ) { // initilization Test myClass = Test; // begin first measurement var1 = myClass.method0(var2, var3); var3 = myClass.method0(var5, var6); //... } } Listing 4.5: Pseudo code pattern of instance method benchmark

2The higher the value, the higher the priority.

58 4.4 Testing Environment

Each benchmark is run 2000 times to get a sufficient sample size of observed execution times. Table 4.2 lists the benchmark results. The sample mean is denoted by X in Table

4.2 and it is calculated by Equation 4.1 where n = 2000 and Xi are the observed execution times. The sample mean is denoted as “average” in the following.

Variant Meas. X[µs] Min [µs] Max [µs] Simple 1st 142,1535 130 190 2nd 141,9805 129 200 3rd 140,5695 130 190 4th 142,1795 130 199 all 141,7208 129 200 Complex 1st 153,373 136 210 2nd 153,374 137 211 3rd 153,74 137 213 4th 153,4225 128 215 all 153,4774 128 215

Table 4.2: Results of the simple and complex benchmark variant written in C++, IA-32

n 1 X X = X (4.1) n n i i=1

The average execution time of the simple variant is lower than that of the complex variant, as expected. The simple variant has a slightly smaller execution time span than the complex variant (71 µs vs. 79 µs). The ratio of execution time span to average overall execution time span is 0,501 and 0,515, respectively. That is, the simple variant performs marginally better. Figure 4.3 shows the corrected sample standard deviation s of the execution times for each run. This is a measure how the observed execution times Xi jitter around the sample mean X. The higher the sample standard deviation s, the higher the jitter. The execution times are taken from a sample, which is not the universe, and the universe’s mean µ is unknown. So, the corrected sample standard deviation is used, which is calculated by Equation 4.2. v u n u 1 X 2 s = t (Xi − X) (4.2) n − 1 i=1

The standard deviations of both variants are close together. The complex variant’s stan- dard deviation is a little bit higher than then simple variant’s, but it is not materially different. The average values make no statement about the distribution of execution times and the “ambient noise”. Figure 4.4 shows the frequency distribution of the observed exe- cution times. Both variants have a similar distribution. They suffer from an accumulation of execution times that is circa 50 µs away from the main area. This affects the sam- ple mean. It is assumed that this is caused by hardware interferences like non-maskable system management interrupts, because the interval is the same for both variants. The simple benchmark’s execution times are a little more “dense” than that of the complex

59 4 Allocation of Native Code

18 16,76975896 16,37300943 16,04879359 16,07790784 16,11598543 15,81285246 16,15213142 16 15,42069508 15,16777924

13,68312039 14

12

10

8

6 Standard in µs Deviation

4

2

0 Measurement 1 Measurement 2 Measurement 3 Measurement 4 Overall Simple Benchmark Complex Benchmark

Figure 4.3: Standard Deviation of the simple and complex benchmark variant written in C++, IA-32

3000

2500

2000

1500

1000 times of observed execution Number 500

0 129 139 149 159 169 179 189 199 209 Observed Execution time in µs

Simple Benchmark Complex Benchmark

Figure 4.4: Frequency Distribution of the observed execution times of the simple and complex benchmark variant written in C++, IA-32

60 4.5 Experimental Results and Evaluation variant. A statistical comparison of the frequency distributions is difficult. First, there is no assumption about the distribution function due to the accumulation of “runaways”. Statistical tests, like Kolmogorow-Smirnow test or the Chi-square test, can check whether both distributions differ significantly or not. However, the distributions are shifted so that they had to be normalized in order to create unique test categories. This is difficult as the mathematical average is not known due to the unknown distribution function. An- other solution might be the “isolation” of the main aggregation using the median. Then, a normal distribution could be assumed because the execution time distribution looks like a bell curve. This complex analysis is omitted. It is beyond the scope of this thesis. Both distributions look similar. It is assumed that the simple variant provides slightly more deterministic results. On the other side, the methods have a simple structure that might be inlined easily by compilers; in contrast to the complex methods. It is assumed that the difference of the jitter of the simple and complex benchmark variant is far less than the overhead introduced by a VM. It is further assumed – and validated by the evaluation in this section – that the “randomness” is sufficient, so that there is no partial impact on a measurement’s or a benchmark element’s results. Even, contrary to expectations, if there is a negative effect, it is definitely lower than the overhead introduced by the VM. This is accepted to avoid method inlining, which would hide VM overhead at all. For that reasons, the complex benchmark variant is used for evaluation in the following.

4.5 Experimental Results and Evaluation

This section presents only one benchmark out of a series of benchmarks. That is sufficient to reveal the problem, the interim solution and the further steps. Listing 4.5 shows the access pattern of the methods. This access pattern is also used for evaluation of the simple and complex benchmark variant in section 4.4. The methods are called through an instance of their defining class. Table 4.3 lists the benchmark results of Mono, of MonoRT and of the C++ implementation on the IA-32 platform. Table 4.4 provides the results of the ARM platform. Like in section 4.4, the sample mean is denoted by X in Table 4.3 and it is calculated by Equation 4.1 where n = 2000 and Xi are the observed execution times. The sample mean is denoted as “average” in the following.

The execution times of the C++ variant are at the final level of round 153 µs (on IA- 32) right from the start. There is no significant overhead, so that the execution times do not suffer from high jitter. The benchmark exhibits a three-stage execution time behavior of Mono’s JIT compilation execution mode. As expected, the very first call of the methods takes longest: the sample mean is 236435,63 µs. It considers the methods’ JIT compilation, reference handling and execution. The second measurement (7125,20 µs on average) includes the methods’ execution and reference handling. It takes the methods into account, which are already JIT compiled, but they are called through different call sites. The third and fourth run measure the methods’ pure execution time. This is the expected execution time level of round 145 µs on average. It is even slightly better than the reference solution written in C++. Repeated executions do not result in a lower

61 4 Allocation of Native Code

Execution Mode Meas. X[µs] Min [µs] Max [µs] C++, 1st 153,373 136 210 2nd 153,374 137 211 3rd 153,74 137 213 4th 153,4225 128 215 Mono, 1st 236435,63 230446 487545 JIT Compilation 2nd 7125,20 7071 7222 3rd 144,56 134 189 4th 144,77 135 191 Mono, 1st 23591,42 18223 134954 AOT mode, 2nd 18429,51 12944 35205 IA-32 3rd 151,61 140 197 4th 151,07 136 193 MonoRT, 1st 8114,75 8019 8268 JIT-based 2nd 7901,51 7790 8047 pre-compilation 3rd 145,7 133 193 4th 145,95 134 192

Table 4.3: Observed execution times of 1000 methods of the C++ variant and the C# variant using different execution modes of Mono and MonoRT, IA-32 execution time. Figure 4.5 shows the frequency distribution of the observed execution times of the third and fourth measurement of Mono in JIT mode.

800

700

600

500

400

300

200 times of observed execution Number

100

0 134 144 154 164 174 184 Observed Execution times in µs

Figure 4.5: Frequency Distribution of the observed execution times of the third and fourth measurement of Mono in JIT mode, IA-32

The distribution is very similar to that of the reference implementation, see Figure 4.4. There is also a second accumulation of observed execution times. That is, the execution time behavior of applications, which are run by Mono, is comparable to that of the

62 4.5 Experimental Results and Evaluation reference implementation in C++. Development objective O.3 (performance) can be met by VM-based execution in principle. Mono’s AOT compilation execution mode on IA-32, which loads pre-compiled native code lazily, has a similar behavior. However, the difference between first and second measurement is less than that of JIT compilation mode. The final execution time level (circa 151 µs) is slightly higher than that of the JIT compilation mode (circa 145 µs). The overhead of the first and second measurement, compared to the third and fourth measurement, is caused by the lazy loading of the native code, by reference handling and JIT compilation of Mono specific internal helper methods. These helper methods are not covered by Mono’s normal AOT mode. The new JIT-based pre-compilation mode on the IA-32 platform exhibits a two-stage ex- ecution time behavior. The first measurement is nearly at the same level as the second measurement (8114,75 µs and 7901,51µs on average). This is in turn similar to the second measurement of the JIT compilation mode, where the native code is already allocated. The final execution time level is shown by the third and fourth measurement in this bench- mark, see Table 4.3. It is at circa 146 µs, so it is similar to that of the JIT compilation mode and to the reference implementation in C++. The JIT-based pre-compilation ob- viously eliminates the direct impact of JIT compilation. It does not eliminate the impact of the lazy compilation principle, which is related to JIT compilation. The overhead of the first and second run is presumably caused by reference handling. Figure 4.6 shows the corrected sample standard deviation of the execution times for each run, which is calculated by Equation 4.2. Mono reaches standard deviations that are

16,07790784 16,11598543 C++ 16,37300943 16,04879359

8714,076409 20,61766325 JIT 10,69366996 11,07869877

3385,680556 2149,179208 AOT 12,17238919 11,58211191

39,28526971 38,80981747 Pre-Compilation 11,11958724 11,56002219

1 10 100 1000 10000 Standard Deviaon of observed execuon mes in µs

1. Measurement 2. Measurement 3. Measurement 4. Measurement

Figure 4.6: Standard Deviation of execution times of 1000 methods (note the logarithmic scale), IA-32 comparable and even better than that of the reference implementation, which is written

63 4 Allocation of Native Code in C++, on the final execution time level. MonoRT’s pre-compilation mode’s results are slightly worse than that of Mono’s JIT compilation mode, but still very similar. That is, code allocation overhead caused by the VM is eliminated at this point. It is possi- ble to reach performance and determinism of conventional programming languages. The benchmarks also figure out the problem, which is described in section 1.2. The first two measurements of the benchmark run by Mono exhibit a worse standard deviation. This non-determinism is the crucial point for real-time systems. From real-time view, the final execution time level has to be reached at the first execution of the methods. Table 4.4 shows the benchmark results of the ARM platform. They are presented here

Execution Mode Meas. X[µs] Min [µs] Max [µs] C++ 1st 657,213 629 1081 2nd 649,813 625 844 3rd 648,538 625 843 4th 650,323 625 848 Mono, 1st 570869,536 552068 600322 JIT Compilation 2nd 28090,882 26500 29817 3rd 468,573 439 662 4th 468,86 442 651 Mono, 1st 4241,62 3085 5936 Full-AOT mode 2nd 2612,581 1513 4268 3rd 473,133 438 813 4th 471,448 439 692 MonoRT, 1st 31906,667 30319 33561 JIT-based 2nd 31909,545 30435 33683 pre-compilation, 3rd 475,657 443 705 4th 477,812 437 704

Table 4.4: Observed execution times of 1000 methods of the C++ variant and the C# variant using different execution modes of Mono and MonoRT, ARM in order to show that Mono’s behavior is not platform-specific and in order to evalu- ate Mono’s Full-AOT mode, which is available for ARM. The results confirm the timing behavior of Mono that was determined on the IA-32 platform. The reference implementa- tion’s execution times are on a level of circa 650 µs right from the start and do not suffer from high overhead. Mono’s JIT compilation mode shows a three-stage behavior like on IA-32. The final execution time level of circa 468 µs is reached at the third measurement. Mono provides a Full-AOT mode on the ARM platform. Like the normal AOT mode on IA-32, it pre-compiles the intermediate code to native code in an extra deployment step. The VM loads this native code at run time of the application, so that the JIT compiler does not have to be invoked. In contrast to IA-32’s AOT mode in this version of Mono, it covers nearly all Mono specific internal helper methods. It is even possible to execute the benchmark code without invoking the JIT compiler, so that this execution mode is of special interest here. This might not be possible for other use cases, see section 4.1.2. Nevertheless, the lazy loading of native code causes overhead too, so that the first measurement takes round 4241,62 µs. This is much less than the first execution in JIT

64 4.5 Experimental Results and Evaluation compilation mode, but the final execution time level is only a ninth of it (circa 473 µs). The second measurement, which takes 2612,581 µs on average, also reveals overhead. That is, Mono’s Full-AOT mode does not provide execution times that are suitable for real-time systems. The results of JIT-based pre-compilation mode on the ARM platform confirm that the overhead of JIT compilation can be eliminated by pre-compilation. The first and the second measurement provide results that are on the same level and that are approximately identical to the second measurement in JIT compilation mode. Like on IA- 32, the execution times at the final level are slightly higher than in JIT compilation mode.

An extensive evaluation of the results for ARM, like it is done for IA-32, is omitted at this point. It would not obtain new findings. However, it raises the question how to compare and rate results of different solutions and systems, in view of determinism. In order to rate the determinism of a solution, i.e., development objective O.1, the benchmark has to be considered as a whole and independent from the distribution of the results. So, the results of the four measurements per benchmark have to be compared. The basic assumptions for the rating of the determinism are: 1. The conditions, e.g., influence by hardware and OS, are equal for each measurement. 2. The more the measured values are uniform compared to each other, the higher the determinism. The first assumption is made due to the comparison of the measurements’ frequency dis- tributions of observed execution times. They are very similar on a particular system, e.g., on IA-32 running Linux, so that equal conditions are assumed. The second assumption is made due to the fact that the overhead, which is related to code allocation, is magnitudes higher than the runaways inside the benchmarks’ results. Table 4.3, Figure 4.3 and Fig- ure 4.4 confirm this. Given the first assumption, a solution cannot be more deterministic than made possible by external conditions like the operating system or hardware effects. Ambient noise is taken for granted. The results have to be comparable even if these con- ditions are equal per measurement but not per benchmark. For example, in cases where the observed execution times have different distribution frequencies per benchmark. The measurements’ average observed execution times and standard deviations are used to rate the determinism. The execution time span is not suitable for this purpose because a single runaway, which can be caused by external effects, affects the result heavily. The standard deviation is the better choice. It gives information about the variations of the observed execution times while it “smoothes” occasional runaways. Two custom measures are in- troduced for this purpose: Ratio of Ordered Execution Times RET (ROET) and Ratio of Ordered Standard Deviations RSD (ROSD). RET and RSD are calculated by Equation 4.3 and Equation 4.4, respectively. n−1 Y Xi RET = (4.3) i=1 Xn n−1 Y si R = (4.4) SD s i=1 n

65 4 Allocation of Native Code

th Xi and si are the i highest sample mean and corrected standard deviation, respectively, of the observed execution times of a benchmark. n is the number of comparative values, which corresponds to the number of measurements in the benchmark used here. RET and RSD are the products of the ratios of how the measurement parameters change dur- ing the benchmark run. They make a statement about the variance of the measurement parameter. A value of 1 indicates that the measurement parameter does not change during the benchmark regardless of the number of comparative parameters. This is the optimum. The higher the value, the more the underlying measurement parameter is non-deterministic. So, RET and RSD are independent of the unit of the measurement pa- rameter and the number of comparative values. They are dimensionless numbers, which are used in this thesis for comparison only. The number of “steps”, which have to be exe- cuted until the final level of execution time or standard deviation is reached, also counts. This requires that the benchmark covers all cases where the measurement parameter can change and it is assumed that the benchmark covers all cases. A three-stage execution time behavior is rated worse than a two-step behavior. On the other side, additional comparative values on the final level do not worse the result. RET and RSD allow a comparison of different solutions at a glance. The comparative values have to be ordered

first because each variance of the measurement parameters contributes to RET and RSD. If the values were not ordered descending, a factor could be less than one. This would improve the result although it is actually negative. For example, RET for the reference implementation running on the IA-32 platform is calculated as follows (compare to the values in Table 4.3):

153, 74µs 153, 4225µs 153, 374µs 153, 373µs R = · · · = 1, 002723 (4.5) ET 153, 373µs 153, 373µs 153, 373µs 153, 373µs

Table 4.5 shows the remaining results of RET and RSD for IA-32. MonoRT’s pre-

C++ JIT mode AOT mode Pre-Compilation RET 1,002723 80733,68621 19117,23126 3025,460284 RSD 1,026332 1627,68076 57007,39142 12,819291

Table 4.5: RET and RSD of 1000 methods of the C++ variant and the C# variant using different execution modes of Mono 2.6.1 and MonoRT, IA-32 compilation provides a gain of determinism compared to the execution modes JIT and AOT. It cannot yet keep up with the reference implementation. These results also confirm that a solution that would always discard JIT compiled or lazily loaded code is not suitable. Also the first and second measurement in pre-compilation mode suffers from high non-determinism and only the execution of native code performs well. Table

4.6 shows the RET and RSD of the ARM platform, in order to confirm the results of IA-32 and in order to demonstrate the expressiveness of the custom measures. MonoRT’s pre-compilation provides a gain of determinism compared to JIT execution mode also on ARM. However, the Full-AOT mode provides more deterministic results, because the pre-compilation mode suffers from high execution times of the first and second measurement.

66 4.5 Experimental Results and Evaluation

C++ JIT mode Full-AOT mode Pre-Compilation RET 1,018 73082,574 50,036 4520,415 RSD 1,238 3147,299 15,502 152,632

Table 4.6: RET and RSD of 1000 methods of the C++ variant and the C# variant using different execution modes of Mono 2.6.1 and MonoRT, ARM

Figure 4.7 shows the startup times of the Mono and MonoRT on IA-32 and ARM using different execution modes in order to evaluate the development objective O.4. The JIT

30000

25000 23360,476 21258,174

20000 ms 15000 Time in Time

10000

5000 918,0455 1669,993933 170,11 235,361 0 JIT AOT / Full-AOT JIT Pre-Compilation IA-32 ARM

Figure 4.7: Startup time of 1000 methods compilation mode has a startup time of round 1 second on IA-32 and 1.6 seconds on ARM. The JIT compilation of the entry point method is crucial. The AOT mode’s startup time on IA-32 and the Full-AOT mode’s startup time on ARM, respectively, consider VM initialization and loading of AOT compiled code. As expected, MonoRT’s pre-compilation mode has the longest startup time, which is 20 times of the JIT mode’s startup time.

67

5 Patching of Native Code

5.1 Lazy Compilation and Reference Handling

Section 4.2 introduced a concept for real-time suitable code generation in virtualizing runtime environments. It relies on a full-featured VM that is augmented with real-time suitable code generation. Sections 4.3.1 and 4.3.2 described the principle and implementa- tion of a JIT compiler based pre-compilation of intermediate code. This allows for a higher execution time determinism compared to the JIT compilation execution mode, because the direct impact of the lazy compilation principle on the execution times is eliminated. However, this does not reach the determinism of the reference implementation written in C++, see section 4.5. Table 4.2 shows that the third and fourth measurement of MonoRT in pre-compilation mode reveal execution times that are substantially shorter than that of the first and second measurement, although the intermediate code is already JIT compiled or loaded. This temporal non-determinism is due to the lazy compilation principle, which preserves the support for dynamic language features. The JIT compiler, which is used for pre-allocation of native code due to development objective O.3 (performance), introduces this temporal non-determinism. Step 6 of Figure 4.1 shows that the Execution Engine might look up code at run time. The pre-compilation avoids cache misses because it “fills” the internal Native Code Cache, but the pre-compilation does not avoid the look-ups. Although they do not cause a JIT compilation, they are responsible for the temporal non-determinism that is revealed in section 4.5. This section describes how this problem, which is introduced by the decision to use the JIT compiler as code allocator, is solved. Listing 2.16 in section 2.4.3 shows the native code of the example application (see List- ing 2.13 and Listing 2.14) that was generated by Mono’s JIT compiler. Lines 5 and 8 correspond to method calls. They show the situation before the methods func and WriteLine, respectively, are called. Lines 5 and 8 of Listing 2.17 show the situation after the method calls and the cache look-ups, respectively. The arguments of the call instructions are modified during execution. This is a consequence of lazy compilation. The working principle of the lazy compilation technique is described on the basis of a call to a non-virtual method and illustrated by Figure 5.1. Like illustrated by Figure 2.7a and Figure 2.7b, a program, which is available in a HLL like C#, is initially compiled to intermediate code (Figure 5.1, step 1). At startup, the native code of the entry method of the application is generated, e.g., JIT compiled (Figure 5.1, step 2). This is method Main in the example and its native code starts at address 0x010591e8. Thereby, only the cur- rently executed intermediate code is translated. Intermediate code, which is referenced, is omitted, e.g., that of method func and that of method WriteLine. Their native code is generated on demand, i.e., when it comes to execution. This is the lazy compilation principle. So, the call instruction in a JIT compiled piece of code cannot point to not yet generated native code. In the example, the call instruction to func points to address

69 5 Patching of Native Code

5. Magic Trampoline (.., arg, ...) { Specific Trampoline ... programm.cs Main (a) @ 0x1059268 @0x010591e8 3. m = arg; // m : funcDesc Main () { push funcDesc ...... 2. _main : jmp genericTramp func(); ... addr = compile (m); ... call 0x1059268 // addr : addr_func WriteLine(); ...... } call 0x1059244 mono_arch_patch_callsite (addr); ... return; 4. } Generic Trampoline Main (b) ... @0x010591e8 push funcDesc 1. ... _main : call MagicTramp programm.exe 7...... 6. call 0x1059328 add 0x4, esp 8. 9. _Main : ... ret func ... call 0x1059244 10. @ 0x1059328 call _func ...... call _WriteLine ret

C#-Code CIL-Code native Code Mono-VM Code

Figure 5.1: Call of a lazily compiled method in Mono

0x1059268, see method Main (a) in Figure 5.1 and line 5 of Listing 2.16. This is the address of a tiny piece of native code that is generated during JIT compilation of Main. It is called a Specific Trampoline. When the program flow reaches the call to func in step 3 of Figure 5.1, the Specific Trampoline is executed. This type of Trampoline is architecture-dependent and method-specific. It pushes a pointer to a Mono-internal data structure, which identifies the code to be called, on the stack (“funcDesc” in Figure 5.1) and jumps to another Trampoline that is called Generic Trampoline (step 4 of Figure 5.1). It is also architecture-dependent and is generated during JIT compilation. The Generic Trampoline saves the caller’s execution context and branches to a Mono-internal function, the Magic Trampoline (step 5 of Figure 5.1). While the other Trampolines are emitted as native code, the Magic Trampoline is programmed in a HLL and it is part of the Ex- ecution Engine. It is responsible for providing the native code for the intermediate code that is called, e.g., by JIT compilation or by loading AOT compiled code, like described in section 2.4.2. The other Trampolines are not inherent part of Mono because they are generated at code generation time. They are “callbacks” that directs the program flow back to the VM’s Execution Engine. The Magic Trampoline is the top-level Trampoline in the hierarchy. It performs the Native Code Cache look-up (step 6 of Figure 4.1) and possibly initiates JIT compilation or loading of AOT compiled code. When the callee’s address is known, the caller’s native code is modified to refer to the final address, see step 7 of Figure 5.1. The argument of the caller’s call instruction is “patched” to the callee’s address, so that the Trampolines’ code is not executed again for the same native call instruction. That is, the overhead of Trampoline execution might be once per call instruction. In the example, the argument of the call to func is changed from 0x1059268 to 0x1059328 in Main (b) of Figure 5.1) (see also Listing 2.17). After the modification of the caller’s code, the Magic Trampoline returns to the Generic Trampoline (step 8 of Figure 5.1), which restores the caller’s context and manipulates the stack, so that the

70 5.2 Concept

Generic Trampoline returns to the callee, see step 9. In step 10, the called method returns to the caller without being aware of the complex actions in between. As Figure 5.1 indicates, Mono’s Trampolines form a hierarchy. The Specific Trampolines are specific for the callee, e.g., a certain method. The Generic Trampolines also come in several flavors. Figure 5.1 shows the flow of a “JIT Trampoline” that handles normal method calls. There are also variants for calls to virtual methods, interface methods, type initializers, jump instructions, AOT compiled methods, synchronization primitives, delegates, generic language features and error handling functionality. [60] describes Mono’s Trampolines in more detail. Summarized, a Trampoline performs the cache look-up that is illustrated in step 6 of Figure 4.1 at run time and cause one-time overhead that is not suitable for real-time systems. Section 4.5 reveals this overhead impressively.

5.2 Concept

This section presents a concept to avoid the overhead of reference handling at run time in order to reach real-time suitable execution times. According to design decision D.3 of section 4.2, the pre-compiled code is transformed, so that it does not contain “indirect” references, i.e., references that point to and invoke code and reference, respectively, han- dling mechanisms of the VM. The goal is to allow an execution of an application without using services of the Execution Engine of the VM. Figure 5.2 illustrates the interplay of pre-compilation and the patching of native code so that it does not contain indirect refer- ences. The illustration bases on the example application that is also used in section 5.1. The patching of the native code is first introduced in [33]. The approach, which is called “Pre-Patch”, pre-executes the Trampolines at startup after the allocation of native code, i.e., pre-compilation. So, all indirect references are resolved and the native code is patched completely. It is not necessary to execute Trampolines at run time. An alternative is to avoid the patching of the native code at step 7 of Figure 5.1. The Trampolines would be executed every time. This is no appropriate solution as it does not meet development objective O.3 (performance). Additionally, section 4.5 reveals a higher standard deviation if the complex Trampoline mechanism is executed. That contradicts O.1 (determinism).

5.3 Implementation

Reference handling is not standardized, so that the implementation presented here is specific to the Mono VM and hardware dependent. Pre-Patch uses VM-internal function- ality that is not (intended to be) provided by the VM to application programmers. So, it is implemented by modifying the VM’s source code. Pre-Patch is realized by recording indirect references that are emitted during pre-compilation. After that, each indirect ref- erence, i.e., each call site that points to Trampolines, is triggered manually. The triggering sequence is determined by the type of the Trampoline the reference points to, because they expect different stack layouts. These sequences are determined through analysis of

71 5 Patching of Native Code

C# code CIL native code file main memory

program.cs

Main () { Main Main ... @0x010591e8 func(); @0x010591e8 ... _main: _main: WriteLine() ...... } call 0x1059268 call 0x1059328 ...... call 0x1059244 call 0x105bdc0

Pre-Compilation Pre-Patch ence (based on Mono's (pre-execution of efer JIT compiler) Trampolines) C# Compiler ect r indirect reference dir Specific Trampoline program.exe func @ 0x1059268 @ 0x1059328 _Main : ... push funcDesc ... ret call _func jmp genericTramp ... call _WriteLine

Figure 5.2: Interplay of Pre-Compilation and Pre-Patch native code that is generated by the JIT compiler. The Pre-Patch mechanism starts at step 3 of Figure 5.1. Trampolines work transparently to the caller and the callee, so that the execution of a stock Trampoline would start the execution of the indirect reference’s target. For that reason, a modified version of the Generic Trampoline of each type is generated at Pre-Patch initialization. That modified version does not return to the callee in step 9 of Figure 5.1. Instead, it returns to the caller, i.e., the Pre-Patch code. That is realized by modifying the stack pointer. The indirect reference or Specific Trampoline has to be redirected to the modified Generic Trampoline. It is a characteristic of Mono that the Trampolines are weaved into the code generation process. That is, more native code with unresolved references can be emitted during the execution of a Trampoline and type initializers can be run. That means, indirect references are also recorded during Pre- Patch and Pre-Patch continues until there are no unprocessed indirect references. That also means that the Specific Trampolines are not modified to avoid side effects. Instead, a modified version of the Specific Trampoline, which points to the modified Generic Tram- poline, is generated and the indirect reference is modified, so that it points to the modified Specific Trampoline. Then, the pre-execution of the Trampoline mechanism is started. Steps 4 to 8 of Figure 5.1 are executed automatically. Step 9 returns to the Pre-Patch code and step 10 is omitted. When the Trampoline execution returns to the Pre-Patch code, the next indirect reference can be processed until all are processed.

Pre-Patch can cause the generation of native code and type initializers can be run in scope of this. Therefore, Pre-Patch is started from managed code in order to catch exceptions and to be able to continue Pre-Patch. This approach is also used for pre-compilation and described in section 4.3.2 and illustrated by Figure 4.2. The Pre-Patch implementation

72 5.3 Implementation uses the “Mono embedded API” to call intermediate code from C code. The intermediate code calls Mono-internal C functions that actually perform Pre-Patch via the “internal call” interface.

Figure 5.1 illustrates the flow for a JIT Trampoline. Further Trampoline types exist including Trampolines that handle calls of methods via the Virtual Table (VTable) and via the Interface Method Table (IMT) of a class, see section 5.1. Both are special cases of a JIT Trampoline. Details about the IMT are described in [59]. Figure 5.3 illustrates the Pre-Patch of a method call via the VTable of a class. Mono’s JIT compiler initiates a call

4. Magic Trampoline (..., arg, ...) { Call sequence of a Virtual Table of a class with Generic Trampoline ... virtual method: Interface Method Table: ... push 0xFFFFFFFE if (arg == 0xFFFFFFFE) { ... vt = get_vcall_slot(&offset); IMT slot 0 IMT-Trampoline-Addr. call MagicTramp // vt : VTable calling class ... IMT-Trampoline-Addr. ... instance sub 0x10, esp IMT-Trampoline-Addr. add 0x4, esp // offset : called slot's offset mov vtableAddrPtr, ebx ret 6. ... push ebx ... if (offset > 0) { push retArg ... m = class_get_vtable_entry 3. mov vtableAddr, eax ... (vt->klasse, offset); nop IMT-Trampoline-Addr. IMT slot 19 // m : VirtMethod0 nop } else { nop struct MonoVTable arg = 0xFFFFFFFF; call *offset(eax) offset > 0 int ... } add 0x18, esp int ... } ... 7. .. .. if (arg == 0xFFFFFFFF) { vtable[]: ... VT-Trampoline-Addr. VTable slot 0 VT-Trampoline } 1. VT-Trampoline-Addr. 2. ... addrVirtMethod0 push 0xFFFFFFFE addr = compile (m); // addr : addrVirtMethod0 VT-Trampoline-Addr. 5. jmp genericTramp ... VT-Trampoline-Addr. // Probing eax, offset, Stack VT-Trampoline-Addr. VTable slot 5 vt_slot = ... convert_imt_to_vtable_slot ... (m, NULL); // vt_slot : VTable-Slot 2 patchVtSlot (vt_slot, addr) ... return; }

Figure 5.3: Pre-Patch of a method call through the Virtual Table to a virtual method via a specific calling sequence that is shown top left of Figure 5.3. The address of the VTable is expected to be held in a certain register. The call instruction’s argument is an offset that determines a slot within the VTable. In the example, VTable slot number 2 is chosen, see step 1 of Figure 5.3. Each VTable slot initially contains a jump to the address of a VTable Trampoline (step 2). This is a special type of a Specific Trampoline that identifies rather a VTable slot in general than, e.g., a certain method. The program flow reaches a Generic Trampoline which calls the Magic Trampoline (steps 3 and 4). The Magic Trampoline determines the native code of the callee and modifies the VTable slot (step 5 of Figure 5.3), so that the slot points to the callee directly. The call target is determined by the stack content. The Magic Trampoline returns in step 6. Like the Pre-Patch of JIT Trampolines, the Generic Trampoline for VTable calls has to be modified in advance, so that it returns to the caller in step 7, instead of to the callee. This modification enables Pre-Patch. Special care has to be taken in order that the stack content is not modified.

73 5 Patching of Native Code

The Pre-Patch of an interface method works similar, like illustrated by Figure 5.4. Mono

4. Calling sequence of an Virtual Table of a class with Magic Trampoline (...,arg, ...) { interface method: Interface Method Table: ... if (arg == 0xFFFFFFFF) { ...... IMT-Trampoline-Addr. IMT-Slot 0 // Probing edx sub 0x10, esp IMT-Trampoline-Addr. 2. IMT-Trampoline 1. m = find_imt_method(); mov vtableAddrPtr, addrIntMethod1 // m : method1Ptr ebx push 0xFFFFFFFF ... 5. ... push ebx jmp genericTramp // Probing eax, offset, Stack push retArg ... vt_slot = mov vtableAddr, eax ... convert_imt_to_vtable_slot mov method1Ptr, edx IMT-Trampoline-Addr. IMT-Slot 19 (m, &classImplMethod); nop // vt_slot : IMT-Slot 2 nop struct MonoVTable 3. // classImplMethod : IntMethod2 nop int ... Generic m = classImplMethod; call *offset(eax) int ... Trampoline add 0x18, esp offset < 0 .. ... if (IMT-Kollision) ... .. 7. vtable[]: push 0xFFFFFFFF imtThunkAddr = ... BuildImtThunk(vt_slot); VT-Trampoline-Addr. VT-Slot 0 call MagicTramp } VT-Trampoline-Addr. VT-Slot 1 ...... VT-Trampoline-Addr. add 0x4, esp 6. addr = compile (m); VT-Trampoline-Addr. ret // addr : addrIntMethod1 VT-Trampoline-Addr. ... VT-Trampoline-Addr. // Probing eax, offset, Stack vt_slot = ... convert_imt_to_vtable_slot ... (m, NULL); // vt_slot : IMT-Slot 2 patchVtSlot (vt_slot, addr) ... return; }

Figure 5.4: Pre-Patch of a method call through the Interface Method Table combines IMT and VTable. So, an IMT call has a negative offset (step 1 of Figure 5.4) whereas a VTable call has a positive offset. Again, Specific and Generic Trampolines are used to call the Magic Trampoline, which patches the IMT slot (step 5) so that it contains the callee’s address. The Generic Trampoline is modified so that it returns to the caller (step 7). In summary, it is necessary to modify the Generic Trampolines and to emit native code that triggers the execution of the Trampolines manually, in order to enable Pre-Patch of the VTable and IMT. VTable and IMT are Mono-internal data structures that are set up during pre-compilation. So, Pre-Patch runs after pre-compilation.

The Pre-Patch mechanism described so far works at initialization time. The CLI provides the programming feature “Delegate”. A Delegate defines a method signature and a method can be assigned to a Delegate. The Delegate behaves on behalf on the method. Delegates can be created at run time, so that the address is not known offline. Also the assignment of a method to a Delegate is done lazily by means of Trampolines. This can lead to non-deterministic execution times of a method upon first call. The Pre-Patch of Delegate Trampolines is triggered at Delegate creation time, i.e., at the assignment of a method to a Delegate. That is, the Delegate creation has a non-deterministic execution time, whereas the call to a method does not suffer from one-time overhead. While pre-compilation and Pre-Patch are realized by modification of the VM, the Pre-Patch of Delegates requires modification of managed code. In particular, the standard library function that creates multi-cast Delegates was modified. A multi-cast Delegate is a Delegate, which can have several elements in its invocation list. Its initialization is also done lazily, but Mono uses a

74 5.4 Experimental Results and Evaluation slightly other mechanism than normal Delegates. It was necessary to use an “internal call” to trigger the Delegate Pre-Patch manually. This contradicts development objective O.2. However, Delegate creation introduces non-determinism by itself, so it is questionable to use Delegates in time-critical code.

5.4 Experimental Results and Evaluation

This section presents the results of the benchmark that is also used in section 4.5. This provides a good understanding of the benefit of Pre-Patch. The methods are called via an instance of their (only) defining class. Listing 4.5 shows the access pattern of the methods. Table 5.1 lists the benchmark results of MonoRT in pre-compilation and Pre-Patch mode and of the C++ implementation on the IA-32 platform. Pre-Patch is implemented for IA-32 only. The IA-32 platform is supported by numerous real-time solutions, which come into question for a comparative evaluation of MonoRT. So, a re-implementation for the ARM platform would not obtain new findings. Like in section 4.5, the sample mean is denoted by X in Table 5.1 and it is calculated by Equation

4.1, where n = 2000 and Xi are the observed execution times. MonoRT in Pre-Patch

Execution Mode Meas. X[µs] Min [µs] Max [µs] C++ 1st 153,373 136 210 2nd 153,374 137 211 3rd 153,74 137 213 4th 153,4225 128 215 MonoRT, 1st 8114,75 8019 8268 JIT-based 2nd 7901,51 7790 8047 pre-compilation 3rd 145,7 133 193 4th 145,95 134 192 MonoRT, 1st 146,529 135 195 JIT-based 2nd 146,7645 135 197 Pre-Patch 3rd 146,1825 134 190 4th 145,9705 133 181

Table 5.1: Observed execution times of 1000 methods of the C++ variant and of the C# variant using the pre-compilation and the Pre-Patch execution mode of MonoRT, IA-32 mode reaches the final execution time level of round 146 µs right from the start. This is even slightly better than that of the reference written in C++. This enables to meet development objective O.3 (performance). Figure 5.5 shows the frequency distribution of the observed execution times of all four measurements of MonoRT in Pre-Patch mode. The distribution is very close to that of the reference implementation, see Figure 4.4 and almost identical to that of Mono’s JIT mode (Figure 4.5). That is, the Pre-Patch has no (negative) effect on the execution time behavior of code, at least in the case presented here. As Pre-Patch is an optimization of native code, which uses “built-in” mechanisms, this is not unexpected. The behavior of the execution times of applications,

75 5 Patching of Native Code

1200

1000

800

600

400 times of observed execution Number 200

0 133 143 153 163 173 183 193 Observed Execution times in µs

Figure 5.5: Frequency Distribution of the observed execution times of the all four mea- surements of MonoRT in Pre-Patch mode, IA-32 which are run by MonoRT, is comparable to that of the reference implementation in C++.

Figure 5.6 and Table 5.2 evaluate development objective O.1 (determinism). The former shows the corrected sample standard deviation of the execution times for each run, which is calculated by Equation 4.2. The corrected standard deviation of MonoRT’s Pre-Patch

16,07790784 16,11598543 C++ 16,37300943 16,04879359

39,28526971 38,80981747 Pre-Compilation 11,11958724 11,56002219

11,16290573 11,00743483 Pre-Patch 10,75923205 10,52787005

0 5 10 15 20 25 30 35 40 45 Standard Deviaon of observed execuon mes in µs

1. Measurement 2. Measurement 3. Measurement 4. Measurement

Figure 5.6: Standard deviation of execution times of 1000 methods, IA-32 mode is even lower than that of the reference implementation. Section 4.5 introduces the custom measures RET and RSD to get an expressive value to rate determinism. Table 5.2 shows the results of RET and RSD for C++ and MonoRT in pre-compilation and Pre-Patch mode, respectively. Pre-Patch provides a gain of determinism compared to

76 5.4 Experimental Results and Evaluation

C++ Pre-Compilation Pre-Patch RET 1,002723 3025,460284 1,005439455 RSD 1,026332 12,819291 1,132982254

Table 5.2: RET and RSD of 1000 methods of the C++ variant and the C# variant using pre-compilation and Pre-Patch mode of MonoRT, IA-32

the execution in pre-compilation mode. Its RET is significant better than that of the pre-compilation mode but minimal worse than that of the reference implementation.

Also Pre-Patch’s RSD is not even one-tenth of that of pre-compilation. Although its corrected standard deviation is generally lower than that of the reference implementation, it jitters more than that of C++. So, RSD is slightly higher. In summary, there is a great determinism benefit over pre-compilation which allows MonoRT to meet O.1.

Figure 5.7 shows the startup times of Mono in JIT mode and of MonoRT in pre- compilation and Pre-Patch mode. As expected, the patching of native code in Pre-Patch

30000 27371,1725

25000 21258,174

20000

15000 Time in µs Time

10000

5000 918,0455

0 JIT Pre-Compilation Pre-Patch

Figure 5.7: Startup time of 1000 methods, IA-32 mode increases the startup time compared to the pre-compilation mode by round 30 % and 6 seconds, respectively. The pre-compilation still contributes the major part of the startup time, so that it is subject to further investigations of objective O.4.

77

6 Optimization of Startup Time

6.1 Interim Analysis

Sections 4.2 and 5.2 introduced a concept for real-time suitable code generation in virtu- alizing runtime environments, which applies pre-allocation and patching of native code. Briefly speaking, the native code allocation and the handling of indirect references are shifted to the startup phase of an application. That is, the allocation of the resource code is deterministic, so that it does not affect code execution. This behavior is suitable for real-time systems. These operations increase the startup time of an application compared to the lazy com- pilation mode. Section 5.4 shows that the startup time increased to 30-times of the JIT mode’s startup time. This is not satisfying for meeting development objective O.4 (startup time). This section describes and evaluates two possibilities that reduce the startup time while preserving the deterministic execution times.

6.2 Reduction of Allocated Code

6.2.1 Concept

Section 4.1.2 describes the relation of code generation technique and the number of indirect references in the generated code. Native code, which was generated lazily, potentially con- tains less indirect references than ahead-of-time compiled or position-independent code. The evaluation of the approach, which incorporates the lazy code generator, reveals that the pre-compilation contributes more to the startup time than the handling of references in case of the benchmark program. The comparison of Figure 4.7 and Figure 5.7 indicates that the pre-compilation of the intermediate code accounts for round three quarters of the startup time in Pre-Patch mode. As a consequence, it is rather promising to reduce the amount of intermediate code to pre-compile than to reduce the number of Trampolines to “pre-patch”. A solution, which might reduce the amount of time spent on Pre-Patch, is to perform the pre-execution of Trampolines per Trampoline and not per call site. So, a Trampoline is used as an indirection, which is comparable to a table. This approach would reduce the Pre-Patch overhead of the benchmark program by 50%, because there are two call sites per method. The approach does not safe time, if there is only one call site per method. Further, the additional indirection can result in a higher execution time. The amount of intermediate code, which has to be considered for pre-compilation and Pre-Patch, can be reduced in two different ways: reducing the amount of intermediate

79 6 Optimization of Startup Time code at all, i.e., at intermediate code level, and reducing the amount of intermediate code, which is actually compiled during pre-compilation, i.e., at native code level. Both possibilities affect the startup time and other development objectives. The intermediate code is compiled to native code beforehand, i.e., ahead-of-timeIn, in order to reduce the startup time at native code level. An AOT compiler is used to generate native code in an additional deployment step. The AOT-compiled code is loaded at pre-allocation time of native code. This takes less time than pre-compilation. The native code is subject to Pre-Patch, so that it does not contain indirect references. The offline compilation also takes time and has to be repeated each time the application or any referenced library is updated. So, it does not have to be done at each application startup. The AOT-compilation time does not contribute to the startup time, like it is explained in section 1.3. However, the extra deployment step is a tradeoff, because it affects the development objective O.2 (portability) negatively. This approach is implemented and evaluated in the following. Figure 6.1 illustrates the interplay of pre-compilation that utilizes an AOT compiler and the pre-patching of native code in the implementation of the MonoRT. AOT compilation usually produces PIC. Sections 2.3.3 and 2.4.3 describe

program.cs

Main () { C# code CIL native code file main memory ... func(); ... WriteLine() Main Main } @0xb773b1f0 @0xb773b1f0

_PLT: _PLT: C# Compiler program.exe ......

_Main : _main : _main : ...... call 2010 call 2010 call _func Pre-Compilation Pre-Patch ...... (based on Mono's (pre-execution call 202b call 202b

call _WriteLine AOT compiler) of Trampolines) direct reference indirect reference

AOT Compiler AOT Trampoline func program.exe.so @ 0x123456: @ 0xb762f030:

_PLT: ...... ...

_main : ... call 2010 ... call 202b

Figure 6.1: Interplay of AOT compiler based Pre-Compilation and Pre-Patch that PIC usually handles references by tables. So, the Pre-Patch mechanism has to be adapted to work with AOT-compiled code. [34] introduced the approach of incorporating an AOT compiler and adapting the Pre-Patch to handle the AOT compiled code first.

A code reachability analysis at intermediate code level can reduce the startup time further. The amount of intermediate code that has to be considered for allocation of native code can be minimized. The reachable, i.e., callable, code is identified and a minimal set of

80 6.2 Reduction of Allocated Code intermediate code is stored persistently in reduced assemblies. These reduced assemblies have to be pre-compiled and patched. This affects development objective O.2 negatively again, because the reachability analysis is an additional deployment step, which has to be repeated every time any of the referenced code is updated. In contrast, the general AOT compilation based approach does not require to re-compile all application code. Only the updated portion has to be handled again. The compromise between startup time and portability is chosen to be in favor of startup time. The reduction of startup time at native code level and at intermediate code level are independent from each other but can be combined. Section 6.2.3 provides experimental results and discusses the approaches further.

6.2.2 Implementation

The startup time optimization at native code level is described first. Mono provides an AOT compilation mechanism, see sections 2.4.3 and 4.1.2. The AOT-compiled native code is loaded lazily at run time, when native code is going to be generated, i.e., at step 6 of Figure 4.1. Section 4.1.2 describes that Mono provides two AOT modes. The normal one, which is available on the IA-32 test system, does not AOT-compile Mono-internal helper functions. In “Full-AOT” mode, which is available on the ARM test system, the majority of Mono-internal helper functions is pre-compiled, but not all of them. In cases where native code is not available, Mono had to fall back to the JIT compiler. Actually, the version of Mono, which is used, aborts the execution of an application, if the JIT compiler would be needed. In order to avoid changes to Mono’s AOT compilation system and because the platform-dependent Pre-Patch is currently developed for the IA-32 platform only, the implementation of the AOT-based Pre-Patch is implemented with MonoRT on the IA-32 platform. That is, the normal AOT mode is used. Section 6.2.3 presents benchmark results. They confirm, that this combination is sufficient to proof the concept. Mono’ newer versions provide Full-AOT for IA-32 but also drop support for the .NET-1.0-profile. That is, profiles that support parametrical polymorphism had to be used. They are not supported by pre-compilation and Pre-Patch (see section 4.3), so that Mono 2.6.1 and its normal AOT mode is used as basis. So, Mono-internal helper functions still have to be generated by the JIT-compiler, but it does not impair the proof of concept of this implementation.

The AOT-based pre-allocation of native code and Pre-Patch work as follows. First, the application assembly, the standard library’s assemblies and additional custom assemblies, are AOT-compiled in an extra (offline) deployment step. The AOT-based pre-compilation at application startup time does not differ fundamentally from the JIT-based one. Mono’s AOT mechanism is on a lower implementation level than the pre-compilation. So, loading AOT-compiled code is realized without additional implementation effort. The Pre-Patch is adapted due to the different handling of references. The native code is stored in a file format, which is very similar to ELF. Method calls go through the PLT. Figure 2.3 illustrates a method call via the PLT in a conventional native application. Method calls

81 6 Optimization of Startup Time in Mono’s AOT-compiled code work similar. Figure 6.1 shows details. Call sites, which use the PLT, point to a specific PLT entry. The PLT entry is specific to the call target and it initially contains the address of a Trampoline. Details about the Trampoline mechanism are omitted here, because they are very similar to that described in section 5.3. In the example, the PLT entry of function func points to an AOT Trampoline with the fictive address 0x123456, which was generated during pre-compilation. The Pre-Patch mechanisms works similar to that described in section 5.3. The execution of the call site of each PLT slot is triggered manually. The Trampolines are modified, so that they return to the Pre-Patch code. This approach might reduce the number of references to be handled, because they are considered table wise. So, besides the amount of intermediate code to pre-compile, also the number of Trampolines to pre-execute is reduced under certain circumstances. As Mono runs in normal AOT mode, the references in JIT-compiled code are handled during Pre-Patch like described in the former chapter.

The startup time optimization at intermediate code level, which is mentioned in section 6.2.1, is realized as follows. A code reachability analysis is performed on the intermediate code in order to reduce the amount of intermediate code to pre-compile and patch. It takes no additional implementation effort to apply this, because Mono provides the tool “monolinker”. It is used to determine the minimal set of intermediate code (CIL) that is necessary to run an application. It also considers the and generates reduced versions. The following command analysis an application assembly and copies the reduced assemblies, including the base library assemblies, to the directory outDir:

$ monolinker −out outDir −c l i n k −a application.exe

The AOT compilation can be applied to the reduced assemblies in the directory outDir. Pre-compilation and Pre-Patch work as described, because the code is reduced at inter- mediate code level. The reachability analysis time does not contribute to the application’s startup time, like described in section 1.3.

6.2.3 Experimental Results and Evaluation

This section presents the results of the benchmark that is also used in sections 4.5 and 5.4. The methods in the benchmark are called via an instance of their (only) defining class, see Listing 4.5. The startup time reduction at native code level, i.e., by incorpo- rating an AOT compiler as code generator, is evaluated first. It has to be evaluated, if the execution time determinism is affected by this. Table 6.1 lists the benchmark results of MonoRT in both Pre-Patch modes and of the C++ reference implementation. Again, the sample mean is denoted by X, it is calculated by Equation 4.1 where n = 2000 and

Xi are the observed execution times. Like the JIT-based pre-compilation, the AOT-based pre-compilation of MonoRT reaches the final execution time level, which is about 148 µs, at the first execution. Although the execution times are slightly higher than that of the JIT-based solution, they are still better than that of the reference implementation. That is, development objective O.3 (performance) is still met. It is noticeable that the first

82 6.2 Reduction of Allocated Code

Execution Mode Meas. X[µs] Min [µs] Max [µs] C++ 1st 153,373 136 210 2nd 153,374 137 211 3rd 153,74 137 213 4th 153,4225 128 215 MonoRT, 1st 146,529 135 195 JIT-based 2nd 146,7645 135 197 Pre-Patch 3rd 146,1825 134 190 4th 145,9705 133 181 MonoRT, 1st 143,5275 135 183 AOT-based 2nd 148,7635 139 185 Pre-Patch 3rd 149,0615 139 194 4th 148,6865 140 189

Table 6.1: Observed execution times of 1000 methods of the C++ variant and the C# variant executed in different pre-compilation and Pre-Patch modes of MonoRT, IA-32 measurement has a lower average observed execution time than the following measure- ments, which is reflected by Figure 6.2. This even affects the former “anomaly” where

1800

1600

1400

1200

1000

800

600

times of observed execution Number 400

200

0 135 145 155 165 175 185 Observed Execution times in µs

Figure 6.2: Frequency Distribution of the observed execution times of the all four mea- surements of MonoRT in AOT-based Pre-Patch mode, IA-32 there is a second execution time aggregation of execution times that is round 30 µs higher. In contrast to the behavior of the average observed execution times, the corrected stan- dard deviation of the first measurement is the best of the four measurements (Figure 6.3).

The custom measures RET and RSD reveal that the AOT-based Pre-Patch mode provides minimal worse determinism than the JIT-based variant, see Table 6.2. While there is an improvement in RSD compared to the JIT-based Pre-Patch, the rating of the execution

83 6 Optimization of Startup Time

16,07790784 16,11598543 C++ 16,37300943 16,04879359

11,16290573 11,00743483 JIT Pre-Patch 10,75923205 10,52787005

10,6786357 10,87205908 AOT Pre-Patch 11,48832923 10,88965151

1 10 100 Standard Deviaon of observed execuon mes in µs

1. Measurement 2. Measurement 3. Measurement 4. Measurement

Figure 6.3: Standard deviation of execution times of 1000 methods, IA-32

C++ JIT Pre-Patch AOT Pre-Patch RET 1,002723 1,005439455 1,1151366 RSD 1,026332 1,132982254 1,1169542

Table 6.2: RET and RSD of 1000 methods of the C++ variant and the C# variant run by MonoRT, IA-32

time behavior RET is minimal worse, but much better than the JIT compilation’s results. The generation of a minimal set of intermediate code reduces the startup time even further. However, this is bought by limited portability. The reasons are the additional deployment steps. The minimal set has to be generated (and AOT-compiled) again, if referenced parts of the application are updated. In this cases, there is not much left of the advantage of the deployment via intermediate code compared to the deployment via, e.g., C code. Further, this approach can hardly reach development objective O.2, e.g., if dynamical loaded code requires standard library functions that are not included by the minimal set. Table 6.3 lists the observed execution times of the benchmark, which is also used in sections 4.5 and 5.4, where the minimal set of the benchmark code, which mostly reduces the core library, is executed by MonoRT in AOT-based Pre-Patch mode.

Execution Mode Meas. X[µs] Min [µs] Max [µs] MonoRT, 1st 143,667 124 187 AOT-based 2nd 150,499 128 192 Pre-Patch 3rd 149,922 130 195 min. set 4th 150,215 129 197

Table 6.3: Observed execution times of the minimal set of 1000 methods (C#) benchmark executed by MonoRT in AOT-based Pre-Patch mode, IA-32

84 6.2 Reduction of Allocated Code

The results are very similar to that provided by MonoRT in AOT-based Pre-Patch mode, which are listed by Table 6.1. It is assumed that the behavior of intermediate code execution by the MonoRT VM does not change, so that an extensive analysis is omitted at this point. Figure 6.4 shows the startup times of the Mono and MonoRT VM, respectively, in JIT, JIT-based and AOT-based Pre-Patch mode. For the latter, the startup time of the minimal set of the benchmark code is also shown. The startup time optimization

30000 27371,1725

25000 21258,174 21911,6125

20000

14695,853 15000 Time in ms Time

10000

5000 918,0455

0 JIT Pre-Compilation JIT Pre-Patch AOT Pre-Patch AOT Pre-Patch (min. set)

Figure 6.4: Startup time of 1000 methods, IA-32

at native code level allows for a reduction by round 20% compared to the JIT-based Pre-Patch mode. If startup time reduction at intermediate code level is applied too, the reduction is round 45% compared to the JIT-based Pre-Patch mode. The minimal set approach, as it is currently realized, does not meet O.2. A solution might be to perform an analysis of the minimal set at initialization time, i.e., just before pre-allocation of native code. This contributes round 7,273 seconds on average to the startup time on the IA-32 system (including writing the reduced assemblies), so that there is no advantage left.

The startup time reduction can be reproduced by the second test system with ARM plat- form. Figure 6.5 shows the startup times of the benchmark in JIT-based pre-compilation mode with and without code reduction at intermediate code level. The startup time is reduced by 25%. In summary, the approach of AOT-based Pre-Patch improves startup time but comes with a minimal degradation of determinism in this test case. The deter- minism is still better than in JIT mode and comparable to the reference implementation, so O.1 is still met.

85 6 Optimization of Startup Time

30000

23360,476 25000

20000 17505,913

15000 Time in ms Time

10000

5000 1669,994 235,361 0 JIT Full-AOT JIT Pre-Compilation JIT Pre-Compilation (min. set)

Figure 6.5: Startup time of 1000 methods, ARM

6.3 Checkpoint and Restore

6.3.1 Concept

Section 6.2 describes two approaches to reduce the startup time of an application, which can be combined. They have in common that they require modifications to the VM and to the application deployment, respectively. They affect development objective O.2 negatively, as additional deployment steps are needed. From an abstract point of view, the approach operating at native code level takes a snapshot after native code generation and restores this at startup time. It is a logical consequence to take a snapshot of the whole application, which is executed by the VM, after real-time suitable code generation, including pre-allocation and patching of native code. This program, i.e., the VM executing the application, is restored, when it is actually executed. The image of the process has to be stored persistently. The only time that contributes to the startup time, according to the description in section 1.3, is the time to restore the execution state of the program.

6.3.2 Implementation

The implementation of the concept involves the external tool “Checkpoint/Restore In Userspace” (CRIU [123]), which is developed by Andrey Vagin and others. CRIU can take a snapshot or checkpoint, respectively, of a running process, generates image files, which can be saved persistently, and restores the process later, so that the process can continue or even migrate to another machine. At time of this writing, CRIU is available for AMD64 and ARM platforms only. It does also not work on AMD64 systems that execute IA-32 applications. So, the ARM test system is used for implementation. Version

86 6.3 Checkpoint and Restore

1.0 of CRIU is used. The implementation of the concept comprises of the installation of CRIU on the test system. This requires version 3.11. The test system used has some specifics, so that the provided target-specific Kernel tree [123] is used. Unfortunately, the “Real Time Preempt”-patches are not available for this Kernel version, so that the test system lacks this feature (see section 4.4). The implementation serves as proof of concept and there is no special integration of the checkpoint/restore tool with the MonoRT VM. So, the handling is barely automated. The checkpointing is triggered manually after the pre-compilation of the application.

6.3.3 Experimental Results and Evaluation

Figure 6.6 shows the results of the benchmark that is also used in sections 4.5 and 5.4. The methods in the benchmark are called through an instance of their (only) defining class, see Listing 4.5. The benchmark is run in MonoRT’s pre-compilation mode on the ARM system. The white bars show the results without conventional startup and the grey bars show the results of the pre-compilation mode using CRIU. The observed execution

40000

31906,66667 32414,8975 31909,54495 32376,0995 35000

30000

25000

20000 Time in µs Time 15000

10000

5000 475,6565657 482,1815 477,8116162 480,207 0 measurement 1 measurement 2 measurement 3 measurement 4 pre-compilation mode restored pre-compilation mode

Figure 6.6: Execution times of 1000 methods using the JIT-based pre-compilation mode of MonoRT, ARM times of both results are quite similar, so that an extensive analysis can be omitted at this point. The checkpoint/restore does not affect development objectives O.3 (performance) and O.1 (determinism) negatively. The startup time is reduced to round 1.84 seconds on the ARM test system. Figure 6.7 compares the startup times of the benchmark in different execution modes on the ARM platform. The checkpoint/restore tool can reduce the startup time so that it is similar to that of the JIT compilation mode. The execution time determinism of the pre-compilation mode, which is better than that of the JIT or

87 6 Optimization of Startup Time

30000

23360,476 25000

20000 17505,913

15000 Time in ms Time

10000

5000 1669,994 1835,2565 235,361 0 JIT Full-AOT JIT Pre-Compilation JIT Pre-Compilation Restored JIT Pre- (min. set) Compilation

Figure 6.7: Startup time of 1000 methods, ARM

Full-AOT mode, is retained. So, the checkpoint/restore approach is a proper possibility to reach development objective O.4 (startup time). Objective O.2 suffers from the additional deployment step of taking the snapshot.

88 7 Evaluation

7.1 Internal Experiments

7.1.1 Introduction

Chapters 4 to 6 introduce an approach to generate native code in virtualizing runtime environments that allows deterministic execution times. They describe the concept and its implementation. They also introduce a benchmark framework for the comparison of the native code generation modes of MonoRT from real-time point of view. This section analyses if the modifications of the Mono VM affect its functional behavior. It also analyses if the real-time suitable native code allocation is ensured in any case, and if not, it points out the corner cases. It has to be ensured that the native code does not contain indirect references when it is going to be executed and no native code is generated at application run time, in order to meet the latter test criterion. With reference to MonoRT, this means that the evaluation would be complete to the effect that the pre-allocation allocates all necessary code and that Pre-Patch handles all emitted patches, i.e., indirect references, which are emitted during pre-allocation of native code.

The evaluation of the real-time native code generation modes of MonoRT cannot test all possible use cases, because there is an infinite number of possible programs. The modifications of the Mono VM concern the translation from intermediate code into native code. That is, it is sufficient to guarantee that the VM works correctly beneath the intermediate code level. Mono 2.6.1’s internal runtime test suite is conducted for this purpose. Mono also ships with further test suites, e.g., for the C# standard library, but it is sufficient to apply the runtime test suite, because the modifications are beneath the intermediate code, i.e., at VM level. It has to be kept in mind that tests cannot verify and proof, respectively, the correctness of a solution. Tests are used to detect errors, but not to indicate overall correctness. In this context, it can not be guaranteed that Mono’s internal runtime test suite covers all “critical” cases. This test suite is used for regression tests of Mono and it includes 380 tests cases. Each of them is comprised of up to a couple of dozen individual tests. They include simple arithmetic operations, array operations, marshalling, exception handling etc. It is assumed that it covers a representative set of uses cases. So, it is used in order to examine the real-time native code generation modes of MonoRT.

7.1.2 Standard Execution Mode

If pre-allocation of native code and Pre-Patch are not activated, all test cases pass. That is, the modifications have no negative effect on the standard, i.e., JIT based, execution

89 7 Evaluation mode. This is important to meet development objective O.2 (portability).

7.1.3 Real-Time Code Generation Mode

The test of the real-time code generation mode includes testing the functional correctness and testing the real-time capability. The functional correctness is fulfilled if the test delivers the correct result. The real-time capability, as it is defined in this thesis, is fulfilled if native code is neither generated nor indirect references, i.e., Patches, are handled during run time of the test. The test cases of the test suite are examined individually in order to check this.

The pre-allocation of native code (see chapter 4) does not support parametric polymor- phism [26], also known as type parameters. Type parameters allow for a software imple- mentation with unknown types. A type does not need to be known at implementation time and it can be represented as a parameter. This programming feature is commonly known as “Generics”. A JVM is not aware of Generics. Type parameters are “compiled away” by the source code compiler via the “type erasure” technique [121, Ch. 9]. That is, Java bytecode does not contain generic type information. This allows for backwards compatibility of early and late JVM specifications.

In contrast, a CLI VM is aware of generic type information which is introduced by the third edition of the ECMA 335 specification in 2005. That is, the CIL can contain generic type information. So, the pre-allocation of native code had to handle this, because it operates at intermediate code level. [62] describes an early design and implementation of Generics in the .NET runtime environment. The integration of polymorphism in the CLI VM “takes advantage of the dynamic loading and code generation capabilities” [62, p.2 ]. Types can be instantiated at run time and native code can be generated lazily, i.e., on demand. This does not contradict the approach of pre-allocation of native code, because the pre-compilation is quasi at run time. However, the current implementation lacks support for Generics for simplicity, which does not question the approach at all. The CLI assemblies are searched only once for methods and classes to pre-compile. CIL code with type parameters is omitted, because the actual value of the type parameter is unknown at this time. It is a naive solution to search for references to the CIL code with type parameters in the set of assemblies, which are considered for pre-allocation of native code. So, the types to be instantiated could be resolved and native code could be generated for all types found. The Pre-Patch mechanism might be utilized for that. However, this approach is not proofed by implementation. The AOT compiler of Mono 2.6.1, which is used for the implementation of MonoRT, does not support Generics. Using the standard library without Generics, both for the AOT compiler based and the JIT compiler based variant of pre-allocation of native code, advances the comparability of both variants. There are approaches to support Generics for AOT compilation up to a certain point [64] and for code sharing [63]. This issue cannot be limited to languages that provide Generics, because the implementation of the standard library used might rely on Generics.

90 7.1 Internal Experiments

For that reasons, MonoRT is evaluated with test cases, which do not use type parameters. The evaluation uses the test cases, which can be compiled with Mono’s C# compiler mcs. According to Mono’s man page, mcs compiles against the .NET 1.x profile. It can handle C# 1.0 and parts of C# 2.0, which do not use Generics. 262 test cases are available, which are compiled by mcs. There is one test case – gc-altstack.exe – which sporadically fails, even when it is executed by a stock Mono VM. This is assumably caused by a race condition that might occur when the GC is active and an exception is to be handled. For that reason, this test case was removed, so that 261 test cases remain.

First, the available tests are run at one go in order to get an overview of the functional correctness of MonoRT. Robert Andre provided a test script that allows running the tests of Mono’s test suite individually. If pre-allocation of native code and Pre-Patch are not activated, all available tests pass. If pre-allocation of native code is activated, 259 out of the available tests pass. The failed test cases are traceable to side effects of real-time native code generation. The test case classinit.exe checks the value of a variable, which was initialized by a type initializer. Its value depends on the value of a variable of another class. The initialization order might vary from the expected one in pre-compilation or Pre-Patch mode, so that this test case might fail. The test case bug-82022.exe bases on side effects, which concern the execution order of exception handling. For that reason, this test case might fail in pre-compilation and Pre-Patch mode. Both test cases remain in the test suite, because they address an issue, which might be handled in future versions of MonoRT. The corner cases are described in the following.

Section 4.3.1 already describes that the solution does not comply with the ECMA 335 standard and its §I.8.9.5. This paragraph regulates the execution of the type initializers. Type initializer’s methods, which are marked with a BeforeFieldInit flag, are allowed to be executed before the first referencing of the type. If the BeforeFieldInit flag is not set, a type initializer’s method must not be called before the first referencing of that type. The execution of type initializers is bound to the lazy compilation principle in Mono. For example, a type initializer is executed when a method of the type is going to be JIT-compiled. This is typically related to the execution of a Patch, i.e., an indirect reference. The current implementation violates this rule, because the type initializers are executed during pre-compilation and Pre-Patch, which does not count as real run time of an application. Test cases of Mono’s internal runtime test suite, which test the order and the point of time of type initializer execution, can – and do – fail. The execution order depends on the pre-compilation order of the methods of a CLI assembly, which can be barely influenced by a user. The current solution favors deterministic execution times, because the execution of a type initializer introduces one-time overhead at run time that is not suitable for real-time applications. Five test cases fail due to checks of the execution order of type initializers and the helper method sharing problem, so that 256 test cases pass in real-time code generation mode. That is, the functional behavior of the Mono VM is affected by the real-time native code generation only marginally. Dedicated real-time solutions like JamaicaVM provide pre-execution of initializers although it violates the

91 7 Evaluation

Java standard [73, §2.17.4]. Another problem with the compilation order of methods concerns the sharing of Mono-internal helper methods. The helper methods are shared per method signature. However, Mono 2.6.1 included a bug which caused a wrapper to be shared between methods with different signatures. This problem came up when pre-allocation of native code is activated, because a lot of methods and internal helper methods are compiled, so that a certain wrapper was shared mistakenly. This issue was reported to Mono’s development community and it was already fixed in newer versions of Mono. The bug fix is included in MonoRT.

In the second part of the internal experiments, each one of the 256 tests is examined individually in order to check if there is native code allocation or resolving of indirect references at application run time. The following passages describe the corner cases of real-time code generation. First, the pre-allocation of native code and Pre-Patch take effect only for one CLI . An Application Domain is similar to an operating system’s process. It isolates managed applications, e.g., in order to provide security. There are no efforts to share pre-allocated code between Application Domains in MonoRT. If an application creates a new Application Domain and starts executing intermediate code of an assembly, the native code has to be allocated or generated again, even if the assembly, which contains the native code, was subject to pre-allocation and Pre-Patch before. So, it is not possible to use this feature for real-time applications. [63] discusses the problems of code sharing between application domains in CLI.

The second limitation concerns a further CLI and C#, respectively, characteristic – Delegates. A Delegate is an object-oriented variant of a . The ECMA 335 [4, §II.14.6] defines it. The arguments of its instance constructor include a pointer to the method to be called (at intermediate code level). If the target method’s native code is not available at delegate instantiation time and the lazy compilation principle is applied, the pointer refers to a stub, which triggers code generation on demand. This means for the real-time suitable code generation of MonoRT, that a delegate instance initially refers to an indirect reference, i.e., to a Patch. When an instantiated Delegate is used to call the target method the first time, the one-time overhead of Patching, which is described in chapter 5, occurs. Delegates can be created implicitly, e.g., by means of thread creation. This is not suitable for real-time systems. The current implementation of MonoRT shifts the execution of the Patch to the Delegate instantiation. That is, the Delegate instantiation suffers from temporal non-determinism rather than its first use. The source code of the standard library is modified for multicast Delegates due to the Mono-specific implementation of this class. It is necessary to call the Delegate’s Patch explicitly via an “Internal call”, which is also used for the implementation of pre-allocation of native code (see section 4.3.2), in order to resolve the indirect references of the Delegates. While all modifications and extensions are below the intermediate code level so far, this breaks the portability of the solution. The Delegate

92 7.1 Internal Experiments constructors of Mono had to be reworked in order to fix this issue. It might be helpful for this, that the Delegate constructors can assume, that the target method’s code is available.

The third issue affects “marshalling”. Marshalling means the conversion of data (struc- tures), so that the data can be exchanged between software components. These com- ponents might be implemented in different programming languages. In case of Mono, marshalling is done by internal helper methods which are generated during pre-allocation of native code. The internal methods are generated for all built-in and custom types found in the intermediate code. A problem arises (from real-time point of view), if the helper methods rely on object addresses. These helper methods are generated lazily at run time, so that JIT compilation and reference resolving might occur. Two cases could be identified. First, when calling native code, e.g., an application written in C, from intermediate code, i.e., managed code. This is known as “Platform Invoke” or PInvoke for short. The arguments of a PInvoke have to be converted to a format which matches the callee’s conventions. A Delegate can be passed as argument when calling native code. It can be used as a function pointer in order to call managed code from native code. If a Dele- gate is subject to marshalling, an internal helper method is generated at run time. It is a Mono-specific that the Delegate’s address is necessary for the helper method genera- tion. In consequence, the JIT compiler is executed at run time in order to generate the helper (marshalling) method. An indirect reference, which is part of the generated helper method, is resolved if the managed method is called from native code the first time. A proper solution needs a rework of the Delegate mechanism. A barely adequate solution is to generate the marshalling helper method at Delegate initialization. However, it would be not possible to resolve the indirect reference without reworking the JIT compilation or reference resolution code. “It just works” (IJW) wrappers [97, Ch. 3] are similar to PInvokes. They are used to call unmanaged native code from managed (C++) code. A test case utilizes the IJW wrapper to pass a method handle of a managed C# method to an unmanaged C method. The C method performs a Mono-specific lookup and calls the C# method via an IJW wrapper. The IJW wrapper is generated at run time, so that the JIT compiler is executed. Marshalling is also necessary when using the Component Object Model [23] (COM). The COM allows software components to be used together, even if they are implemented with different programming languages. Two COM components communicate through defined interfaces. Mono’s internal helper methods, which are responsible for marshalling in the context of COM, require the address of the object to be marshalled and they are cached per object. The problem arises in both directions of the client/server principle of COM. So, the execution of the JIT compiler and reference resolving might be required at application run time. The current implementation lacks a solution.

The fourth limitation is related to reflective programming features. Reflective program- ming means that a program can obtain and even modify information about its structure.

93 7 Evaluation

Reading Reflection is able to load a handle of a method and call the method, which is represented by the handle. It is possible that a Mono-internal helper method for invoking such a method is not generated during pre-allocation of native code, because no other method with the same signature is pre-compiled. For example, if the getter or setter method of an indexer operator, which is typically implemented by an intermediate code instruction, is loaded explicitly and called via Reflection, a Mono-internal helper method might be generated. This behavior is observed on an array of strings. Reflection also allows modifying or inserting intermediate code at run time. It is obvious that this forces a re-compilation into native code at run time, which contradicts the real-time capability. The use of the Reflection programming features can have side effects. So, it is recommended not to use it for the real-time portion of code.

A fifth issue concerns the execution of the type initializers again. The pre-allocation of native code and the Pre-Patch catch exceptions, which are thrown by type initializers. The execution of the type initializer is aborted in such a case. That is, the pre-compilation of the type’s methods is aborted as well. If such a type is referenced at application run time and its initializer does not throw an exception, the JIT compiler is executed at run time to compile the methods.

Mono 2.6.1, which serves as basis for MonoRT, has an erroneous optimization of the jmp and ldftn, respectively, intermediate code instruction. If a method is loaded by means of ldftn or jumped to by means of jmp, this is done via jump Trampolines. The optimization shares the Trampolines in a wrong way, so that it is possible to bypass additional helper methods, e.g., for synchronized method access. This issue is reported to the Mono developer community and it is fixed in newer versions of Mono. The optimization is deactivated in the current implementation of MonoRT. That is, if a jmp or ldftn instruction is used, the jump Trampoline and also JIT compiler might be executed at run time.

The pre-allocated and optimized native code is loaded into the main memory. This can result in a higher memory footprint than in JIT compilation mode, because pre- allocation of native code considers all methods and types within an assembly and all referenced assemblies. The memory footprint can be reduced by reducing the amount of intermediate code, which has to be considered by pre-allocation. This approach is discussed in [94] and section 6.2.1 takes this up in order to reduce the startup time. The solution taken here favors deterministic execution times rather than memory efficiency.

7.1.4 Summary

It is ensured, except for some special cases, that the JIT compiler is not used at run time and that there are no indirect references in the native code after Pre-Patch. The special cases especially concern the interoperability of managed applications with other software

94 7.2 Comparative Experiments and the use of type parameters. The current implementation is specific to Mono version 2.6.1 and its implementation of the lazy compilation principle. For example, if a solution handles indirect references via tables, it had to be ensured that all entries in the tables are handled, in order to verify the validity of the approach. [49] compares reference handling via tables to the “backpatching” approach, which is applied by Mono. So, the benchmarks in this section are an excerpt of all possible programs, which – in turn – shows the validity of the approach.

7.2 Comparative Experiments

7.2.1 Introduction

Chapters 4 to 6 introduce an approach for the generation of native code in virtualizing runtime environments that allows deterministic execution times. They describe the con- cept, its implementation and they introduce a benchmark framework for a comparison of the Mono VM – which serves as experimental vehicle – to itself, with and without modified code generation. This section applies the benchmarks also to other runtime en- vironments in order to compare the solution presented here to state-of-the-art solutions with regard to development objectives O.1, O.3 and O.4. The test candidates have to be comparable to the solution presented here, so they have to fulfill O.2. Design decision D.1 of section 4.2 (using a VM as basis) limits the choice of candidates to full-featured virtualizing runtime environments. When comparing these requirements to the related work presented in section 3.4, the following VMs are worth to investigate: • MonoRT with pre-compilation and Pre-Patch (C#) • Mono 2.6.1 with Full-AOT mode (C#) • Microsoft .NET CLR v2.0.50727 (C#) • IBM WebSphere Real Time V2 for RT Linux build 2.4 (Java) • Aicas JamaicaVM 6.0 Release 3 build 6928 (Java) The test candidates are chosen because they represent full-featured state-of-the-art imple- mentations of virtualizing runtime environments. They provide different code generation technologies, which is interesting from the view of deterministic execution times. Mono 2.6.1, which is the base of MonoRT, is chosen to evaluate Mono’s “Full-AOT” mode. The .NET framework and its runtime CLR is a mature commercial product and quasi represents the reference of CLI environments. It provides an AOT compiler called “NGen” but does not target real-time systems in the first place. This does not state a problem because the benchmarks used here examine the code generation particularly and not real-time programming features in general. Microsoft’s CLI implementation also allows to adapt the pre-compilation technique that is introduced in section 4.3.1. It provides the language features that are necessary to implement the non-invasive pre-compilation approach that is described in section 4.3.2. It is referred to as “Reflection Pre-Compilation”

95 7 Evaluation

(RPC) in the following. The results of the RPC mode have to be considered cum grano salis, because this variant provides only limited control over pre-compilation and it relies on – and uses – the .NET-2.0-profile, whereas the benchmarks regarding MonoRT and Mono 2.6.1 run on .NET-1.0-profile, i.e., another standard library. While the first three candidates target CLI environments, IBM WebSphere Real Time and Aicas JamaicaVM are Java Virtual Machines. Both are RTSJ-compliant implementations and claim to provide high determinism that is suitable for real-time systems. They have different code generation approaches. IBM WebSphere Real Time uses an AOT com- piler to generate native code in an offline step. This native code is loaded by the VM when it is going to be executed. For the benchmarks presented here, a so-called “Shared Cache” is filled with native code of the benchmark by the tool “admincache”. This tool is shipped with IBM Web Sphere Real-Time for RT Linux and its execution is an additional deployment step. The JamaicaVM uses the programming language C as intermediate language in order to generate standalone executables. They are built by the tool “jamaicabuilder”, which takes the Java class files as input and generates native code in an additional deployment step. For the benchmarks presented here, it is called with the options -compile and -lazy = false in order to pre-compile the intermediate code to native code and to initialize the classes before execution. The comparison between CLI and C#, respectively, and Java environments is valid here. The benchmarks are designed to examine execution time determinism and do not use lan- guage specifics, see section 4.4. The test systems for the experiments, which are presented in this section, are the same as described in section 4.4. All but Mono’s Full-AOT mode are tested on the IA-32 platform. The Full-AOT benchmarks are run on the ARM platform. For tests with .NET, the Windows 7 operating system with Service Pack 1 is used. The benchmark program is set to high priority by passing the parameter “/realtime” to Win- dows’ “start”-command. The test application is short-running and not memory-intensive. So, influences by the automatic memory management are attempted to be avoided. The test application’s Java-version uses so called “No-Heap-Real-Time-Threads”, that do not use GC-controlled memory and run on higher priority than the GC. The following sections present test cases, which base on the micro benchmark frame- work introduced. They evaluate the real-time suitable code generation of MonoRT and compares it to other solutions. Starting from the JIT compilation mode, the effects of pre-compilation, Pre-Patch and startup time optimization, respectively, are pointed out. Further, both Pre-Patch modes of MonoRT are compared to other test candidates by conducting the benchmarks.

7.2.2 Instance methods benchmark

This section presents the results of the benchmark that is also used in sections 4.5, 5.4, 6.2.3 and 6.3.3. Here, it is applied to MonoRT and the test candidates that are listed in section 7.2.1. The methods are called through an instance of their defining class,

96 7.2 Comparative Experiments see Listing 4.5. Table 7.1 lists the benchmark results of MonoRT with AOT-based pre- compilation and Pre-Patch mode and of the other candidates. The sample mean is denoted by X, which is calculated by Equation 4.1 where n = 2000 and Xi are the observed execution times. The first and the second measurement of Mono’s Full-AOT mode reveal

Execution Mode Meas. X[µs] Min [µs] Max [µs] MonoRT, 1st 143,5275 135 183 AOT-based 2nd 148,7635 139 185 Pre-Patch, 3rd 149,0615 139 194 IA-32 4th 148,6865 140 189 Mono, 1st 4241,62 3085 5936 Full-AOT, 2nd 2612,581 1513 4268 ARM 3rd 473,133 438 813 4th 471,448 439 692 .NET, 1st 163,084 152 276 RPC, 2nd 162,7895 152 672 IA-32 3rd 162,7205 152 683 4th 162,5155 151 253 .NET, 1st 226,3315 212 466 AOT, 2nd 182,5 172 446 IA-32 3rd 125,103 117 217 4th 125,054 119 236 IBM 1st 449,702 386 517 WebSphere 2nd 449,747 391 513 RT, AOT, 3rd 449,292 394 518 IA-32 4th 449,67 392 514 JamaicaVM, 1st 29207,221 18796 132662 Compile Early, 2nd 1983,385 1522 3496 IA-32 3rd 1958,338 1479 2762 4th 1937,9825 1468 2819

Table 7.1: Observed execution times of benchmark with 1000 instance methods one-time overhead that is similar to that of Mono’s normal AOT mode, see section 4.5. The pure execution time level is reached first at measurement 3, i.e., at first repetition of the same method calls. So, this execution mode is not suitable for real-time systems. The AOT mode of Microsoft .NET provides a more deterministic behavior, but is also lacks high determinism. The pure execution time level of approx. 125 µs is reached first at measurement 3, while measurement 1 is of approx. 226 µs on average. The results of .NET’s RPC mode confirm that the concept of pre-compilation is well-suited to provide deterministic execution times in virtualizing runtime environments. The level of pure execution times, which is circa 163 µs, is reached at very first execution. This is a little bit worse than the performance of MonoRT. IBM WebSphere Real-Time for RT Linux, which provides an AOT compilation mode, which is comparable to that of Mono, also provides deterministic results. One-time overhead could not be revealed in this test case. The pure execution time level of round 449 µs is reached right from the start. JamaicaVM’s results indicate one-time overhead at first execution. The average pure execution time is

97 7 Evaluation

450

400

350

300

250

200

150

times of observed execution Number 100

50

0 386 396 406 416 426 436 446 456 466 476 486 496 506 516 Observed Execution times in µs

Figure 7.1: Frequency Distribution of the observed execution times of the all four mea- surements of IBM WebSphere Real-Time for RT Linux in AOT mode, IA-32 between 1983,385 µs at measurement 2 and 1937,9825 µs at measurement 4.

Figure 7.1 shows the frequency distribution of the observed execution times of all four measurements of IBM WebSphere. It is very similar to that of MonoRT, see Figure 6.2. This indicates that the conditions are equal for the benchmarks which run on Linux.

The frequency distributions of the observed execution times differ on Windows/.NET. For example, compare Figure 6.2 and Figure 7.1 to Figure 7.2. The last one illustrates the number of observed execution times of the benchmark run by .NET in PRC mode, whereas only execution times up to 276 µs are considered, which is the maximum observed at measurement 1 (see Table 7.1).

There are a few runaways, which are up to 683 µs and which are assumed to be caused by the non-real-time nature of the Windows 7 operating system. Considering them in Figure 7.2 would impair the illustration purpose, so that they are omitted. This does not reduce the expressiveness of the statement. The frequency distribution of the Windows-based benchmark is more smooth, i.e., there is only one “aggregation” of execution times. This indicates that Linux-based results exhibit a higher standard deviation than the Windows- based ones, which is also confirmed by Figure 7.3 that shows the standard deviations of the benchmark.

MonoRT and the .NET variants provide the best behavior. Measurement 2 and mea- surement 3 of .NET RPC have a relatively high standard deviation (approx. 14 µs vs. approx. 8 µs of measurement 1 and 4), which is assumed to be caused by the runaways of the observed execution times that might be caused by external effects. Besides this, .NET RPC and MonoRT provide the lowest standard deviations. This indicates high determinism.

98 7.2 Comparative Experiments

4000

3500

3000

2500

2000

1500

1000 times of observed execution Number

500

0 151 161 171 181 191 201 211 221 231 241 251 261 271 Observed Execution times in µs

Figure 7.2: Frequency Distribution of the observed execution times of the all four mea- surements of .NET in PRC mode cut at 276 µs, IA-32

5948,720873 JamaicaVM 119,6924179 (IA-32) 138,2945641 165,3537696 18,217034 IBM WebSphere 17,1720027 (IA-32) 17,56562016 17,9171318

11,83459491 .NET AOT 11,4803266 (IA-32) 7,415449106 6,323340104

8,945107684 .NET RPC 14,13336711 (IA-32) 14,43223907 7,898794543 196,4425363

Mono Full-AOT 150,8997482 (ARM) 45,41781272

44,28430077 10,6786357 MonoRT 10,87205908 (IA-32) 11,48832923 10,88965151

1 10 100 1000 10000 Standard Deviaon of observed execuon mes in µs

measurement 1 measurement 2 measurement 3 measurement 4

Figure 7.3: Standard Deviation of execution times of benchmark with 1000 instance meth- ods

99 7 Evaluation

In order to rate and compare the determinism of the quite different results, Table 7.2 lists the ROET RET and ROSD RSD to make a statement about the variance of the average execution times and standard deviations.

RET RSD MonoRT, AOT Pre-Patch (IA-32) 1,115137 1,116954 Mono Full-AOT (ARM) 50,036 15,502 .NET PRC (IA-32) 1,006458 3,702393 .NET AOT (IA-32) 2,642336 3,98479 IBM WebSphere (IA-32) 1,002767 1,132262 JamaicaVM (IA-32) 15,586024 79,33099

Table 7.2: RET and RSD of benchmark with 1000 instance methods

IBM WebSphere provides the most “stable” values, because its ROET and ROSD is lowest of all candidates. This is preferable from real-time point of view. However, the execu- tion times are round two times higher than that of MonoRT and .NET. Their ROET and ROSD are also close to 1. That is, MonoRT provides execution time determinism that is comparable to that of a dedicated real-time solution while providing very good performance. Figure 7.4 shows the startup times of this benchmark. The test candidates that provide

25000 21911,6125

20000

15000

Time in ms Time 10000

4774,549 3166,991 5000 2120,944 235,361 619,143

0 MonoRT AOT-based Mono Full-AOT .NET PRC .NET AOT IBM WebSpehre JamaicaVM Pre-Patch (IA-32) (ARM) (IA-32) (IA-32) (IA-32) (IA-32)

Figure 7.4: Startup time of benchmark with 1000 instance methods the highest determinism have the highest startup times. The figure shows the startup time of MonoRT in AOT-based Pre-Patch mode, which is longer than that of the other test candidates. Chapter 6 discusses and demonstrates several possibilities that reduce the startup time. However, the code reduction at native code level is the only approach that works with Pre-Patch, so that these results are presented here. Also .NET’s startup time increases significantly in comparison to its AOT mode. Mono’s Full-AOT mode shows the possible startup time of full-featured virtualizing runtime environments, when all code is pre-compiled and loaded lazily.

100 7.2 Comparative Experiments

7.2.3 Interface Methods benchmark

This section presents the results of a benchmark that calls the methods via an interface. This benchmark is intended to reveal overhead, if any, that is related to VM-internal interface method resolution. Depending on how the methods are dispatched, e.g., per call-site or per method, the benchmark should reveal different execution time “patterns”. The implementation is as follows. There is an interface and two classes that implement it. One out of the two classes is chosen randomly to avoid compiler optimizations. This class is casted to the interface instance and the methods are called via this interface instance. Listing 7.1 shows the access pattern of the methods. interface IFMethods { int method0 ( int , int ); //... }

class MethodsIFclass1 : IFMethods { main ( ) { IFMethods im; rand = Random ( ) ;

i f ( rand < 0 . 5 ) im = new MethodsIFclass1 (); else im = new MethodsIFclass2 (); // begin first measurement im.method0(); //... } method0 ( int var0, var1) { //.. } }

class MethodsIFclass2 : IMethods { method0 ( int var0, var1) { //.. } } Listing 7.1: Pseudo code pattern of interface method benchmark

Figure 7.5 contrasts MonoRT’s Pre-Patch modes with the pre-compilation and the JIT compilation mode. The JIT compilation mode has the longest execution time of round 288100 µs at first execution. The following executions take round 216 µs. In contrast, the instance method benchmark also reveals overhead at the second execution in JIT compilation mode, see Table 4.3. Mono handles interface method calls by tables. An

101 7 Evaluation

1000000 measurement 1 measurement 2 measurement 3 measurement 4 288108,3285

100000 73652,8885

10000 228,8945 227,8595 218,5435 216,4905 216,4725 216,1865 216,1795 215,6855 228,943 228,086 218,922 218,217 218,003 216,884 1000 Time in µs Time

100

10

1 MonoRT AOT Pre-Patch MonoRT JIT Pre-Patch MonoRT Pre-Compilation MonoRT JIT

Figure 7.5: Average observed execution times of benchmark with 1000 interface methods run by MonoRT, IA-32 interface method call site refers to an entry in the IMT. Interface method calls have to be patched only once. In contrast, each call to an instance method has a separate call site, see section 5.1. Figure 7.5 also illustrates the gain of runtime overhead reduction by means of pre-compilation. The average execution time of the first execution is reduced from round 288100 µs to round 73600 µs. This is not a satisfying result for real-time systems, because the final execution time level is at round 216 µs. MonoRT’s Pre-Patch modes also eliminate the overhead of reference resolution. Then, the execution is not affected by code generation or reference resolution. The AOT-based Pre-Patch mode has an execution time level of round 228 µs, which is reached at first execution. This is circa 4% higher than MonoRT’s execution modes that use the JIT compiler as code generator.

Figure 7.6 contrasts the benchmark results of MonoRT in AOT-based Pre-Patch mode with the other candidates. The AOT-based Pre-Patch mode is used for further evaluation, because it provides a shorter startup time than the JIT-based Pre-Patch mode while there is only a slight degradation of performance. MonoRT provides the most “stable” execution times of all test candidates. They do not suffer from one-time overhead. The first execution already reaches the level of pure execution times without additional overhead. The other test candidates, including .NET in PRC mode and IBM WebSphere, reveal overhead on measurement 1 and measurement 2. As an exception, JamaicaVM has significant overhead only at measurement 1. The corrected standard deviation s of the observed execution times is calculated by Equation 4.2 and illustrated by Figure 7.7. MonoRT, IBM WebSphere and the .NET variants provide the best results. Mono’s Full-AOT mode suffers from high standard deviations. The behavior of the observed execution times and the standard deviations are summarized by ROET RET and ROSD RSD. They make a point about the determinism of the observed execution times and are listed by Table 7.3. MonoRT provides the lowest values, because the execution times do not suffer from one-time overhead. That is, MonoRT exhibits the best results for determinism and good results for performance in this benchmark.

102 7.2 Comparative Experiments

1000000 163,186 162,865 190,066 189,348 7214,956 7068,829 137654,312 135274,539 702,41 18985,166 21220,698 702,135 18974,7955

100000 2358,144 2093,783 53667,489 2086,047 490860,076 2086,624 20993,218

10000 227,86 228,895 228,943 228,086 1000 Observed execuon mes in µs 100

10

1 MonoRT, Mono, .NET, .NET, IBM WebSphere, JamaicaVM, AOT-based FullAOT RPC AOT AOT Compile Early Pre-Patch (ARM) (IA-32) (IA-32) (IA-32) (IA-32) (IA-32) measurement 1 measurement 2 measurement 3 measurement 4

Figure 7.6: Observed execution times of benchmark with 1000 interface methods

5948,721 JamaicaVM 119,692 (IA-32) 138,295 165,354

18,217 IBM WebSphere 17,172 (IA-32) 17,566 17,917

11,835 11,48 .NET AOT 7,415 6,323

8,945 .NET RPC 14,133 (IA-32) 14,432 7,899

19074,644 Mono Full-AOT 557,861 (ARM) 455,596 457,187

10,679 MonoRT 10,872 (IA-32) 11,488 10,89

1 10 100 1000 10000 100000 Standard Deviaon of observed execuon mes in µs

measurement 1 measurement 2 measurement 3 measurement 4

Figure 7.7: Standard Deviation of execution times of benchmark with 1000 interface meth- ods

103 7 Evaluation

RET RSD MonoRT, AOT Pre-Patch (IA-32) 1,0103 1,0691 Mono Full-AOT (ARM) 28,947 51,444 .NET PRC (IA-32) 27806,6829 3,7409,9829 .NET AOT (IA-32) 36121,503 530,3958 IBM WebSphere (IA-32) 256,8091 4,6302 JamaicaVM (IA-32) 10,1037 53,6699

Table 7.3: RET and RSD of benchmark with 1000 interface methods

After evaluating performance, i.e., development objectives O.3, and determinism, i.e., de- velopment objective O.1, Figure 7.8 shows the results of the startup time evaluation. They

35000 31481,068

30000 25402,434

25000

20000

Time in ms Time 15000

10000 5224,0785

5000 3256,5745 2447,0645 241,471 612,4095

0 MonoRT JIT MonoRT AOT Mono Full-AOT .NET PRC .NET AOT IBM WebSpehre JamaicaVM Pre-Patch Pre-Patch (ARM) (IA-32) (IA-32) (IA-32) (IA-32) (IA-32) (IA-32)

Figure 7.8: Startup time of benchmark with 1000 interface methods are similar to that of section 7.2.2. The solution that provide the highest determinism have the longest startup time. Figure 7.8 also shows the startup time of the JIT-based Pre-Patch mode of MonoRT. The code reduction at native code level reduced the startup time by round 20% compared to the JIT-bases Pre-Patch mode.

7.2.4 Class Methods benchmark

This section presents the results of a benchmark that calls the methods via an instance of another class, which implements the method. This is a common case in programs written in an object-oriented language. It is indented to reveal overhead, if any, that is related to VM-internal method resolution and the time of class initialization and loading, respectively. Listing 7.2 shows the access pattern of the methods. There are 1000 classes in the benchmark code. Each of them implements one public method. An instance of each class is created. The method of each of the 1000 class instances is called per measurement, so that the execution time of all 1000 methods is considered. The creation time of the

104 7.2 Comparative Experiments class instances is not considered. Depending on the class and/or code loading strategy, the method code is loaded at first reference, e.g., the method call. This overhead due to lazy loading is not suitable for real-time systems and this benchmark examines the behavior of the test candidates. class Class0 { main ( ) { c l s 0 = new Class0 ( ) ; c l s 1 = new Class1 ( ) ; //... c l s 9 9 9 = new Class999(); // begin first measurement cls0 .method0(); cls1 .method1(); //... } method0 ( int var0, var1) { //.. } }

class Class1 { method1 ( int var0, var1) { //.. } } Listing 7.2: Pseudo code pattern of class methods benchmark

Before comparing MonoRT to the other test candidates, the gain of the different devel- opment steps is examined, starting with the JIT compilation mode up to the AOT-based Pre-Patch mode. Figure 7.9 compares the average observed execution times of MonoRT’s Pre-Patch modes, the pre-compilation and the JIT compilation mode. The JIT com- pilation mode shows a “three-stage” behavior like in the instance methods benchmark. Measurement 1 has the longest execution times of approx. 221000 µs on average. Mea- surement 2 does not include the JIT compilation time, so that it reveals an average observed execution time of 7500 µs. Measurement 3 and 4 are at the pure execution time level, which is at circa 178 µs. In pre-compilation mode, the JIT compilation over- head is eliminated. So, measurement 1 and measurement 2 provide similar results, which are approx. 8300 µs. They are comparable to that of measurement 2 of the JIT com- pilation mode. That is, the direct impact of JIT compilation could be eliminated by pre-compilation. The pure execution times of 184 µs are reached at measurement 3 and measurement 4. They are minimal higher than in JIT compilation mode. The overhead of reference resolution is eliminated by MonoRT’s Pre-Patch mode. Both the JIT-based and AOT-based Pre-Patch mode provide deterministic execution times. They reach the level of pure execution times at first execution. These execution time behavior is suitable for real-time systems. Like in the benchmarks previously presented, the execution times

105 7 Evaluation

1000000 measurement 1 measurement 2 measurement 3 measurement 4 221003,324

100000 7499,534 8353,4385 8365,281

10000 208,0705 207,8855 186,9525 184,3035 184,2835 207,104 207,511 190,212 186,693 178,442

1000 177,46 186,4 Time in µs Time

100

10

1 MonoRT AOT Pre-Patch MonoRT JIT Pre-Patch MonoRT Pre-Compilation MonoRT JIT

Figure 7.9: Average observed execution times of benchmark with 1000 class methods run by MonoRT, IA-32

RET RSD MonoRT, AOT Pre-Patch (IA-32) 1,0104 1,0329 Mono Full-AOT (ARM) 12,9413 8,9191 .NET PRC (IA-32) 1,0079 1,5349 .NET AOT (IA-32) 1,7446 4,1203 IBM WebSphere (IA-32) 2,4426 1,0738 JamaicaVM (IA-32) 8,2132 57,6585

Table 7.4: RET and RSD of benchmark with 1000 class methods of the AOT-based approach are higher than the JIT-based one, which is 207 µs to 187 µs, i.e., circa 10 %, in this case.

Figure 7.10 summarizes the results of the other test candidates. Along with Figure 7.9, it indicates that MonoRT provides the most “stable” execution times. They do not suffer from one-time overhead. Out of the other test candidates, .NET’s PRC mode provides the most real-time suitable results. It is remarkable that the CLI implementations pro- vide better performance than the Java pendants. Whereas the observed execution times of MonoRT and .NET (PRC mode) are in the range of 207 µs to 227 µs, IBM Web- Sphere’s results start at 1222 µs and JamaicaVM exhibits even higher execution times. The performances provided by MonoRT and by .NET in RPC mode are comparable to the non-real-time case, see Figure 7.9. Figure 7.11 shows the standard deviations of the observed execution times.

The results are similar to that of sections 7.2.3 and 7.2.2. MonoRT, .NET in RPC mode and IBM WebSphere provide the best results, because the standard deviations are low (compared to the other test candidates) and they are quite uniform. While Figure 7.10 rather examines development objective O.3 (performance), Table 7.4 considers development objective O.1 (determinism).

106 7.2 Comparative Experiments

100000 2596,594 2596,828 2604,688 21258,2365 770,3 773,481 5821,697 3373,529 1222,239 1224,055 1222,7825 10000 2979,3975 225,939 270,8605 203,565 203,4935 227,0415 266,6205 226,272 1000 226,2865 207,511 207,886 207,104 208,0705

100 Observed execuon mes in µs

10

1 MonoRT, Mono, .NET, .NET, IBM WebSphere, JamaicaVM, AOT-based FullAOT RPC AOT AOT Compile Early Pre-Patch (ARM) (IA-32) (IA-32) (IA-32) (IA-32) (IA-32) measurement 1 measurement 2 measurement 3 measurement 4

Figure 7.10: Observed execution times of benchmark with 1000 class methods

5948,721 JamaicaVM 119,692 (IA-32) 138,295 165,354

18,217 IBM WebSphere 17,172 (IA-32) 17,566 17,917

11,835 .NET AOT 11,48 (IA-32) 7,415 6,323

8,945 .NET RPC 14,133 (IA-32) 14,432 7,899

122,351 Mono Full-AOT 221,13 (ARM) 48,253 43,921

10,679 MonoRT 10,872 (IA-32) 11,488 10,89

1 10 100 1000 10000 Standard Deviaon of observed execuon mes in µs

measurement 1 measurement 2 measurement 3 measurement 4

Figure 7.11: Standard Deviation of execution times of benchmark with 1000 class methods

107 7 Evaluation

35000 31238,2455

30000 25372,212

25000

20000

Time in ms Time 15000

10000 5255,965

5000 3108,759 2453,5905 320,168 588,205 0 MonoRT JIT MonoRT AOT Mono Full-AOT .NET PRC .NET AOT IBM WebSpehre JamaicaVM Pre-Patch Pre-Patch (ARM) (IA-32) (IA-32) (IA-32) (IA-32) (IA-32) (IA-32)

Figure 7.12: Startup time of benchmark with 1000 class methods

MonoRT and .NET provide the best results. That is, the observed execution times and standard deviations of benchmarks, which are run with them, have the least jitter of the tested VMs. MonoRT’s level of determinism is comparable to that of dedicated real-time solutions. So, development objective O.1 is met for this use case. Figure 7.12 compares the startup times of the test candidates. The results are similar to that of section 7.2.3. The startup time of MonoRT’s Pre-Patch mode could be reduced by the use of the AOT compiler as code generator by round 20% compared to using the JIT as code generator. Execution modes that provide the highest determinism, e.g., MonoRT and .NET in PRC mode have the highest startup time.

7.2.5 Type Initializer benchmark

This section presents the results of a benchmark that calls the methods via an instance of another class, which implements the method. These classes have a static variable and this variable is initialized by a type initializer. This benchmark extends that of section 7.2.4. It is indented to reveal overhead, if any, that is especially related to class or type initialization. Listing 7.3 shows the access pattern of the methods. There are 1000 classes in the benchmark code. An instance of each class is created in the benchmark. The method of each of the 1000 class instances is called per measurement, so that the execution time of all 1000 methods is considered. The creation time of the class instances is not considered. Each of the 1000 classes implements one public method and a type initializer. The type initializer assigns an arbitrary constant value to a static variable. The explicit type initializer introduces another potential source of temporal non-determinism, because it is called only once. It depends on the class and/or code loading strategy, when the type initializer is executed. This might affect the temporal determinism of an application negatively from real-time point of view.

108 7.2 Comparative Experiments

class Class0 { main ( ) { c l s 0 = new Class0 ( ) ; c l s 1 = new Class1 ( ) ; //... c l s 9 9 9 = new Class999(); // begin first measurement cls0 .method0(); cls1 .method1(); //... } method0 ( int var0, var1) { //.. } }

class Class1 { static int s t v r ; static mthdcls1() { s t v r = 1 ; } method1 ( int var0, var1) { //.. } } Listing 7.3: Pseudo code pattern of type initializer benchmark

Before comparing MonoRT to the other test candidates, the gain of the different devel- opment steps is examined, starting with the JIT compilation mode up to the AOT-based Pre-Patch mode. Figure 7.13 compares the average observed execution times of MonoRT’s Pre-Patch modes, the pre-compilation and the JIT compilation mode. The behavior of the observed execution times is similar to that of the class methods (section 7.2.4) and instance methods benchmark (section 7.2.2). The comparison of the different execution modes in this benchmark shows that the temporal non-determinism due to the lazy compilation principle is eliminated by pre-compilation and Pre-Patch. It is remarkable that the observed execution times are higher than that of the class methods benchmark, although the benchmark structure differs only in the static class variable and the explicit type initializer. The differences are glaring when the results of the JIT- and AOT-based Pre-Patch mode are compared, see Figure 7.9 and Figure 7.13. The AOT- based results are circa 9% higher than the JIT-based ones (207 to 190 µs) in the class method benchmark. Here, the AOT-based results are circa 79% higher (535 to 299 µs). The first execution in JIT compilation mode takes longer in this benchmark (circa 252000 µs) than in the class method benchmark (circa 221000 µs). The type initializer affects the execution times but the additional run time overhead can be eliminated by means of pre-compilation and Pre-Patch.

109 7 Evaluation

1000000 measurement 1 measurement 2 measurement 3 measurement 4 252343,9715

100000 7554,9585 8548,826 8554,995

10000 536,6795 536,2325 535,155 537,74 299,7505 298,3845 298,953 298,934 298,891 298,693 235,8615 1000 235,354 Time in µs Time

100

10

1 MonoRT AOT Pre-Patch MonoRT JIT Pre-Patch MonoRT Pre-Compilation MonoRT JIT

Figure 7.13: Average observed execution times of benchmark with 1000 class methods and type initializer run by MonoRT, IA-32

1000000 2840,376 2848,878 24105,79 2851,852 2939,634 1329,884 1315,894 72599,242 100000 1647,621 7056,761 2268,172 1878,415

10000 406,529 407,831 486,904 409,845 407,732 486,077 484,119 483,716 536,233 537,74 536,68 1000 535,155 Observed execuon mes in µs 100

10

1 MonoRT, Mono, .NET, .NET, IBM WebSphere, JamaicaVM, AOT-based FullAOT RPC AOT AOT Compile Early Pre-Patch (ARM) (IA-32) (IA-32) (IA-32) (IA-32) (IA-32) measurement 1 measurement 2 measurement 3 measurement 4

Figure 7.14: Observed execution times of benchmark with 1000 class methods and type initializer

110 7.2 Comparative Experiments

RET RSD MonoRT, AOT Pre-Patch (IA-32) 1,009722 1,041025 Mono Full-AOT (ARM) 28,946793 51,444431 .NET PRC (IA-32) 1,429687 1,502499 .NET AOT (IA-32) 1,426868 1,351065 IBM WebSphere (IA-32) 6,722036 4,378658 JamaicaVM (IA-32) 8,546627 110,058933

Table 7.5: RET and RSD of benchmark with 1000 class methods and type initializer

Figure 7.14 summarizes the results of the other test candidates. Along with Figure 7.13, it indicates that MonoRT provides the most “stable” execution times. All other test candidates suffer from non-uniform execution times. These indicate one-time overhead during the execution of the benchmark. Figure 7.15 shows the standard deviations of the observed execution times. The results are similar to that of the previous sections. MonoRT and .NET provide the best results, because the standard deviations are low (compared to the other test candidates) and they are quite uniform.

8153,461927 JamaicaVM 97,81728176 (IA-32) 118,3541776 106,7447266 65,64914352 IBM WebSphere 34,46202594 (IA-32) 24,9005363 29,88114456 18,795 .NET AOT 17,301 (IA-32) 19,627 18,966 19,374 .NET RPC 14,034 (IA-32) 13,996 15,15 1185,501 Mono Full-AOT 162,87 (ARM) 68,232 55,029 17,393 MonoRT 17,205 (IA-32) 17,068 16,992

1 10 100 1000 10000 Standard Deviaon of observed execuon mes in µs

measurement 1 measurement 2 measurement 3 measurement 4

Figure 7.15: Standard Deviation of execution times of benchmark with 1000 class methods and type initializer

While Figure 7.14 rather examines development objective O.3 (performance), Table 7.5 considers development objective O.1 (determinism). MonoRT provides the best results. The observed execution times and standard deviations of benchmarks suffer from the least jitter of the tested VMs. So, development objective O.1 is met for this use case. Figure 7.16 compares the startup times of the test candidates. Also these results are similar to that of the previous sections. The startup time of the Pre-Patch mode could be reduced by the use of the AOT compiler as code generator by circa 20%.

111 7 Evaluation

35000 31778,473

30000 25476,92

25000

20000

Time in ms Time 15000

10000

3824,373 3089,554 5000 2476,716 326,936 606,438

0 MonoRT JIT MonoRT AOT Mono Full-AOT .NET PRC .NET AOT IBM WebSpehre JamaicaVM Pre-Patch Pre-Patch (ARM) (IA-32) (IA-32) (IA-32) (IA-32) (IA-32) (IA-32)

Figure 7.16: Startup time of benchmark with 1000 class methods and type initializer

7.2.6 Static Methods benchmark

This section presents the results of a benchmark that calls static methods of a class. It is indented to reveal overhead, if any, that is especially related to method resolution within a class. Listing 7.4 shows the access pattern of the methods. There is one class in the benchmark code, which implements 1000 static methods. It is not necessary to create an instance of the class in order to call such a method. There is neither an explicit type initializer nor static variables. The difference of this benchmark to that of section 7.2.4 is, that a method does not have to be referenced via an object or an instance of a class. Instead, it is resolved via the type. These 1000 methods are called per measurement, so that the execution time of all 1000 methods is considered.

class Class0 { main ( ) { // begin first measurement method0 ( ) ; method1 ( ) ; //... } static method0 ( int var0, var1) { //.. } static method1 ( int var0, var1) { //.. } } Listing 7.4: Pseudo code pattern of static methods benchmark

112 7.2 Comparative Experiments

Before comparing MonoRT to the other test candidates, the gain of the different devel- opment steps is examined again. Figure 7.17 compares the average observed execution times of MonoRT’s Pre-Patch modes, the pre-compilation and the JIT compilation mode.

1000000 measurement 1 measurement 2 measurement 3 measurement 4 230184,719

100000 7880,1605 7876,051 7131,477

10000 145,1415 143,6965 143,5195 139,8275 139,4865 138,8625 1000 138,5615 143,157 139,807 139,503 139,376 137,877 Time in µs Time

100

10

1 MonoRT AOT Pre-Patch MonoRT JIT Pre-Patch MonoRT Pre-Compilation MonoRT JIT

Figure 7.17: Average observed execution times of benchmark with 1000 static methods run by MonoRT, IA-32

For MonoRT, the results are similar to that of the other benchmarks. They are almost identical to that of the instance methods benchmark, see section 7.2.2. The pre-allocation of native code and the elimination of indirect references allow for an execution time behavior, which is suitable for real-time systems. A further analysis is omitted at this point and it is referred to section 7.2.2. Figure 7.18 summarizes the results of the other test candidates. Unlike MonoRT, some of the other test candidates show results, which are different to that of section 7.2.2. .NET in the RPC mode provides results, which are as good as MonoRT’s ones, from real-time point of view. IBM WebSphere does not reach the determinism of the instance methods benchmark, because the first measurement reveals one-time overhead. Figure 7.19 shows the standard deviations of the observed execution times. Table 7.6 considers development objective O.1 (determinism) in compressed form.

RET RSD MonoRT, AOT Pre-Patch (IA-32) 1,012549 1,1288723 Mono Full-AOT (ARM) 30,911325 20,960775 .NET PRC (IA-32) 1,005502 1,576056 .NET AOT (IA-32) 3,058988 4,821486 IBM WebSphere (IA-32) 80,615294 2,877008 JamaicaVM (IA-32) 12,368473 99,955581

Table 7.6: RET and RSD of benchmark with 1000 static methods

MonoRT provides very good results. The observed execution times and standard devi- ations of benchmarks suffer from few jitter. So, development objective O.1 is met for

113 7 Evaluation

1000000 20887,641 1786,548 1765,011 1746,272 491,349 490,204 491,228 100000 39343,664 373,224 374,747 118,604 143,52 3019,739 145,142 118,154 118,091 118,278 102,203 1420,099 102,143 197,429 143,157 161,559 143,697 10000

1000 Observed execuon mes in µs 100

10

1 MonoRT, Mono, .NET, .NET, IBM WebSphere, JamaicaVM, AOT-based FullAOT RPC AOT AOT Compile Early Pre-Patch (ARM) (IA-32) (IA-32) (IA-32) (IA-32) (IA-32) measurement 1 measurement 2 measurement 3 measurement 4

Figure 7.18: Observed execution times of benchmark with 1000 static methods

5302,794 JamaicaVM 101,906 (IA-32) 129,201 154,396 54,297 IBM WebSphere 20,014 (IA-32) 19,581 19,877 12,584 .NET AOT 14,735 (IA-32) 6,855 8,377 6,086 .NET RPC 6,903 (IA-32) 6,609 7,785 136,748 Mono Full-AOT 125,41 (ARM) 34,468 30,438 11,959 MonoRT 11,484 (IA-32) 11,241 11,6739

1 10 100 1000 10000 Standard Deviaon of observed execuon mes in µs

measurement 1 measurement 2 measurement 3 measurement 4

Figure 7.19: Standard Deviation of execution times of benchmark with 1000 static meth- ods

114 7.2 Comparative Experiments this use case. The results of .NET in RPC mode also confirm the approach chosen by MonoRT. Figure 7.20 compares the startup times of the test candidates. Also these results are similar to that of the previous sections. The startup time of MonoRT’s Pre-Patch mode could be reduced by the use of the AOT compiler as code generator by 24% compared to using the JIT as code generator. This is still more than the startup time of the other test candidates. However, the solutions, that provide the most deterministic results, also have the longest startup times. The approach to reduce the startup time, which is described by section 6.3, has the ability to eliminate this problem.

30000 26737,5355

25000 20329,8715

20000

15000 Time in ms Time

10000

4806,5395 5000 2974,4835 2097,205 224,3395 597,4885

0 MonoRT JIT MonoRT AOT Mono Full-AOT .NET PRC .NET AOT IBM WebSpehre JamaicaVM Pre-Patch Pre-Patch (ARM) (IA-32) (IA-32) (IA-32) (IA-32) (IA-32) (IA-32)

Figure 7.20: Startup time of benchmark with 1000 static methods

7.2.7 Static Class Methods benchmark

Section 7.2.6 examines the behavior, if static methods of one class are executed. Section 7.2.4 examines the execution of non-static methods of other classes. Some test candi- dates show different behavior on both benchmarks, so that it is worth to evaluate the combination of both. This section presents the results of a benchmark that calls static methods of different classes. Listing 7.5 shows the access pattern of the methods. Figure 7.21 compares the average observed execution times of MonoRT’s Pre-Patch modes, the pre-compilation and the JIT compilation mode. The results are similar to that of the other benchmarks. The four execution modes reflect the development steps, which are presented in this thesis. The direct effect of lazy compilation on the execution times can be eliminated by the pre-compilation – more generic: pre-allocation – of native code from intermediate code. Temporal non-determinism due to indirect references is eliminated by the resolving of indirect references, i.e., Pre-Patch. That is, MonoRT fulfills development objective O.1 (determinism) also for this use case.

115 7 Evaluation

class Class0 { main ( ) { // begin first measurement Class0 .method0(); Class1 .method1(); //... } static method0 ( int var0, var1) { //.. } }

class Class1 { static method1 ( int var0, var1) { //.. } } Listing 7.5: Pseudo code pattern of static class methods benchmark

1000000 measurement 1 measurement 2 measurement 3 measurement 4 265702,544

100000 8562,5235 8559,3015 7505,913

10000 192,5485 192,4015 192,2465 163,0305 162,5095 162,1505 161,5175 162,0115 192,122 139,6545 1000 139,5185 163,058 Time in µs Time

100

10

1 MonoRT AOT Pre-Patch MonoRT JIT Pre-Patch MonoRT Pre-Compilation MonoRT JIT

Figure 7.21: Average observed execution times of benchmark with 1000 static class meth- ods run by MonoRT, IA-32

Figure 7.22 summarizes the results of the other test candidates. MonoRT and its Pre- Patch mode provide results, which are better than that of dedicated real-time solutions, i.e., IBM WebSphere and JamaicaVM. Also .NET in RPC mode shows execution times, which do not reveal one-time overhead or other temporal non-determinism. This proofs the concept taken by MonoRT. .NET’s AOT mode does not provide the same execution time behavior. There, the first two executions suffer from overhead. The results of IBM WebSphere in this benchmark are different to that of other benchmarks. It is notably that IBM WebSphere shows overhead at the first and second measurement. That is similar to its behavior in the interface method benchmark, see section 7.2.3.

116 7.2 Comparative Experiments

10000000 450,353 450,003 2113,078 2121,438 2094,791 3922,706 23765,514

1000000 4529776,861

100000 169,55 314,03 246,872 191,954 169,972 192,402 191,393 169,306 192,549 192,247 192,122 517,995 169,903 520,702 5446,464 2412,277

10000

1000 Observed execuon mes in µs

100

10

1 MonoRT, Mono, .NET, .NET, IBM WebSphere, JamaicaVM, AOT-based FullAOT RPC AOT AOT Compile Early Pre-Patch (ARM) (IA-32) (IA-32) (IA-32) (IA-32) (IA-32) measurement 1 measurement 2 measurement 3 measurement 4

Figure 7.22: Observed execution times of benchmark with 1000 static class methods

Table 7.7 considers development objective O.1 (determinism) in compressed form and confirms the impression that is received from Figure 7.22. MonoRT and .NET provide very good results from real-time point of view.

RET RSD MonoRT, AOT Pre-Patch (IA-32) 1,004328 1,023851 Mono Full-AOT (ARM) 49,221577 173,514082 .NET PRC (IA-32) 1,008926 1,638864 .NET AOT (IA-32) 2,12257 3,137548 IBM WebSphere (IA-32) 87815,14073 22322,38569 JamaicaVM (IA-32) 11,589673 87,248643

Table 7.7: RET and RSD of benchmark with 1000 static class methods

Figure 7.23 compares the startup times of the test candidates. These results confirm, that the solutions, which provide the most execution time determinism, have the highest startup times. The startup time of MonoRT’s Pre-Patch mode could be reduced by the use of the AOT compiler as code generator by 20% compared to using the JIT as code generator.

117 7 Evaluation

35000

30082,545 30000

24083,7775 25000

20000

Time in ms Time 15000

10000 5004,3705 3104,7695 5000 2408,5825 239,617 585,183

0 MonoRT JIT MonoRT AOT Mono Full-AOT .NET PRC .NET AOT IBM WebSpehre JamaicaVM Pre-Patch Pre-Patch (ARM) (IA-32) (IA-32) (IA-32) (IA-32) (IA-32) (IA-32)

Figure 7.23: Startup time of benchmark with 1000 static class methods

7.2.8 Summary

Sections 7.2.2 to 7.2.7 show results of comparative experiments of MonoRT and other virtualizing runtime environments with different native code generation approaches. A series of a micro benchmark is conducted. The properties to examine are execution time determinism, performance and startup time. The benchmarks are designed to evaluate the effect of the native code allocation on the execution time determinism in particular. The comparison of MonoRT with mature real-time solutions shows that the approach, which is developed in this thesis, is suitable for real-time suitable native code generation. .NET’s RPC mode confirms this too. Although the CLI VM Mono, which is not a real-time execution environment, is used as basis to implement the approach, it provides deterministic execution times. That is, the approach can be combined with other real-time programming aspects, such as special programming features (timer, synchronization, etc.) or a real-time suitable automatic memory management. These aspects are excluded by the benchmarks presented here, because they examine the effect of native code generation on the execution times, performance and startup time only. It has to be noted that the implementation of the .NET RPC mode is at a rather high level. It does not allow fine control over compilation optimizations like inlining, which would bias the benchmark results. The same holds for the AOT compilation of IBM WebSphere Real-Time for RT Linux. However, AOT compilation, like performed be IBM and Mono’s AOT subsystem, is intended to do intensive optimizations, as the AOT compilation time is not a major criterion in a real-world use case. The approach that is taken by MonoRT results in a longer startup time than the other so- lutions and approaches, respectively. Section 6.3 describes an approach, which eliminates the problem. The previous sections do not include results with that solution, because it is not available on the IA-32 platform, which provides full support for real-time suitable

118 7.2 Comparative Experiments native code generation. The source code of MonoRT, the source code and the results of the comparative bench- marks are stored at the enclosed CD or can be downloaded at: github.com/mdae/MonoRT This CD and the repository also include a manual how to build and install MonoRT as well as how to build and run the comparative and internal benchmarks.

119

8 Summary and Outlook

8.1 Discussion

Section 4.2 introduces the concept of real-time suitable code generation and discusses the advantages of this approach over existing ones for certain use cases. Chapters 5 to 7 describe and evaluate the proof-of-concept by means of the modification of the CLI VM “Mono” towards “MonoRT”. The lessons learned from this allows a discussion about the generality of the approach and how can it be applied to other VMs, respectively. There is no standardized interface at VM code level where to hook in the code generation routines. Figure 4.1 illustrates that the code loader and code generator have to be utilized. [44] provides a substrate for building VMs. It builds the VM around the code generator and not vice versa. Some high-level languages even provide an interface for VM-internal code generation handling, see section 4.3.2. However, it does not make a big difference, if the interface is at VM source code level or if it is even provided by a high-level language. It is important that it gives “sufficient” control over the code generation process. In this context, this means that all potentially executed native code can be generated. For example, if a VM does not require special wrapper methods in order to invoke application methods from within the VM, there is not need to generate wrapper methods. The Mono VM makes extensive use of wrappers, so they had to be identified and generated during pre-compilation. It is an individual task to identify and engineer the hooks, because most HLL VMs are not designed with real-time code execution in mind. The handling of indirect reference depends on the VM-specific implementation, too. For example, Mono’s JIT mode uses direct references and back-patching at run time, while Mono’s AOT mode also handles indirect references via a table. The JVM Quicksilver also uses a table to resolve indirect references. This table is initialized lazily at run time. The use of indirect references is coupled to the native code generator. It is also a valid solution to modify the code generator, so that it does not emit indirect references, at least for a certain part of the native code. This blurs the borders between lazy compilation and the conventional compile-and-link approach. To sum up, it is specific to the VM where to hook in to manipulate the code generation process and which strategy causes the lowest implementation effort. The lowest common factor for applying the real-time code generation approach to another VM from implementation point of view is, that all necessary code and meta information are available at code generation time. As this approach can rely on the original code generator, e.g., a JIT compiler, there is no need to modify other parts of the VM. Experiences from VMkit [44] and ErLLVM – a LLVM backend for Erlang’s JIT compiler High Performance Erlang [98] – confirm that there is a tight coupling of the code generator and the VM services. A replacement of the code generator might also imply modifications to VM services, e.g., to the memory management.

121 8 Summary and Outlook

Some (high-level) language policies hamper the real-time code generation approach pre- sented here. For example, it can be valid to run a type initializer in a CLI VM before first reference to that type. This is not allowed in a JVM. While this does not contradict the pre-compilation approach per se, deferring the execution of the type initializer does not make sense from real-time point of view. So, the JVM had to be modified. Such policies might affect the initialization of internal data structures, e.g., virtual tables of classes. It is also an individual task to handle these issues. It might also make sense to drop the support for some language features for simplicity. Section 7.1.3 mentions that “Generics” are not supported in MonoRT. It would also be questionable to augment the “Dynamic Software Updating” (DSU) feature of Erlang with real-time code generation. DSU maintains lists of call sites in the system. If a piece of code, e.g., a method – called “module” in Erlang – is updated, the references to that object are updated too [56]. This introduce a source of temporal non-determinism.

8.2 Conclusion

Applications, which are developed with modern general purpose languages like Java or C#, are typically deployed via platform-independent intermediate code. The instructions of this intermediate code are usually executed by an additional computer program – a virtualizing runtime environment. In doing so, the virtualizing runtime environment has to provide a representation of the intermediate code, which can be executed directly by the underlying real computation system. One of the benefits of this approach is the increased portability. An application does not have to be adapted to different target systems. The transformation from intermediate code into native code is typically done lazily on demand. This contradicts the requirement of a real-time system, where the temporal determinism of code execution is crucial by definition. This determinism cannot be guaranteed, if an application’s native code is allocated at run time. This thesis examines the native code generation in virtualizing runtime environments with regard to meet real-time requirements. Section 1.3 points out the development objectives • high execution time determinism • transparency to the programmer and retaining portability • no negative performance impacts • low startup time in order to gain acceptance of a solution. The topic is broken down to three parts.

Existing approaches and solutions are researched in chapter 3. These solutions are eval- uated in section 4.1. Based on this evaluation, a concept for real-time capable native code generation in virtualizing runtime environments is developed, see section 4.2. The approach intends a pre-allocation of native code in an initialization phase of an ap- plication. This incorporates the standard native code generator of a virtualizing runtime

122 8.2 Conclusion environment. This decision was made to reach the portability and good performance results in particular. The open-source CLI implementation Mono serves for the imple- mentation and proof-of-concept. Mono’s JIT compiler is used to generate the native code.

Experiments revealed that a sole pre-allocation of native code is not sufficient to provide deterministic execution times. The experiments are realized by a benchmark framework, which is developed to examine execution time determinism, performance and application startup time. The standard code generator, which implements the principle of lazy com- pilation, inserts indirect references in the native code, so called “Patches”. The resolving of these indirect references at application run time introduces temporal non-determinism, which is not suitable for real-time systems. So, the second step is the resolving of indirect references in native code before run time. This is described in chapter 5. The Patches, which are emitted by the native code generator, are modified and resolved just before the application starts. This allows running applications without the need to generate or modify native code at run time, so that real-time suitable execution time determinism is reached.

While the solution provides high execution time determinism, it suffers from an increased startup time. The third step handles the reduction of startup time, see chapter 6. Several approaches are examined. Two approaches turned out to be suitable with regard to the development objectives. One possibility is to reduce the amount of native code to generate. The most suitable solution applies the incorporation of an AOT compiler. An AOT compiler is used to translate the intermediate code to native code in an offline step. This time does not count as startup time, because it is does not have to be done at each application startup. The AOT compiled native code is loaded instead of being generated by a JIT compiler. The AOT compiled code is position independent – in contrast to the JIT compiled code. The resolving of the indirect references is adapted to handle AOT compiled code. The startup time could be reduced by circa 20% for the benchmark tests. A second – and more holistic – approach is to take a checkpoint of the application right after the allocation and optimization of native code. This checkpoint is restored when the application is going to be run, which reduces the startup time to that of native applications. However, the external checkpoint and restore tool used does not yet support the platform, which was used to implement the real-time suitable native code generation. So, this approach could be evaluated only partially.

The modified virtualizing runtime environment is tested extensively for functional cor- rectness and if the real-time native code generation works as expected, see section 7.1. The corner cases are pointed out and solutions are suggested. Comparative experiments, which conduct the benchmark framework developed, show that the modified Mono VM – MonoRT – provides at least as deterministic results as dedicated real-time solutions, see section 7.2.

123 8 Summary and Outlook

8.3 Outlook

This thesis considers the allocation of the resource code in virtualizing runtime environ- ments, so that it is suitable for real-time systems. A real-time system needs determinism in a holistic sense. So, the approach developed here can be applied to build a real-time system, but it alone is not sufficient. Especially the ECMA 335 needs a formal basis similar to the RTSJ, which defines real-time programming features like timers, schedul- ing mechanisms, memory handling, etc. For example, the current regulation of the type initializer execution might contradict real-time programming in some cases. A real-time specification might also involve the reduction of programming languages features, like discussed in section 8.1. Related to the implementation of the approach by means of MonoRT, a future task is to switch to a newer version of Mono. This would allow incorporating the Full-AOT mode as bases for pre-allocation of native code. It is expected to reduce the startup time by more than 20% compared to using the JIT compiler as code generator. Languages, which provide generic programming features, have to be supported by applying appropriate techniques. The MonoRT VM has to be augmented with further real-time programming features like a real-time capable threading and scheduling system including prioritization and synchronization and also a real-time capable memory management. Its openness predestinates it to act as further research VM. As part of switching to a newer version of Mono, the code of the real-time code generation extension had to be re-factored. My understanding for large software projects and my coding skills improved during the work on this topic. Early modifications during the development of MonoRT look somewhat odd in some cases and the evolution of the coding skills is apparent, e.g., when comparing the implementation of pre-compilation and Pre-Patch.

124 List of Abbreviations

ABI ...... Application Binary Interface ALSR ...... Address Space Layout Randomization AOT ...... Ahead-of-Time API ...... Application Programming Interface CDC ...... Connected Device Configuration CIL ...... Common Intermediate Language CLDC ...... Connected Limited Device Configuration CLI ...... Common Language Infrastructure COFF ...... Common Object File Format COM ...... Component Object Model CPU ...... Central Processing Unit CRIU ...... Checkpoint/Restore In Userspace DSP ...... Digital Signal Processor DSU ...... Dynamic Software Updating ELF ...... Executable and Linkable Format FPGA ...... field-programmable gate array GC ...... Garbage Collector GCC ...... GNU Compiler Collection GOT ...... Global Offset Table HLL ...... High-Level Language IJW ...... It Just Works IMT ...... Interface Method Table ISA ...... Instruction Set Architecture ITC ...... Initialization Time Compilation JavaME ...... Java Micro Edition JIT ...... Just-in-Time JNI ...... Java Native Interface JRE ...... Java Runtime Environment LLVM ...... Low-Level Virtual Machine LLVM IR ...... LLVM Intermediate Representation MonoRT ...... Mono Real-Time Ngen ...... Native Image Generator OS ...... Operating System PIC ...... Position-Independent Code PLC ...... Programmable Logic Controller PLT ...... Program Linkage Table QSI ...... Quasi Static Image ROET ...... Ratio of Ordered Execution Times

125 8 Summary and Outlook

ROSD ...... Ratio of Ordered Standard Deviations RPC ...... Reflection Pre-Compilation RTCE ...... Real-Time Core Extensions RTL ...... Register Transfer Language RTSJ ...... Real-Time Specification for Java SCIL ...... Simplified Common Intermediate Language SIM ...... Subscriber Identifier Module VES ...... Virtual Execution System VM ...... Virtual Machine VTable ...... Virtual Table WCET ...... Worst Case Execution Time

126 Bibliography

[1] JSR 139 - Connected Limited Device Configuration (CLDC), Version 1.1. www.jcp. org/en/jsr/detail?id=139 (retrieved June 27, 2014), March 2003. [2] JSR 218 - Connected Device Configuration (CDC), Version 1.1.2. www.jcp.org/ en/jsr/detail?id=218 (retrieved June 27, 2014), March 2003. [3] Standard ECMA-334 - C# Language Specification. International, ECMA, 2006. [4] Standard ECMA-335 - Common Language Infrastructure (CLI). International, ECMA, 4th edition, 2006. [5] CrossNet. crossnet.codeplex.com (retrieved June 27, 2014), April 2009. [6] clang: a C language family frontend for LLVM. clang.llvm.org/ (retrieved September 10, 2013), 2013. [7] Cosmos – C# Open Source Managed Operating System. cosmos.codeplex.com (retrieved March 13, 2012), January 2013. [8] The LLVM Compiler Infrastructure. llvm.org/ (retrieved September 10, 2013), 2013. [9] Compiling Apps with .NET Native. msdn.microsoft.com/en-us/library/ dn584397.aspx (retrieved April 28, 2014), 2014. [10] Migrating Your Windows Store App to .NET Native. msdn.microsoft.com/en-us/ library/dn600634(v=vs.110).aspx (retrieved April 28, 2014), 2014. [11] Georg Acher. JIFFY - A FPGA-Based Java Just-in-Time Compiler for Embedded Applications. PhD thesis, Technical University Munich, 2003. [12] aJile Systems, Inc. Real-Time Low-power Network Direct Execution Micropro- cessor for the JavaTMPlatform aJ-102. www.ajile.com/index.php?option=com_ content&view=article&id=2&Itemid=6 (retrieved March 21, 2012), 2011. [13] Apogee Software Inc. RTJRE. www.apogee.com/products/rtjre (retrieved September 5, 2013). [14] ARM Ltd. Jazelle. www.arm.com/products/processors/technologies/jazelle. php (retrieved August 27, 2013), 2013. [15] A. Armbruster, J. Baker, A. Cunei, C. Flack, D. Holmes, F. Pizlo, E. Pla, M. Proc- hazka, and J. Vitek. A real-time Java Virtual Machine with Applications in Avionics. ACM Trans. Embed. Comput. Syst., 7(1):5:1–5:49, December 2007. [16] J. Auerbach, D. Bacon, B. Blainey, P. Cheng, M. Dawson, M. Fulton, D. Grove, D. Hart, and M. Stoodley. Design and Implementation of a Comprehensive Real- Time Java Virtual Machine. In Proceedings of the 7th ACM & IEEE international conference on Embedded software, EMSOFT ’07, pages 249–258, New York, NY, USA, 2007. ACM.

127 Bibliography

[17] John Aycock. A Brief History of Just-In-Time. ACM Comput. Surv., 35(2):97–113, June 2003. [18] Michael Barr and Anthony Massa. Programming Embedded Systems. O’Reilly, 2nd edition, 2006. [19] Eli Bendersky. Load-time relocation of shared libraries. eli.thegreenplace.net/ 2011/08/25/load-time-relocation-of-shared-libraries/ (retrieved June 27, 2014), August 2011. [20] E. Bhatkar, Daniel C. Duvarney, and R. Sekar. Address Obfuscation: An Efficient Approach to Combat a Broad Range of Memory Error Exploits. In In Proceedings of the 12th USENIX Security Symposium, pages 105–120, 2003. [21] G. Bollella, B. Delsart, R. Guider, C. Lizzi, and F. Parain. Mackinac: Making HotSpot Real-Time. In ISORC’05, pages 45–54, 2005. [22] S. Braun, M. Obermeier, J. Schmidt-Colinet, K. Eben, and M. Kissel. Notwendigkeit von Metriken für neue Programmiermethoden automatisierungstechnischer Anla- gen. In Echtzeit, pages 11–20, 2010. [23] Gregory Brill. Applying COM+. New Riders Publishing, Thousand Oaks, CA, USA, 1st edition, 2000. [24] B. Brosgol and B. Dobbing. Real-time convergence of Ada and Java. Ada Lett., XXI(4):11–26, September 2001. [25] Eric J. Bruno and Greg Bollella. Real-Time Java Programming: With Java RTS. Prentice Hall PTR, Upper Saddle River, NJ, USA, 1st edition, 2009. [26] Luca Cardelli and Peter Wegner. On Understanding Types, Data Abstraction, and Polymorphism. ACM Comput. Surv., 17(4):471–523, December 1985. [27] L. Carnahan and M. Ruark. Requirements for Real-Time Extensions for the JavaTMPlatform. www.itl.nist.gov/div897/ctg/real-time/rtj-final-draft. (retrieved June 27, 2014), September 1999. [28] A. V. Chapyzhenka, D.V. Ragozin, and A.L. Umnov. Low-power Architecture for CLI-Code Hardware Processor. Problems in Programming, pages 20–38, 2005. [29] T. I. S. Committee. Tool Interface Standard (TIS) Executable and Linking Format (ELF) Specification Version 1.2, May 1995. [30] Marco Cornero, Roberto Costa, Ricardo Fernández Pascual, Andrea C. Ornstein, and Erven Rohou. An Experimental Environment Validating the Suitability of CLI as an Effective Deployment Format for Embedded Systems. In Proceedings of the 3rd international conference on High performance embedded architectures and compilers, HiPEAC’08, pages 130–144, Berlin, Heidelberg, 2008. Springer-Verlag. [31] Angelo Corsaro and Douglas C. Schmidt. The Design and Performance of the jRate Real-Time Java Implementation. In On the Move to Meaningful Internet Systems, 2002 - DOA/CoopIS/ODBASE 2002 Confederated International Conferences DOA, CoopIS and ODBASE 2002, London, UK, UK, 2002. Springer-Verlag.

128 Bibliography

[32] Angelo Corsaro and Douglas C. Schmidt. Evaluating Real-Time Java Features and Performance for Real-Time Embedded Systems. In Proceedings of the Eighth IEEE Real-Time and Embedded Technology and Applications Symposium, RTAS ’02, pages 90–, Washington, DC, USA, 2002. IEEE Computer Society. [33] Martin Däumler and Matthias Werner. Optimierung der Code-Generierung virtualisierender Ausführungsumgebungen zur Erzielung deterministischer Aus- führungszeiten. In WolfgangA. Halang and Peter Holleczek, editors, Kommunika- tion unter Echtzeitbedingungen, volume 1, pages 29–38. Springer Berlin Heidelberg, 2012. [34] Martin Däumler and Matthias Werner. Reducing startup time of a deterministic virtualizing runtime environment. In Proceedings of the 16th International Workshop on Software and Compilers for Embedded Systems, M-SCOPES ’13, pages 48–57, New York, NY, USA, 2013. ACM. [35] F. de Bruin, F. Deladerriere, and F. Siebert. A standard Java virtual machine for real-time embedded systems. In Proceedings of Data Systems in Aerospace 2003, DASIA ’03, 2003. [36] Klaus Dembowski. Raspberry Pi - Das Handbuch: Konfiguration, Hardware, App- likationserstellung. Springer Fachmedien Wiesbaden, 2013. [37] D. Donald Thompson and C. Miller. Introducing the .NET Micro Framework - Product Positioning and Technology White Paper, 2007. [38] Robert Fitzgerald, Todd B. Knoblock, Erik Ruf, Bjarne Steensgaard, and David Tarditi. Marmot: an for Java. Softw. Pract. Exper., 30(3):199– 232, March 2000. [39] J. Fleischmann, Klaus Buchenrieder, and R. Kress. Codesign of embedded systems based on Java and reconfigurable hardware components. In Design, Automation and Test in Europe Conference and Exhibition 1999. Proceedings, pages 768–769, March 1999. [40] Free Software Foundation, Inc. The GNU Compiler for the JavaTM Programming Language. www.gnu.org/software/gcc/java/ (retrieved March 21, 2012), 2011. [41] Free Software Foundation, Inc. GCC, the GNU Compiler Collection. gcc.gnu.org (retrieved March 28, 2012), 2012. [42] Mike Fulton and Mike Stoodle. Real time micro benchmark suite. sourceforge. net/projects/rtmicrobench/ (retrieved June 27, 2014), May 2013. [43] Mike Fulton and Mark Stoodley. Compilation Techniques for Real-Time Java Pro- grams. In Proceedings of the International Symposium on Code Generation and Optimization, CGO ’07, pages 221–231, Washington, DC, USA, 2007. IEEE Com- puter Society. [44] Nicolas Geoffray. Fostering Systems Research with Managed Runtimes. PhD thesis, Université Pierre et Marie Curie, 2009.

129 Bibliography

[45] and Greg Bollella. The Real-Time Specification for JavaTM. Addison- Wesley Longman Publishing Co., Inc., 2000. [46] James Gosling, , Guy Steele, and Gilad Bracha. The JavaTMLanguage Spec- ification. Addison-Wesley Professional, 2005. [47] K John Gough. Stacking Them Up: A Comparison of Virtual Machines. In Proceed- ings of the 6th Australasian Conference on Computer Systems Architecture, ACSAC ’01, pages 55–61, Washington, DC, USA, 2001. IEEE Computer Society. [48] David Hardin. Real-Time Objects on the Bare Metal: An Efficient Hardware Real- ization of the Java Virtual Machine. In Proceedings of the Fourth International Sym- posium on Object-Oriented Real-Time Distributed Computing, ISORC ’01, Washing- ton, DC, USA, 2001. IEEE Computer Society. [49] Stefan Hepp and Martin Schoeberl. Worst-Case Execution Time Based Optimiza- tion of Real-Time Java Programs. In Proceedings of the 2012 IEEE 15th Inter- national Symposium on Object/Component/Service-Oriented Real-Time Distributed Computing, ISORC ’12, pages 64–70, Washington, DC, USA, 2012. IEEE Computer Society. [50] M. T. Higuera-Toledano and A. J. Wellings, editors. Distributed, Embedded and Real-time Java Systems. Springer US, 2012. [51] M. Teresa Higuera-Toledano. About 15 years of real-time Java. In Proceedings of the 10th International Workshop on Java Technologies for Real-time and Embedded Systems, JTRES ’12, pages 34–43, New York, NY, USA, 2012. ACM. [52] SungHyun Hong, Jin-Chul Kim, Jin Woo Shin, Soo-Mook Moon, Hyeong-Seok Oh, Jaemok Lee, and Hyung-Kyu Choi. Java client ahead-of-time compiler for embed- ded systems. In Proceedings of the 2007 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems, LCTES ’07, pages 63–72, New York, NY, USA, 2007. ACM. [53] Galen C. Hunt and James R. Larus. Singularity: rethinking the software stack. SIGOPS Oper. Syst. Rev., 41(2):37–49, April 2007. [54] International Electrotechnical Commission. IEC 61131-3 Ed. 1.0 en:1993: Pro- grammable controllers — Part 3: Programming languages. International Elec- trotechnical Commission, Geneva, Switzerland, 1993. [55] Anders Ive. Towards an embedded real-time Java virtual machine. PhD thesis, Department of Computer Science, Lund Institute of Technology, 2003. [56] Erik Johansson, Mikael Pettersson, Konstantinos Sagonas, and Thomas Lindgren. The development of the hipe system: design and experience report. International Journal of Software Tools for Technology Transfer, 4, 2002. [57] Pramod G. Joisha, Samuel P. Midkiff, Mauricio J. Serrano, and Manish Gupta. A Framework for Efficient Reuse of Binary Code in Java. In Proceedings of the 15th international conference on Supercomputing, ICS ’01, pages 440–453, New York, NY, USA, 2001. ACM.

130 Bibliography

[58] Vania Joloboff, Francois De Ferriere, Christian Fabre, Bertrand Delsart, Michael Weiss, Andrew Johnson, Fridtjof Siebert, Fred Roy, Xavier Spengler, and Frederick Hirsch. TurboJ, a Java Bytecode-to-Native Compiler. In ACM SIGPLAN Workshop on Languages, Compilers, and Tools for Embedded Systems, volume 1474/1998 of Lecture Notes in Computer Science, Montreal, Canada, 1998. Springer-Verlag. [59] Jambunathan K. Interface method dispatch (im ta- ble and thunks). monoruntime.wordpress.com/2009/04/22/ interface-method-dispatch-im-table-and-thunks/ (retrieved June 27, 2014), April 2009. [60] Jambunathan K. Magic (of) trampolines. monoruntime.wordpress.com/2009/04/ 13/magic-of-trampolines/ (retrieved June 27, 2014), April 2009. [61] Tomas Kalibera, Jeff Hagelberg, Filip Pizlo, Ales Plsek, Ben Titzer, and Jan Vitek.

CDx: a Family of Real-time Java Benchmarks. In Proceedings of the 7th Inter- national Workshop on Java Technologies for Real-Time and Embedded Systems, JTRES ’09, pages 41–50, New York, NY, USA, 2009. ACM. [62] Andrew Kennedy and Don Syme. Design and Implementation of Generics for the .NET Common Language Runtime. In Proceedings of the ACM SIGPLAN 2001 Conference on Programming Language Design and Implementation, PLDI ’01, pages 1–12, New York, NY, USA, 2001. ACM. [63] Andrew Kennedy and Don Syme. Combining Generics, Pre-compilation and Shar- ing between Software-Based Processes. research.microsoft.com/apps/pubs/ default.aspx?id=69129 (retrieved June 27, 2014), January 2004. [64] Andrew Kennedy and Don Syme. Pre-compilation for .NET Generics. citeseerx. ist.psu.edu/viewdoc/summary?doi=10.1.1.124.3911 (retrieved June 27, 2014), 2005. [65] Brian W. Kernighan. The C Programming Language. Prentice Hall Professional Technical Reference, 2nd edition, 1988. [66] Hermann Kopetz. Real-Time Systems: Design Principles for Distributed Embedded Applications. Springer Publishing Company, Incorporated, 2nd edition, 2011. [67] Chandra Krintz. A Collection of Phoenix-Compatible C# Benchmarks. www.cs. ucsb.edu/~ckrintz/racelab/PhxCSBenchmarks/ (retrieved October 2, 2012). [68] KW-Software. ProConOS eCLR. www.kw-software.com/en/iec-61131-control/ runtime-systems/proconos-eclr (retrieved September 9, 2013), 2013. [69] and Vikram Adve. LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation. In Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization, CGO ’04, Washington, DC, USA, 2004. IEEE Computer Society. [70] John R. Levine. Linkers and Loaders. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1st edition, 1999.

131 Bibliography

[71] Joseph C. Libby and Kenneth B. Kent. An embedded implementation of the Com- mon Language Infrastructure. J. Syst. Archit., 55(2):114–126, February 2009. [72] Serge Lidin. Inside Microsoft .NET IL Assembler. Microsoft Press, Redmond, WA, USA, 2002. [73] Tim Lindholm and Frank Yellin. The JavaTMVirtual Machine Specification. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1999. [74] Linux Kernel Organization, Inc. Linux Realtime Patches. www.kernel.org/pub/ linux/kernel/projects/rt/ (retrieved March 21, 2012), 2012. [75] Michael H. Lutz and Phillip A. Laplante. C# and the .NET Framework: Ready for Real Time? IEEE Softw., 20(1):74–80, January 2003. [76] James Mc Enery, David Hickey, and Menouer Boubekeur. Empirical evaluation of two main-stream RTSJ implementations. In Proceedings of the 5th international workshop on Java technologies for real-time and embedded systems, JTRES ’07, pages 47–54, New York, NY, USA, 2007. ACM. [77] Andre Meisel, Alexander Draeger, Sven Schneider, and Wolfram Hardt. Design flow for reconfiguration based on the overlaying concept. In Proceedings of the 19th IEEE International Symposium on Rapid System Prototyping, pages 89–95, Monterey, CA, USA, June 2008. IEEE Computer Society. [78] Microsoft Corporation. Native Image Generator (Ngen.exe). msdn.microsoft.com/ en-us/library/6t9t5wcf(v=vs.80).aspx (retrieved March 21, 2012), 2012. [79] Microsoft Corporation. .NET Framework. www.microsoft.de/net (retrieved March 21, 2012), 2012. [80] V. Mikheev, N. Lipsky, D. Gurchenkov, P. Pavlov, V. Sukharev, A. Markov, S. Kuk- senko, S. Fedoseev, D. Leskov, and A. Yeryomin. Overview of Excelsior JET, a High Performance Alternative to Java Virtual Machines. In Proceedings of the 3rd inter- national workshop on Software and performance, WOSP ’02, pages 104–113, New York, NY, USA, 2002. ACM. [81] Gilles Muller, Bárbara Moura, Fabrice Bellard, and Charles Consel. Harissa: a flexible and efficient Java environment mixing bytecode and compiled code. In Proceedings of the 3rd conference on USENIX Conference on Object-Oriented Tech- nologies (COOTS) - Volume 3, COOTS’97, pages 1–1, Berkeley, CA, USA, 1997. USENIX Association. [82] Nazomi Communications, Inc. Ja108: Universal java accelerator (product brief). java.epicentertech.com/Archive_Embedded/Nazomi/ja108_pb.pdf (re- trieved August 27, 2013), 2001. [83] Kevin Nilsen. Improving abstraction, encapsulation, and performance within mixed- mode real-time Java applications. In Proceedings of the 5th international workshop on Java technologies for real-time and embedded systems, JTRES ’07, pages 13–22, New York, NY, USA, 2007. ACM.

132 Bibliography

[84] Kevin Nilsen. Differentiating Features of the Aonix PERCTMVirtual Machine. www. aonix.com/pdf/PERCWhitePaper_e.pdf (retrieved March 21, 2012), 2009. [85] A. Nilson and T. Enkman. Deterministic Java in Tiny Embedded Systems. In Pro- ceedings of the Fourth International Symposium on Object-Oriented Real-Time Dis- tributed Computing, ISORC ’01, pages 60–68, Washington, DC, USA, 2001. IEEE Computer Society. [86] J. Michael O’Connor and Marc Tremblay. picoJava-I: The Java Virtual Machine in Hardware. IEEE Micro, 17(2):45–53, 1997. [87] Hyeong-Seok Oh, Beom-Jun Kim, Hyung-Kyu Choi, and Soo-Mook Moon. Evalu- ation of Android Dalvik virtual machine. In Proceedings of the 10th International Workshop on Java Technologies for Real-time and Embedded Systems, JTRES ’12, pages 115–124, New York, NY, USA, 2012. ACM. [88] . The simple Real Time Java. www.rtjcom.com (retrieved March 21, 2012), 2010. [89] Ricardo Fernández Pascual. GCC CIL Frontend. www.hipeac.net/system/files/ Ricardo-Fernandez-061129.pdf (retrieved March 21, 2012), 2006. Final presen- tation PhD intership. [90] Filip Pizlo, Lukasz Ziarek, Ethan Blanton, Petr Maj, and Jan Vitek. High-level pro- gramming of embedded hard real-time devices. In Proceedings of the 5th European conference on Computer systems, EuroSys ’10, pages 69–82, New York, NY, USA, 2010. ACM. [91] Filip Pizlo, Lukasz Ziarek, and Jan Vitek. Real Time Java on Resource-constrained Platforms with Fiji VM. In Proceedings of the 7th International Workshop on Java Technologies for Real-Time and Embedded Systems, JTRES ’09, pages 110–119, New York, NY, USA, 2009. ACM. [92] Ales Plsek, Lei Zhao, Veysel H. Sahin, Daniel Tang, Tomas Kalibera, and Jan Vitek. Developing safety critical Java applications with oSCJ/L0. In Proceedings of the 8th International Workshop on Java Technologies for Real-Time and Embedded Systems, JTRES ’10, pages 95–101, New York, NY, USA, 2010. ACM. [93] Todd A. Proebsting, Gregg Townsend, Patrick Bridges, John H. Hartman, Tim Newsham, and Scott A. Watterson. Toba: Java for Applications A Way Ahead of Time (WAT) Compiler. In Proceedings of the 3rd conference on USENIX Conference on Object-Oriented Technologies (COOTS) - Volume 3, COOTS’97, Berkeley, CA, USA, 1997. USENIX Association. [94] Bernhard Rabe. Self-contained CLI Assemblies. In Proceedings of .NET Technolo- gies’2006, pages 67–74, 2006. [95] Remotesoft Inc. Salamander .NET Linker, Native Compiler and Mini-Deployment Tool. www.remotesoft.com/linker (retrieved March 21, 2012), 2008. [96] Stefan Richter, Andreas Rasche, and Andreas Polze. Hardware-Near Programming in the Common Language Infrastructure. In Proceedings of the 10th IEEE Interna-

133 Bibliography

tional Symposium on Object and Component-Oriented Real-Time Distributed Com- puting, ISORC ’07, pages 329–336, Washington, DC, USA, 2007. IEEE Computer Society. [97] Simon Robinson. Expert .NET 1.1 Programming. Apress, 1st edition, 2004. [98] Konstantinos Sagonas, Chris Stavrakakis, and Yiannis Tsiouris. ErLLVM: An LLVM Backend for Erlang. In Proceedings of the Eleventh ACM SIGPLAN Workshop on Erlang Workshop, Erlang ’12, pages 21–32, New York, NY, USA, 2012. ACM. [99] Alexej Schepeljanski, Martin Däumler, and Matthias Werner. Entwicklung einer echtzeitfähigen CLI-Laufzeitumgebung für den Einsatz in der Automatisierung- stechnik. In WolfgangA. Halang and Peter Holleczek, editors, Eingebettete Systeme, volume 1 of Informatik aktuell, pages 21–30. Springer Berlin Heidelberg, 2011. [100] Klaus-Dieter Schmatz. Java Micro Edition - Entwicklung mobiler JavaME- Anwendungen mit CLDC und MIDP. dpunkt.verlag, second edition, 2007. [101] Martin Schoeberl. JOP: A Java Optimized Processor for Embedded Real-Time Sys- tems. PhD thesis, Vienna University of Technology, 2005. [102] Martin Schoeberl. A Java Processor Architecture for Embedded Real-time Systems. J. Syst. Archit., 54, jan 2008. [103] Tobias Schoofs, Eric Jenn, Stéphane Leriche, Kelvin Nilsen, Ludovic Gauthier, and Marc Richard-Foy. Use of PERC Pico in the AIDA avionics platform. In Proceed- ings of the 7th International Workshop on Java Technologies for Real-Time and Embedded Systems, JTRES ’09, pages 169–178, New York, NY, USA, 2009. ACM. [104] Mauricio Serrano, Rajesh Bordawekar, Sam Midkiff, and Manish Gupta. Quicksil- ver: a quasi-static compiler for Java. In Proceedings of the 15th ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications, OOPSLA ’00, pages 66–82, New York, NY, USA, 2000. ACM. [105] Sam Shiel and Ian Bayley. A Translation-Facilitated Comparison Between the Com- mon Language Runtime and the Java Virtual Machine. Electron. Notes Theor. Comput. Sci., 141(1):35–52, December 2005. [106] Kazuyuki Shudo. Performance Comparison of Java/.NET Runtimes. www.shudo. net/jit/perf/ (retrieved October 2, 2012), 2005. [107] Fridtjof Siebert and Andy Walter. Deterministic Execution of Java’s Primitive Bytecode Operations. In Proceedings of the 2001 Symposium on JavaTM Virtual Machine Research and Technology Symposium - Volume 1, JVM’01, pages 141–152, Berkeley, CA, USA, 2001. USENIX Association. [108] Richard L. Sites, Anton Chernoff, Matthew B. Kirk, Maurice P. Marks, and Scott G. Robinson. . Commun. ACM, 36(2):69–81, February 1993. [109] Jim Smith and Ravi Nair. Virtual Machines: Versatile Platforms for Systems and Processes. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2005. [110] P. Smith. Software Build Systems: Principles and Experience. Pearson Education, 2011.

134 Bibliography

[111] Internation J Consortion Specification. Real-Time Core Extensions, Draft Real- Time Core Extensions, Draft 1.0.14. http://www.soi.city.ac.uk/~kloukin/ IN2P3/material/rtce.1.0.14.pdf (retrieved June 27, 2014), 2000. [112] Y. N. Srikant and Priti Shankar, editors. The Compiler Design Handbook: Opti- mizations and Machine Code Generation. CRC Press, Inc., Boca Raton, FL, USA, 2003. [113] William Stallings. Betriebssysteme - Prinzipien und Umsetzung (4. Aufl.). Pearson Studium, 2003. [114] John A. Stankovic. Misconceptions About Real-Time Computing: A Serious Prob- lem for Next-Generation Systems. Computer, 21(10):10–19, October 1988. [115] Mark Stoodley, Kenneth Ma, and Marius Lut. Real-time Java, Part 2: Comparing compilation techniques. http://www.ibm.com/developerworks/java/library/ j-rtj2/index. (retrieved March 20, 2012), April 2007. [116] Struys, M. and Verhagen, M., PTS Software. Real-Time Behavior of the .NET Compact Framework. msdn.microsoft.com/en-us/library/ms836789.aspx (re- trieved August 26, 2013), 2003. [117] David Stutz, Ted Neward, and Geoff Shilling. Shared Source Cli Essentials. O’Reilly & Associates, Inc., Sebastopol, CA, USA, 2002. [118] Inc. Sun Microsystems. Sun Java Real-Time System 2.2. docs.oracle.com/ javase/realtime/doc_2.2/release/JavaRTSQuickStart.html (retrieved April 18, 2014), 2009. [119] TimeSys Corporation. RTSJ Reference Implementation (RI) and Technology Com- patibility Kit (TCK). www.timesys.com/java/ (retrieved March 21, 2012), 2007. [120] Sascha Uhrig and Jörg Wiese. Jamuth: An IP Processor Core for Real-time Systems. In Gregory Bollella, editor, JTRES, ACM International Con- ference Proceeding Series, pages 230–237. ACM, 2007. [121] Christian Ullenboom. Java ist auch eine Insel. Galileo Computing, tenth edition, 2012. [122] Martin v. Löwis and Andreas Rasche. Towards a Real-Time Implementation of the ECMA Common Language Infrastructure. In Proceedings of the Ninth IEEE In- ternational Symposium on Object and Component-Oriented Real-Time Distributed Computing, ISORC ’06, pages 125–132, Washington, DC, USA, 2006. IEEE Com- puter Society. [123] Andrey Vagin. Checkoint/Restore in Userspace. criu.org/ (retrieved November 25, 2013), 2013. [124] Ronald Veldema, Thilo Kielmann, and Henri E. Bal. Optimizing java-specific over- heads: Java at the speed of c? In Proceedings of the 9th International Conference on High-Performance Computing and Networking, HPCN Europe 2001, pages 685–692, London, UK, UK, 2001. Springer-Verlag.

135 Bibliography

[125] Matthias Vodel, Rene Bergelt, and Wolfram Hardt. A generic data processing frame- work for heterogeneous sensor-actor-networks. International Journal On Advances in Intelligent Systems, 5(3-4):483–492, December 2012. [126] Matthias Vodel, Matthias Sauppe, Mirko Caspar, and Wolfram Hardt. SimANet - A Large Scalable, Distributed Simulation Framework for Ambient Networks. Recent Advances in Information Technology and Security - Journal of Communications, 3(7):11–19, December 2008. [127] Matthias Werner. Responsivität – ein konsensbasierter Ansatz. WBI GmbH, 2000. [128] . Mono. www.mono-project.com (retrieved March 21, 2012), 2012. [129] K. Yaghmour. Embedded Android: Porting, Extending, and Customizing. Oreilly and Associate Series. O’Reilly Media, Incorporated, 2013. [130] Tan Yiyu, Lo Wan Yiu, Yau Chi Hang, Richard Li, and Anthony S. Fong. A Java processor with hardware-support object-oriented instructions. and Microsystems, 30(8):469 – 479, 2006. [131] A. Zerzelidis and A. J. Wellings. Requirements for a Real-Time .NET Framework. SIGPLAN Not., 40(2):41–50, 2005. [132] T. Zhou, L. Charest, and E.M. Aboulhamid. SCIL processor - a CIL processor for embedded systems. In Proceedings of the IEEE Northeast Workshop on Circuits and Systems, 2007. NEWCAS 2007, 2007. [133] Ben Zorn. Ben’s CLI Benchmarks. research.microsoft.com/en-us/um/people/ zorn/benchmarks/ (retrieved April 21, 2014).

136