<<

Faculty of Computer Science

Chair for Real Time Systems

Diploma Thesis

Porting DotGNU to Embedded

Author: Alexander Stein Supervisor: Jun.-Prof. Dr.-Ing. Robert Baumgartl Dipl.-Ing. Ronald Sieber Date of Submission: May 15, 2008 Alexander Stein Porting DotGNU to Embedded Linux Diploma Thesis, Chemnitz University of Technology, 2008 Abstract

Programming PLC systems is limited by the provided libraries. In contrary, hardware-near programming needs bigger eorts in e. g. initializing the hardware. This work oers a foundation to combine advantages of both development sides. Therefore, Portable.NET from the DotGNU project has been used, which is an im- plementation of CLI, better known as .NET. The target system is the PLCcore- 5484 microcontroller board, developed by SYS TEC electronic GmbH. Built upon the porting, two variants to use interrupt routines withing the Portabe.NET runtime environment have been analyzed. Finally, the reaction times to occuring interrupt events have been examined and compared.

Die Programmierung für SPS-Systeme ist durch die gegebenen Bibliotheken beschränkt, während hardwarenahe Programmierung einen gröÿeren Aufwand durch z.B. Initialisierungen hat. Diese Arbeit bietet eine Grundlage, um die Vorteile bei- der Entwicklungsseiten zu kombinieren. Dafür wurde Portable.NET des DotGNU- Projekts, eine Implementierung des CLI, bekannter unter dem Namen .NET, be- nutzt. Das Zielsystem ist das PLCcore-5484 Mikrocontrollerboard der SYS TEC electronic GmbH. Aufbauend auf der Portierung wurden zwei Varianten zur Ein- bindung von Interrupt-Routinen in die Portable.NET Laufzeitumgebung untersucht. Abschlieÿend wurden die Reaktionszeiten zu eintretenden Interrupts analysiert und verglichen. Acknowledgements

I would like to thank some persons who had inuence and supported me in my work. At rst, I want to thank my supervisors, Jun.-Prof. Dr.-Ing. Robert Baumgartl and Dipl.-Ing. Ronald Sieber, who passed many valuable comments to me. I am grateful to SYS TEC electronic GmbH for making this diploma thesis possible. Special thanks go to the developers of Portable.NET, who helped me to understand the internals of the program and to nd solutions for problems that occured in the course of my work. Last but not least, I want to thank my familiy that supported me in their special manner. Without their support, my study would not have been possible this way. Contents

List of Tables iv

List of Figures v

Listings vi

1 Introduction 1 1.1 Objective ...... 1 1.2 Structure ...... 2

2 State of the Art 3 2.1 .NET Framework ...... 3 2.1.1 ...... 4 2.1.2 ...... 6 2.1.3 Common Language Specication ...... 6 2.1.4 Common Intermediate Language ...... 7 2.1.5 Metadata ...... 8 2.1.6 .NET Class ...... 8 2.2 DotGNU ...... 9 2.2.1 Hans- ...... 10 2.2.2 Foreign Function Interface ...... 11 2.2.3 Interpreter ...... 11 2.2.4 Unrolled Assembler Code ...... 12 2.2.5 JIT Support ...... 13 2.2.6 Debugging ...... 14 2.2.7 X11 Support ...... 16 2.2.8 Embedded Engine ...... 16 2.3 ...... 17

i Contents

2.4 Rotor ...... 17 2.5 Coldre vs. m68k ...... 17 2.5.1 Exclusive Instructions ...... 18 2.5.2 Floating Point Size ...... 18 2.5.3 Other Dierences ...... 19 2.5.4 ABI Changes ...... 19

3 Implementation 20 3.1 Incompatible m68k Code ...... 20 3.2 Incompatible Alignments ...... 21 3.3 Broken Toolchain ...... 24 3.4 Unrolling Assembler Code ...... 24 3.4.1 Setup of the Unroller ...... 25 3.4.2 Unroller Implementation ...... 26 3.4.3 m68k Specics ...... 27 3.4.4 Macro Counting ...... 28 3.4.5 Extended Testsuite ...... 29 3.4.6 Floating Point Remainder ...... 34 3.4.7 Big Endianess ...... 34 3.4.8 NOP is not just no Operation ...... 35 3.4.9 Caches ...... 36 3.5 # Debugging ...... 37 3.6 Interrupt Access in C# ...... 38 3.6.1 Kernel Module ...... 39 3.6.2 C# Application ...... 40 3.7 Results ...... 42

4 Performance 43 4.1 Portable.NET Benchmark ...... 43 4.2 Benchmark Results ...... 44 4.3 Interrupt Response Time ...... 46

5 Final Remarks and Further Work 53 5.1 Conclusion ...... 53 5.2 Improvements ...... 53

ii Contents

5.2.1 Increasing Engine Performance ...... 53 5.2.2 Tweaking IRQ Handling ...... 54 5.2.3 Porting JIT to MCF5484 ...... 55 5.2.4 Miscellaneous ...... 55

6 Acronyms 56

Bibliography 58

A Contents of the enclosed CD 61

iii List of Tables

2.1 New Instructions introduced on Coldre ...... 18

3.1 List of mentioned assembler instructions in this chapter ...... 22

4.1 Results from pnetmark for the dierent engines ...... 45 4.2 Comparisons to the interpreter engines ...... 50

iv List of Figures

2.1 Reduced -ISA-combinations, adopted from [1] ...... 5 2.2 Relationship between CLS and CTS, adopted from [1] ...... 7 2.3 Development using pnet, adopted from [11] ...... 14 2.4 Debug Communication in Portable .NET ...... 15

3.1 IRQ access setup ...... 41

4.1 IRQ timings at 100 Hz rate ...... 48 4.2 IRQ timings at 10 Hz rate ...... 49 4.3 IRQ timings at 1 Hz rate ...... 50 4.4 IRQ timings at 0.1 Hz rate ...... 51

v Listings

3.1 Test and Set for m68k (adopted from pnet) ...... 20 3.2 Testcode for missing load_memindex macros ...... 31 3.3 Testcode for missing oating point macros ...... 31 3.4 Testcode for missing unsigned div instruction ...... 32

vi 1 Introduction

SYS TEC electronic GmbH1 has developed a microcontroller board named PLCcore- 5484, which is based on the 32-bit Freescale MCF5484 microcontroller. This Programmable Logic Controller (PLC) core module is used industrially for the au- tomation of processes.

Programming PLCs is limited by the system libraries, which are provided by the manufacturer. Thus, it is not possible to access hardware components or to react directly on interrupt events, unlike embedded programming using C or Assembler. However, it is desirable to combine the advantages like comfort of PLC programming and the exibility of hardware-near programming. Additionally, disadvantages of either sides should be eliminated, e. g. explicit hardware initializing in hardware- near programming or restricted type casts in PLC programs.

1.1 Objective

This work shall build the foundation to make the above mentioned features possible. Therefore, a .NET runtime, actually the opensource implementation DotGNU, is used. DotGNU is suitable for the embedded Linux in PLCcore-5484, because it allows the wanted exibility and hardware access, which is required for realizing real time systems. Furthermore, this runtime also provides a class library for abstraction and encapsulation of details, like it is commonly used in PLC programming.

Porting DotGNU to the MCF5484 microcontroller has to be done at rst. Afterwards, the possibilities of interrupt routine usage have to be analyzed and implemented.

1http://www.systec-electronic.com

1 1 Introduction

1.2 Structure

The diploma thesis is divided into ve chapters. The rst section, the introduction, explains the backgrounds and objectives of this work. State of the art, the second chapter, outlines technologies and architectures related to the porting of DotGNU to embedded Linux. The next chapter, the implementation, describes the actual porting, as well as encountered problems and their solutions. This part also includes the realization of interrupt access inside DotGNU and the results of the entire im- plementation. The second last chapter contains information about performance of the runtime engine and response times from external interrupts. The nal chapter summarizes this paper and states ideas for further work.

2 2 State of the Art

2.1 .NET Framework

This section gives an overview about .NET and its features. For more information about this framework [8], [1], [19], [17] and [11] are recommended.

.NET is a programming platform developed by . It is designed indepen- dently of the platform and and standardized as ECMA international standard 335, which became accepted as an ISO standard later. The ocial name is Common Language Interface (CLI).

Generally, it is possible to write every program using the .NET framework, but this does not hold true for drivers. The large programming library makes it easier to create an application, but in most cases it is just a wrapper for the Win32 API. There is also support for commonly used technologies and protocols, e. g. HTTP, XML, SOAP. This provides a unied set of features on every computer using a specic version of the .NET. So, all applications have the same features in the runtime like code checking, threading, garbage collection and .

It is designed to be robust and secure. There are two dierent security models: Role- Based Security and . The rst one is the typical setup: several roles are dened, which have a dierent set of rights to access ressources on the running system. A program executed by a user automatically gets all the rights the user has. Using the second method, the rights granted that way may also be reduced. The security manager maps the security policy and the evidence (e. g. source, author) of the running into a permission set assigned to the assembly. Accessing a ressource is restricted by a permission object, which checks the permission set of the assembly, if access is allowed. This is done for all assemblies in the current calling stack and for the actual access. The rights do not have to be restricted by

3 2 State of the Art any assembly. Combined with signed assemblies, an assembly replacement may be noticed, which could deny access for malicious code. The installation is easier than before as a program can be installed by copying the necessary les to the wanted directory. Consequentially, the deinstallation is just as easy. Removing the application directory is enough. There is no need for entries in the Windows registry to run the programs. The conguration is done via XML les and included in the provided assemblies. It is possible to install several versions of the .NET framework side by side, so applications made for dierent .NET versions are runnable without modications. Interoperability is created everywhere it could be useful. It is possible to use several .NET compatible programming languages, as they get compiled to an intermediate language (see section 2.1.4). Regarding the language features, it is possible to choose the most suitable ones.

2.1.1 Common Language Runtime

The Common Language Runtime (CLR) is the core of .NET. It is designed as a stack-based virtual machine, like the Java Virtual Machine (JVM) (see [1]). As a consequence, most of the instructions do not require any arguments, as the used values are stored on the top of the stack in the corresponding order. In contrary, register based machine instructions have one or more arguments indicating, which registers have to be used. Before instruction execution, they are popped from the stack and cannot be used in following instructions. To reuse those values, they have to be duplicated rst on the stack. The result value of those instructions, if existant, is stored again on top of the stack. Load and store instructions are used to read and write RAM data. In the end, each instruction from the virtual machine has to be transformed into native assembler code for the underlying register machine. This additional layer creates a platform independence, as the Common Intermediate Language (CIL) code (see section 2.1.4) is portable over several architectures. In the same way, it is independent from any , in which the source code is written. Instead of compiling a binary tied to a specic hardware Instruction Set Architecture (ISA), which then has to be supported by the VM, a high-language-level virtual machine denes an

4 2.1 .NET Framework

ISA, which species the interface to the VM. Using this method, just have to support one ISA, which is the only one a host system has to support. Thus, even the necessity of cross compilers is eliminated. Eventually, the compiler-ISA combinations are reduced by introducing the Intermediate Language (IL) middle layer. Figure 2.1 a) shows the standard usage, one compiler for each language and ISA (3 · 4 = 12). By adding the IL layer, just one compiler per language and one CLR per ISA are necessary (3 + 4 = 7), as shown in gure 2.1 b).

a) b)

language 1 language 2 language 3 language 1 language 2 language 3

Intermediate Language

ISA A ISA B ISA C ISA ISA A ISA B ISA C ISA D

Figure 2.1: Reduced compiler-ISA-combinations, adopted from [1]

See [20] for more information on virtual machines.

The CLR also manages the memory and uses a garbage collector for allocated mem- ory. This eliminates the problems of invalid memory accesses and memory leaks in languages like C and C++, where is done by the programer.

The runtime handles exceptions and enforces the security policy. The commonly used role-based security policy is supported, just as the code base security policy. In the latter one, there are hints in the code, which handle access rights for the user. Generally, it is possible to drop privileges independently of the role-based access rights. The possibility of running native code on an architecture is done via a Platform Invoke. This method call realizes the setup of arguments according to the calling convention and also handles the return value.

5 2 State of the Art

2.1.2 Common Type System

The Common Type System (CTS) is the ruleset of the CIL code. It denes, how variable types have to be declared, used and managed. This ruleset describes, which features are possible to use within any .NET program. What is not dened here, cannot be used. The denitions include all possible data types, e. g. oating point types that the CIL compiler has to use. A programming language does not have to support all features oered by the CTS, but just the features it needs (see gure 2.2). If some data type in the source code is not directly supported in CTS, the compiler has to assure that a proper type is used instead. This guarantees that the generated CIL code is valid for all programming languages used to program for .NET. An assembly, the smallest autonomous unit in .NET, contains the generated CIL code and metadata, which describe the code.

2.1.3 Common Language Specication

The Common Language Specication (CLS) denes the minimum requirements a programming language aimed to be used for .NET has to comprise. Every CIL com- piler has to support this ruleset. By this means, it is guaranteed that the generated code can be used in other programming language, e. g. by using cross language in- heritance or calling (see gure 2.2). But this only applies to public parts visible from outside of the assembly. Inside an assembly, everything being specied in the CTS, is possible and valid. So, some of the predened CLR types are not CLS conform: e. g. typedref, int8, unsigned int16. There are several CLS rules (see [8] for the full set), which declare the requirements for CLS compliance. Boxed values (value types are boxed inside a reference type), for example, are not CLS conform. The same applies to unmanaged pointers.

To declare an assembly CLS compliant, it has to be marked so explicitly with CLSCompliantAttribute. There is no implicit CLS compliance. However, it is pos- sible to exclude some assembly elements from CLS compliance checking.

6 2.1 .NET Framework

CTS

C++ VB.NET

CLS

C#

Figure 2.2: Relationship between CLS and CTS, adopted from [1]

2.1.4 Common Intermediate Language

The instruction set used in this virtual machine is called Common Intermediate Language (CIL), sometimes referenced as Microsoft Intermediate Language (MSIL) or just IL, but normally, all these terms have the same meaning. CIL is an intermedia language, because it neither is a high-level programming language like C#, nor is it native suitable for a specic CPU. Every instruction is listed in CLI [8] with a description, thrown exceptions, require- ments for correctness and what is needed to verify the code. Most of the more than 200 instructions perform actions on the stack. Interestingly, most of them have no information about the type of arguments. It is a polymorphic instruction set, unlike e. g. Java . There is only one add instruction for adding integers and oating point values. So, the CLR has to attend to data types. The backend gets more complex this way, but on the other hand, it is easier to create a compiler for generating CIL. Yet, the CIL is not enough to be applicable in the CLR. Additionally, metadata are necessary.

7 2 State of the Art

2.1.5 Metadata

Additionally, metadata are used to describe the module itself. They are organised in tables, which describe types dened inside the module, external but referenced types, and a . The latter one describes the module itself and states e. g., which other assemblies are required. Other information given is the assembly name, a version, and language information. If there is no such information about target language, the used string is called neutral.

The manifest is a special type of metadata. It has to exist only in one module inside an assembly. The starting function is named here and the references to other modules and assemblies are also listed.

Metadata are used to realize properties, which have just one name on the public side. The internal side has two function names used for setting and getting internal variable values. Properties are special class members that have a get and set method. The is accessed by its name and, depending on the context, the get method is used for reading the property value. The set method is used for value updates sequentially. In addition, used attributes like CLSCompliantAttribute (see section 2.1.3) are declared in the metadata.

2.1.6 .NET Class Library

CLI [8] denes two proles of library sets: Kernel prole and Compact Prole. De- pending on the applied prole, a dierent set of libraries is available. The rst one builds the minimal set of features, including the Base Class Library and the Runtime Infrastructure Library. The latter one extends the Kernel Prole, by featuring an Extended Markup Language (XML)-, Network- and a Reection Library in addition to the mentioned libraries.

The .NET class library is a superset of the libraries dened by CLI. It also includes (not complete):

ˆ ADO.NET, used for accessing structured data from and XML les.

ˆ ASP.NET, used for dynamic web applications

8 2.2 DotGNU

ˆ .NET Remoting, used for Remote Procedure Calls

ˆ System.Windows.Forms, library for creating graphical applications

2.2 DotGNU

DotGNU is an ocial GNU project attempting to provide an alternative to Mi- crosoft's .NET. It is a meta-project consiting of three projects:

ˆ DotGNU Execution Environment

ˆ phpGroupWare

ˆ DotGNU Portable.NET

DotGNU Execution Environment (DGEE)1 is a component realizing a webservice server. phpGroupWare2 is a web-based collaboration suite. Those two subprojects were ignored, as the main aim was to use CIL application on embedded Linux. As a consequence, only the term DotGNU Portable.NET will be used instead of DotGNU in this paper, because DotGNU would not be exact.

DotGNU Portable.NET

DotGNU Portable.NET (pnet for short) is the part, which implements the ECMA standards 334 (C# [7]) and 335 (CLI [8]). This allows the creation and execution of CLI applications. pnet contains the CLR, a C# compiler, CIL assembler and disassembler, a ldd pendant for CIL showing linkage dependencies and several devel- opment tools. In dierence to .NET, pnet supports platforms like Linux, Windows, MacOS, Solaris and some more. It also runs more processors like , x86_64, ARM, PPC, IA64, etc.

DotGNU Portable.NET Library (pnetlib for short) is the C# class library being com- pliant to CLI and it tries to be compatible to the .NET library as much as possible.

1http://www.gnu.org/software/dotgnu/dgee.html 2http://www.phpgroupware.com

9 2 State of the Art

Nevertheless, some .NET C# classes are missing or incomplete in the current imple- mentation. This library also contains code for the System.Windows.Forms namespace used for creating window-based applications. To be independent from various graphic toolkits, like Qt or GTK+, pnetlib uses its own toolkit implementation for drawing objects.

2.2.1 Hans-Boehm Garbage Collector

The pnet engine uses boehm-gc (complete name is Boehm-Demers-Weiser garbage collector) for garbage collection, including small changes to support cygwin, so pnet maintains a copy of its own for that library. In general, this library allows the programer to replace malloc in C and new in C++. Thus, there are no and deallocation necessary.

It is also possible to use this library for detection [3] by detecting inaccessible objects, which were not freed manually. If an object is inaccessible, it cannot be deallocated later in the main program, so it has leaked.

Algorithm: The boehm-gc uses a modied mark-and-sweep algorithm [2] and is divided into 4 phases:

1. Preparation

2. Mark Phase

3. Sweep Phase

4. Finalization Phase

Preparation: In this phase, all mark bits associated to each object are cleared, indicating, if an object is accessible or not.

Mark Phase: All objects that are accessible from pointers stored in registers and on the stack are marked to be reachable. All pointers containing a pattern indicating a heap address are scanned. The stack range is specied by the current stack pointer

10 2.2 DotGNU and GC_stackbottom (the cool-end of the stack, see section 3.1 for architecture de- pendent settings on Coldre).

Sweep Phase: After a heap scan, the inaccessible, hence unmarked objects, are inserted in a free list. Usually, the object is not swept immediately and the corre- sponding page is not even touched. This has the advantage that, commonly, there is no paging during sweeping. In most cases, the actual sweep is done during the allocation of new objects.

Finalization: It is possible to register a nalization routine for an object. This allows executing code just before the object is freed. Objects that are marked as inaccessible and registered for nalization, are enqueued in a list, which is processed outside of the collector.

2.2.2 Foreign Function Interface

Another external library is lib, which is used for programs that do not know the amount or types of arguments at compile time. This knowledge is necessary to create a correct machine code and to be compliant to the calling convention. So, especially interpreters can use this library to call a function only known at runtime [12].

For proper usage, the signature of a function has to be known. The mentioned signa- ture includes a pointer to the function to be called, as well as the number and types of arguments and the type of return value. With this information provided, the library manually pushes the function arguments onto the stack by using an architecture- dependent assembler code, and by jumping into the subroutine. After returning from the called function, the assembler code cleans up the stack and, if necessary, moves the return value into a given buer.

2.2.3 Interpreter

As described in [4], the engine is split into ve parts, which are responsible for loading, verication, coding, CVM interpreter and support for the class library. Loading

11 2 State of the Art means converting an IL application (like a *.exe) into an internal form, which the engine can manipulate. Verication is done when executing a method. Then, IL bytecode is analysed to assure whether it is correct and security conditions are obeyed. During the analysis, the IL code is converted to a more ecient execution format named Converted Virtual Machine (CVM). Depending on the bytecode verication, the best alternative, based on the operand types, is chosen to perform e. g. an addition. For adding integers, IADD is chosen, and for oating point values FADD respectively. This increases performance, as the interpreter does not have to track all data types. As the converted bytecode is cached, this is only done at the rst call of a method. The support for platform invokes (pinvoke), which allows native function calls from C# (and other languages which can be compiled to CIL code), pnet uses lib for managing the calling convention on the target platform. It is used e. g. with the class library support functions in the engine. ilrun is a program, which is used for convenience usage of the engine. Generally speaking, it is used to execute an IL application, but there are also options to enable debugging, tracing, etc. It is like a wrapper written in C, because the actual engine is separated in a library, that any application, like ilrun, can link to it. A typical approach for an interpreter is to read the next instruction and use a loop with a huge switch-case statement for interpreting the code. However, pnet uses an mechanism that looks like a big switch-case statement, but actually, the macros used at each case block create C code, which uses goto (macro at the end) to jump to C labels (macro at the beginning) directly depending on the next instruction. Using this method, a lot of comparisons are dropped. Actually, the interpreter just has to look up the next code block that handles the next instruction.

2.2.4 Unrolled Assembler Code

Yet, this interpreter also has the possibility to use native assembler code for supported architectures. The so-called Unroller creates assembler code for CVM instructions, where possible. This is done every time a method has been called for the rst time and is about to be executed. The Unroller parses every instruction and uses macros for creating assembler code. Those macros have to be dened for the architecture

12 2.2 DotGNU to be supported. Macros exist for tasks, as for example loading a constant integer, adding two registers, load and store memory data, etc.

The engine has an internal stack, frame and instruction pointer. A big performance improvement is the xed assignment of those pointers to a register using the register assignment feature of GNU Compiler Collection (GCC). This way, the engine can always use those registers for the internal values without searching a new register for temporal usage and moving the data from memory to this register and vice versa.

The Unroller code explicitly lists the registers, which are available for general pur- poses. With this list, it can mange the assignments of local variables and registers that can be used directly in assembler across several CVM instructions. Where pure interpreting would read from memory, operate on that data and store the data back on every instruction, the usage of the unroller only needs to load the data from mem- ory prior to the rst operation (given that the free registers available are enough) on a variable, operate on that register for all following instructions, and nally, store the register value back to the memory. The actual implementation of the Coldre V4e Unroller is described in section 3.4.

2.2.5 JIT Support

The support for Jit-in-Time (JIT) compiling is included in pnet by using an external library called libjit, which is also written by pnet developers. Usually, the library is linked to pnet, but it is designed to be independent from any interpreter or virtual machine. Currently, only x86 and x86_64 architectures are supported. However, it is even possible to use libjit on non-supported architectures, because this library falls back upon an internal interpreter.

The JIT usage in pnet can be congured at compile time. It is also possible to use dierent engines, which either do interpreting or jitting. The possible development ow is shown in gure 2.3.

By using JIT, it is possible to increase the runtime performance even further than by using an unroller. By having a greater scope of IL code, register assignments can be done in a better way. Often used variables are held in registers as long as possible. Additionally, compared to the unroller, it is not necessary to store registers back into

13 2 State of the Art memory prior to branching. Method calls can as well be done directly in assembler code. Loops will benet in particular, as every loop has a branch at the end.

Source code

IL compiler Development

IL code

JIT compiler Interpreter

Runtime

Native code Interpreter code

Figure 2.3: Development using pnet, adopted from [11]

2.2.6 Debugging

The engine in pnet supports debugging of CIL applications. It is enabled by default, as pnet is compiled with tool support, which is required for debugging.

Debugging in pnet is done by using a network socket for communication between the engine and the debug client. The standard switch starts debugging on localhost, but there is also an option for debugging using a remote host. The latter is important for debugging CIL applications on embedded devices. The IP from the debugging client host has to be known, as the engine does not support DNS.

The communication is done by sending commands from the client to the engine, which responds in XML text. A sample communication can be seen in gure 2.4. It shows the command show_position from the debugging client and its result from

14 2.2 DotGNU the debug engine. The answer says that the debugger currently stops inside the method Main() along with the signature of this method. The current location is also sent, indicating to the debug client, in which line of the C# code the debugger stops, allowing to highlight the corresponding source code line inside the client debug program. The path to the source code le is included inside the metadata in the IL le, so the directory layout has to be the same on the debugging client as well as on the programming machine.

Debug Client Debug Engine

show_position

Request

Response

Method

Figure 2.4: Debug Communication in Portable .NET

For a working debugging session, the binary has to be compiled with the pnet tools. Only the debug symbol format is dened by CLI [8], but every runtime can implement its own format usage. Thus, for running applications, it is possible to use dierent CLI implementations for compiling and execution, but to be able to debug, these implementations have to be the same.

15 2 State of the Art

2.2.7 X11 Support

The class library pnetlib also supports part of the System.Windows.Form names- pace, which is applied for using dialog elements in a Graphical User Interface (GUI). The classes in that namespace use System.Drawing for creating GUI elements. System.Drawing.Xsharp is the link between System.Drawing and Xsharp, where the latter one is the link between the C# classes and a X11 C library used to access a X11 server. As not every targeted platform may have a X11 server or even a graphical display, it is possible to disable X11 support. As the Coldre board belongs to the latter category, pnetlib can be used on that non-graphical environment by simply disabling X11 support. In such cases, the Xsharp C library is just a stub doing nothing. Using this method, everything compiles just like on a normal system. So, it is possible to replace the stub later with a working library, enabling X11 support without any further modications to the already compiled IL library.

2.2.8 Embedded Engine

The standard method to launch a CIL application is to use ilrun, where the actual program is given as a parameter. In some embedded environments, this might not be possible. Another approach for running CIL applications is to link a program with the static engine library. The necessary functions to launch the engine and to run CIL code are called in the main application rather than in ilrun. This might be useful, where the additional features of ilrun besides running a CIL application, are not needed. This may save storage space and may decrease the amount of running processes, which could increase the overall system performance. CLI [8] denes two proles with a dierent set of features: Kernel Prole and Com- pact Prole. In addition, pnet also provided variants of those 2 proles with oating point support. Also, two non-standardized proles are added by pnet: tiny, which is a feature-reduced version of the kernel prole. The other one is full, which includes all features implemented in pnet. Depending on the conditions of an embedded sys- tem, it might by useful to switch to another prole, resulting in a smaller runtime engine.

16 2.3 Mono

2.3 Mono

Mono is an alternative open source implementation of the ECMA standards 334 and 335, allowing to create cross-platform applications. It is sponsored by Novell, which also developed some modules of Mono, as for example the runtime engine and the JIT compiler. Novel also oers commercial support in consulting, developing and services [16]. Mono supports several operating systems and architectures. It contains a C# compiler, a runtime engine including a JIT-compiler, a garbage collector and a threading support. There is also a byte code interpreter that can be used for porting to new platforms or debugging a JIT compiler [15]. Mono's .NET class implementation contains additional C# bindings for GTK+, called GTK#. Mono provides a console debugger written in C# as well, which is not feature-complete yet. A GUI is planned to be included in MonoDevelop, an Intergrated Development Environment (IDE) that originated from SharpDevelop as described in [14].

2.4 Rotor

Rotor, also called Shared Source CLI, is the reference implementation for the CLI standard. It can be run on Windows XP, Mac OS X and FreeBSD. According to [19], most of the code is from Microsoft, which is a reduced version of the .NET implementation compliant to the ECMA standard. The main purpose of Rotor is non-commercial usage, like education and academic research. Any commercial usage is prohibited.

2.5 Coldre vs. m68k

The Coldre processor is considered as a subtype of the Motorola 68000 (m68k) CPU family. Although they are identical in most aspects, there are several things making both of them incompatible.

17 2 State of the Art

2.5.1 Exclusive Instructions

Comparing m68k and Coldre, each of them has its exclusive instructions. The latter one only implements a subset of the m68k instruction set. As described in [13], the changes were made to reduce complexity in the Coldre model. Less commonly used instructions, addressing modes, and instruction sizes were removed. Instructions larger than 48 bit are also not supported, as this is the instruction size limit on Coldre. The reference manual [9] lists some unsupported instructions and a replacement possibility.

Depending on the internal core, the Coldre CPU has several instructions unavailable in a standard m68k CPU. All of the listed instructions in 2.1 were used during the unroller implementation (see section 3.4). The most important ones (among others) are:

Assembler Instruction Description MOV3Q Quick move instruction for 3-bit immediate (−1, 1 − 7) MVS Move of 8 or 16-bit data with sign extension to 32-bit MVZ Move of 8 or 16-bit data with zero extension to 32-bit Table 2.1: New Instructions introduced on Coldre

These dierent instruction sets make it dicult to replace one processor by the other one without recompiling the used source code. Especially m68k assembler instructions in architecture-dependent source code possibly need code modication to be useable on a Coldre processor.

2.5.2 Floating Point Size

The m68k uses the double-extended precision using 80 bits, as dened in [21]. The Coldre, in contrast, only has a FPU (introduced with V4e core) supporting double precision using 64 bit double-precision oating point registers.

18 2.5 Coldfire vs. m68k

2.5.3 Other Dierences

Beside the dierent set of supported instructions and oating point size, there are dierences in apparently identical instructions. Due to the reduced complexity of the core, some instructions support less addressing modes, e. g. the TAS instruction does not support data register addressing. Other instructions have completely dif- ferent arguments, e. g. CPUSHL (for pushing and invalidating cache lines) takes cache lines and indexes on Coldre, while the m68k version uses physical addresses. The instructions ASL, ASR, MULS and MULU always clear the overow ag that is set accordingly on m68k.

2.5.4 ABI Changes

Due to the dierent oating point size, a slightly dierent instruction set, and some other dierences, the compiled code is not interchangeable between m68k and Coldre CPUs in every case. Furthermore, the Application Binary Interface (ABI) enforced by the toolchain for Coldre uses a dierent mechanism to pass oating point return values back from a subroutine. The code compiled for Coldre uses two 32 bit registers for passing data, while the m68k version uses a oating point register. Considering these dierences, it is also questionable, if m68k and Coldre are of the same architecture hence, expecting the binaries are interchangeable otherwise.

19 3 Implementation

The porting of pnet had several aspects, which needed architecture-dependent mod- ications on the one hand, and on the other hand, some new features had to be implemented into the engine.

3.1 Incompatible m68k Code

lib and libgc, used by pnet, which have already an implementation for the m68k architecture, both use assembler code at some positions. Unfortunately, they use some m68k-exclusive instructions that are unavailable on the Coldre CPU. libgc implements a test-and-set function for each architecture and, for the m68k version, it uses the negb (see table 3.1 for a list of mentioned assembler instructions and a short description) instruction for negating a byte-value set by sne depending on the condition code set by tas (see listing 3.1 for approach). This is necessary, because the tas instruction only sets bit 7 on the destination register, but the result has to be stored in bit 0.

1 int GC_test_and_set( unsigned int *addr ) {

2 char o l d v a l ;

3

4 __asm__ __volatile__(

5 " tas %1@"

6 " sne %0"

7 "negb %0"

8 : "=d" (oldval)

9 : "a" (addr) : "memory");

10 return o l d v a l ;

20 3.2 Incompatible Alignments

11 }

Listing 3.1: Test and Set for m68k (adopted from pnet)

But the Coldre, on the other hand, only supports a neg.l instruction for negating a 32-bit value. Due to the dierence in size, there are no side eects, as only bit 0 is used in libgc code. This small code change makes it possible to use a Coldre compiler. The typical m68k comes without a MMU. Therefore, libgc on m68k is implemented using a xed stack bottom. This does not work on a Coldre (assuming at least a V4e core) that uses the MMU for supporting . Fortunately, it is also possible to let libgc determine this value by reading it from the /proc/self/maps list. For lib, several changes have to be made. The congure script for lib checks, if both types are the same size, and sets some macros accordingly. It seems that the authors of lib did not expect that a m68k CPU might use the same size for double and long double. The m68k part does not catch the case when both are the same. Therefore, two case statements use the same constant value, which results in a compiler error. By simply commenting out the needless code part, lib works on Coldre, too. Similar to libgc, assembler instructions had to be replaced, too. The unsupported m68k instructions are used in the code used for passing the return value from the native function call back to the program. Since Coldre, compared to m68k, has a dierent oating point size, oating point values are passed in a dierent way. As not every Coldre model has a FPU, the toolchain builds a code that returns oating point values using data registers. For 64-bit oating point values, two data registers are enough. Instead of moving the returned value from a 64-bit oating point register back to the destined memory address, two sequential 32-bit moves are used.

3.2 Incompatible Alignments

For garbage collecting, pnet uses libgc, which calculates at compile time, in depen- dence on the targeted architecture, the specic pointer alignment. As published in most m68k ABIs, 32-bit or greater values are aligned with 2 byte, using a padding

21 3 Implementation

Assembler instruction Description negb 0−operand, byte-sized, m68k only instruction scc Set according to condition, sne is not equal variant tas Test and Set, sets condition codes depending on destination neg.l 0−operand, longword sized rem Remainder, stores result of modulo operation bcc Branch conditionally, branch if condition is met nop No Operation, does nothing, but synchronizes the pipeline tpf Trap False, does nothing, faster than nop cpushl Push and Possibly Invalidate Cache, privileged instruction mov Move data from/to registers and/or memory Table 3.1: List of mentioned assembler instructions in this chapter byte in structures, if necessary. That is why libgc expects pointers that are aligned with 2 byte.

In contrast to the fact mentioned beforehand, pnet assumes all pointers managed by libgc are at least 4 byte aligned. Following that assumption, the lower 2 bits are always cleared. Thus, these bits never change, pnet uses them for internal data management at some places. In a pointer to a ILType structure, these 2 bits indicate, which types of value are stored in it: native data, value types or reference types. Despite that, pnet uses typed allocation within libgc, resulting in faster garbage collection, because libgc knows both the layout of allocated objects and which one can/cannot store pointers. This also requires a 4 byte alignment of pointers.

But the ABI requests 2 byte alignment, which breaks these expectations. So, the compiler must use a dierent alignment. This is possible when using a GCC attribute at the aected variables. But the developers did not expect an architecture with only 2 byte alignment. Consequently, those attributes were not used. Another alternative approach is to use the -malign-int compiler switch. It enforces the compiler to align all 32-bit variables to 32-bit boundaries. A positive eect is that the generated code is faster on processors with 32-bit busses, as no data is stored misaligned in memory. However, a big drawback is the fact that this change also aects 32-bit variables used in structures, as they may be aligned dierently. This is contradistinctive to the commonly used SYSV ABI, which aligns those elements on 2 byte boundaries.

As the current kernel still uses 2 byte alignment, system calls using a structure will

22 3.2 Incompatible Alignments return wrong results for the simple reason that the padding bytes used by pnet and the kernel are incompatible due to their dierent alignments. To evade this instance, a compatibility layer has been created, transferring a structure of one alignment to the other and vice versa. The actual structure member oset is determined by the following code (example for int, adopted from [5]):

#define _IL_ALIGN_CHECK_TYPE(type,name) \ struct _IL_align_##name { \ char pad; \ type field; \ }

_IL_ALIGN_CHECK_TYPE(ILInt32, int);

#define _IL_ALIGN_FOR_TYPE(name) \ ((ILNativeUInt)(&(((struct _IL_align_##name *)0)->field))) offset = _IL_ALIGN_FOR_TYPE(int);

Using these C macros, a (imaginary) structure is expected at address 0x0. The rst member is a dummy sized to 1 byte, which enforces the compiler to insert padding bytes prior to the field member. The address of the second member simultaneously gives the oset, which indicates the default alignment. Using the Coldre toolchain, this macro results the expected to 2 bytes. Using the -malign-int compiler switch, the alignment is increased to 4 bytes. The compatibility layer was implemented for the structure struct stat only, because this structure is used in the testsuite included in pnetlib. In the end, this layer is more a proof of concept. For practical reasons, it was decided instead, that the kernel will be recompiled using 4 byte alignment, which also should give a performance increase. The mentioned increase in performance could not be tested, as other external changes to the kernel had concurrently been made. To use a 4 byte alignment for the kernel and in eect, for the whole system, is not a problem, as all running programs can be recompiled using the dierent alignment, because the source code is available. Going this way, the compatibility layer became obsolete, was ignored, and completely removed in further development.

23 3 Implementation

3.3 Broken Toolchain

After implementing these changes and using 4 byte alignment, a simple Hello World! could be printed to the console. Complexer programs than this one just got some ex- ceptions in interpreting C# code. Deeper analysis revealed that there was a problem in handling numbers. When printing a number to the console, log10() is used to achieve the amount of digits in that number. Hence, the handling of oating pointer numbers was not correct. Working on that problem started with a GCC 3.4, targeted for m68k with Coldre V4e support. Some test applications and disassembles showed that the handling of long double variables was completely dierent to oat or double variables. As the coldre only supports 64-bit oating point values in its registers, the used GCC 3.4 applied three 32-bit data registers to build oating point values greater than 64-bit. However, the conversion from/to long double seemed to have problems resulting in incorrect values. In fall 2007, codesourcery1 released an updated toolchain featuring GCC 4.2. This new version resolved the problems with long double variables, because they have the same size as double, resulting in 64-bit oating point numbers. The glibc included in the ash storage on the board was still taken from the old toolchain. In order to use the new compiler and all libraries, like the essential glibc, in the pnet virtual machine, everything had to be linked statically to ilrun. This procedural method made it possible to use the new toolchain, while the rest of the system was using the old one. The switch to the new toolchain was made concurrently with the switch from 2 byte to 4 byte alignment, as other external changes were made to the Linux kernel sources. Afterwards, userspace applications and the Linux kernel are built using the new toolchain and 4 byte alignment.

3.4 Unrolling Assembler Code

Interpreting the CVM code always is a method to run IL applications. On the one hand, an interpreter is more portable than a JIT-compiler, on the other hand,

1http://www.codesourcery.com

24 3.4 Unrolling Assembler Code this is slower (see table 4.1 for benchmark results and table 4.2 for comparisons in interrupt response times) and is mostly suitable to test functionality. Yet, pnet oers the possibility to unroll native assembler code (see section 2.2.4) within the interpreter. The CVM runtime engine still converts the CIL instructions into their internal matches. Instead of pure interpreting, the internal unroller creates native assembler code for the CVM instruction, if it is supported. Otherwise, the unrolled assembler code jumps back to the CVM, where the unsupported instruction can still be interpreted.

To support several architectures, the unroller uses macros for operations that should be unrolled in assembler code. Every architecture then has to implement these macros to create the corresponding assembler instructions. This way, the architecture- independent part of the unroller does not need to be changed. Yet, sometimes it is still necessary, as described in section 3.4.3.

3.4.1 Setup of the Unroller

The unroller itself needs some information to be able to create correct assembler code. At the beginning, it needs information about three registers to store the in- ternal stack, frame and instruction pointer. This is possible with the GCC manual register assignment. By this manual assignment, one can optimize the interpreter loop manually. The compiler itself may misguess variables often used in loops, which should be stored in registers all the time for performance reasons. The unroller also needs a list of freely useable registers to operate on. The actual register name is not used, but a number that identies the register.

The implementation of this work uses address registers a3, a4 and a5 for the virtual stack, frame and instruction pointer. This is possible and advantageous, as address registers are required for indirect memory access that is needed to access stack vari- ables within an IL application. The data are read from the memory address based on stack or frame pointers. Sometimes, those register values need to be adjusted, resulting in additions or subtractions within the unrolled assembler code. As address registers are capable of those adjustments, they are the best solution for holding the virtual stack, frame and instruction pointer. If data registers would be used, every memory access using this pointer would require an additional movement from data

25 3 Implementation to address register. The registers a6 and a7 are reserved for system frame and stack pointer according to the ABI. The register d7 is used for temporary values in some instructions, e. g. the remainder (rem) instruction needs a dierent destination reg- ister than already given by the input. The macro for generating this assembler code (md_rem_reg_reg_word_32(inst,reg1,reg2)) provides two registers for the input only, where the rst one is also the output register. As unrolling is only a lightweight form of JIT, it does not need to handle argument passing for function calls. Normally, d0 and d1 are used for temporary values. It is not necessary to be ABI compliant within unrolled assembler code, as those register values are stored back to memory before any function call is done. As no function call is done directly in unrolled assembler code, the engine does not have to cope with that.

3.4.2 Unroller Implementation

The unroller uses a big switch-case statement to decide, if possible, which instruction is unrolled. Therefore, some action is needed as to get the necessary register numbers or memory addresses. Actually, C macros are used to create assembler code. These macros expand to C code, which writes the corresponding opcodes to a given memory address. The passed unique register numbers indicate, which register has to be used in this instruction. The created assembler code must ensure that nothing else than the given registers are altered. Despite from that, condition codes are always generated explicitly. If some macro is in need of a temporary register, it is useful to use one general purpose register for that and to remove it from the useable register list in the setup. As stated above, some instructions, as for example function calls, cannot be unrolled. For this case, the fallback solution is to jump back to the CVM using generated assembler instructions. The following opcode is interpreted in the ordinary way. This also allows to unroll only some CVM instructions, while leaving others beside without losing functionality. This is especially usefull during the development process, where not all macros are implemented yet. But a correctness test is already possible. It is noticeable that this implementation uses instructions like MVS and MVZ for speed improvements. Consequently, the Coldre core has to support ISA_A, the standard

26 3.4 Unrolling Assembler Code instruction set, and ISA_B in particular. Here is an example for the macro md_neg_reg_word_32, which negates the value of one register (macro counting part and unnecessary parts stripped): typedef enum { CFV4E_NEG = 0x4480 /* Negate */ } CFV4E_OP_reg;

#define md_neg_reg_word_32(inst, reg) \ cfv4e_reg(inst, CFV4E_NEG, reg)

#define cfv4e_reg(inst, op, reg) \ do { \ assert(!is_ax_reg(reg)); \ *(inst)++ = (unsigned short) op | \ (unsigned char) regnum(reg); \ } while (0) md_neg_reg_word_32 is called somewhere inside the unroller code, which indicates that a register value should be negated. inst is the pointer to the memory address, where the assembler instructions have to be inserted. reg is the enumeration number, indicating, which register has to be changed. For the NEG instruction, the upper 12 bits are xed to 0x448 and the lower 3 bits are used to indicate the register, whose value is about to be negated (bit 3 is always cleared). By subsequent macro substitution, one 16-bit word (which is one Coldre instruction) is written in the memory. Using the logical OR, the register number is inserted at the appropriate position. Other macros use more than one parameter. For that reason, shifting this arguments to the correct bit-position is necessary.

3.4.3 m68k Specics

The unroller implementation mentioned at the beginning of this paper did not sup- port any m68k architectures. As a result, no attention has been turned to peculiar

27 3 Implementation specics. This aects the two dierent registers types: data and address registers. Where data registers are used for general calculations and data movement, the ad- dress registers are used to access address memory directly or indirectly with or with- out displacements. However, the list of useable registers in the setup (see section 3.4.1) does only know one type. So, the address registers cannot be used directly for instructions like additions, which are possible indeed on m68k architecture. As every register in the unroller list must be useable for any instruction, only data registers were suitable. If the unroller uses a data register containing a memory address for indirect memory access, it has to be moved to an address register rst. This has to be done in the unroller macros, as the unroller is not aware of those dierent types. With this setup, the unroller always uses data registers for the variable mapping and only the architecture-dependent macros occasionally use address registers when necessary.

Another specialty is the condition code setting. CLI obviously supports conditional branching and therefore, the unroller generates an instruction for setting the condi- tion code explicitly, and afterwards a conditional branch, depending on the condition. Before the branch is actually taken, the register stack has to be ushed. This is es- sential, because the register-variable assignment might be changing after the branch. Especially, if the following instruction is interpreted, the register value has to be stored back into memory.

This results in possibly storing register values to memory before branching, but after setting the condition code. This works for almost any architecture other than m68k, because a data movement does not change the condition code ags. Yet, the m68k behaves dierently, as it sets the condition ags with every MOVE instruction, resulting in invalidating and overwriting the generated condition ags before. So, some branches do not work correctly and additional save and restore-processes of those ags are needed to correct this issue on the m68k architecture.

3.4.4 Macro Counting

After implementing all macros, there is no guarantee that every assembler gener- ated code is used correctly. To verify the code creation, a special macro has been developed, which increments a counter in a global array. The index in this array

28 3.4 Unrolling Assembler Code corresponds to the macros that were enumerated. As not all generated assembler instructions are used, think of conditional branches for example, the unrolled macros have to be counted on execution. The mentioned counting macro also creates assem- bler instructions for incrementing the counting value stored in memory. However, this approach does not work on conditional branches, as pnet stores the beginning of the memory address of the unrolled branch instruction block. This is done, as the branch target address as well as the size of the unrolled code are unknown. Afterwards, when the unroller knows the branch target, it patches the branch instruction by inserting the now known branch position. If the counting macro code is inserted before the branch, the patch does not work anymore, as the branch instruction is at another position than the one the unroller knows. To verify branch instructions, the testsuite has been extended (see section 3.4.5) to test every CIL branch instruction. As the counting macro uses a MOVE instruction to load the array memory address into an address register for indirect incrementing, the condition codes are destroyed (see section 3.4.3). To preserve them, they need to be saved in a temporary register. The Coldre only supports the movement of the condition code register into a data register, but this may destroy any value stored in the temporary register used by the unroller. This is especially true for the macro md_clear_membase, as it can be deployed several times one after another. This is critical, as they depend on a temporary value, created by another macro, which has to be the same for all following md_clear_membase usages. As no other data registers are free for unroller usage, the macro counting uses a second temporary register, thus decreasing the number of registers freely available for the unroller. These so counted values are printed to the console during program termination. With this list, it is possible to determine, which macros were used heavily, so it is worth improving them, and which ones were not used during program execution. This is the prerequisite for the extension made to the testsuite in the following section.

3.4.5 Extended Testsuite

A testsuite for library checking is included in pnetlib. So, a lot of the functionality is tested for correctness. This includes testing programming mistakes in the library

29 3 Implementation itself, as well as implementation errors in the runtime.

Analysing the testsuite with the macro counter (see section 3.4.4 for details), it is revealed that some unrolling macros were not used. As a result, the given testsuite cannot check completely for implementation errors in the runtime. So, in some cases, the runtime behaves improperly and the test does not notice it. To check the missing unrolling macros, the testsuite has to be extended.

As the macro counting shows the amount of unroller macro usage at program exit, it is trivial to nd the missing ones. Generally, the macro name itself does not show, by which CIL instruction or C# code it is uses. Luckily, these macros have names, indicating, which CIL instruction or C# construct may be missing. Those missing macros were:

ˆ md_load_memindex_sbyte

ˆ md_load_memindex_short

ˆ md_neg_reg_oat

ˆ md_rem_reg_reg_oat

ˆ md_udiv_reg_reg_word_32

The shipped testsuite is based on an application called csunit, developed for creating testsuites easily. The test program itself only gets a parameter pointing to a testsuite in form of a dll. During start-up, it tries to load the Suite()-method using reections. This method returns an instance of the testsuite that is used for the check. Using this technology, it is easy to extend a testsuite with an extra suite implementing own test cases. Each test case is capable of doing a setup and cleanup independently from each other. This way, it is possible to allocate, manage and free all ressources not managed by the garbage collector at all.

Listing 3.2 shows the test functions for loading memindex using byte and short data types. These loops assure that values in the arrays corresponds to their index number. The test case uses the possibility of a setup, which is creating and initializing the array values according to their indexes and data types.

30 3.4 Unrolling Assembler Code

1 p u b l i c void TestMemoryAccessLoadByte()

2 {

3 for ( int i = 0; i < size; i++)

4 {

5 AssertEquals("barray", i , barray[i ]);

6 }

7 }

8

9 p u b l i c void TestMemoryAccessLoadShort()

10 {

11 for ( int i = 0; i < size; i++)

12 {

13 AssertEquals("sarray", i , sarray[i ]);

14 }

15 }

Listing 3.2: Testcode for missing load_memindex macros

The second test implements tests for oating point arithmetics. Both tests check with positive and negative operands on oat and double data types. Listing 3.3 shows the test functions. A short description of the oating point remainder is given in section 3.4.6.

1 p u b l i c void TestFloatArithNeg()

2 {

3 float a = 10.5 f ;

4 AssertEquals("10.5f", 10.5f, a);

5 AssertEquals(" −10.5 f " , −10.5 f , −a ) ;

6

7 double b = 1 0 . 5 ;

8 AssertEquals("10.5", 10.5, b);

9 AssertEquals(" −10.5" , −10.5 , −b ) ;

10 }

11

31 3 Implementation

12 p u b l i c void TestFloatArithRemainder()

13 {

14 float a = 10.5 f ;

15 float b = 3.0 f ;

16 float c = a % b ;

17 AssertEquals("10.5f % 3.0f", 1.5f , c);

18

19 a = −10.5 f ;

20 c = a % b ;

21 AssertEquals(" −10.5f % 3.0f", −1.5 f , c ) ;

22

23 double d = 1 0 . 5 ;

24 double e = 3 . 0 ;

25 double f = d % e ;

26 AssertEquals("10.5 % 3.0", 1.5, f);

27

28 d = −10.5d ;

29 f = d % e ;

30 AssertEquals(" −10.5d % 3.0d" , −1.5 , f ) ;

31 }

Listing 3.3: Testcode for missing oating point macros

The third test just tests for an unsigned integer division. The operation is per- formed on positive and negative values. The unchecked block is necessary, as the value −10 lies outside the variable type's range. For the conversion to an unsigned integer, the two-complement of −10 is used. The cast to int at the asser- tion methods call is used to assure that the calculated values are interpreted correctly.

1 p u b l i c void TestIntArithUnsignedDiv()

2 {

3 uint a = 10;

4 uint b = 3 ;

5 uint c = a / b ;

32 3.4 Unrolling Assembler Code

6 AssertEquals("10 / 3", 3, ( int ) c ) ;

7

8 unchecked {

9 a = ( uint ) −10;

10 }

11 b = 3 ;

12 c = a / b ;

13

14 AssertEquals("−10 / 3", 0x55555552, ( int ) c ) ;

15 }

Listing 3.4: Testcode for missing unsigned div instruction

The last extension is not explicitly missing, but it assures all conditional branch instructions do work correctly. The motivation behind that test is that pnet uses macros of its own for condition codes. The following list gives an overview.

ˆ MD_CC_EQ

ˆ MD_CC_NE

ˆ MD_CC_LT

ˆ MD_CC_LE

ˆ MD_CC_GT

ˆ MD_CC_GE

ˆ MD_CC_LT_UN

ˆ MD_CC_LE_UN

ˆ MD_CC_GT_UN

ˆ MD_CC_GE_UN

Normally, these macros are dened to an architecture-dependent counterpart like CFV4E_CC_NE for Not Equal, which are dened to a number corresponding the condi- tion code encoding used by the Bcc instruction (see the description of Bcc in [9] for

33 3 Implementation a list of all condition codes). These generic condition code macros are used in some macros, indicating, which condition code should be tested. The actual usage of the condition code encoding allows to just shift them to the correct position to create the wanted assembler instruction, as for example a conditional branch.

By implementing a test, which uses all CIL branch instructions, it can be assured, the condition codes are implemented correctly, as a functional test with macro counting is not possible without further modication of pnet (see section 3.4.4 for further details).

A test for oating point branches is not necessary, as those branches are handled dierently by setting a result value, depending on the comparison. Subject to this integer value, a branch is taken or omitted.

3.4.6 Floating Point Remainder

In C, the oating point remainder can be calculated by using fmod(). Generally, the result z is dened: x == i · y + z 0 <= z < y. As the Coldre lacks a native oating point remainder instruction, the unroller uses the following implementation for reg1 = reg1 % reg2 (in C like pseudocode): double work, reg1, reg2; work = reg1; work = work / reg2; round_down(work); work = work * reg2; reg1 = reg1 - work;

3.4.7 Big Endianess

Although pnet is usually programmed to be endian-independent, there is a problem loading constant oating point values into registers within the unroller. The unroller loads them from a given memory address into a oating point register, in contrast to load an integer value, as they are loaded by moving an immediate into a data

34 3.4 Unrolling Assembler Code register, ignoring any endian issues. In contrast, the oating point value stored in memory has to be converted into the correct endianess. CLI [8, III, p. 6] states: `All operand numbers are encoded least-signicant- byte-at-smallest-address (a pattern commonly termed little-endian).' The code con- verting CIL instructions into CVM just copies those instruction arguments, ignoring the endianess. In principle, this is not a problem, as the interpreter code checks for endianess and switches byte order as necessary. Using the unroller instead of the interpreter for this instruction, the unroller code of a big endian architecture has to switch the byte order before loading the oating point value into a register. The implementation of this paper uses a temporary array for switching the byte order.

3.4.8 NOP is not just no Operation

The documentation for the unroller[6] advices that, at unrolled branch instructions, the oset in extension words is lled with NOP instructions, in case some instruction jumps directly into the oset. The actual oset is patched into the bcc and extension words later, when the oset is known. If the oset is small enough, the extension words are not touched at all. The advised NOP instruction has two big drawbacks on Coldre: It forces the integer pipeline to be synchronized and it takes 6 CPU cycles to execute. In case of 8 byte osets, those two NOPs are not changed, and if the branch is not taken, it takes 12 CPU cycles and two pipeline synchronizations, where nothing has to be done actually. The programmers reference manual [9] advices to use the Trap False (TPF) instruction instead, if no synchronisation is needed. This also has the advantage that a TPF executes in one CPU cycle, resulting in a faster code execution than with NOP. It is possible to use TPF instructions with dierent sizes, which are designed for replacing unconditional branches with 2, 4 or 6 byte osets. This is useful in if-else-statements, where, always at the end of the if block, an unconditional branch takes place. If the else block is small enough, the corresponding assembler code can be used as an extension word of the TPF instruction. However, as stated in the device errata [10], a TPF instruction hiding a conditional branch may result in a branch prediction error that will not be xed. Instead, there is a list of workarounds at hand:

35 3 Implementation

ˆ Deactivating the branch cache, resulting in lower branch performance

ˆ No replacement of unconditional branches using TPF in word and longword size

ˆ No usage of TPF, if the hidden instruction is a conditional branch

As the implementation of the unroller of this paper uses TPF in byte size only, which is just a real NOP without any instruction hiding, the stated bug is not the case here. That is why there is no problem in using this bugged instruction.

3.4.9 Caches

The Coldre has a Harvard structured memory cache including 32 kByte cache for data and instructions. For the data cache, it is possible to use either the write- through mode or copyback mode. The rst one always passes data writes to the external bus. If there is a cache miss, this write does not allocate a new cache line. A cache hit during writes also updates the cache. The copyback mode just updates the cache without writing data to the external bus. In fact, the cache data is only written to memory, if the cache line is about to be replaced, caused by a cache miss or by a CPUSHL instruction. The kernel implementation that this paper is based on, uses the write-through mode. As the unroller writes to memory, which is later used for instruction execution, this aects both data and instruction cache. As both caches are not synchronised to each other, it may occur that the instruction cache possesses memory data from a previous instruction, whereas the unroller wrote to that memory address afterwards, while unrolling a CIL method. In this case, when this method is executed, the instruction cache has wrong instruction data, resulting in unpredictable behavior. That is why, after unrolling a method, the cache has to be invalidated. A cache ush is not necessary, because all data are written to memory immediately. On the Coldre, a cache ush is the same as a cache invalidation, so there is no dier- ence in implementation. This invalidation forces the instruction cache to reload the instructions from the memory. As a result, updated data are achieved. The engine knows about such problems and therefore, it calls a function after un- rolling a method, which ushes caches, if necessary, depending on the running archi- tecture. The x86 architecture, for example, does not have to do anything, because

36 3.5 C# Debugging caches are synchronised automatically. Hence, for the Coldre, some action has to be done.

The necessary CPUSHL instruction is a privileged one. Thus, it is only allowed to be used in kernel mode. Otherwise, a privilege violation exception is raised. As a consequence, a system call has to be used or to be created. The already existing ARM implementation uses an existing system call. So, it seems reasonable to use a proper pendant on Coldre. After some research, it has been detected that there is already a system call ushing caches. Yet, it turned out that this function was not implemented for Coldre, but just for some other m68k sub-architectures. Adding the needed functionality could easily be done by using the existing convenience functions, actually ushing the caches on the Coldre.

But there is no system call function callable from userspace, although the system call was listed in the system call table. However, this list also indicated the system call number, which could then be used. The current implementation uses the syscall function, which enters kernel space and executes the actual system call.

3.5 C# Debugging

When doing a cross build, it is recommended to build pnet without tool support, because a C# compiler, assembler or disassembler are unlikely to be used on an embedded target system. As a consequence of compiling pnet without tools, there is no support for debugging or tracing, which may be unwanted during application development.

To be able to debug C# applications without building the pnet tools, minor modica- tions have to be made to the build system. All parts of pnet necessary for debugging have to be included in the build process and the introduction of a new congure option, to enable debugging separately from the tools (which includes debugging as before). Some changes in denitions had to be made to take notice of the new congure option.

For tracing support, only some denitions in the source code need to be modied. This also allows an explicit trace option for conguring pnet without tools support.

37 3 Implementation

Yet, during the rst attempts to debug, it showed up that debugging was not well supported using the CVM interpreter engine (independently from unrolling). Support for listing local variables and setting watch points only worked using a JIT engine backend. But after a conversation with the developers in the DotGNU IRC channel, they applied some changes to get an interpreter-compatible debugging, comparable to the JIT backend.

A graphical frontend for debugging within pnet is libide. This had the slight dis- advantage, that it was only possible to debug on localhost, because the debugging socket was bound to the IP 127.0.0.1. By changing it to the current IP for a specic interface, or just to 0.0.0.0 to allow connections on all interfaces, remote debugging was also possible on the Coldre.

However, it showed up that programs failing on the Coldre executed correctly, using the remote debugger. This fact leads to the assumption that the code is executed on the host system rather than on the target system during debugging. This may aect tests of applications tried out on the build system, which show a dierent behavior on the target system.

Debugging C# applications and libraries also helped porting pnet to the Coldre CPU. The tracing output of the interpreter and the unroller engine can be compared to each other. If the output diers at some point, further investigations have to be made, to nd out, if the output dierence is caused by an improper implementation. This also allows one to nd the actual method causing an error inside a testbench, which can be investigated further by debugging.

3.6 Interrupt Access in C#

Beside porting pnet to the Coldre CPU, another task of this paper is to analyze the response time for external interrupts within C# using pnet. Knowledge about the respone time is needed to react to external events, as for example an emergency switch in a C# application. Due to the fact that interrupts are handled by the Linux kernel, an additional kernel module has to be developed allowing to inform userspace processes about an interrupt event.

38 3.6 Interrupt Access in C#

3.6.1 Kernel Module

In order to get proper results, an interrupt was needed that could easily be triggered on the development board. Therefore, the idea is to use a button on that board, which triggers an interrupt in the running Linux. The kernel module notices this interrupt and noties an already waiting C# program. This application should react to that by setting an assembled LED, which is also controlled by this kernel module.

With such a setup, it is possible to test the module by simply pressing the button. An attached oscilloscope can determine the time spent between pressing the button and the LED ashing, which is the interrupt response time. To record possible inuences on the response time, caused by caching and external eects, a frequency generator was used, to avoid pushing the button manually.

Unfortunately, this button raises the interrupt number seven, which is linked to the highest interrupt level on the Coldre processor. This level is not maskable, which is especially necessary in kernel space. So, by pressing the button it is possible that the kernel may be interrupted in a critical section, which results in an unpredictable state and the kernel has to be restarted. This happens especially, if a lot of load is on the system, because the kernel has to do a lot of context switching. This will increase the chance that the IRQ 7 will interrupt critical sections in kernel space, leading the kernel into an undened state.

The kernel module is quite simple. During module loading, it registers an interrupt handler, congures and enables the interrupt to be edge-triggered, and creates a device le (/dev/irq7), which supports reading and writing from userspace. It also congures the in-/output pins correctly to the targeted behavior. The read operation blocks, until a new interrupt has occurred, which avoids busy waiting. As it is unpredictable, when an interrupt will occur, this is the better alternative. The write operation is not linked to interrupts at all, it just sets an LED to the given state. This is necessary to get the response time and it is a simpler approach than creating an extra module for that task. Also, the userspace program has to handle just one device le.

39 3 Implementation

3.6.2 C# Application

The C# application is the part in this setup that reads the current state of the button from the device le and writes the result back to the module to switch on or o the LED. As the read operation to the device le blocks, it would be useless to do that oper- ation in the main thread. This would block the whole application, eliminating the possibility to do any other tasks in the C# program. To bypass this, a separate thread is created, which handles reading and writing to the kernel module, while the main thread can do other things. As an example for a general purpose design, the C# application is divided into two parts: a main program (irqtest.exe) and a library (libirqaccess.dll) handling the device le. This allows a separate development of both the test application and the module. The latter one can then simply be linked to the program used in production, while maintaining the interface given by the library. This design also allows to use an alternative backend written in C#, but using, instead of the C# class library, platform invokes calling C functions directly. On the one hand, this solution bypasses the buer, parameter and result value checking inside the library, which results in a faster execution, but on the other hand, an additional C library is needed, which contains the functions for the platform invoke and depending on the setup, bypassing checks may cause errors. However, the C# library itself cannot directly do the write operation to the device le, as this implies the module is adapted for that usage. Common usage would do more than just writing the previously returned value back to the device. For this scenario, an additional link between the main thread and the library thread is needed. At initialising the separate thread, the main thread also passes a delegate, which should be called at interrupt occurrence. This way, the main application can dene what to do in such situations, independently from the library. The method call into the library for writing to the device le is just done for simplicity; the normal usage does not require such a program ow. Figure 3.1 illustrates the whole setup. The main thread registers an delegate inside the IRQ library. The IRQ thread reads from the device le (/dev/irq7) created by the kernel module. This function blocks, until IRQ 7 occurs. The read call unblocks

40 3.6 Interrupt Access in C# and returns to the C# library, which will then call the delegate registered before, supplying the current button state. The delegate itself will call a writing method inside the library, which accesses the kernel module again by its device le. The writing function in kernel space sets the LED state according to the supplied value. Back in the IRQ thread, it just reads from the device le again.

C# application

IRQ Library - libirqaccess.dll Main application - irqtest.exe

IRQ thread register delegate main thread

delegate delegate call

write

User space /dev/irq7 /dev/irq7

Kernel space kernel module

blocking write read

unblock on interrupt

set LED

Figure 3.1: IRQ access setup

41 3 Implementation

3.7 Results

Due to the internal architecture of pnet, the porting of the interpreter version to a Coldre CPU succeeded without critical obstacles in pnet itself. Most of the problems led back to external things like used libraries, the toolchain, or the applied kernel implementation. The main issue was the needed 4 byte alignment, where the standard ABI uses 2 bytes. So, the whole underlying system had to be rebuilt using an ABI modication. An advantage of the applied libraries was the already existing implementation for the m68k. So, only incompatibilities between the m68k and Coldre CPU are to be found and corrected to get a proper implementation. The interpreter is very portable and can be used on a new architecture very fast. Though, compared to an unroller version, it is slower (see section 4 for performance measurements). To get a proper implementation using native assembler code, an understanding of the targeted architecture is necessary. This includes the given register sets and types, special behaviors (see section 3.4.3 about m68k specialties) and also the instruction set, especially specic, faster variants of instructions for common cases to allow an improvement to the speed of the unroller. This paper implemented a fully useable engine, using an unroller for the MCF5484, including a Coldre V4e core. Thus, this implementation is not suiteable for the standard m68k CPU, because it uses instructions only available on Coldre. Es- pecially the Coldre processor has to support ISA_B, as some instructions from it are used for speed improvements. The included FPU is supported and used by the unroller, which improves oating point calculations. In general, a FPU is required by this implementation, but oating point support is deactivateable easily by clearing the list of oating point registers available to the unroller. A kernel module was also developed, registering an interrupt that is triggered by an external button on the development board. Using a device le created by this module, a userspace application can access that module notications and set a LED on the board. This allows to get the interrupt response time from a C# application. Timing and performance measurements are discussed in the separate chapter 4.

42 4 Performance

This chapter handles the performance aspects of the Coldre porting of pnet. Two dierent measurements have been taken in this paper. The rst one is a runtime benchmark, which tests several functional parts of the engine. The other one exam- ines the response time of a C# application from an interrupt on that engine.

4.1 Portable.NET Benchmark

Portable.NET Benchmark (pnetmark for short) is a benchmark tool for testing a CLR. It is written in C# completely, and divided in several subtests, which each test a specic aspect of the used engine. The test is done by rst calculating the amount of runs that can be done in one second. This amount is multiplied by three to run a test lasting about three seconds. With the now measured duration, the value runs per millisecond can be calculated, which gets multiplied by a magnication factor to normalize the results of all benchmark parts to get comparable results. According to the documentation[18], this program is loosely based on the CaeineMark, which is used for benchmarking Java Virtual Machines.

Sieve: The rst test calculates all prime numbers up to a given maximum. If run with a default maximum, one test run checks all integers up to 512 to get all prime numbers. The calculation is done by two nested loops that test, if a given number (from the outer loop) can be divided by another number (from the inner loop), whereas the remainder is 0.

Loop: This test is about looping while doing a sorting. A test run lls an array with increasing numbers. They are sorted using the bubble-sort algorithm. Despite

43 4 Performance from that, inside the loop, two sums are also calculated. There is not any specic purpose for that, as the result is not used outside of the loops.

Logic: The next test performs logical calculations. Inside two loops, several logical or and logical and are used inside if-statements, as well as several logical not upon variables to negate them.

String: This test uses the System.Text.StringBuilder class to do some string operations. At rst, it concatenates several small strings into a big one. Finally, a string search is done to work on the created string.

Float: There is also a test about oating point calculations. This one builds up a rotation matrix, which is applied to 20 vectors. The standard conguration does transformations with an increasing angle up to 185◦.

Method: The last test is about method calling. It does recursion up to a given depth. A second recursion method tries to disable inlining, which might be done by a JIT-compiler.

n P log(xn) 1 The nal score is calculated using the geometric mean: e n . As the subtests have dierent magnication factors, the nal result can only compare tested CLRs up to a specic degree. This does not apply to each subtest itself, as they are done independently from any magnication factors. If a CLR, compared to another engine in a single test, has a doubled result, the tested CLR is twice as fast as the other one.

4.2 Benchmark Results

The rst running CLR was pnet, using full interpretation. The next milestone during development was an engine using an unroller that lacked unrolled oating point instructions. The nal milestone in this paper was an engine, which also uses the integrated FPU.

44 4.2 Benchmark Results

The following benchmark results are created by running the benchmark using the corresponding ilrun command. This test was done without any competing userspace process, as well as an unounted NFS share and disconnected network to reduce the external inuences as much as possible. Table 4.1 lists the results that are averaged from the three consecutive runs. The engines are compiled without any debug information and have been optimized by the gcc option -O2, but the pure CVM version was compiled with -O1, as otherwise, it did not compile, assemble to be more precise, without errors. All unrolling engines lacked the macro counting feature, as this is useless on a productive system and also reduces the amount of available data registers (see section 3.4.4).

Test CVM non-FPU Unroller Unroller Sieve 45 606 604 Loop 40 668 678 Logic 45 329 327 String 219 600 611 Float 2 20 22 Method 37 163 163 Total result 33 252 257 Table 4.1: Results from pnetmark for the dierent engines

As expected, the unroller engines have a better result than the pure interpreting one. Notably in the CVM engine is the comparatively high result in the string test. This may be caused by the test itself, as most string operations are done using glibc string functions. So, there are less IL instructions to interpret, which increases the result. Also, the oat test diers signicantly from the other results. The oat test does a lot of calculations, where all IL instructions have to be interpreted. The tests seems to stress the engine too much here, resulting in a very poor result. The two engines featuring an unroller are completely identical, despite the fact that one unrolls FPU instruction while the other one interprets them. So, of course, the unroller using FPU instructions, is faster in the oat test, but not as much better as already expected. This is also due to the way the benchmark test is programed. The vector transformation is done using a loop, which results in a IL branch instruction. But due to the unroller architecture, the register stack has to be ushed before any branches are taken, resulting in storing all used register values back to memory.

45 4 Performance

In the end, each vector transformation done within the loop loads the value into a oating point register, operates on them, and nally stores them back to memory. As the unroller is not capable of doing global register allocation, the performance benets are reduced by the program ow. The other dierences between the unroller engines are very interesting. There are no oating point operations inside the tests. So, the engine creates the same assembler instructions for those tests. However, there are still dierent results. The dierent code size and location that result in dierent cache behavior might be a possible reason. So, cache hits and cache misses inuence the benchmark performance for both engines in dierent situations, which leads to dierent benchmark results. All engines were used unstripped, but running the benchmark with stripped binaries, no signicant performance improvement was measurable. The code size reduction is not worth mentioning. The results may be inuenced by the same eect as the unroller engines, due to changed code layout.

Conclusion

This benchmark clearly shows the performance improvement, which is caused by unrolling native assembler instructions. It also might be helpful to test specic unroller improvements. On the other hand, it also shows the limitations of the unroller architecture that can be bypassed by using a JIT compiler. Nevertheless, for repeating program code, compared to pure interpreting, the unroller mechanism creates a notable performance improvement.

4.3 Interrupt Response Time

Another task was to get response times from a C# application reacting on external interrupt events. The implementation is described in chapter 3.6. A normal C program, which does nearly the same as the C# application, was also developed. This allows a comparison with non-.Net applications. The only dierence between both of them is that the C program does not use function pointers for calling functions of external self-written libraries. It is designed to be very minimalist, the

46 4.3 Interrupt Response Time loop is running in the main thread as well. Due to this simple implementation, this program will show a response time that should not be undercut by a C# application by reason of its virtualization and interpretation penalty. During the test setup, a manually triggered event was carried out to get an idea of the resulting times. It showed up that, the rst time the interrupt was triggered, the response time of the C# program was signicantly greater than the following ones. This is due to the unrolling overhead, which is done, when a method is called for the rst time. In later production usage, this can be avoided by doing a self-test, which will run all necessary methods. This initial unroll will speed up all following calls. However, a downside is that, if the system is running on low memory, some cached methods may be deleted. In that situation, the overhead will strike again. Doing a forced unroll by using a special setup without using the delegate, the rst interrupt call was not as slow as before, but still slower than a fully unrolled method pass. It showed up that unrolling the delegate takes up most of the time. Removing the delegate, the unrolling would be fast, but this does also mean that the library, for getting interrupt notications has to be adopted to the used main application every time. The real time measurements were done using a frequency generator simulating a button trigger. A positive eect is that this is more exact, as pressing the button once manually can create more than one interrupt. Another aspect is, that pressing the button with 100 Hz is not possible when doing it manually. The test itself was done using four dierent frequencies to detect cache eects: 100 Hz, 10 Hz, 1 Hz and 0.1 Hz. All frequencies were tested consecutively in all three program variants (C, C# using a C library, and pure C#). The system was rebooted after each program variant to reduce any cache eects caused by the physical memory location of the program. The used oscilloscope lacked integrated support for statistics. Consequently, the results had to by taken down by hand. This is important as, due to that fact, it is possible that the results did not catch the real minimum and maximum during the test run. Especially in the test with higher frequencies, it is possible that one extreme result might have been missed out. Another aspect is that the detected maximum may occur with a very low probability, so the results contains a 90 % maximum, which shall indicate a maximum that is valid for about 90 % of all

47 4 Performance interrupt instances. It is not an exact result, an estimated one. Depending on the frequency and the program variant that has been applied, this shows a better and more exact range of the interrupt response time for the common cases.

Figure 4.1 shows the result for the setup using a 100 Hz rate. As expected, the C code is the fastest, and the pure C# variant the slowest. Interestingly, the detected maximum in all three variants diers dramatically from the usual range. This may have been caused by the high frequency, resulting in a higher CPU usage from the testing application, resulting sometimes in process preemption, as the time slice is exhausted. The kernel is compiled using a 1000 Hz timer, which causes the preempted process to be suspended for 1 ms. It is also noticeable that, running with 100 Hz, sometimes an interrupt seems to be missing out, resulting in a switched button- LED-. This can be caused by the problem that the triggered interrupt is not maskable. This results in reentering into the critical section, possibly causing a lost-update.

1400 min 90% max C# max 1200 C# with C lib C 1000

800

600 response time [µs]

400

200

0

Figure 4.1: IRQ timings at 100 Hz rate

Running the test with 10 Hz frequency shows a dierent result. As gure 4.2 shows, the measured maximum is clearly lower than running with 100 Hz. Due to the reduced frequency, a preemption does not occur, and the positive cache eects seem

48 4.3 Interrupt Response Time to be reduced, as the typical timing range increases with an enhanced C# usage. Another interesting result can be seen by the fact, that the detected worst cases are even better than the minimum in another version, which uses more C# code. This shows that there is still some code to be interpreted, which causes these dierences.

500 min C# 90% max max

400

300

C# with C lib

response time [µs] 200

100 C

0

Figure 4.2: IRQ timings at 10 Hz rate

This also causes the even higher results for 1 Hz (see gure 4.3) and 0.1 Hz (see gure 4.4). The lower frequency results in even lower cache hits, as the possibility of cache replacements is higher between the interrupt events. This fact is supported by a greater result range. The lower results in the 1 Hz test are nearly the same as the results in 10 Hz, indicating a similar cache usage. In contrast, the higher results indicate an increased amount of cache misses.

The last test running 0.1 Hz shows a minimum and maximum only, because the low frequency complicated the setting of a 90 % limit, so it is omitted.

49 4 Performance

min C# 90% max max 500

400

300 C# with C lib response time [µs] 200

100 C

0

Figure 4.3: IRQ timings at 1 Hz rate

As the results uctuated, especially when using C# variants, another LED was set inside the interrupt handler, which indicated the response time of the handler itself. The results revealed that the response time was always about 22-25 µs. The kernel itself does not cause the result uctuations, so the problem is limited to userspace.

Comparison to Interpreter Version

Table 4.2 compares the interrupt response times from the unroller and interpreter engine running both programs using C#. As expected, in general, the interpreter is slower than the unroller version.

Frequency [Hz] Interrupt response time [µs] Unroller Interpreter C# + C C# C# + C C# 100 135/160/1100 284/314/1260 140/177/1110 256/410/1330 10 127/160/220 262/336/488 140/182/208 378/460/516 10 134/190/254 296/400/540 146/204/240 394/508/560 0.1 198/294 392/792 200/338 504/644 Table 4.2: Comparisons to the interpreter engines

50 4.3 Interrupt Response Time

min C# 800 max

700

600

500

400 response time [µs] 300 C# with C lib

200

100 C

0

Figure 4.4: IRQ timings at 0.1 Hz rate

It is striking to note that the time dierence for the test application using the C library is not very high. This may be due the fact that there are not as much IL instructions as in the program version, which is completely written in C#. Thus, there are not many possibilities left, where the unroller can boost the performance. Running the test application with any C library, the response times increase as expected.

For more complex and bigger interrupt handling routines, the interpreter overhead may become bigger. For the used application using a C library, not much code is written in C#:

1. Calling the read function inside the library

2. Calling the delegate

3. Calling the write function inside the library

Extending the C# code also means more to be interpreted inside the engine. In such setups, using an unroller is a better solution.

51 4 Performance

Conclusions

As expected, the C# variants are slower than a pure C solution. Nevertheless, running such a program, triggering a high frequency interrupt, the worst case can increase dramatically due to the exhausted timeslice. The best results were measured using a 10 Hz frequency, which seems to use the CPU cache better than the other tests. This is important, as the response time to a single shot interrupt will be larger than a frequently used one. This has to be kept in mind for creating time critical interrupt handlers. This also shows that this method of interrupt handling is not suiteable for real time systems, due to their unpredictable time behaviors. It might even get worse, if inside the C# interrupt handling routine garbage collection or a repeated unrolling process is necessary. Most of the time, the response times stuck near the minimum, but running an appli- cation in parallel, which enforces cache invalidation explicitly, the results do uctuate much more in the result range. The minimum and maximum are also higher than without cache ushing. Using a JIT backend instead of the combined unrolling and interpreting ones will improve the C# performance, especially the pure C# variant, because there are a lot of instructions inside pnetlib, which do buer and size checking.

52 5 Final Remarks and Further Work

5.1 Conclusion

The porting of DotGNU Portable.NET to embedded Linux has been accomplished in this diploma thesis. Now, it is possible to use the Portable.NET runtime environment on the MCF5484 Coldre microcontroller. IL applications are now executeable on the target and on the development platform. This is the foundation for further work on the SYS TEC's PLCcore-5484 board, which shall combine the advantages of hardware-near programming and PLC programming.

Two possibilities for integrating interrupt routines in Portable.NET have been devel- oped and implemented. Interrupt response times have been measured under dier- ent circumstances. Analyzing the results, the response times are larger when using Portable.NET's IL code than using hardware-near programming languages like C or Assembler. Therefore, it is necessary that SYS TEC has to balance out, which method is most suitable, if Portable.NET is about to be used.

5.2 Improvements

Although this paper is useable for developments of the technology that shall replace the current PLC, there are several aspects that can be improved further.

5.2.1 Increasing Engine Performance

A location for simple improvements is the unroller macro for loading xed oating point values. As stated in chapter 3.4.7, IL binaries store them in little-endian. Upon loading, the byte order needs to be switched. The current implementation uses C

53 5 Final Remarks and Further Work code, utilizing a temporary array. The Coldre CPU oers an instruction, called BYTEREV, for reversing bytes inside a register. Several load and store operations can be replaced by using this single instruction, but it has to be kept in mind that this requires ISA_C, which might not be available on the applied processor.

The CVM interpreter holds a considerable set of instructions. It is possible to add new instructions, which match some pattern of IL code. Such a new CVM instruction can be unrolled using processor specic instructions. This way, e. g. the Multiply- Accumulate (MAC) unit on the Coldre could be used, which would otherwise be dicult to implement. A MAC unit, usually integrated in Digital Signal Processors (DSPs), is used for more ecient calculations in form of A0 = A + B · C.

Due to the condition code update during move instructions (see chapter 3.4.3), every register stack ush includes two condition code movements. This also happens, if no data movement is necessary. So, another optimization is to insert condition code save and restore only, if required.

5.2.2 Tweaking IRQ Handling

Beside improving the runtime itself, the IRQ handling can also be modied. Instead of using the C# class library or an external C library, it is possible to use the methods from the Platform.Filemethods namespace, which utilizes the le access possibilities from the engine. This drops the necessity of an external library as well as the checks inside the class library. An interrupt response time measurement and comparison to the existing proposal is also needed.

The current C# library for accessing interrupts lacks the possibility of pausing and stopping interrupt handling. Also, the opportunity of replacing the delegate can be inserted for more common usage.

As described in chapter 4.3, upon the rst interrupt handling, the response is clearly greater than given in normal results. This is due to the assembler code unrolling. This overhead during execution time can be hidden by triggering the interrupt manually or by software, like a self-test.

54 5.2 Improvements

5.2.3 Porting JIT to MCF5484

The most signicant performance improvement may be achieved by using a JIT compiler. However, porting libjit to the MCF5484 microcontroller is the most complex possibility. This would drop any interpretation, because all IL instructions are transformed into native assembler code. Restrictions, given by the unroller, do not apply to JIT, as this method does not transform instructions one by one. A greater scope of IL code gives the opportunity to do a transformation, which performs much better. Especially the two dierent sets of registers are not supported by the unroller. A JIT compiler can perform a much more eective usage. Additionally to a JIT, it is possible to use precompiled assembler code. This removes the necessity of compiling IL code during runtime. Including native assembler code into IL libraries limits the usage of them on the targeted processor. Also, some linking information, like in ELF libraries, is necessary. Therewith, the runtime can use this precompiled code directly.

5.2.4 Miscellaneous

Despite performance improvements, there are other parts, which are worth to give some thought to. Sometimes, the testsuite fails in a couple of tests, which is due the fact that the timeouts in these tests are too small for this processor. In order to get the correct result, everytime the testsuite is run, the timeouts need to be adjusted. Because double and long double data types are of the same size on Coldre, lib contains unused code for long double in 80-bit size. This redundant code can be removed. Although Portable.NET oers a GUI named libide for debugging, it might be useful to insert all program activities like coding, compiling, testing, and debugging into one application, e. g. Eclipse1. This allows a more ecient usage.

1http://www.eclipse.org/

55 6 Acronyms

ABI Application Binary Interface API Application Programming Interface ARM Advanced RISC Machine CIL Common Intermediate Language CLI Common Language Interface CLR Common Language Runtime CLS Common Language Specication CTS Common Type System CVM Converted Virtual Machine DGEE DotGNU Execution Environment DNS Domain Name System DSP Digital Signal Processor ELF Executable and Linking Format GCC GNU Compiler Collection GNU GNU's Not GTK+ The GIMP ToolKit GUI Graphical User Interface HTTP Hypertext Transfer Protocol IDE Intergrated Development Environment ISA Instruction Set Architecture IL Intermediate Language IRQ Interrupt Request ISO International Organization for Standardization JIT Jit-in-Time JVM Java Virtual Machine m68k Motorola 68000 MAC Multiply-Accumulate

56 Acronyms

MSIL Microsoft Intermediate Language PLC Programmable Logic Controller PPC PowerPC RAM Random Access Memory SYSV Unix System V VM Virtual Machine XML Extended Markup Language

57 Bibliography

[1] Beer, W.; Birngruber, D.; Mössenböck, H.; Wöÿ, A.: Die .NET- Technologie. Grundlagen und Anwendungsprogrammierung. 1. Auage. Heidel- berg: dpunkt.verlag, 2003

[2] Boehm, H.-J.: Conservative GC Algorithmic Overview, apr 2008 http://www.hpl.hp.com/personal/Hans_Boehm/gc/gcdescr.html

[3] Boehm, H.-J.: Using the Garbage Collector as Leak Detector, apr 2008 http://www.hpl.hp.com/personal/Hans_Boehm/gc/leak.html

[4] DotGNU Portable.NET: HACKING, [le in engine directory], Rev. 1.7, may 2002 http://cvs.savannah.gnu.org/viewvc/dotgnu-pnet/pnet/engine/ HACKING?revision=1.7

[5] DotGNU Portable.NET: il_align.h, [le in include directory], Rev. 1.3, may 2007 http://cvs.savannah.gnu.org/viewvc/dotgnu-pnet/pnet/include/il_ align.h?revision=1.3

[6] DotGNU Portable.NET: unrolling.txt, [le in doc directory], Rev. 1.7, dec 2004 http://cvs.savannah.gnu.org/viewvc/dotgnu-pnet/pnet/doc/ unrolling.txt?revision=1.7

[7] ECMA international: Standard ECMA-334 - C# Language Specication, 4th edition, jun 2006 http://www.ecma-international.org/publications/standards/ Ecma-334.htm

58 Bibliography

[8] ECMA international: Standard ECMA-335 - Common Language Infrastructure (CLI), 4th edition, jun 2006 http://www.ecma-international.org/publications/standards/ Ecma-335.htm

[9] Freescale Semiconductor: ColdFire Family 's Reference Manual, Rev. 3, mar 2005 http://www.freescale.com/files/dsp/doc/ref_manual/CFPRM.pdf

[10] Freescale Semiconductor: MCF5485 Device Errata, Rev. 4, dec 2007 http://www.freescale.com/files/32bit/doc/errata/MCF5485DE.pdf

[11] Kühnel, A.: Visual C#. Das umfassende Handbuch. 1. Auage. Bonn: Galileo Press, 2005

[12] Green, A.: lib - a portable foreign function interface library, apr, 2008 http: //sourceware.org/libffi

[13] MicroAPL Ltd: Dierences between ColdFire & 68K, oct 2006 http://www.microapl.com/Porting/ColdFire/cf_68k_diffs.html

[14] Mono Project: Mono Debugger, dec 2007 http://www.mono-project.com/Debugger

[15] Mono Project: Mono Runtime, jan 2008 http://www.mono-project.com/Mono:Runtime

[16] Mono Project: Mono FAQ: General, feb 2008 http://www.mono-project.com/FAQ:_General

[17] Mössenböck, H.: Softwareentwicklung mit C# 2.0. Ein kompakter Lehrgang. 2., aktualisierte und erweiterte Auage. Heidelberg: dpunkt.verlag, 2006

[18] Weatherley, .: Portable.NET Benchmark, [README le in archive], feb 2003 http://download.savannah.gnu.org/releases/dotgnu-pnet/pnetmark-0. 0.6.tar.gz

[19] Schwichtenberg, H.: Microsoft .NET 3.0 Crashkurs. Schnelleinstieg in neue Technologien und Tools. Unterschleiÿheim: Microsoft Press Deutschland, 2007

59 Bibliography

[20] Smith, J. E.; Nair, R.: The Architecture of Virtual Machines. In: Computer (2005), Vol. 38, pp. 3238

[21] The Institute of Electrical and Electronics Engineers: ANSI/IEEE Standard 754-1985, Standard for Binary Floating Point Arithmetic, 1985

60 A Contents of the enclosed CD

This paper is published together with a CD, containing the following directories: interrupt Source les for the interrupt kernel module and the userspace application. patches Patch les, which add the Coldre porting, debug and trace support for cross compiling, enhanced testsuite, and macro counting. pdf This paper in PDF format, readable on PC. timings An OpenDocument Spreadsheet, which contains the exact results for the measured interrupt reponse times.

61