MASTER’S THESIS | LUND UNIVERSITY 2013

A thin MIPS hypervisor for embedded systems

Mikael Sahlström

Department of Computer Science Faculty of Engineering LTH

ISSN 1650-2884 LU-CS-EX 2013-38

A thin MIPS hypervisor for embedded systems

Mikael Sahlström [email protected]

October 15, 2013

Master’s thesis work carried out at SICS Swedish ICT.

Supervisor: Arash Vahidi, [email protected] Examiner: Per Andersson, [email protected]

Abstract

Embedded systems are becoming more and more important as both the systems and their applications are getting more advanced. This in- creases the demand of and emphasizes the need for securing these sys- tems. Virtualization is one tool in solving this problem by providing isolation between trusted critical applications and untrusted feature rich applications. In this thesis we investigate how a thin hypervisor can provide isolation on an MIPS based embedded system with a minimal footprint and performance impact as well as implement and evaluate it.

Keywords: Hypervisor, MIPS, embedded systems, virtualization, security

Abstrakt

Inbyggda system blir mer och mer viktiga då både systemen och pro- gram som körs på dem blir mer avancerade. Detta ökar kraven på och betydelsen av säkerhet i dessa system. Virtualisering är en lösning på det här problemet och kan isolera kritiska program från icke betrodda pro- gram med mer och större funktioner. I den här rapporten så undersöker vi hur en tunn hypervisor kan förse ett inbyggt MIPS baserat system med isolering utan att offra för mycket prestanda. Vi kommer göra detta genom att implementera och evaluera en tunn hypervisor. 2 Acknowledgements

I would like to thank Arash Vahidi for the guidance and support, Viktor Do for putting up with all my questions and everyone else at the SICS security lab in Lund. I would also like to thank Per Andersson and Jonathan Kämpe for the feedback.

3 4 Contents

1 Introduction 11 1.1 Purpose and goals ...... 12 1.2 Thesis overview ...... 12

2 Virtualization 13 2.1 Isolation ...... 14 2.2 Types of virtualization ...... 15 2.2.1 ISA translation ...... 15 2.2.2 Para-virtualization ...... 15 2.2.3 Hardware support ...... 15 2.3 Virtual addressing ...... 16 2.4 Page table ...... 17 2.5 Putting it all together ...... 18

3 The MIPS architecture 19 3.1 Overview ...... 19 3.1.1 Registers ...... 20 3.1.2 Coprocessors ...... 20 3.1.3 Pipeline ...... 21 3.1.4 Memory ...... 21 3.2 Exceptions and interrupts ...... 23 3.3 Coprocessor 0 ...... 24 3.3.1 Coprocessor 0 registers ...... 25 3.3.2 CP0 hazards ...... 27 3.4 Caches ...... 27 3.5 Memory management unit (MMU) and the TLB ...... 28 3.5.1 CP0 TLB registers ...... 28 3.5.2 TLB entries ...... 29 3.5.3 A TLB refill handler ...... 30

5 CONTENTS

4 Implementation of a thin hypervisor 33 4.1 Structure ...... 33 4.2 Memory management ...... 35 4.2.1 Memory layout ...... 36 4.2.2 Translation lookaside buffer ...... 36 4.2.3 Page tables ...... 37 4.3 Isolation ...... 38 4.4 Hypercalls ...... 38

5 Evaluation 41 5.1 Methods ...... 41 5.2 TLB refills ...... 42 5.3 Hypercalls ...... 43 5.4 Interrupts ...... 44 5.5 System calls ...... 44 5.6 Multiple applications or multiple guests ...... 45 5.7 Isolation ...... 46 5.8 Hypervisor size ...... 47

6 Conclusions 49 6.1 Future work ...... 49

Bibliography 51

6 List of Figures

2.1 Virtualization in a nutshell...... 13 2.2 Virtual addressing of the pages A, B, C and D [7, p. B-41]...... 16 2.3 Mapping of a virtual address to physical memory with a 2-level page table [7, p. B-45]...... 17

3.1 MIPS32 memory map [6, p. 51]...... 22 3.2 Fields in the status register [6, p. 60]...... 25 3.3 A TLB entry [6, p. 132]...... 29 3.4 EntryHi and PageMask register fields [6, p. 134]...... 29 3.5 EntryLo0-1 register fields [6, p. 136]...... 30

4.1 Basic hypervisor structure...... 34 4.2 Memory layout of the hypervisor and guests...... 36

5.1 Memory accesses with both sequential and pseudorandom accesses. . 43 5.2 Illustration of how isolation is maintained...... 46

7 LIST OF FIGURES

8 List of Tables

3.1 Registers and their use [6, p. 36]...... 20 3.2 Exceptions and their mnemonic [6, p. 66 - 67]...... 24 3.3 CP0 control registers for memory management [6, p. 133]...... 28

4.1 Hypercalls provided by the hypervisor...... 39

5.1 Amount of instructions spent on each hypercall...... 44 5.2 Amount of instructions spent on each interrupt...... 44 5.3 Amount of instructions spent on each system call...... 45 5.4 Running multiple applications in one guest compared to running mul- tiple guests with one application...... 45 5.5 Size of the hypervisor files in lines of code, including comments and empty lines...... 47 5.6 The amount of memory used by the hypervisor where n is the amount of guests and r is the amount of page table entries...... 47 5.7 The amount of memory used by a guest without the hypervisor. ... 48

9 LIST OF TABLES

10 Chapter 1 Introduction

Virtualization is an important tool in securing applications by providing isolation with a relatively low computational overhead. A hypervisor enables virtualization and lies between the hardware and one or more virtual machines as an abstrac- tion layer. By controlling hardware accesses the hypervisor can protect sensitive applications from each other and themselves [2]. As embedded systems are becoming more advanced with full featured systems like Android running third party applications, the need for securing these systems become more and more apparent. Developers are presented with an interesting challenge: how can critical software run securely side by side with untrusted software without sacrificing too much performance? Virtualization with a thin hypervisor, a minimal low footprint hypervisor, is one solution to this problem. For example could a real-time operating system run critical communication software while noncritical applications run in a more user oriented system like Android and the hypervisor makes sure these two systems are separated from each other. This approach lets critical systems be small and manageable, minimizing risk for bugs and can make verification of the critical software easier while still having a modern user experience that, if compromised, will not affect the secure system [4]. Virtualization also can provide better hardware utilization which can be utilized on servers where virtualized applications can be moved between hardware as load varies. However in embedded systems, there are usually not much room for advanced hardware utilization tricks. Here, it is the security aspects of virtualization that are interesting. A thin hypervisor can provide virtualization while having a minimal memory footprint and low overhead. By keeping the amount of hypervisor code to a min- imum, we can reduce the risk for bugs and simplify code audits and verifications. These aspects, among others, makes virtualization a very interesting technique to use in embedded systems where many applications can benefit from it.

11 1. Introduction

1.1 Purpose and goals

The purpose of this thesis is to explore how a thin hypervisor can be implemented on the MIPS architecture and how it can provide virtualization with a minimal footprint and performance impact. This thesis is divided into four tasks:

• A study of the MIPS architecture with most focus on parts relevant to hyper- visor design.

• Design of a thin hypervisor for MIPS capable of running multiple guests.

• Implementation of a thin hypervisor running on a simulated embedded plat- form.

• Analysis of security and performance aspects.

The hypervisor implementation will be done entirely within the context of this thesis and by the authors of it. To be able to run our hypervisor we will use the Open Virtual Platforms (OVP) simulation tool to simulate an embedded platform. This platform is constructed by writing C code describing the platform and compiling this together with the existing peripheral descriptions in OVP. The file containing the platform description code consists of less than 100 lines of code. We need guests to use when testing the implementation. These guests should be as close as possible to what would be used in real world applications. Therefore, we choose to base our guests on the FreeRTOS real-time operating system which is a viable and widely used real-time operating system for embedded systems and has support for over 33 architectures [18]. To be able to use FreeRTOS as a guest, we have to modify it so that it can run natively on our CPU and also modify it to be able to run on top of our hypervisor. Then we can compare guests running natively with guests running on top of our hypervisor.

1.2 Thesis overview

This thesis is organized as follows: the first chapter describes different methods and principles of virtualization focusing on, for this thesis, relevant aspects. Then we will explain the MIPS architecture with focus on the parts that are interesting when creating a hypervisor. The third chapter will describe how a thin hypervisor can be implemented on the MIPS architecture and the chapters following will evaluate and discuss this implementation with measurements from tests done on the prototype implementation.

12 Chapter 2 Virtualization

By virtualization we mean abstraction of hardware where some layer of logic provides services to a system running on top of it. The system running on top accesses restricted services through a virtual interface provided by the layer of logic. The system or systems running on top of a virtualization layer are called guests. Figure 2.1 shows a typical virtualization layout where three independent guest, for example three different operating systems or three instances of the same operating system, are running on the same hardware.

Guest Guest Guest

Virtual Virtual Virtual interface interface interface

Virtualization layer

Hardware

Figure 2.1: Virtualization in a nutshell.

Normally the hardware defines an interface that the system running on top it uses to access different services provided by the hardware, for example managing interrupts. With virtualization we can redefine that interface as we please or simulate one like the one defined by hardware. But looking from the point of view of the

13 2. Virtualization system running on top of the virtualization layer, there is still a well-defined interface to use [5, p. 1-6]. There are some requirements on the hardware to be able to support virtualiza- tion. There needs to be least two security modes where at least one of them can not run privileged instructions that manages hardware. In 1974, Popek, Gerald J. and Goldberg, Robert P. formulated this in a theorem: “For any conventional third generation computer, a virtual machine monitor may be constructed if the set of sensitive instructions for that computer is a subset of the set of privileged instructions.” [1, p. 417]. To understand this theorem we need to understand three concepts:

Third generation computer is a computer that have the requirements mentioned above: two security modes, a relocation mechanism (in our case the TLB) and trap mechanisms.

Sensitive instructions are instruction that try to access or modify the configura- tion of resources in the system.

Privileged instructions are instructions that are only allowed to execute in priv- ileged security mode and will cause a trap when executed in unprivileged security mode.

What this theorem says is that for hardware to be able to support virtualization, sensitive instructions can not be available to guests. The TLB is discussed more in section 2.3. One of the techniques to implements trap mechanisms used to catch sensitive instructions will be discussed more in section 2.2.2.

Instruction Set Architecture (ISA) We mentioned sensitive instructions and privileged instructions earlier, they are all part of the instruction set architecture (ISA). This is the lowest set of instruction interface that communicates directly with the hardware. It may be compiled from a higher level language like C or interpreted by some intermediates, for example the Java Virtual Machine. In the end, all software are executed by some ISA.

2.1 Isolation

The main point of virtualization is to provide isolation between guests. All virtu- alization approaches provides this to prevent that they affect each other in some non desirable way. While this isolation is what we want, it can provide challenges in embedded systems. Guests usually contribute together to the overall function of the device and need some way of communicating securely between each other while still preserving the isolation from each other [3]. This is accomplished with remote procedure calls (RPC).

14 2.2 Types of virtualization

2.2 Types of virtualization

Isolation between guests can be accomplished in multiple ways using several tech- niques. Different types of virtualization mainly differ in which guests they can host or how the guests communicate with the virtualization layer. One type is process virtualization where an operating system provides a virtu- alized environment for its processes to run in. Process virtualization will not be covered by this thesis, we will instead focus on system virtualization. System virtualization is when the virtualization layer can host an entire operating system, for example as shown in figure 2.1. A hypervisor is another name for the virtualization layer in this instance. It virtualizes all the resources of a real machine including CPU, memory and devices, creating a virtual machine [5, p. 10-13].

2.2.1 ISA translation It is possible, in both process virtualization and system virtualization, to emulate both hardware and ISAs. The latter can be done with two techniques, interpretation and binary translation [5, p. 27-29]. Interpretation is the process of interpreting each instruction while the system is running and executing correctly. It takes extra resources to interpret each instruction into another supported instruction but we gain the ability to run applications with an, by the hardware, unsupported ISA [5, p. 29-32]. If both the guest and the host have the exact same ISA we do not have to translate every instruction, only sensitive instructions, and in this way can save some execution time. We will not use ISA translation in this thesis but instead para-virtualization.

2.2.2 Para-virtualization Para-virtualization is a technique that give the guests an interface to the hypervisor that they use to preform privileged instructions. The guest will be modified to run on a hypervisor instead of directly on hardware. The term para-virtualization was coined by the Denali system where it was uti- lized to create a lightweight multi-VM environment suited for networked application servers [2, p. 11]. In this thesis, we use para-virtualization when implementing our guest.

2.2.3 Hardware support Hardware vendors have developed hardware support for virtualization to simplify implementation and reduce performance costs. Both AMD [15] and Intel [16] have released hardware supporting a new privileged security mode that all sensitive in- structions trap to. ARM hardware virtualization support consists of virtualization extensions as well as a security extension called TrustZone where there are two different states, one that executes trusted code and one that execute untrusted code.

15 2. Virtualization

Release 5 of the MIPS architecture [17] includes hardware support for virtual- ization. Hardware supported virtualization can be more efficient than when running soft- ware assisted virtualization but makes the hardware more complex. By not being dependent on specific hardware support the benefits of virtualization can reach al- ready existing hardware without such support, as long as it have support for the three fundamental requirements mentioned in the theorem at the beginning of this chapter. Those requirements will also be discussed in the sections that follows.

2.3 Virtual addressing

Virtual addressing enables different applications to have their own address space and use addresses independent of hardware. One way to achieve this is to divide physical memory into blocks and assign these blocks a virtual address usually also combined with an access policy. These blocks of physical memory are called pages [7, p. B-40 - B-44].

Virtual Physical address address

0 A 0 4K B 4K D 8K C 8K 12K D 12K A 16K 20K C 24K B Page table 28K

Figure 2.2: Virtual addressing of the pages A, B, C and D [7, p. B-41].

For the hypervisor to run transparent to the guests they need to have their own virtual address space. By giving them this, they can have a complete address space (not entirely true in the MIPS architecture, more on this in section 3.1.4) as it would when running alone on hardware. The hypervisor can also make pages read only or not accessible for the guest which enables interesting security applications. Guests does not know anything about the physical addresses and its pages can be remapped to any physical address at any point. But the hardware must know where to store or read information referenced by a certain virtual address. These translations are stored in a table called a page table. It contains mappings between virtual addresses and physical addresses as shown, simplified, in figure 2.2. For each unknown virtual address to physical address translation the hardware executes an interrupt so that the translation can be found. To speed up translation of addresses, a hardware cache called Translation Lookaside Buffer (TLB) is used.

16 2.4 Page table

The TLB can not cache every mapping needed so the interrupt used to find translations is used to refill the TLB. This interrupt is called a TLB refill interrupt. In MIPS the TLB refill is handled in software and not, as in some architectures, a hardware defined refill handler. It is, in MIPS, up to the TLB refill code to decide which entry in the TLB to replace. Most operating systems replaces the least recently used entry to replace [7, p. B-45] as a recently accessed address will likely be accessed again (the principle of locality) which means that by having recently accessed mappings in the TLB we can save a lot of unnecessary refills [7, p. 45].

2.4 Page table

There are different methods of designing page tables, the most obvious is the 1-level page table. Here, the virtual address are split into two parts where one is used to find the correct page in a single page table and the other is the page offset. A page table will usually also contain restrictions regarding page access and management, for example read only or no execute. This is also stored in the TLB. We can have multiple page tables, for example one per guest or one per applica- tion, and let some part of the virtual address dictate which page table to use. This is illustrated in figure 2.3.

Virtual address

Page table number Virtual page number Page offset

Page table Page table Which page table Page table Physical address to a page Memory Page table

Figure 2.3: Mapping of a virtual address to physical memory with a 2-level page table [7, p. B-45].

In some architectures, for example ARM [2, p. 59-61], the layout of the page table are predefined by hardware. This can speed up the page table lookup routine at the expense of more complex hardware. MIPS CPUs leave everything up to software and only provide some limited assistance when processing a TLB refill exception. By not providing any restrictions on page table layout, it can be tailored to the specific system. If each page table entry has the size of 16 bytes (4 integers) and we want to map 2 GB of memory the page table would take up 4 MB of memory (assuming 4 KB page size). This together with that most systems are only using the top addresses

17 2. Virtualization and the bottom addresses we would have a hole in the middle that never is used and thus have to store unnecessary parts of the page table. One solution to this is to put the page table in virtual memory. The penalty is that we will have to take care of TLB misses (translations not found in the TLB) when we try to access memory within the TLB refill handler, that is TLB misses on addresses referencing the page table, which will add some complexity to our TLB refill handler but will reduce the amount of memory used. This approach requires that we can place our page table wherever we want and that we handle the page table lookup in software. As we will see later this is fully possible, and even recommended, in MIPS [6, p. 144].

2.5 Putting it all together

The hypervisor will run in a higher privilege mode than the guests which can not access or manage hardware without going through the hypervisor. Instructions that manages hardware, so called sensitive instructions, has to be a subset of the instruc- tions only allowed to be executed in a higher security mode, so called privileged instructions. All memory accesses will go through the TLB so that the hypervisor can maintain isolation by mapping them to the correct locations. This is done with the help of multiple page tables, usually one per guest. These page tables are pre-constructed to match where the guests are placed in physical memory and contain the correct access policy for each page. Hardware accesses such as periodic timer interrupts or temporary disabling inter- rupts are replaced with hypercalls to the hypervisor and handled there. Sometimes the hypervisor will translate them to a real hardware access and sometimes it can simulate the access, for example with periodic timer interrupts.

18 Chapter 3 The MIPS architecture

The MIPS project was started in 1981 at Stanford University by John L. Hennessy and a number of graduate students. Hennessy would, in 1984, leave Stanford to co-found MIPS Computer Systems which later on turns into MIPS Technologies. Even though MIPS is most known for its use in embedded systems there are cases where MIPS has been used for high performance computing, for example in Silicon Graphics workstations and servers or SiCortex super computers.

3.1 Overview

MIPS is a reduced instruction set computer (RISC) architecture and did, already from the start, make use of pipelining. Pipelining was well known at the time but not yet used to its full potential. There are four different instruction set architectures (ISA) supported by MIPS:

• MIPS32.

• MIPS64.

• microMIPS32.

• microMIPS64.

MIPS32 is the standard and initial ISA that MIPS CPUs supported. MIPS32 has 32- bit registers and a 32-bit wide address space. MIPS32 is a subset of MIPS64 which is, together with some register and address translation tricks (see section 3.1.1 and 3.1.4), why MIPS32 can run on a MIPS64 CPU. mircoMIPS32 and mircoMIPS64 are code compression ISAs with 16- and 32-bit instructions. They have similar performance as the regular ISAs but with lower code size. We will focus on MIPS32 in this thesis.

19 3. The MIPS architecture

Up to the CPU there were only two privilege levels, user mode and kernel mode but since then a third was added, supervisor mode. Changing mode does not add or remove any instructions or features, it does just make certain things illegal.

3.1.1 Registers There are 32, 32 bit general purpose, registers on a MIPS32 CPU as well as some special purpose registers [10, p. 31]. Name and purpose of the 32 general purpose registers can be found in table 3.1. The existing special purpose registers in a MIPS CPU are two that holds re- sults of integer multiply, division and multiply-accumulate operations [10, p. 31]. One register holds the high part and one holds the low part of the result and are retrieved with two instructions, mflo and mfhi. These two registers are interlocked which means that if you try to retrieve the result before the calculation is done, the processor will stop and wait until it is done. In case of a division, the low register will store the result and the high register the remainder [6, p. 38].

Table 3.1: Registers and their use [6, p. 36].

Register number Name Description 0 zero Always returns 0. 1 at Reserved for use by assembly. 2 - 3 v0, v1 Values returned by subroutine. 4 - 7 a0 - a3 First four arguments to a subroutine. 8 - 15 t0 - t7 Temporary registers subroutines can use 24, 25 t8, t9 without saving any values. 16 - 23 s0 - s7 For use in subroutines. Existing values must be saved away and restored on exit. 26, 27 k0, k1 Reserved for interrupt and trap handlers. 28 gp Global pointer. Used for easy access to static or extern variables. 29 sp Stack pointer. 30 s8 or fp A ninth subroutine register or, for subroutines that need one, a frame pointer. 31 ra Return address for subroutines.

3.1.2 Coprocessors Coprocessors are traditionally optional parts of the processor dedicated to take re- sponsibility for some extension to or subset of the functionality. In MIPS processors, this is partly true as coprocessor 0 (CP0) is not optional. CP0 is embedded on the CPU chip and is the system coprocessor. It contains the memory-management unit (MMU), the status register and a number of other registers which are used for excep- tion handling and interrupts [10, p. 30]. It is because CP0 contains these important registers that it has to be a part of the processor. In all, there are four coprocessors

20 3.1 Overview defined in the MIPS architecture, coprocessor 0 being one. CP0 is discussed more in section 3.3. CP1 is reserved for the floating point coprocessor (FPU) which uses two floating point formats recommended by IEEE 754, single precision (32 bits of storage) and double precision (64 bits of storage). CP1 have 32 floating point registers of its own that are used for floating point control and setting/getting values [6, p. 156-159]. CP2 is available for specific implementations, for example custom ISA extensions, specific registers on a SoC or just to get 32 additional easy accessible registers [6, p. 60]. CP3 is reserved for the floating point unit in MIPS64 release 1 and later archi- tectures [10, p. 30]. It is not really compatible with CP1 due to that its decode space overlaps with the standard MIPS32/64 floating point instruction set [6, p. 60].

3.1.3 Pipeline

A simple RISC pipeline can consist of five stages: instruction fetch (IF), instruction decode (ID), execution (EX), memory access (MEM) and write back (WB). The first stage (IF) fetches the instruction pointed to by the program counter (PC) and updates the PC to point to the next instruction, the second stage (ID) reads the registers specified in the instruction, the third stage (EX) performs arithmetical or logical operations, the forth stage (MEM) performs loads or stores to memory and the last stage (WB) writes back data to the register file [7, p. C-5]. The MIPS architecture started with a five stage pipeline like the one described above and pipelines similar to this is still used by simpler MIPS CPUs [6, p. 6]. Even though the pipeline is different depending on which MIPS CPU you look at, for example the has a five stage pipeline and the R4000 has an eight stage pipeline, there are a few things all MIPS CPUs have in common: The instruction after a branch or jump will always be executed, independent of whether the branch has been taken or not. It is up to the compiler to make use of this feature. A simple, but inefficient, solution would be to put a nop after each branch but it is much better to increase counters or move one of the instructions before the branch here. This instruction is said to be in the branch delay slot [10, p. 58]. When fields in coprocessor 0, such as fields in the status register (see section 3.3.1), are changed they could potentially affect all pipeline stages and thus instruc- tions within the pipeline. The instruction ehb has been introduced to handle this and guarantees that later instructions still run correctly.

3.1.4 Memory MIPS addresses are always formed by adding a 16-bit offset to a base register. Addresses are also never the same as physical addresses even though, in some cases, very close related [6, p. 47-50]. This makes the term virtual addresses interesting as all addresses are virtual addresses even though they are not always translated through the translation lookaside buffer. In figure 3.1 we can see that some sections

21 3. The MIPS architecture are mapped and some unmapped. Mapped sections are translated through the translation lookaside buffer and unmapped are not but all addresses are still virtual. All loads and stores on MIPS must be aligned. Half-words (two bytes size) may be loaded only from a 2-byte boundary and whole words (four bytes size), only from a 4-byte boundary. Unaligned loads will cause a trap being executed. In user mode, any addresses with the highest bit set is illegal and trying to access these will cause a trap. There are also some instructions, for example CPU control instructions, that are illegal in user mode. The address space is divided into four areas named kuseg, kseg0 kseg1 and kseg2. kuseg is located in the low 2 GB of the address space and will always be accessible from user mode. These addresses will be translated by the MMU and should not be used in systems without one [6, p. 50]. In section 2.3 it was mentioned that guests could not have access to the complete address space. The explanation for this is that kuseg is the only address space that is mapped though the MMU and accessible in user mode. Virtual addresses usable for guests can therefore only span the range 0x0000 0000 to 0x7fff ffff.

Virtual address space Physical memory

Kernel-accessible 0xE000 0000 mapped (kseg2). Supervisor-accessible Page table 0xC000 0000 mapped (kseg2). Uncached, unmapped 0xA000 0000 (kseg1). Cached, unmapped 0x8000 0000 (kseg0). User space (kuseg). 0x0000 0000 Mapped. 0x0000 0000

Figure 3.1: MIPS32 memory map [6, p. 51].

The 512 MB above kuseg is called kseg0 and are not translated by the MMU but is instead translated by stripping off the top bit and mapping them to the low 512 MB of physical memory. This area is usually where the operating system is located but in our case it is where the hypervisor is located as it can only be accessed when in kernel privilege mode. kseg0 is almost always accessed through the cache [6, p. 48]. The default interrupt entry points are also located here, at address 0x8000 0000. After kseg0 there is another 512 MB called kseg1. These addresses are, like kseg0, not translated through an MMU but instead the leading three bits are stripped off. This will give kseg1 duplicate mappings with kseg0 and as a result, kseg1 is not cached. kseg1 is also the only area that is guaranteed to behave properly after a system reset which is why the after-reset starting point is located here (0xbfc0 0000). The physical starting address is 0x1fc0 0000 [6, p. 49]. This space can only be accessed in kernel mode. The top 1 GB of address space is called kseg2. This space is split into two if supervisor mode support is enabled where the low half is accessible in supervisor

22 3.2 Exceptions and interrupts mode and the top half by kernel mode. kseg2 is translated through the MMU and should therefore only be accessed after the MMU has been set up.

3.2 Exceptions and interrupts

Exceptions in MIPS are precise exceptions and are designed to make life as easy as possible for software. It is guaranteed that instructions before the exception victim are completely finished and instructions after (including the victim) are, even though they might be in the pipeline, treated as if they never have been run and all effects from them are reversed. This ensures that execution can be continued at EPC (see section 3.3.1) without any instruction that might have entered the pipeline before the exception have changed the CPU state. Exceptions also appears in instruction sequence even though they might be dis- covered at different times due to pipelining. Exceptions discovered early in the pipeline are noted and moved with the instruction through the pipeline until the end where it will be either raised or discarded depending on if any preceding instruc- tion discovers an exception late in the pipeline. A discarded exception will be lost, though in reality some later instruction will usually discover the same or a similar exception. The different exceptions that can be issued are listed in table 3.2. On an interrupt in MIPS CPUs the last instruction to be executed, since excep- tions are precise, is the one that has just finished the MEM stage in the pipeline and the victim will be the one that just finished its EX stage (see section 3.1.3). All exception entry points lie in untranslated regions of the memory map, kseg1 for uncached and kseg0 for cached entry points. The uncached entry points are used when BEV in the status register (see section 3.3.1) is set are fixed but when BEV is not set, the EBase register can be used to change all entry points, together as one, to some other block. A MIPS CPU does the following when it takes an exception:

1. Set up EPC to point to the restart location.

2. Set EXL in the status register to force the CPU into kernel mode and disables interrupts.

3. Cause is set up so that software can see what the reason for the exception was. If the exception is an address exception, BadVAddr (see section 3.3.1) is set. Memory management exceptions set some MMU registers.

4. The CPU starts to run instructions at the exception entry point.

When returning to user mode from an exception, the software must make sure that no user program instructions can run at kernel mode and no kernel privilege instructions are executed in user mode (causing another exception). The instruction eret does this as well as returning to the address stored in the EPC register [6, p. 114].

23 3. The MIPS architecture

Table 3.2: Exceptions and their mnemonic [6, p. 66 - 67].

Mnemonic Description Int Interrupt. Mod Memory store but page marked as read-only in the TLB. TLBL No valid entry in the TLB matches the virtual address used. TLBS A TLB refill exception when EXL in the status register is set, usually a TLB refill within the TLB refill handler. AdEL Address error. Attempt to access an address outside of kuseg when in user mode. AdES Address error. Attempt to read at a misaligned address. IBE Bus error. DBE External hardware has signaled an error of some kind. Syscall Executed a syscall instruction. Bp Executed a break instruction. RI Instruction code not recognized. CpU Tried to run a coprocessor instruction but that coprocessor is not enabled. Ov A integer arithmetic instruction with an overflow trap have overflowed. TRAP Condition met on one of the conditional trap instructions. FPE Floating-point exception. C2E Exception from CP2. MDMX Tried to execute an MDMX instruction but the MX field in the status register was not set. Watch Physical address of load/store matched enabled value in WatchHi/WatchLo registers. MCheck Machine check. The CPU has discovered some disastrous error in the CPU, for example two TLB entries matching one virtual address. Thread Thread-related exception. DSP Tried to run a DSP ASE instruction but the MMX field in the status register is not set. CacheErr Parity/ECC error somewhere in the CPU.

3.3 Coprocessor 0

The main responsibilities of coprocessor 0 (CP0) are: • CPU configuration.

24 3.3 Coprocessor 0

• Cache control.

• Exception and interrupt control.

• Memory management unit (MMU) control.

CPU configuration is done through the various registers attached to CP0 and these can be used to select various features such as little- or big-endian, how the system interface works or which coprocessors to enable. CP0 can be used to manipulate cache entries through different flavors of cache instructions. Registers in CP0 also defines what will happen on which interrupts or exceptions and how to handle them.

3.3.1 Coprocessor 0 registers There are a number of important registers on CP0. For example, after a power up, the status register (SR) have to be set up correctly to continue the bootstrap process. A few configuration registers will probably also have to be set although this is dependent on the CPU implementation. The relevant registers, for this thesis, will be explained in the sections that follow.

The status register (SR) The status register is a 32 bit wide register and contains all the mode fields on a MIPS CPU. This is the most important register in a CPU.

31 28 27 26 25 24 23 22 21 20 19 18 16 15 8 7 6 5 4 3 2 1 0 CU3-0 RP FR RE MX PX BEV TS SR NMI 0 IM7-0 KX SX UX KSU ERL EXL IE

Figure 3.2: Fields in the status register [6, p. 60].

This register contains a field, KSU, that can be read to see which privilege mode the CPU currently is in or be written to change the privilege mode. The field is ignored when the exception bit (EXL) is set which happens when an interrupt is issued and forces the CPU to kernel privileged mode. Interrupts have to be enabled by setting the correct bit in the IM field. Except for TLB refills and other errors in the pipeline, interrupts that are not set here will not be issued. When the system is booting up caches, registers (with some exceptions), and the TLB are in an undefined state. To be able to initialize these parts of the CPU and run application boot code we need to start in some defined state where the cache and TLB is not used. As we can see in figure 5.1 the only uncached address space is kseg1. The boot entry point is therefore here. We also want the interrupt entry points to be relocated from kseg0, that is cached, to kseg1 so that we can take care of interrupts. The field BEV in the status register does this, relocates the interrupt entry points to kseg1, and is set on boot. When the initiation of cache and TLB is done this field can be cleared and the system can continue as normal.

25 3. The MIPS architecture

There are two instructions that disable and enable all interrupts set in the IM field, di and ei. These two instructions set the interrupt enabled (EI) field or clears it, which enables or disables the interrupts without having to touch the IM field.

Exception program counter (EPC) This register holds the address where execution should be continued after an in- terrupt. The address of the instruction that caused the interrupt is stored in this register unless the instruction was located in the branch delay slot, then the ad- dress of the instruction before the causing instruction is stored or we would miss the branch when returning execution. The eret instruction will continue execution at the address stored in this register as well as clearing the EXL field of the status register, which makes the CPU to go back to the security mode, for example kernel mode, set in the KSU field.

Cause The cause register holds information about which interrupts that have been issued. It can be read to know, for example, exactly which interrupt that triggered the general exception handler to be executed. It contains a field called interrupt pending (IP) matching the IM field in the status register and the corresponding bits will be set for each interrupt that has been issued and are waiting to be handled. It also contains fields which are set when interrupts that can not be disabled with the IM field is issued, such as timer interrupts (TI). When an interrupt is issued due to an instruction in the branch delay slot the field branch delay (BD) is set. By combining this information with the address stored in the EPC register we can always find the instruction that caused the interrupt. The cause register also contains a field that is set, by the hardware, to predefined values describing the interrupt in more detail, for example that the interrupt was caused by a syscall or that it was caused by a TLB miss occurring when inside the TLB refill handler.

Bad virtual address This register holds the address that causes any MMU related exception, bad align or if a user programs that try to access addresses outside of kuseg.

Count and compare The count register acts like a timer and increments at a constant rate independent of instructions in the pipeline. If a value is written to the compare register a timer interrupt request is made when this value is equal to the value of the count register. This timer interrupt can be identified by looking at the cause register where the TI field and a bit in the IP7-2 field will be set. When a timer interrupt is handled, the compare register has to be written to clear the interrupt.

26 3.4 Caches

Configuration registers

The MIPS architecture defines four config registers. These contains, mostly, only readable values that defines relevant information about the hardware. For example config0 contains information on which type of MMU that is used, config1 the size of the TLB and L1 cache. The size of L2 and L3 caches can be read in config2.

3.3.2 CP0 hazards

As mentioned in section 3.1.3, there are several instances where changes in the CPU affects an instruction in the pipeline and thus has to be cleared, otherwise undesired side effects appears. All hazards will be cleared by an exception or eret. Execution hazards can be cleared with ehb which older CPUs will see has a nop and instruction hazards are cleared with jr.hb or jalr.hb. Instruction hazards occur when the CP0 state is changed, for example changing a TLB entry followed by a fetch, load or store instruction in the affected page. There are also hazards between CP0 instructions, for example mfc0 is dependent on the value in a CP0 register.

3.4 Caches

Caching of memory data in small but fast memories for quick access by the CPU is a well known and well used technique. It keeps the pipeline running without unreasonable waiting times for data to be written to or fetched from memory. Data in the cache is stored in blocks which are many words that are connected to each other and very likely to be used together (see the principle of locality [7, p. 45]). It is common practice for the L1 caches in MIPS CPUs to be indexed using the virtual address and tagged using the physical address. This improves performance as the cache does not have to wait for the TLB to translate the address but can lead to cache aliasing. Aliasing is when one physical address gets mapped to multiple virtual addresses and then the data can be placed on multiple indexes which means that the cache now contains copies of the same data. Modifications made at one address will not be made at the other and invalid data can be seen as valid. This is a recipe for errors! Aliasing can be avoided by making sure that two alternative addresses for a physical page are separated by at least a multiple of the largest likely L1 cache set size. In MIPS, L2 caches are always indexed and tagged based on the physical address so no aliasing can occur here [6, p. 102]. When switching between guests in a hypervisor we have to be careful not having data in the cache which is not invalidated. Having old data there can lead to strange errors or leak information about the previous guest.

27 3. The MIPS architecture

3.5 Memory management unit (MMU) and the TLB

The memory management unit (MMU) provides, among other things, memory trans- lation. This is done through the translation lookaside buffer (TLB). The TLB is a hardware that translate virtual addresses into physical addresses. The memory is divided into, in MIPS and typically in all other systems [7, p. 106], 4-Kbyte chunks called pages. MIPS can handle larger pages than this but 4-Kbyte is the widely used page size.

3.5.1 CP0 TLB registers MMU control is done through a few instructions that manage registers located on CP0 (see section 3.3). All TLB entries are written or read through the registers listed in table 3.3.

Table 3.3: CP0 control registers for memory management [6, p. 133].

Register name Register number Description EntryHi 10 Holds the virtual page number and the address space identifier. EntryLo0-1 2-3 Each VPN maps to two PFN so the permission flags are set for each entry together with the PFN in these two registers. PageMask 5 PageMask can be used to create entries that map larger pages than 4-Kbyte. Index 0 Determines which TLB entry that will be read or written by instructions. Random 1 A pseudo-random number that is used by tlbwr to write a new TLB entry at a random location. Context 4 This register helps with looking up the virtual address in a memory- held page table record. The fields of the register is laid out so that this can be used as a pointer to the record in memory, if the high bits are set correctly.

The Index register is used to read or write a certain entry in the TLB. All entries are numbered zero to the number of entries minus 1. Index is also set automatically when doing a software search of the TLB with tlbp [6, p. 137].

28 3.5 Memory management unit (MMU) and the TLB

3.5.2 TLB entries When a program wants to access memory it presents the TLB with a virtual address and the TLB search its entries for a matching physical addresses.

EntryHi PageMask EntryLo0 EntryLo1

Flags Flags VPN ASID Page mask PFN PFN C D V G C D V G

Input Output (pair of physical pages)

Figure 3.3: A TLB entry [6, p. 132].

The virtual address is stored in a field called virtual page number (VPN) and the physical address is stored in a field called physical field number (PFN). In MIPS each TLB entry has a virtual address mapping to two physical addresses.

The EntryHi and PageMask registers

31 13 12 8 7 0 EntryHi VPN 0 ASID

31 29 28 13 12 0 PageMask 0 Mask 0

Figure 3.4: EntryHi and PageMask register fields [6, p. 134].

The EntryHi register hold two fields used for finding a TLB entry. The VPN field in this register is, as mentioned, the field that holds the virtual address and is the high bits of a virtual program address. Bits 31 to 13 of the high program address is used to look up an entry and bit 12 to choose between the two physical addresses stored in each entry. When a refill exception occurs, this field is set up automatically to match the virtual address that could not be found in the TLB [6, p. 134]. The ASID field is used for the address space identifier which is usually used by the operating system so that it can know which address space the virtual address belong to. When using tlbr to inspect TLB entries this value has to be restored as tlbr overwrites it. The PageMask register is used to set up TLB fields that map larger, or smaller, pages. The mask will cause corresponding bits in the virtual address to be ignored when matching a TLB entry and instead carry them unchanged to the physical address. No MIPS CPUs permit arbitrary patterns as in this register. They are different depending on the CPUs capabilities [6, p. 135].

29 3. The MIPS architecture

The EntryLo0-1 registers The EntryLo0 and EntryLo1 registers hold one physical address each together with four settings field controlling access and cache method for the page.

31 6 5 3 2 1 0 PFN C D V G

Figure 3.5: EntryLo0-1 register fields [6, p. 136].

PFN is the higher bits of the physical address that the corresponding VPN will be translated to.

C is a 3 bit field mainly used in cache-coherent multiprocessors when, for example, you know that some pages does not need to have changes tracked automati- cally. The only universally supported values are 2 (uncached) and 3 (cachable noncoherent).

D (dirty) functions as a write enable bit. If set to 1 all writes are allowed and if set to 0 all writes will be trapped.

V (valid) is used to deny access (set to 0) to addresses. Any use of addresses with V set to 0 will cause an exception.

G (global) is used to match virtual addresses regardless of ASID. Set to 1 to ignore ASID when matching [6, p. 137].

3.5.3 A TLB refill handler TLB entries are set up by writing the required fields in the EntryHi and the EntryLo registers and then using tlbwr or tlbwi instructions to copy that into the TLB [6, p. 141]. The context registers purpose is to make the TLB refill procedure more easy. As mentioned in table 3.3, the high nine bits of this register will store what we write there. The rest is automatically set up when a TLB refill exception occurs so that it will contain the address of the correct page in the page table. This requires that the start of the page table is located at an address with the low 23 bits set to zero as we only can write the highest nine bits to the context register. It also requires that each entry in the page table is exactly 16 bytes long, making room for four 32 bit integers. By having this layout and writing the start of the page table to the context register, it will always contain the address to the correct entry in the page table when an TLB refill exception occurs. We can then very easily implement an TLB refill handler.

30 3.5 Memory management unit (MMU) and the TLB

Listing 3.1: A simple TLB refill handler using the context register.

tlb_refill: mfc0 k0, cp0_context lw k1, 0(k0) lw k0, 4(k0) mtc0 k1, cp0_entrylo0 mtc0 k0, cp0_entrylo1 ehb tlbwr eret The code in listing 3.1 shows an TLB refill handler that uses the context register. It starts by moving the content of the context register to k0 then loading the first four bytes at this register to k1. These four bytes are set up in advance to contain the PFN and settings fields exactly as the EntryLo expects them to be. Then the next four bytes are loaded to k0. Both k0 and k1 are then moved to EntryLo0 and EntryLo1 respectively. Now we have to be sure that the contents of registers k0 and k1 are moved before we issue a TLB write instruction. The execution hazard barrier ehb does this. In our example here, we do not care onto which entry in the TLB our new entry will be written so we’ll do an tlbwr (TLB write random) and then return execution at the instruction that caused the TLB refill exception with eret. It is a good idea to put the page table in kseg2 as it is kernel accessible and mapped through the MMU. By doing this we do not have to use as much physical memory for the page table and if we have multiple, we do not have to place them far apart as we have to do when using the context register [6, p. 142]. But by placing a page table in mapped memory we can get nested TLB refill exceptions, for example on the lw instructions in listing 3.1. A nested TLB refill exception will happen when the CPU already is in exception mode (EXL is set to one). It will then be redirected to the general exception handler where it can be detected and handled specially. When the general exception handler returns, it does so to the original victim, not the victim in the TLB refill handler. This way, the original victim will cause an TLB refill exception again but this time with the page table address as an entry in the TLB [6, p. 143 - 145].

31 3. The MIPS architecture

32 Chapter 4 Implementation of a thin hypervisor

We use the Open Virtual Platforms (OVP) simulation tool to build and run our simulated MIPS platform consisting of a MIPS 4KEc CPU, an UART and mem- ory. The MIPS 4KEc CPU and peripherals are already implemented in the OVP simulation environment, the only thing we need to do is to create a platform which connects these together which is done in less than 100 lines of C code. On this plat- form we then will implement a thin hypervisor capable of running multiple guests where each guest, in our test cases, is a slightly modified instance of the FreeRTOS kernel, though any MIPS capable operating system can be run. The FreeRTOS port we used as a base for our guests is the port for the PIC32, a with a MIPS M4K core from Microchip [14]. While this port is made for a similar core as the one we are using, it makes use of external interrupt controllers and some other peripherals that we are not interested in. This port is also made for running in higher CPU privilege mode that has direct access to hardware which our guests should not have. We have therefore modified FreeRTOS to use the virtual interface provided by the hypervisor and to be able to run in user privilege mode.

4.1 Structure

The hypervisor runs in an address space that is not mapped through the TLB and in CPU kernel privilege mode which means that it is located in kseg0 (see section 3.1.4). The guests are located in the higher parts of kseg1 but mapped through the address space of kuseg. The guests have a defined low virtual address of 0x0 which is used as an exception entry point. They then use addresses upwards and can use as many pages as is defined in the page table which is created when the hypervisor boots.

33 4. Implementation of a thin hypervisor

When the system boot up, it uses the entry point in kseg1 as it is the only address space that behaves well before initializing caches and the TLB. Here, boot code that sets up the cache, TLB and CPU registers is executed. Not much initialization of the hypervisor is needed as the guest states in the hypervisor is well defined from the start. The only thing we need to do before jumping to the first guest is setting up the page tables. When this is done we set bits in the status register enabling CPU user privilege mode, set initialization as the guest exception cause and jumps to the first guest exception entry point at virtual address 0x0.

Guest ... Guest

Interrupts Hypercalls

Hypervisor

Interrupts

MIPS core and peripherals simulated with OVP

Figure 4.1: Basic hypervisor structure.

Each guest is then run until a timer interrupt is issued which is caught by the hypervisor. The guest context is saved in the hypervisor and next runnable guest context is loaded as well as invalidating the whole TLB causing all virtual addresses to trigger a refill and the new guests virtual addresses can be used. The guests also needs to receive interrupts from the hypervisor. Those are set up by the guest issuing a hypercall to the hypervisor telling it that the guest would like to receive a certain interrupt. The hypervisor will then execute the guest interrupt handler when an interrupt shall be triggered in the guest. Not all interrupts has to be issued through hardware interrupts but can instead be simulated by the hypervisor. For example, a periodic tick interrupt can not be set up for each guest in a convenient way. Instead the hypervisor keeps track of the time elapsed and when appropriate, executes the guest interrupt handler. The hypervisor only uses three interrupts, not counting general interrupts oc- curring on processor exceptions: Timer interrupts are issued when the compare register is equal to the count reg- ister and is caught by the hypervisor so that it can schedule guests appropri- ately. They are also used to keep track of when to run the guests tick interrupt handler so that the guests can keep track of the time. Software interrupts are used by the guest and issued by a hypercall. They are then caught by the hypervisor and directed to the correct guest. TLB refill interrupts are issued by the hardware when a virtual address needs to be translated to a physical address and the translation does not exist in the TLB. The hypervisor catches this interrupt and translates the address correctly. The hypervisor has an array of structs that contains the current state for each guest and the page table for it as shown in listing 4.1. The struct is used to store

34 4.2 Memory management important values used when simulating the MIPS architecture, such as the count and compare registers and interrupt enabled field so that the hypervisor behaves, from the guest point of view, as real hardware would.

Listing 4.1: Guest state saved in the hypervisor.

struct guest { uint32_t running; uint32_t sp; uint32_t interrupts_enabled; uint32_t count; uint32_t compare; uint32_t sw0_interrupt_waiting; void (*interrupt_handler)(void); /* 4 uint32_t per page table entry. */ uint32_t page_table[PAGE_TABLE_SIZE * 4]; unsigned char context[HV_GUEST_CONTEXT_SIZE] __attribute__((aligned (8))); }; In MIPS, a software interrupt set while interrupts are disabled will be issued as soon as interrupts are enabled again. The variable sw0_interrupt_waiting is used to keep track of such interrupts and if set, a software interrupt will be issued as soon as interrupts are enabled. By only keeping a variable containing the current running guest index in this array of guest structs we can easily switch between guests:

1. Some circumstance tells the hypervisor, for example a timer interrupt, that it is time to switch guest.

2. The hypervisor saves the current running guest state in the guest context array.

3. A scheduling algorithm determines the next guest to run and updates the current running guest index.

4. The hypervisor loads the current running guest context from the context array and returns execution where the program counter in the context points.

The running variable is, by the scheduling algorithm, used to set the different states a guest can be in. For example, we do not want to schedule not running guests or guests waiting for some external information. We also want to execute the guest boot code if the guest never have been executed.

4.2 Memory management

All guest memory is managed by the hypervisor by controlling TLB refills and page tables. The TLB refill code makes sure that only pages in the current page table are used, the guest context scheduling in the hypervisor makes sure that the correct page table are used and that the old TLB entries are invalidated.

35 4. Implementation of a thin hypervisor

4.2.1 Memory layout In section 3.1.4 we described how all addresses are virtual but not all are mapped through the TLB. Figure 4.2 illustrates this together with the address space layout and mappings of the hypervisor and guests. The hypervisor is located in the lower part of kseg0 which is mapped to physical memory by setting the top bit of the virtual address to 0. The hypervisor will therefore be located in the absolute bottom of the physical memory.

Virtual address space

Physical addresses Hypervisor boot code kseg1 Hard-wired mapping 0xbfc0 0000

0xa000 0000 Guest 1 ... Guest x Hypervisor boot code 0x9000 0000 0x1fc0 0000 kseg0 Hypervisor Guest 1 ... Guest x 0x8000 0000 0x1000 0000 Mapped to Currently running guest useg Hypervisor 0x0000 0000 0x0000 0000

Figure 4.2: Memory layout of the hypervisor and guests.

The guests are then placed one by one in the upper part of kseg0 but by mapping correctly through the TLB, the currently running will be located at virtual addresses 0x0 and upwards.

4.2.2 Translation lookaside buffer Throughout the whole execution, the TLB only consists of two things: valid map- pings belonging to the current guest and invalid mappings. Invalid mappings will be refilled as time passes and the currently executing guest need more. The invalidation of TLB entries is done by writing, starting from the first entry, the EntryHi to an address in kseg1 and write it to the TLB. Then add one page to the address and write that to the next entry. By doing this all physical addresses in the TLB will be mapped by virtual addresses that are never accessed through the TLB. Because we always place the same address at the same TLB position we will never get duplicate matches, something that can cause a TLB shut down or worse [6, p. 141]. The reason for not just clearing the valid bit in each entry is that when the new guest accesses the same virtual addresses, a general exception will be issued and not a TLB refill exception. The TLB refill handler does not make use of the context register because of the requirements on page table placement when not accessing them through kseg2. Accessing the page tables through kseg2 will allow us to place them wherever we want due to that the page tables being translated through the TLB but also means that we will need to be able to handle TLB refills within a TLB refill. Listing 4.2 shows the TLB refill handler used in our hypervisor with explanations.

Listing 4.2: The TLB refill handler in our hypervisor

36 4.2 Memory management

/* Get the address of the current page table. */ la k0, current_page_table lw k0, 0(k0)

/* Get the virtual address that shall be translated. */ mfc0 k1, _CP0_ENTRYHI

/* Divide with 4096 (page size) and 2 (two physical addresses per page table entry). This is equals to the entry index. */ srl k1, k1, 13

/* If the page table index is lager then the page table size then jump to an error handler. */ addiu k1, k1, -PAGE_TABLE_SIZE bgtz k1, tlb_error addiu k1, k1, PAGE_TABLE_SIZE

/* Multiply page table index with the size of an entry (2^4 = 16 bytes) and add that to the base address to get the address of our entry. */ sll k1, k1, 4 add k1, k0, k1

/* Load the entry and write it to the TLB control registers. */ lw k0, 0(k1) lw k1, 4(k1) mtc0 k0, _CP0_ENTRYLO0 mtc0 k1, _CP0_ENTRYLO1 ehb

/* Write to an random position in the TLB and return. */ tlbwr eret nop

4.2.3 Page tables The page tables are preconstructed when the hypervisor boots up and contains values on the format that the EntryLo0-1 registers expect. Each entry has two 4 byte values containing the physical page and settings for this page, two because the TLB expects two physical mappings per virtual mapping. As seen in listing 4.1, the page tables are located within the array of guest structs which means that the current page table can be calculated from the current guest variable when handling an TLB refill. But to save a few instructions in the TLB refill we maintain a pointer to the current page table. Walking the page table can be an expensive process as, usually, several memory loads are involved. There are many ways to walk the page table looking for a page to load into the TLB and choosing one involves weighing the extra execution time it takes to walk the page table against lowering the amount of TLB refills due to choosing the pages used now and in the near future. We have chosen to use a technique where the page to be loaded are calculated from the address in a simple

37 4. Implementation of a thin hypervisor way:

V irtualAddress P hysicalAddress = P hysicalAddress + · EntrySize P agetable P ageSize · 2

This calculation can be seen in listing 4.2, starting with a right shift, srl, and then continuing, after an error check, with a left shift, sll, and then adding the offset of the entry in bytes to the base page table address. This calculation can be left to the hardware by using the coprocessor 0 context register. But due to constraints on the physical base page table address when using this register we would have to use large amounts of address space when having multiple page tables or put them in kseg2 and have them mapped by the TLB. The context register could be used as an offset counter that can be read to know the offset that should be added to the page table base address, if the physical base address within the context register is set to zero.

4.3 Isolation

When switching guests, all registers are saved and the next guest registers are loaded or default values if the new guest never have been executed before. This ensures isolation on a CPU register level. When accessing memory the guest can only access addresses in the kuseg address space, addresses between 0x0 and 0x7fff ffff. Trying to access an address outside of this address space will cause an exception that the hypervisor will catch and terminate the guest. The TLB refill handler uses a pointer to the current guest page table and uses the virtual address to calculate which page table entry to load into the TLB. Here, we have to make sure that the entries loaded into the TLB is in the page table and correct or else the guest could cause a TLB shut down and halt the entire system.

4.4 Hypercalls

The hyperviser provides each guest with nine hypercalls listed in table 4.1. Each hypercall functions as presented in the table is a macro containing assembler in- structions setting the hypercall id in one register and the value, if any, in another and then causing a interrupt with the syscall instruction. Listing 4.3 shows a hypercall in assembly. As we can see here it uses the defined value 2 to represent a disable interrupt hypercall. After the hypercall returns to the instruction after the syscall instruction we take care of the returned value, if any.

Listing 4.3: A disable interrupt hypercall as done by the guest.

li v1, 2 syscall move t0, v0

38 4.4 Hypercalls

The syscall instruction will make the CPU enter kernel privilege mode and ex- ecuting an interrupt handler that detects the hypercall and handles it by saving the current guest context, jumps to a hypercall function in the hypervisor and then loads the current guest context and returns. Unlike branches, the instruction after, in this case move, and the syscall instruction will not be run together as there is no branch delay slot in this case. The syscall instruction triggers an exception and in MIPS, all exceptions are precise. Executing the instruction after the exception victim would go against the definition of precise exceptions.

Table 4.1: Hypercalls provided by the hypervisor.

Name Description HV_ENABLE_INTERRUPTS() Enable interrupts for current guest. HV_DISABLE_INTERRUPTS() Disable interrupts for current guest. HV_TRIGGER_SW() Trigger a software 0 interrupt, either directly or as soon as interrupts are enabled. HV_GET_ID() Get the current guest ID. HV_YIELD() Tell the hypervisor to schedule another guest. HV_SET_COMPARE(value) Update the guest compare value to trigger a timer interrupt when count is equal to this value. HV_WRITE_UART(value) Write the string value is pointing at on the UART. HV_ERET(value) Enable interrupts and jump to value. Waiting interrupt will not be issued until execution has started at value. HV_SET_SCHEDULE(value) Tell the hypervisor to schedule guest with id value.

39 4. Implementation of a thin hypervisor

40 Chapter 5 Evaluation

The simulator used, OVP, is instruction accurate but not 100% cycle accurate. Pipeline stalls, branch prediction, cache refill/misses, memory contention/blocking, resource contention/blocking and such aspects that would affect a real system can not be measured on our simulated model. We are instead interested in evaluating the latency that the hypervisor adds to system calls, interrupts and TLB refills. We are also interested in how the amount of TLB refills are affected when running applications on our hypervisor. The two cases that are interesting is one worst case test where memory is accessed randomly and one best case test where memory is accessed sequentially and thus not very refill heavy. As one of our goals is to implement a thin hypervisor we also need to evaluate the size of the hypervisor, both in lines of code and memory footprint. These evaluations will be done in this chapter. A comparison between our hypervisor and other MIPS hypervisors is discussed more in section 6.1.

5.1 Methods

All evaluations are done with FreeRTOS as the guest or native operating system. The tests in section 5.2 contains ten tests, five with one application and five with another. For each application, the five tests differ in the amount of guests that are running at the same time. One application (TLB refills sequential in figure 5.1) writes to a 100K byte segment of memory sequentially, byte by byte, starting from position zero as seen in test case 1. The other (TLB refills random in figure 5.1) does the same thing but instead randomizes the write position, for each byte, as seen in test case 2. Each guest has their own 100K byte segment of memory to process. This shows two use cases, one where the TLB and page table are stress tested by forcing more refills and one with less TLB refills.

41 5. Evaluation

These test cases will create a graph containing two plots, one showing TLB refills in the worst case scenario and one showing TLB refills in the best case scenario (not counting cases where the whole application fits in the TLB). A real world application would then be in the area between those two plots.

Test case 1 is run on each guest when accessing memory sequentially. for position = 1 → X do data[position] ← Y end for

Test case 2 is run on each guest when accessing memory randomly. for position = 1 → X do i ← Random_number(0,X) data[i] ← Y end for

The latency of a TLB refill is, in our case, quite constant as we have a very deterministic refill handler and nothing but the current guest can cause TLB refills. The memory area where the hypervisor is located is hardware mapped as it is located in kseg0. This means that we will never get TLB refill exceptions on any of the memory accesses that are within the hypervisor. 5.2 TLB refills

Studies has shown that 5 - 10% of the runtime of a normal system can consist of TLB handling [8] and a kernel can spend 80% of its operating time on TLB miss handling [9]. Being able to quickly handle TLB misses is key to having an effective system. A minimum TLB refill handler can be seen in listing 3.1 and consists of only eight instructions with no error checks, such as making sure the address in the context register is valid. The TLB refill handler used in our hypervisor (see listing 4.2) consists of 17 instructions which includes checking for memory accesses not allowed with the current page table. We want to do this in as few instructions as possible to keep the refill handler fast. In figure 5.1 we show TLB refills and context switches . Each guest executes one test application, either test case 1 or test case 2. In each experiment all guests run the same test case. In the figure, we can see test case 1 as the TLB refills (Sequential) curve and test case 2 as the TLB refills (Random) curve. When performing sequential memory accesses with only one guest we quickly reach 8 TLB refills. These refills include the guest boot procedure and the first part of calculation. Then the TLB refills slowly increase to 22 pages as the block of memory the guest is using does not fit in one page. When the amount of guests are increased to two we see an increase in the amount of TLB refills. We can also, in figure 5.1, see that the amount of TLB refills does not increase exponentially when the amount of guests doubles, which indicates that the system

42 5.3 Hypercalls

Figure 5.1: Memory accesses with both sequential and pseu- dorandom accesses. behaves in a stable manner and that the increase in overhead when adding guests is not unmanageable. The random memory access case shows a similar behavior as the sequential memory access. We still have the same hypervisor overhead but with an increased use of memory pages between each guest context switch. The test case when only a single guest is running can be seen as equal to running a guest without a hypervisor, from the perspective of the amount of TLB refills, as the hypervisor runs in kseg0 which is not mapped through the TLB and thus causes no TLB refills. Also, no entries are invalidated in the TLB by the hypervisor when only running one guest because this would be unnecessary.

5.3 Hypercalls

It is important that hypercalls are fast while still executing in a secure manner as they are run every time the guest want to manage or access hardware in some way. Table 4.1 shows the amount of instructions each hypercall spends measured from when the hypercall is issued to execution resumes at a guest. Not very surprising, we can see that HV_SET_SCHEDULE() and HV_YIELD() are the two most expensive hypercalls. These two are the only hypercalls that switches guest context and invalidates the TLB, something that is expensive. Fortunately, these two are mostly used when communicating between guests. The empty hypercall in table 5.1 is a hypercall consisting only of code common to all hypercalls but without doing anything besides this. The common code consists only of saving and restoring the guest context.

43 5. Evaluation

Table 5.1: Amount of instructions spent on each hypercall.

Name Instructions Empty hypercall 133 HV_ENABLE_INTERRUPTS() 145 HV_DISABLE_INTERRUPTS() 145 HV_TRIGGER_SW() 288 HV_GET_ID() 137 HV_YIELD() 339 HV_SET_COMPARE(value) 149 HV_WRITE_UART(value) 199 HV_ERET(value) 149 HV_SET_SCHEDULE(value) 349

5.4 Interrupts

The amount of instructions spent on each interrupt with a hypervisor compared to the amount of instructions without one can be found in table 5.2. We are here interested in how much extra instructions, or latency, the hypervisor adds to each interrupt. The amount of instructions show in the hypervisor column are counted from the point where the interrupt is issued to when execution is continued in the guest.

Table 5.2: Amount of instructions spent on each interrupt.

Name Virtualized system Native system Timer interrupt 791 359 Software interrupt 1008 718 TLB refill 17 8

The instructions added by our hypervisor to each interrupt are mostly to be able to keep isolation between guests. Timer interrupts are the interrupts with the highest increase of overhead which can be explained by that the guest scheduling algorithm is executed here to see if a new guest should be scheduled. The increase in the TLB refill exception consists of instructions that ensures the guest only can access, for it, valid memory.

5.5 System calls

The system calls listed in table 5.3 are commonly used system calls in FreeRTOS. The number of instructions might not be relevant as those are specific to this par- ticular guest but instead, the increase is more interesting as it shows the latency our hypervisor adds to system calls.

44 5.6 Multiple applications or multiple guests

Table 5.3: Amount of instructions spent on each system call.

Name Virtualized system Native system Enable interrupts 145 1 Disable interrupts 145 1 Yield 351 9 Enter critical 163 16 Exit critical 172 24

Enable interrupts and disable interrupts are two worst case scenarios as these are done with one instruction, ei and di, when running the system alone. The hypervisor has to preserve the guest state and modify the guest struct shown in listing 4.1. Then it has to restore the guest state and return execution of the guest. The yield system call is used to switch between tasks and uses a software interrupt to do this. A native system and a system with a hypervisor would do this in a similar way as the instructions that sets the correct software interrupt in the cause register just has to be replaced with a hypercall.

5.6 Multiple applications or multiple guests

Table 5.4 shows results from tests where we run multiple applications, either all in one guest running FreeRTOS or each application in one guest. Each application executed the Jenkins Hash algorithm [13] over a small memory area. The application is considered to be finished when it has hashed the entire area. The column instructions in table 5.4 shows the amount of instructions spent until all running applications are finished.

Table 5.4: Running multiple applications in one guest com- pared to running multiple guests with one application.

Guests Applications / Guest TLB refills Guest switches Instructions 1 1 8 0 296 459

1 2 8 0 665 336 2 1 116 43 1 307 584

1 4 8 0 1 214 415 4 1 277 85 2 570 182

Two applications in one guest does not increase the amount of TLB refills due to both FreeRTOS and the application can be run without filling the whole TLB. When comparing running two guests with four guest we can see that the amount context switches more then doubles. The same happens with the amount of TLB refills. One reason for this is that the old guest TLB contents is invalidated when

45 5. Evaluation switching guest which means that the TLB has to be refilled again as the new guest need to access memory. While the boot code and FreeRTOS start up code is included in the amount of instructions run, it does not explain the more then double amount of instructions when running twice as many guests. Here, we instead see the overhead of switching guests. This indicates that we here have opportunity for improvement, for example not invalidating the whole TLB but instead invalidate each entry as the new guest needs them. This saves the need for a invalidate routine and at the same time might result in entries still being there the next time a guest is scheduled.

5.7 Isolation

Each guest can execute instructions that are available in user privilege mode. These are instructions that either uses CPU registers or memory.

Guest 1

Page Memory table

Guest 2

Hypervisor

Figure 5.2: Illustration of how isolation is maintained.

As shown in 4.2, all guest are located in physical memory 0x1000 0000 and upwards which is where kseg0 maps. This ensures that executing guests can not access other guests in any other way than by trying to get the TLB refill handler to map some of its virtual addresses to another guest. The guest state is saved in the hypervisor without caring about its contents, by doing this we make the process independent of the guest. If the guest changes the stack pointer, global pointer or program counter to some devious value hoping the hypervisor would use these values it is out of luck. All invalid exceptions or invalid memory accesses will make the hypervisor ter- minate the guest. The guest state will then be preserved for examination.

46 5.8 Hypervisor size

5.8 Hypervisor size

The hypervisor should be a thin hypervisor, a hypervisor that can be run on embed- ded systems with small amounts of memory. The memory footprint as well as the amount of code has to be minimal to allow as much space as possible to the guests. The amount of lines of code for the hypervisor, everything included, is presented in table 5.5. This is a very subjective measurement but gives a feeling for the actual size of the hypervisor. Hypervisor.c is the main file containing functions that handle hypercalls and interrupts as well as invalidating the TLB and creating the page tables. The only memory needed are for the array of structs that holds the guest context, cache variables for current guest context and some memory for the hypervisor stack. The size of the guest context struct depends mostly on the size of the page table as it is located within this struct. The rest in the struct is less than 200 bytes. Table 5.5: Size of the hypervisor files in lines of code, in- cluding comments and empty lines.

File name Lines of code Description hypervisor.c 346 Main hypervisor file. scheduler.c 25 Guest scheduling algorihm. hv_exceptions.S 75 Exception entry points. Assembly. hypervisor_asm.S 260 Assembly routines used by the main hypervisor file. print.c 135 Functions used to provide printf functionality. boot.S 361 Assembly boot code.

Table 5.6: The amount of memory used by the hypervisor where n is the amount of guests and r is the amount of page table entries.

Section Bytes used text (interrupt handlers) 464 text (boot code) 2080 text (hypervisor) 4496 data (hypervisor) n · 172 + n · r · 16

The amount of memory used for the hypervisor depends on the amount of guests, as mentioned, and when running the tests shown in section 5.2 we used the amounts shown in table 5.6. The data section includes the page tables for all 16 guests as well as room for their contexts. Of the 39064 bytes used in the data section, 36096 bytes are page tables. Table 5.6 can be compared to table 5.7 where the memory usage of a single FreeRTOS guest without a hypervisor is presented. There we can see that even

47 5. Evaluation though FreeRTOS is a small guest it is a lot larger then the hypervisor. FreeRTOS is, at least, in the order of hundreds of times smaller than the kernel. It is clearly the page tables that dominates the hypervisor size. Any page table are not present in the guest presented in table 5.7 as it is run in kseg0.

Table 5.7: The amount of memory used by a guest without the hypervisor.

Section Bytes used text (interrupt handlers) 448 text (boot code) 2176 text (FreeRTOS) 19848 data (FreeRTOS) 96

48 Chapter 6 Conclusions

This thesis has shown that a thin hypervisor providing isolation for multiple guests is a viable option for embedded systems even if no explicit hardware support for hypervisors exists. Our implementation of a thin hypervisor demonstrates this by providing isolation for multiple guests running on top of it. Para-virtualization of FreeRTOS consisted of only changing a couple of C lines in three files and replacing some sensitive instructions in assembly to hypercalls. With these minor changes in the guest code, our hypervisor can provide security features that the guests themselves would not be able to without major changes. By keeping the hypervisor thin, maintenance of the code is simplified. This can minimize the amount of bugs and lower the time it takes to fix them. When constructing a system where performance is critical, virtualization might introduce to much of an overhead. This overhead has to be weight against the increased security virtualization can offer. Our hypervisor added a, for performance critical systems, significant overhead. For these systems a more optimized hypervisor or some other approach to securing the system might be a better solution.

6.1 Future work

Even though the hypervisor created in this thesis can provide isolation for multiple guests running simultaneously there are much that can be improved and added. In terms of security, services such as kernel integrity protection where the hyper- visor makes sure that no applications can modify or in other ways tamper with the kernel would be a good addition. Isolation services, where the hypervisor enforces some kind of memory access policy within the guest, is only possible if the guest is internally split up in multiple of the size of pages, in our case 4 KB. We have to split the guest in this way because we can only enforce a certain access policy on a page, not on individual bytes within a page.

49 6. Conclusions

Support for secure remote procedure calls (RPC) would be a good addition as communication between guests is an essential feature. When only having two guests, RPC can be trivial to implement but with more than two guest challenges such as who are allowed to communicate with who appears. The overall efficiency of the hypervisor can be improved where some low hanging fruit would be to remove the need for software interrupts inside the hypervisor and only save parts of the guest context when entering the hypervisor instead of the whole guest context as is done today. A more advanced guest context switch that does not invalidate the old TLB but instead replaces it with the last recently used pages of the new guest can reduce the amount of TLB refills significantly. One way to solve this is by the use of ASIDs where each guest has their ASID. Then there would be no need for invalidating the TLB as the currently running guest would only be able to use entries tagged with its own ASID. Several of the hypercalls present can be optimized to so called fast path hypercalls where the minimal effort is spent on the hypercall, for example when saving and restoring the guest context. More advanced scheduling could improve the overall performance feel of the system, especially if the hypervisor has RPC support. There are existing projects that provide virtualization for MIPS platforms even though they might be very different, for example L4/MIPS or KVM solutions. A comparison between these would be interesting. Because L4/MIPS is a microkernel and KVM a linux kernel module they have both advantages and disadvantages. It would be good to see how the hypervisor manages to run a large real world application to get another perspective of the hypervisors performance then when running tests on the small FreeRTOS.

50 Bibliography

[1] Popek, Gerald J. and Goldberg, Robert P.; Formal requirements for virtualizable third generation architectures, Commun. ACM, vol. 17, num. 7, July 1974.

[2] Heradon Douglas; Thin Hypervisor-Based Security Architectures for Embedded Platforms, February 2010.

[3] Viktor Do; Security Services on an Optimized Thin Hypervisor for Embedded Systems, July 2011.

[4] Gernot Heiser; Virtualizing Embedded Systems – Why Bother?, DAC ’11, 2011.

[5] Morgan Kaufmann/Elsevier; Virtual Machines: Versatile Platforms for Systems and Processes, ISBN: 978-1558609105.

[6] Dominic Sweetman; See MIPS Run Linux, Second edition, ISBN: 978- 0120884216.

[7] David A Patterson, John L. Hennessy; Computer Architecture, A Quantitative Approach, Fifth edition, ISBN: 978-0123838728.

[8] Jacob, Bruce L. and Mudge, Trevor N.; A look at several memory management units, TLB-refill mechanisms, and page table organizations, SIGOPS Oper. Syst. Rev., vol. 32, num. 5, December 1998.

[9] Rosenblum, M. and Bugnion, E. and Herrod, S. A. and Witchel, E. and Gupta, A.; The impact of architectural trends on operating system performance, SIGOPS Oper. Syst. Rev., vol. 29, num. 5, December 1995.

[10] MIPS Technologies; MIPS Architecture For Programmers Volume I-A: Intro- duction to the MIPS32 Architecture, Revision 3.50

[11] MIPS Technologies; MIPS Architecture For Programmers Volume III: The MIPS32 and microMIPS32 Privileged Resource Architecture, Revision 3.12

51 BIBLIOGRAPHY

[12] MIPS Technologies; MIPS64 Architecture For Programmers Volume III: The MIPS64 Privileged Resource Architecture, Revision 5.01

[13] Jenkins Bob; A hash function for hash table lookup, Dr. Dobb’s Journal, 1997.

[14] Microchip PIC32 FreeRTOS port; http://www.freertos.org/port_PIC32_MIPS_MK4.html.

[15] AMD virtualization technology; http://sites.amd.com/us/business/it-solutions/virtualization/ pages/virtualization.aspx.

[16] Intel virtualization technology; http://www.intel.com/content/www/us/en/virtualization/ virtualization-technology/hardware-assist- virtualization-technology.html.

[17] MIPS virtualization technology; http://www.imgtec.com/mips/mips-virtualization.asp.

[18] FreeRTOS; http://www.freertos.org.

52