Porting to the Tilera TILEPro64 Architecture

ROBERT RADKIEWICZ and XIAOWEN WANG

KTH Information and Communication Technology

Master of Science Thesis Stockholm, Sweden 2013

TRITA-ICT-EX-2013:69

KTH Royal Institute of Technology

Master Project at KTH Barrelfish to the Tilera TILEPro64 Architecture

Authors: Robert Radkiewicz and Xiaowen Wang

Examiner: Prof. Mats Brorsson, KTH, Sweden

Abstract

Barrelfish is a research with the focus on the scalability of manycore architectures and the increasing numbers of heterogeneous hardware. Instead of heavily relying on the cache coherency protocol which has been proved to be an inherent bottleneck on manycore systems, Barrelfish employs the thought of distributed systems and uses the message-passing approach to implement inter-core communication. The TilePro architecture is a manycore system with up to 64 cores and several mesh networks. Because of its special hardware design, TilePro is considered to be a ideal vehicle to run Barrelfish on, in order to make full use of the advantages from its manycore and network structure. The porting of Barrelfish on TilePro architecture involves some general set-ups of image , system, , and system calls, inter-dispatcher communication and so on. At the beginning, the whole system fully starts up on the first logic core, and later the monitor on the first core will be responsible for booting up others in order, according to the pre-configuration of memory space on the initial core. There are two sorts of communication provided originally in Barrelfish under the protocol of user remote procedure call (URPC). The local message passing (LMP), which happens when two dispatchers communicate with each other on the same core, is implemented by invoking system calls, passing all the values through reserved registers. The user-level message passing (UMP), which is designed for inter-core communication, depends on a shared memory approach. The inter-core communication begins as the second core is starting. The project also investigates how to utilize TilePro mesh network structure to fulfil the inter-core communication so that the characteristic of the architecture would be exploited thoroughly. TilePro offers several mesh networks with different properties and behaviours. In this case, we mainly use user dynamic network (UDN) instead of UMP to carry out remote core-to-core communication, although still based on the existing protocol of URPC. The result shows that Barrelfish can completely boots up on two cores at least and some user applications could be executed either on the first core or on the second properly, while the core-to-core communication is working on TilePro UDN network.

Contents

1 Introduction 3 1.1 Current Implementations ...... 4 1.1.1 Factored Operating System ...... 4 1.1.2 Tessellation ...... 4 1.1.3 Barrelfish ...... 4 1.2 Core-to-Core Communication ...... 5 1.3 TilePro64 ...... 6 1.4 Contributions ...... 6 1.5 Structure of Thesis ...... 6

2 Implementation 7 2.1 Requirements ...... 7 2.2 Booting an OS on Tilera ...... 7 2.2.1 Shipping the Kernel ...... 7 2.2.2 Hypervisor ...... 8 2.2.3 Overview of Booting Process ...... 9 2.3 newlib ...... 11 2.4 Virtual Memory ...... 11 2.5 ASIDs ...... 14 2.6 Handling ...... 15 2.7 System Calls ...... 15 2.8 Processes and Threads ...... 16 2.9 I/O ...... 17 2.10 Local Communication ...... 17 2.11 Core-to-Core Communication ...... 18 2.11.1 Existing Barrelfish Ports ...... 18 2.11.2 Implementation of Message Passing on TilePro ...... 18 2.11.2.1 Static Network ...... 18 2.11.2.2 User Dynamic Network ...... 18 2.11.2.3 Other Dynamic Networks ...... 19 2.11.3 Implementation of a New Backend ...... 19

3 Reflection on the Porting Process 23

4 Results 25 4.1 Porting Results ...... 25 Contents

4.2 Modifications on Barrelfish ...... 25

5 Conclusion 29

vi List of Figures

2.1 Procedure to create a bootrom file ...... 8 2.2 Virtual memory layout ...... 13 2.3 Barrelfish internal virtual memory layout ...... 13 2.4 Physical memory layout ...... 14 2.5 Dispatcher structure in Barrelfish ...... 16 2.6 UDN backend ...... 20

4.1 Bootstrap on TilePro ...... 26

List of Tables

4.1 Modification to Barrelfish ...... 27

Chapter 1 1 Introduction

Computer hardware is changing rapidly in past two decades. From the single-core to dual-core, and then to multi-core architectures, researchers have been always seeking the best way to boost the performance of computers. According to some seemingly inevitable defects on the single-core processor, e.g. the lack of parallelism and the increasing thermal issues, the single-core architecture seems to hit the bottleneck and cannot be developed significantly any more [1]. Meanwhile the multi-core architecture emerges and is widely used in variety of workplaces owing to its inherent advantages. For example, different programs have a different gain in speed due to the usage of multiple processors. It depends on how separated the tasks with most cpu-load are. These separated tasks could be a server, which answers to several network connections and does some calculation per connection. For some tasks which are not separated so naturally, they can be split into several sub-tasks, which are then separated. Another example is image processing, where an image could be split into several sub-images, which can be processed by one task. In this example all tasks have a dependency to the whole image, where they get to select the data from.

Based on the fast advancement of hardware and its needs in the real working environment, future OSes should be able to make full use of these new features accordingly. Some traditional general-purpose OSes, e.g. the variants of Unix and Windows, use a shared memory kernel with data structures protected by locks [2], which derives from the basic OS thought on the single-core architecture. They are capable of using multiple cores to allocate the workload between them, but this approach does not scale very well with the number of cores rising significantly [3, 4]. Although those OSes have been developed to fully support SMP and ccNUMA architectures in order to obtain the relatively high performance computing. However, the future computer architectures tend to increase the number of cores and hardware diversity [2]. Then as a result, the cache coherence protocols and will become prohibitively costly, when a large number of cores (manycore system) are involved. 1 Introduction

1.1 Current Implementations

How to exploit this so-called manycore system efficiently becomes a popular topic. Some researchers have proposed new thoughts by re-conceiving the OS architecture to avoid being limited by the scalability bottleneck of traditional OSes.

1.1.1 Factored Operating System

Factored Operating System (fos) [5] is a new operating system targeting manycore systems with scalability as the primary design constraint, where space sharing replaces time sharing to increase scalability. The main feature of fos is that it factors an OS into a set of services where each service is built to resemble a distributed Internet server. The OS kernel and user application services are respectively located on the different sets of servers, so that they would not interfere each other, thereby increasing the degree of distribution and parallelism of the system. Each server runs on a given core, and they may communicate with each other based on the paradigm of message passing.

1.1.2 Tessellation

Tessellation OS [6] restructures the operating system to support a simultaneous mix of interactive, real-time, and high-throughput parallel applications. It utilizes two novel ideas, Space-Time Partitioning and Two-Level Scheduling, to reach the goals of resource distribution, performance isolation and QoS guarantees. Applications are divided into performance-isolated, gang-scheduled cells communicating through secure channels.

1.1.3 Barrelfish

Barrelfish [2] demonstrates that the message passing method outperforms traditional shared memory idea which is the main bottleneck to scale the OS to many-core architecture. It also proposes the “share-no-memory” concept for the OS, exploiting the explicit and asynchronous message passing method for the communication between cores in order to reduce the side effects from cache coherence protocols. This message passing or exchanging is also for the implementation of the shared OS state replication for each core, which will reduce the overhead of load on the system interconnect, memory contention and synchronization. Another remarkable characteristic of Barrelfish is the compatibility of heterogeneous architectures. According to the demands from real workplaces, there is a trend that several different types of architectures may be combined into one board, incorporating to deal with some

4 1.2 Core-to-Core Communication tasks together. Barrelfish is designed originally to implement this demand, making it possible to boot up different kernels for different hardware.

1.2 Core-to-Core Communication

In a shared memory scenario data may be accessed from different processors and is normally placed somewhere in the RAM. So the accesses to this data must be controlled to avoid the situation where the asynchronous nature of this scenario may corrupt the data or calculations on these. Problems arise, when one processor overwrites data from another processor or a processor is working with outdated data. To maintain control over this there are tools like locking and cache coherence protocols. Locking is a tool to lock accesses to this memory, so that only the processor which holds this lock can access it, while all other access at this time are blocked until the lock is free again. Cache coherence is a tool to let the processor-specific caches stay in sync. Processors have local caches, which allow them to access cached data in orders of magnitude faster. For data which is shared among different processors, this caches are updated, when one of them changes. Both of these tools may cause severe overhead when the amount of processors accessing the data is increasing [3, 5, 2]. In the research for the operating system Corey [7] the authors tried to make sharing more explicit, so that an application can choose if data structures are shared to all threads or not, instead of sharing everything automatically. In an experiment, they opened in a lot of threads file handles and closed them directly afterwards, without touching them. In these file handles are automatically shared to all threads and the time to open and close one handle on one core increases with the amount of cores also doing that. So one conclusion drawn here is that applications should decide if data is shared instead of sharing it automatically. Inspired by the distributed networking system, the communication between cores through message passing has been considered as a more efficient way to adapt the future manycore architecture, instead of exploiting cache coherence protocols. That is because today’s computer architectures may have already evolved to a distributed system [8]. With message passing only the cores communicating together establish an explicit message channel, but unlike cache coherence protocols the remaining cores are not involved to implicitly update the data accordingly. So even when a lot of cores are working on the same problem, only cores which are active messaged, are concerned with the new value. This makes the communication more explicit and puts it into the foreground. Moreover, the message passing is able to reduce the communication latency in principle if the messages are encoded compactly enough. Baumann’s experiments [2] also have proven the scalability issues of cache coherent shared memory, showing the message passing is a promising alternative for manycore system. However, notice that message passing does not really deny cache coherence protocols, the applications should have the choice, which construct to use.

5 1 Introduction

1.3 TilePro64

TilePro64 processor implements the multicore architecture with 64 cores, incorporat- ing a matrix of processing elements, interconnected through a scalable point-to-point mesh network. The processor cores have been designed for the purpose of reaching an optimal balance between the size of a single core and the number of cores. In general it is aiming at providing high throughput, programmability and power efficiency. Since Barrelfish is originally designed for many-core architectures, we hope TilePro could be a good vehicle to test the scalability of Barrelfish, so that its scalability effects could become more visible than on previous benchmarks. The TilePro has multiple mesh networks, which are designed to provide very low latency communication or communication with high throughput, depending on the kind of application. The design aim for the networks is to scale well even when a lot of cores are communicating with each other. This may support the Barrelfish-approach of message passing and seems worth to be investigated.

1.4 Contributions

The project mainly investigates how to port Barrelfish onto TilePro architecture and make use of TilePro network structure to implement core-to-core communication instead of a shared-memory mechanism provided originally in Barrelfish. The whole porting process involves some general configurations of image booting, virtual memory system, context switch, interrupts and system calls, inter-dispatcher communication and so on. So far Barrelfish can completely boots up on two cores at least, while the inter-core communication between them is ongoing using TilePro user dynamic network. Moreover, some user applications could be executed on these running cores correctly.

1.5 Structure of Thesis

In the rest of this thesis, Chapter 2 concretely depicts all crucial parts we fulfil or understand in this project; Chapter 3 is designed to reflect our porting process; The result and conclusion parts are set up in Chapter 4 and Chapter 5 respec- tively.

6 Chapter 2 2 Implementation 2.1 Requirements

The goal of the project is to port Barrelfish onto TilePro with the least effort needed until there is the possibility to use message passing mechanism between cores through an Barrelfish communication interface. The project is also to investigate how to use TilePro-specific user dynamic network to implement this core-to-core communication, and run some user application on the main core. Any other irrelevant parts of the operating system not needed to fulfil this, would be ignored in term of time limit.1

2.2 Booting an OS on Tilera

In this section we describe in detail how to generically boot up an operating system on the TilePro architecture, since understanding and satisfying the hardware requirement is the first priority to begin our work.

2.2.1 Shipping the Kernel

Barrelfish intends to ship the kernel to the processor with the help of Multiboot Specification [9], which is a part of GRUB. There is no Multiboot-implementation for Tilera, so we use the Tilera toolset to boot the kernel instead. In this toolset the first booters (level 0.5 and level 1 boot loader), hypervisor, hypervisor configu- ration file and the kernel are bundled to a bootrom file, which can be sent to the hardware.

1As we found out, most parts need to be ported. 2 Implementation

L0.5 and L1 booter

boot.bin

Hypervisor

Bootrom

Hypervisor configuration file

Hypervisor

Client supervisor

Figure 2.1: Procedure to create a bootrom file

2.2.2 Hypervisor

The operating system can work either directly on the hardware or with a hypervisor in between. The mode to run directly on the hardware is called “Bare Metal Environment”. The hypervisor is a light layer, which is intended to help with the implementation of an OS on top of TilePro and make the porting to a future Tilera architecture easier. The hypervisor is designed to be very lightweight and is only running when explicitly called by the OS or if an interrupt occurs. This design makes the hypervisor a efficient tool without introducing any critical barriers during the porting process, therefore we decided to use it.2 The hypervisor will get started when the bootrom is loaded on the architecture. It will load the kernel ELF3 file, put all segments into the specified physical addresses and jump to the entry point address to hand over control to the supervisor. The supervisor then has full control over the machine and can carry out its boot- up. The hypervisor is not able to handle any relocations defined in the ELF file, because this is not needed for traditional operating systems, where all cores use the same kernel image. However in Barrelfish all kernels must be loaded to separate

2The hypervisor keeps its promise in general. There is no document showing that some different behaviours exist between using bare metal environment and hypervisor. 3Executable and Linkable Format – a standard Unix file format for executable programs containing meta-information about how to load the program.

8 2.2 Booting an OS on Tilera memory locations to be able to work on heterogeneous platforms, where different kernel images are needed. To be able to load the kernel image to different memory locations, we use a bootloader instead of the kernel. This bootloader is loaded by the hypervisor and maps the memory location for the data section depending on the core it is executed on. The real kernel is then loaded to this memory and executed. So the kernel is loaded to the same virtual address on each core, but to a different physical one. As an alternative to this, would be to let the bootloader perform some relocations on the kernel image.

2.2.3 Overview of Booting Process

Here we provide a broad overview of the booting process inspired by [10]. The main steps of the booting process for the first core are listed below: 1. Ship bundled hypervisor, bootloader, kernel & program files; 2. The hypervisor loads the bootloader ELF-file; 3. Hypervisor sets the program counter to the entry point of the ELF-file; 4. Install initial page table & activate memory translation; 5. Set up tile-specific kernel stack; 6. Copy page table to a tile-specific location; 7. Add tile-specific data section mapping; 8. Switch to the new page table; 9. Load kernel module and jump to its entry point; 10. Slice the physical memory in sections, one for each core; 11. Set up init’s address space; 12. Load ELF modules defined in menu.lst; 13. Parse the given command line; 14. Load init module; 15. A timer should be initialized at this moment; 16. The hypervisor is told to start up all other cores; 17. Context switch to init in user mode; 18. crt0 starts the runtime system, which starts bootstrap and initializes local message passing; 19. init starts monitor & mem_serv, and then exits;

9 2 Implementation

20. monitor starts other modules; 21. spawnd initiates the second core; 22. monitor on the second core starts up; 23. Two monitors on different cores start remote inter-core communication, estab- lishing a binding between two dispatchers.

The booting process is somewhat different for all following cores: 1. Hypervisor waits until the first core tells it to start the other cores; 2. Hypervisor sets the program counter to the entry point of the ELF-file; 3. Install initial page table & activate memory translation; 4. Set up tile-specific kernel stack; 5. Copy page table to a tile-specific location; 6. Add tile-specific data section mapping; 7. Switch to the new page table; 8. Load kernel module and jump to its entry point; 9. Wait until the first core starts up this particular core; 10. Set up monitor’s address space; 11. Load monitor module; 12. A timer should be initialized at this moment; 13. Context switch to monitor in user mode; 14. crt0 starts the runtime system, which starts bootstrap thread and initializes local message passing; 15. monitor starts remote inter-core communication with the first core by estab- lishing a binding with corresponding dispatcher. 16. spawnd starts up. One interesting thing is here that the hypervisor only allows to start up all cores at once, but Barrelfish needs to start up one core at a time. To combine this different approaches, we are telling the hypervisor to start up all cores very early. Those cores then do some startup and wait for a special message. This message is send via the UDN network, which is explained in section 2.11.2.2. In contrast to other implementations of Barrelfish, on the TilePro platform some configurations have been already defined at compile time:

10 2.3 newlib

• Most virtual address mappings are put into the data section of the kernel; • The memory of the initial virtual mappings are allocated by the hypervisor while allocating the sections; • For exception handling the interrupt vector is put into a special section, which is loaded to a well-known address. In this boot process, no devices are initialized, because no devices are supported at the moment.

2.3 newlib

Barrelfish uses newlib [11] as C standard library. A amount of functions inside Bar- relfish are implemented with this particular C standard library, including introducing some crucial changes. This forces us to use the newlib. Newlib contains some machine- dependent files, which are implemented for a big number of systems. Unfortunately the official newlib release, which is used in Barrelfish, does not support TilePro platform. Tilera themselves give out a version of newlib as part of their distribution in order to support TilePro platform, but this distribution is based on a newlib released in 2004, so it could not be used as replacement for the existing newlib in Barrelfish. Then we have to merge some files manually according to the newlib FAQ [11], to get a working newlib both on Barrelfish and TilePro platform.4

2.4 Virtual Memory

A virtual memory system is used to allow different processes to run in a system without that they need to know each other [12]. Virtual addresses are translated to physical addresses with the assistance of memory management unit (MMU) in a scheme, which is valid per process. The OS can have some requirements on specially mapped virtual addresses which point to special location, such as some kernel memory area. Sometimes an OS can also have requirements about physical addresses. The platform often requires some special physical addresses (which may point to the interrupt vector or memory mapped I/O for example) and virtual addresses. Therefore the memory layout is determined both by the platform and the OS. Barrelfish requires a large section, which is mapped 1:1 to the physical section, to be able to access physical addresses by a simple calculation (subtracting an offset). Furthermore it needs one part at the beginning, where the multiboot information

4Actually it is known so far that it is impossible to use printf to output 64-bit values reliably. A current workaround is to print the upper and lower 32 bits separately.

11 2 Implementation for the other cores is lying. This 1:1 section must be handled as a special case for TilePro, because it has the property that the virtual address space is smaller than the physical address space (32-bit vs. 64-bit). With this property we are not able to maintain this mapping over the whole needed memory, if there is a lot of physical memory installed. So we have a mapping which is different from each core, by slicing the memory in core-wise. This scheme is working due to the property of Barrelfish not to share any memory between cores.5 One alternative approach would be to use temporary mappings described in [13]. For this to work it must be assured that one mapping is not used, before setting up another one. TilePro only has some reserved addresses at the end of the virtual memory range, which are used by the hypervisor internally. There are no reserved physical addresses because an OS on top of the hypervisor only sees “Client Physical Addresses”, which are translated rapidly to real physical addresses by the hypervisor. This abstraction works in practice, so that we do not need to care about this translation and just handled the client physical addresses, as they would be real physical addresses. The shipping process has some requirements on the memory layout in the sense that the information, which the kernel composes different sections (text, data, ...), must be told to the booter. This information is put into the ELF file via a linker script. The initial P=V mode (virtual address is mapped to the same physical address), which is active until the first page table has been installed, ensures that all needed virtual addresses are statically translated to physical addresses. As mentioned before in Section 2.2.2 we need to map the data section of the kernel separately for each core. To be able to do so, we need a different initial page table for each core, differentiated by calculating the core ID. The stack for the kernel must be also different per core, but we just use different memory part next to each other, otherwise we would need to write the mappings without having a stack, i.e. in assembly. All these requirements lead us to create the virtual and physical memory layout shown in Figure 2.2 and Figure 2.4 respectively.6 Figure 2.2 describes virtual memory layout. Some addresses are pre-determined by TilePro, e.g. addresses from 0xFD000000 to 0xFE00000 are for kernel code section, addresses above 0xFE000000 are for hypervisor, we cannot use these memory space for other purposes. A special part in Figure 2.2 is called Barrelfish internal, pictured in detail in Figure 2.3. In this area, some Barrelfish-related data should be put here to specifically start up the first user-level process init in terms of Barrelfish convention. Furthermore, due to the fact that TilePro page size (64KB for small page and 16MB for large page) is normally much larger than other architectures’, for instance, in x86 and ARM, small page size is 4KB, we decided to lay more data in this area, such as kernel data section, to save some memory space. For the physical memory, as shown in Figure 2.4, except those statically mapped parts where mainly

5Actually there is a little shared memory for starting up another core, which is written once from the first core and read from another to-be-started core. So the memory from the first core is mapped into all other cores. 6Drawn with the help of [14].

12 2.4 Virtual Memory

0xFFFF FFFF – hypervisor – 0xFE00 0000  Kernel-Text  0xFD03 0000  Bootloader-Text  0xFD02 0000  code section, Hypervisor glue  one large page size 0xFD01 0000   PL 1 Interrupt vectors  0xFD00 0000  Kernel stacks, 64k per tile one large page size 0xFC00 0000   kernel space ~1.9 GB  0x8000 0000   ~2 GB  0x0100 0000  Barrelfish internal several small pages 0x0000 0000

Figure 2.2: Virtual memory layout

0xFF 0000  rodata + data, mapped per core 1 MB 0xF0 0000 –unmapped– 0xD0 0000  page tables, sliced per core L1 + 1 * L2 for up to 256 cores 0xC0 0000 –unmapped– 0xB1 0000  bootloader data size should be stable 0xA0 0000 –unmapped– 0x28 0000  multiboot data Tell app cores their properties 0x27 0000  DISPATCHER  0x07 0000  BOOT_ARGS mapped dynamically 0x05 0000  BOOT_INFO  0x01 0000 o –unmapped– for NULL pointers 0x00 0000

Figure 2.3: Barrelfish internal virtual memory layout

13 2 Implementation

  Free space sliced per core  0x1300 0000 ) 256 * (rodata + data) mapped per core 0x0300 0000  Kernel stacks, 64k per tile  0x0200 0000    Barrelfish internal   0x0100 0000  Kernel Text  0x0003 0000 statically mapped  Bootloader Text  0x0002 0000   Hypervisor glue   0x0001 0000  PL 1 Interrupt vectors  0x0000 0000

Figure 2.4: Physical memory layout contain kernel data, we have separate data section for each core, and the remaining free space is sliced per core, so that one core cannot interfere with others, and it is possible to support heterogeneous hardware.

2.5 ASIDs

TilePro supports the use of Address Space Identifiers (ASIDs) and we use them in order to avoid flushing the Translation Lookaside Buffer (TLB) during the context switch. Every dispatcher on one core will be assigned an unique ASID, then during the context switch, the TLB needs not to be flushed, since each process in the page table is recognizable by its ASID. However there is a limit of totally 256 ASIDs per core restricted by TilePro hardware and when this number is reached, it is not possible to spawn new dispatchers.

The kernel memory content has no ASID and is marked as global, so it is accessible independently by all running processes. Due to split the memory into the user and kernel space, we are able to provide the same mapping for kernel memory for all processes, by copying the mapping from the initial page table. So with the knowledge that the mappings for the kernel are always the same, we do not need to flush the kernel mappings.

14 2.6 Interrupt Handling

2.6 Interrupt Handling

Tilera has a protection scheme to prevent some malicious intent, such as illegal instruction execution and misuse of networks. Those protected parts in the system, typically memory ranges (page-wise), instructions and special purpose registers [15], have a specific minimum protection level. One core is exactly in one protection level at any time and can only access those resources in the same or lower minimum protection level. Protection level 0 is for the user space, level 1 the operation system and level 2 the hypervisor. Interrupts are handed to the operating system or to the hypervisor. The hypervisor is able to pass interrupts conditionally or unconditionally to the operating system. Before that it is able to perform some actions, e.g. try to solve a TLB miss and hand it to operating system only a real page fault occurs. This technique is well-known as “downcall” [15]. The Barrelfish requirements of normal interrupts differ from other operating systems in the way that the registers are not saved on the stack, but in a special area depending on the state of the current dispatcher [16]. This is needed when an interrupt causes a context switch, because when switching back, it always jumps into one of some special entry routines, which will later jump into the saved registers. In Barrelfish there are 3 entries to handle interrupts, two of them are designed for the page fault, depending on the status of dispatcher at the moment a page fault takes place. Barrelfish is responsible for deciding how to deal with an interrupt once it receives from TilePro hypervisor.

2.7 System Calls

In general a system call is a way to request a kernel service from the user space. The user code puts some information at a well-known address and causes a trap. The kernel serves the request and may return to the user code [12]. Barrelfish provides the syscall interface, allowing the user code to invoke kernel services by passing a legal capability and some arguments. A capability is a token, which allows a specific access to a resource [17]. Every time when invoking a syscall, the capability will be examined first before the kernel service begins. If the examination of capability fails, then an error will be issued to the user. Another special design in Barrelfish about syscalls is that local message passing mechanism is also implemented through syscall. That is to say in some cases it is likely to send a local message and then quickly the receiver process should be waked up to receive the message, so a context switch might be involved in this process. Therefore we need to consider this situation. One solution would be to save all caller-save registers as well as other essential registers to a specific save-area in the syscall handler, and return the pointer to the syscall function. This procedure also depends on the running status of dispatcher (see Section 2.8).

15 2 Implementation

  some other values...     about threads  genetic  part    stacks    some other values...    some status indicators    ASID  architecture-related critical section  parts   save-areas    dispatcher entries 

Figure 2.5: Dispatcher structure in Barrelfish

For TilePro architecture, instruction swint is to signal an interrupt to cor- responding handler. There are four swint interrupt levels because there are four protection levels in TilePro architecture, where 0 is user level, 1 for operating system, 2 for hypervisor and 3 for virtual machine. So we ought to use swint1 to issue a syscall to hand over the control to the kernel. According to the TilePro ABI [18], general-purpose registers from r0 to r9 are responsible for passing the arguments, thus we use r10 to hold syscall number assigned from Barrelfish. In other words, we can pass 10 32-bit values through the registers, and if there are more, those remaining arguments will be put on the stack.

2.8 Processes and Threads

In Barrelfish a dispatcher is the basic unit for kernel scheduling and managing its own threads. It is similar to the concept of process in other operating systems. There are two main parts in the dispatcher. The first part is architecture-related, while the second part is generic for all hardware. For the architecture-related part, it mainly contains status indicator, entries to enter some action, and save-areas where all registers should be saved here during context switch. In our case it also contains an ASID for avoiding TLB flush (see Section 2.5). For the generic part, it mainly contains some information of how to manage threads and also stacks for itself. The dispatcher structure is shown in Figure 2.5. The kernel maintains a dispatcher control block (DCB) for each dispatcher. The

16 2.9 I/O

DCB contains entries that define the dispatcher’s cspace (capability tables), vspace (page tables), some scheduling parameters, and a pointer to a user space dispatcher structure. This struct manages the scheduling of the dispatcher’s threads [19]. Switching between threads happens in user-mode. The architecture specific part of the implementation must be able to save the state of a thread and restore it, everything else happens in the architecture independent part. Dispatcher can be in one of two modes: enabled and disabled [19]. It is enabled when it is running user threads, for example, every time one thread is resumed, its dispatcher should be in enabled mode. During the context switch, all registers should be saved into enabled save-area. When a dispatch is running in disabled mode, it is running the kernel code, for example, managing TCBs and when a new dispatcher is created, it should be set disabled. When it is pre-empted at this time, all register state will be saved into disabled save-area. Unless restoring a dispatcher from a disabled context, the kernel always enters a dispatcher at one of the 5 entry points: run, page fault, page fault disabled, trap and LRPC [16]. A dispatcher is entered at run entry point when it was not previously running and the last time it was running was either enabled or yielded the CPU [16]. For other 4 entries, in our project, we did not make any effort for them, because they are not necessary so far.

2.9 I/O

This project does not involve any efforts to develop I/O driver for TilePro. The only implementation is the function printf, which is basically just a wrapper function calling to a hypervisor method.

2.10 Local Communication

Barrelfish team makes a lot of efforts to develop inter-dispatcher communication (IDC). It is proven that this message-passing-based communication method outperforms shared-memory method dependent highly on cache coherency protocol. The local message passing (LMP) is designed for communication between dispatchers on the same core. Barrelfish team beautifully implements this functionality without involving any use of cache coherency protocol. That is because all the data between dispatcher is passed through local registers via invoking syscalls. Therefore, no any shared memory is allocated in this process. Barrelfish also provides an interface for LMP implementation on different architectures. In the project, we merely allocate 7 out of 10 argument-passing registers for LMP, and invoke the syscall to deliver the message. So far the LMP on the initial core is working well without any error.

17 2 Implementation

2.11 Core-to-Core Communication

2.11.1 Existing Barrelfish Ports

Barrelfish has been already ported onto 2 mainstream architectures: x86 and ARM. For both of these architectures the core-to-core communication is implemented via user-level message passing (UMP). Barrelfish team implements the UMP by a shared memory method based on a clever use of the cache coherence protocol [20]. The purpose is to reduce the usage of cache coherence protocol as possible as it can, so as to increase the efficiency of the remote communication. However, this could be seen as a compromise for those architectures without an appropriate hardware structure to support core-to-core message passing. There is another port onto the “Single-Chip Cloud Computer” from Intel [21]. In this port the core-to-core communication is based on UMP and extended to use message queues in the shared memory and Inter-processor interrupts. Therefore all supported architectures use shared memory in some way to communicate between cores.

2.11.2 Implementation of Message Passing on TilePro

An overview of the networks can be found in [22], which is a document about the older Tile Processor. Here are the available networks:

2.11.2.1 Static Network

The Static Network (SN) is a user-mode accessible network with predefined routes. The word “static” is describing the property of the routing. Routes are set up statically in contrast to all other networks on the TilePro platform. This means at every single node (tile) for each port (North East South West Processor) there is a fixed route. A packet is only switched by the port (the direction) it comes from and can be either consumed by the actual tile (sent to processor) or be send to another port. This reduces the communication delay to a minimum at the cost of fixed routes. This means one core can only communicate with one other core bidirectionally, because the messages sent from the local Processor can only be switched in one of the four orientations. This is not sufficient as general purpose messaging platform.

2.11.2.2 User Dynamic Network

The User Dynamic Network (UDN) is a user-mode accessible network with dynamic routes. The routing here is defined by a destination header on every message. A message consists of a tag and a number of words. With help of the tag the network

18 2.11 Core-to-Core Communication can do hardware multiplexing [22]. There are 4 different input queues per core, which are used depending on the tag of the incoming message. If no queue has a matching tag, the data is stored in a catch-all queue. Messages from one core to another are in order, but messages from different cores to one core can arrive in any order. The words of the message itself are always in order.

2.11.2.3 Other Dynamic Networks

IDN The I/O Dynamic Network is implemented similar to the UDN, but is intended for device drivers communicating with off-chip devices. It is physically separated from the UDN in order to separate user messages from I/O messages and provide higher priority (in terms of interrupt priority) over UDN messages. It is not accessible to the supervisor, only device drivers registered inside the hypervisor have access to it. So we cannot use the IDN, but the properties are the same like the UDN. CDN The Coherence Dynamic Network is the network used internally by the cache- coherence protocol. It could be used indirectly by using shared memory as communication, but this is not the aim of this work. MDN The Memory Dynamic Network is managing the memory access between tiles or external memory, it is only accessible for the Cache Engine and therefore not used by us. TDN The Tile Dynamic Network is managing the memory access between tiles, and is also only accessible for the Cache Engine.

2.11.3 Implementation of a New Backend

In order to make full use of TilePro’s mesh network structure and remove the use of cache coherence protocols, we decided to develop our new backend in Barrelfish based on TilePro’s hardware. There are in general 2 networks, which would fulfil our needs: The UDN and the CDN. Using the CDN would be a simple port of the UMP protocol to the TilePro platform, but to use the uniqueness of this platform we used the UDN. For multiplexing on UDN there are 4 input queues per core, which are not enough for a general purpose platform. We use only the catch-all input queue and do our own multiplexing based on the tag. A tag will either result in putting a message in one of the associated queues, if there is any, or it is possible to look up the tag of a message lying inside the catch-all queue. In order to implement the inter-dispatcher communication between cores, apparently one dispatcher needs to know whom it wants to talk to and where another one is. So the information needed to send a message to another dispatcher

19 2 Implementation

Dispatcher 0 Dispatcher 1 out channel out channel on core 0 on core 1 i i n n

c c h h a a n n n n e e l l

Backend message buffer, Backend message buffer, demultiplexing by channel ID demultiplexing by channel ID

UDN network

Figure 2.6: UDN backend is the target core ID and the channel ID which is used as tag, as well as target ASID which specifies the destination dispatcher on the target core. The channel ID is unique per core, so we have IDs on a bidirectional channel, one for incoming messages and one for outgoing messages. On one core one Barrelfish channel consist internally of one output channel and one input channel. The output channel contains the target core ID and channel ID as well as ASID, needed to send a message. The input channel includes the channel ID, needed to retrieve messages from the UDN queue. Once a message arrives, the dispatcher will check if this message is sent for it by comparing its own ASID to the message’s target ASID, if matching well, then it receives this message, otherwise this dispatcher will be switched out by operating system, and another one will come to check instead until some one receives the message correctly. Indeed the dispatcher always knows whom it will talk to, before two dispatchers are able to communicate, a binding between them has been established by monitor. Figure 2.6 shows the structure of UDN backend implemented in Barrelfish.

The UMP in the x86 and ARM implementations uses a polling approach to receive messages. Although the UDN network allows the use of interrupts to deliver UDN messages, we still remain the polling method, because it keeps in line with the way that messages are implemented in Barrelfish. When using messages inside

20 2.11 Core-to-Core Communication the source code, the user only needs to specify if he wants to do local or remote communication. For remote communication the right backend is chosen through some compile settings, and the needed code is generated and written into some build files, so that another backend can be used by only changing the compilation option. To stay closely in the pattern we use the polling approach, so that using UDN is just a compilation option.

21

Chapter 3 3 Reflection on the Porting Process This chapter discusses the process we followed to realize the implementation. Our initial approach was platform-first. Firstly we checked what the TilePro platform offers and requires, then we checked what Barrelfish needs. To start with the porting, we created a new empty target architecture and a lot of stub files, in which causes the code to crash and prints out the current line number1, so that we could easily trace where the code is currently running and locate the crashing point. After the stub was created, we had a list of techniques to be implemented. Our work cycle for these techniques can be described with the following phases: Platform Check which primitives TilePro offers and which configuration it needs. Concept Understand the basic concept of that technique in general (not TilePro specific). Minimum Working Example Build a minimum working example, which is not necessarily interfacing with the rest of Barrelfish. OS Check the requirements of Barrelfish. Implementation Implement the codes in a way that it interfaces with Barrelfish. Testing Test the codes by starting up Barrelfish, checking the conditions, generating outputs or debugging the system. The process was not always so linear as described above. Understanding the concept is a phase which we sometimes need to repeat again, or it will last from checking the primitives of TilePro to the implementation phase. For the first basic concepts, like booting up an operating system, virtual memory management, interrupts, etc., this approach worked well. With this approach we were able to get a minimal running example without caring about Barrelfish too much and then we could broaden it to fulfil the Barrelfish requirements. So we chose this approach, because it worked best with our need to learn most of this techniques. Another reason why we chose it, is because for most parts there were not too much

1Basically: assert(!"implement me"); 3 Reflection on the Porting Process documented requirements on Barrelfish. The requirements were mostly in the form of Barrelfish interfaces or source code of Barrelfish for other platforms. So we needed to understand the technique and how it works in TilePro before we were able to understand the Barrelfish requirements. Between the general techniques we need to implement in Barrelfish, there were a lot of small steps to be done, where this approach was not feasible. Our work cycle there can be described with this phases:

Run Run the Barrelfish code, until the point where an error occurs. OS Find the reason for the error due to outputs, stack traces or debugging. That means find out what requirements Barrelfish has, but which we have not fulfilled yet. Concept For the requirements, which involve a new concept: Understand it in general. Platform Check which primitives TilePro offers and which configuration it needs. Implementation Fix the error or fulfil the requirements. Testing Test it.

Notably we changed the order of the OS requirements and the Platform. Additionally we did not have a phase for a minimum working example here, because we already have the interfaces and requirements and know how much implementation needs to be done to fulfil them. In this part, the requirements of the operating system determine our porting process in the way that we need to find out the problem first, before looking into the platform. This chapter merely keeps track of the process of porting Barrelfish onto TilePro architecture. Since we spent most of our time on doing engineering work on this large project, this is the reflection of what we did and which principles we stuck to, thereby obtaining some promising results in the middle of porting process.

24 Chapter 4 4 Results 4.1 Porting Results

By the end of this project we gained some positive outcomes. First of all we could manage to start Barrelfish on at least two cores of TilePro architecture completely and correctly. The first process init get started, and then it calls mem_serv and monitor to start. monitor afterwards boots up other remaining processes on the first core, while it starts communicating with other processes via LMP. Finally spawnd is responsible for initiating another core. Instead of booting up the second core we also could manage to run some simple user applications on the first core by combining them in the bootrom file, such as helloworld or the console, etc. Furthermore, the second core is completely started up. Specifically after the process monitor on the second core gets started, it does some remote communication with the first core’s monitor at the aim of establishing a binding between dispatchers. Afterwards two dispatchers are able to send messages to each other. At this moment all communication is mainly based on our UDN network driver. By the end spawnd is invoked, so we believe the second core now is ready according to Barrelfish team’s roadmap shown in Figure 4.1. As same as it on the first core, we also could manage to run some user applications on the second core.

4.2 Modifications on Barrelfish

Since normally it is not allowed to modify any setting in hardware, there is no modification on the side of TilePro. On the other hand, we add all essential architecture-related parts in Barrelfish to make sure the operating system could be run correctly. The first step is to initiate the machine and boot up the kernel, in which includes setting up the page tables, allocating the memory, handling interrupts and system calls, and preparing for the first process to run. When the process is starting, the 4 Results

hypervisor + core 0 core 1 menu.lst + bootrom images

cpu (kernel) cpu (kernel)

mem_serv init monitor

ramfsd monitor

spawnd

skb spawnd

Figure 4.1: Bootstrap on TilePro

26 4.2 Modifications on Barrelfish context switch needs to be implemented, according to TilePro’s registers. In user mode, we need to implement context switch between threads in dispatcher, set up system call entry to the kernel code, high-level memory management (architecture- independent part), UDN backend and its supporting code, etc. Table 4.1 mainly summarizes the essential modules we added into Barrelfish to make it work on TilePro.

Table 4.1: Modification to Barrelfish Mode Functionalities kernel code boot loader, kernel start-up, memory allocation, page table, context switch between dispatchers, register structure for context switch, system call handler, interrupt handler user code context switch between threads, system call entry (including LMP), entry point for threads, high-level memory management (pmap), UDN backend code, UDN support code

27

Chapter 5 5 Conclusion In this project we try to port Barrelfish operating system onto TilePro architecture and obtain some promising results. Barrelfish could completely start up on at least two cores. Meanwhile the local and remote communications between dispatch- ers are working in principle, using LMP and TilePro UDN network respectively. Moreover we could manage to run some simple user programs on both running cores.

According to the bootstrap of Barrelfish we could emphasize that we succeed to port Barrelfish onto TilePro architecture, since the other cores will be using the same way to boot up as the second core, provided that there is enough memory for each core. UDN network is suggested to be exploited for the core-to-core communication, because it not only makes full use of the advantages of TilePro hardware, but also keeps in line with the design principles of Barrelfish. There is another issue we need to discuss here is the use of so-called bulk transfer technique in Barrelfish kernel. Bulk transfer is a mechanism designed to facilitate massive data exchange between processes, depending on shared memory. But this mechanism cannot be opted off, which means we have to use it anyway. That is also to say even we implement UDN network intead of UMP without any shared memory required, but at some point Barrelfish needs to call bulk transfer functionality, so the shared memory is involved more or less. TilePro actually is a shared-memory-based architecture, although it supports its own network structure to send inter-core messages. Therefore we do not need to consider more about this shared memory problem, but if someone wants to port Barrelfish onto real distributed systems without any shared memory, then a new alternative method should be invented to replace the data bulk transfer.

One of the original objectives of this project is to evaluate the efficiency of Barrelfish working on TilePro architecture after the porting, including finishing some benchmark testing work. However owing to the time limitation and the complexity of the engineer work, it is hard for us to meet all of our goals set up before. Even though we try to lead a positive start for this project to continue further in the future. At this moment what we could imagine for the future work may involve, improving the porting on the first core by starting process serial so that the keyboard function is working, 5 Conclusion implementing a timer for the system, considering more about how to implement Barrelfish on heterogeneous systems and distributed systems, investigating more about the advantages and disadvantages of cache coherence protocols and message passing methods, establishing some benchmarks to measure Barrelfish on TilePro, etc. Therefore the future work is still tough but meaningful. Hopefully our contribution is not the end, but just the beginning.

30 Bibliography

[1] H. Sutter, “The Free Lunch Is Over: A Fundamental Turn Toward Concurrency In Software,” Dr. Dobb’s Journal, vol. 30, no. 3, Mar. 2005, http://www.gotw. ca/publications/concurrency-ddj.htm. [2] A. Baumann, P. Barham, P.-E. Dagand, T. Harris, . Isaacs, S. Peter, T. Roscoe, A. Schüpbach, and A. Singhania, “The Multikernel: A new OS architecture for scalable multicore systems,” in Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles, ser. SOSP ’09. New York, NY, USA: ACM, 2009, pp. 29–44, http://doi.acm.org/10.1145/1629575.1629579. [3] S. Boyd-Wickizer, A. T. Clements, Y. Mao, A. Pesterev, M. F. Kaashoek, R. Morris, and N. Zeldovich, “An Analysis of Linux Scalability to Many Cores,” 9th USENIX Symposium on Operating Systems Design and Implementation, 2010, Oct. 2010, http://www.usenix.org/event/osdi10/tech/ full_papers/Boyd-Wickizer.pdf. [4] A. Kleen, “Linux multi-core scalability.” Linux Kongress, 2009, http://halobates. de/lk09-scalability.pdf. [5] D. Wentzlaff and A. Agarwal, “Factored Operating Systems (fos): The Case for a Scalable Operating System for Multicores,” SIGOPS Oper. Syst. Rev., vol. 43, no. 2, pp. 76–85, Apr. 2009, http://doi.acm.org/10.1145/1531793.1531805. [6] J. A. Colmenares, S. Bird, H. Cook, P. Pearce, D. Zhu, J. Shalf, S. Hofmeyr, K. Asanović, and J. Kubiatowicz, “Resource Management in the Tessellation Manycore OS,” Proceedings of the Second USENIX Workshop on Hot Topics in Parallelism (HotPar’10), Berkeley, California, Jun. 2010, http://tessellation.cs. berkeley.edu/publications/pdf/TessellationHotPAR10.pdf. [7] S. Boyd-Wickizer, H. Chen, R. Chen, Y. Mao, F. Kaashoek, R. Morris, A. Pesterev, L. Stein, M. Wu, Y. Dai, Y. Zhang, and Z. Zhang, “Corey: An Operating System for Many Cores,” in Proceedings of the 8th Symposium on Operating Systems Design and Implementation, Dec. 2008, http://pdos.csail.mit. edu/corey. [8] A. Baumann, S. Peter, A. Schüpbach, A. Singhania, T. Roscoe, P. Barham, and R. Isaacs, “Your computer is already a distributed system. Why isn’t your OS?” in Proceedings of the 12th Workshop on Hot Topics in Operating Systems, May 2009, http://www.barrelfish.org/barrelfish_hotos09.pdf. [9] Multiboot specification, Free Software Foundation, 2009, http://www.gnu.org/ software/grub/manual/multiboot/multiboot.html. [10] S. Hitz, “Multicore ARMv7-a support for Barrelfish,” Aug. 2012, bachelor’s thesis, http://www.barrelfish.org/hitz-bachelor-multicore-arm.pdf. [11] newlib. newlib FAQ. http://sourceware.org/newlib Accessed: 2012-11-26. [12] A. S. Tanenbaum, Modern Operating Systems, 3rd ed. Upper Saddle River, NJ, USA: Prentice Hall Press, 2007. [13] F. Wang, “A Clarification on Linux Addressing,” Nov. 2008, public note, http: //users.nccs.gov/~fwang2/linux/lk_addressing.txt Accessed: 2013-02-13. [14] M. Demling, “Creating memory maps in LaTeX using the {bytefield} package,” Jun. 2011, blog entry, http://www.martin-demling.de/2011/06/ memory-maps-in-latex-using-the-bytefield-package/ Accessed: 2013-02-13. [15] Tile Processor Architecture Overview for the TILEPro Series, 1st ed., Tilera Corporation, Mar. 2011, http://www.tilera.com/scm/docs/index.html. [16] A. Baumann, S. Peter, T. Roscoe, A. Schüpbach, and A. Singhania, Barrelfish Specification, 862nd ed., Barrelfish, May 2012, http://www.barrelfish.org/ TN-010-Spec.pdf. [17] J. Shapiro, “What is a Capability, Anyway?” 1999, essay, http://www.eros-os. org/essays/capintro.html Accessed: 2013-02-24. [18] Tile Processor Application Binary Interface, 4th ed., Tilera Corporation, Sep. 2011, http://www.tilera.com/scm/docs/index.html. [19] I. Kuz and A. Phanishayee, Barrelfish Architecture Overview, 1st ed., Barrelfish, Jun. 2010, http://www.barrelfish.org/TN-000-Overview.pdf. [20] A. Baumann, Inter-dispatcher communication in Barrelfish, Barrelfish, Dec. 2011, http://www.barrelfish.org/TN-011-IDC.pdf. [21] S. Peter, A. Schüpbach, D. Menzi, and T. Roscoe, “Early experience with the Barrelfish OS and the Single-Chip Cloud Computer,” Ettlingen, Germany, Jul. 2011, http://www.barrelfish.org/barrelfish_marc11.pdf. [22] D. Wentzlaff, P. Griffin, H. Hoffmann, L. Bao, B. Edwards, C. Ramey, M. Mattina, C.-C. Miao, J. F. Brown III, and A. Agarwal, “On-Chip Interconnection Architecture of the Tile Processor,” IEEE Micro, vol. 27, no. 5, pp. 15–31, Sep. 2007, http://dx.doi.org/10.1109/MM.2007.89.

TRITA-ICT-EX-2013:69

www.kth.se