Porting Cilk to the Barrelfish OS

CHAU HO BAO LE

KTH Information and Communication Technology

Master of Science Thesis Stockholm, Sweden 2013

TRITA-ICT-EX-2013:66

KTH Royal Institute of Technology Dept. of Software and Computer Systems

Degree project for the degree of Master of Science in Information and Communications Technology Porting Cilk to the Barrelfish OS

Author: Chau Ho Bao Le

Supervisor: Georgios Varisteas, MSc Examiner: Prof. Mats Brorsson, KTH, Sweden

Abstract Barrelfish is an experimental instance of structure which exhibits good features such as hardware heterogeneity, scalability, dynamicity, etc. Bar- relfish is in progress and lacks applications. Therefore, there is a need to investigate the efficiency of applications running in Barrelfish and one of candidates is a shared-memory application. To conduct an empirical study, Cilk is chosen inasmuch as its runtime li- brary is designed for shared-memory architectures and it has been known to expose good performance. This thesis focuses on making Cilk run on top of Barrelfish in order to reach two goals: portability which is described to be supported by Barrelfish, and good speed afterwards. The porting involves compiling Cilk runtime source code by replacing its pthread subroutines with set of in Barrelfish and then changing the way Cilk scheduler spawns worker thread on multiple cores. However, the main point of the porting is to make different cores access to the same virtual address space. Luckily, Barrelfish provides a notion of domain which specifies the number of cores in an application so that these cores can share the same memory space. This thesis also has carried out benchmarks on some Cilk programs and found that Cilk does not perform as well as it is expected. In addition measurements on parallel workers shows that Cilk on Barrelfish takes more cycles to perform computation. Although Cilk still maintains work-first principle, it cannot achieve the time bound. The spanning domain cost is proportional to the number of cores, but it will matter if applications take small time to complete.

Key words: Barrelfish, Cilk, porting, multikernel, shared-memory, work-stealing, mes- sage passing

Acknowledgment

I would like to show my gratitude to Professor Mats Brorsson who has helped me by giving such good advice at the very first steps of this thesis and has been patient with my silly issues.

This thesis has received a significant supervision from PhD. Researcher Georgios Varis- teas. I am deeply thankful to him because he has given me suggestions, instructions as well as his experience is this field.

Finally, the special thanks go to my family and friends for their love and supports through the duration of my studies.

Stockholm, March 20, 2013 Chau Ho Bao Le

v

Contents

1 Introduction 1 1.1 Overview ...... 1 1.2 Problem statement ...... 1 1.3 Related work ...... 2 1.4 Report layout ...... 2

2 Background 3 2.1 OSes on multiple processors ...... 3 2.1.1 Factored Operating System (fos) ...... 3 2.1.2 Tessellation ...... 3 2.1.3 Barrelfish ...... 4 2.2 Parallel programming models ...... 4 2.2.1 Shared memory ...... 4 2.2.1.1 Task-centric or task-based model ...... 4 2.2.1.2 Explicit threading ...... 5 2.2.2 Message passing ...... 5 2.3 Porting ...... 5

3 Barrelfish OS 7 3.1 Introduction ...... 7 3.1.1 Overview ...... 7 3.1.2 Multikernel structure ...... 7 3.2 Conceptions and Notions ...... 8 3.3 Building, Compiling and Booting ...... 10 3.3.1 Building ...... 10 3.3.2 Compiling ...... 11 3.3.3 Booting ...... 11 3.4 Summary ...... 12

4 Cilk 13 4.1 Brief Overview ...... 13 4.2 Compiling ...... 15 4.2.1 Compilation process ...... 15 4.2.2 Compilation strategy ...... 15 4.3 Scheduling ...... 16 4.3.1 Work-stealing scheduler ...... 16

vii 4.3.2 Implementation ...... 16 4.4 Summary ...... 19

5 Porting Cilk to Barrelfish 21 5.1 Challenges ...... 21 5.2 Multithreaded Model ...... 22 5.2.1 Cilk on the original platform ...... 22 5.2.2 Cilk on Barrelfish OS ...... 23 5.3 Modifications on Cilk ...... 26 5.3.1 Compile time ...... 26 5.3.2 Runtime ...... 27 5.4 Modifications on Barrelfish ...... 29 5.4.1 Hake ...... 29 5.4.2 Makefile ...... 29 5.5 Summary ...... 30

6 Benchmarks 31 6.1 Environment settings ...... 31 6.2 Measurements ...... 32 6.2.1 Measurements of serial applications ...... 32 6.2.2 Measurements of Cilk applications ...... 32 6.3 Experiments ...... 33 6.4 Evaluation ...... 53

7 Conclusion 55 7.1 Contribution ...... 55 7.2 Future Work ...... 56

viii List of Figures

3.1 The multikernel structure ...... 8 3.2 Barrelfish structure ...... 9

4.1 A serial C and a Cilk program to compute the nth Fibonacci number . . 14 4.2 The Cilk dag computes the 3th Fibonacci number ...... 14 4.3 The compilation process of a Fibonacci program ...... 15 4.4 Runtime data structures of a deque ...... 17 4.5 Interactions between thief and worker in the three cases ...... 18

5.1 Model of shared-memory scheduler in Cilk ...... 23 5.2 Multithreaded model of Cilk scheduler in its original platform ...... 23 5.3 The model of Cilk scheduler in Barrelfish ...... 24 5.4 Model of Cilk scheduler in Barrelfish with a domain ...... 25 5.5 Multithreading model of Cilk scheduler in Barrelfish ...... 25 5.6 Compilation progress of a Cilk program in Barrelfish ...... 26 5.7 wait/notify mechanism to replace POSIX create and join ...... 28

6.1 Cilk application invokes the runtime library ...... 32 6.2 Spanning domain overheads over cores ...... 33 6.3 Comparison of TW of cilksort on Barrelfish and Linux ...... 36 6.4 Speedup vs. serial versions of cilksort ...... 36 6.5 Thread distribution over 8 cores of cilksort ...... 37 6.6 Comparison of TW of FFT on Barrelfish and Linux ...... 39 6.7 Speedup vs. serial versions of FFT ...... 39 6.8 Thread distribution over 8 cores of FFT ...... 40 6.9 Comparison of TW of fib on Barrelfish and Linux ...... 42 6.10 Speedup vs. serial versions of fib ...... 42 6.11 Thread distribution over 8 cores of fib ...... 43 6.12 Comparison of TW of LU on Barrelfish and Linux ...... 45 6.13 Speedup vs. serial versions of LU ...... 45 6.14 Thread distribution over 8 cores of LU ...... 46 6.15 Comparison of TW of matmul on Barrelfish and Linux ...... 48 6.16 Speedup vs. serial versions of matmul ...... 48 6.17 Thread distribution over 8 cores of matmul ...... 49 6.18 Comparison of TW of strassen on Barrelfish and Linux ...... 51 6.19 Speedup vs. serial versions of strassen ...... 51 6.20 Thread distribution over 8 cores of strassen ...... 52

ix List of Tables

6.1 Hardware configurations of the virtual machine ...... 31 6.2 Spanning domain cost ...... 33 6.3 Execution time of 6 serial Cilk programs ...... 34 6.4 Measurements of cilksort on Barrelfish and Linux ...... 35 6.5 Number of steals with 8 workers of cilksort ...... 37 6.6 Number of threads spawned in 8 workers of cilksort ...... 37 6.7 Measurements of FFT on Barrelfish and Linux ...... 38 6.8 Number of steals with 8 workers of FFT ...... 40 6.9 Number of threads spawned in 8 workers of FFT ...... 40 6.10 Measurements of fib on Barrelfish and Linux ...... 41 6.11 Number of steals with 8 workers of fib ...... 43 6.12 Number of threads spawned in 8 workers of fib ...... 43 6.13 Measurements of LU on Barrelfish and Linux ...... 44 6.14 Number of steals with 8 workers of LU ...... 46 6.15 Number of threads spawned in 8 workers of LU ...... 46 6.16 Measurements of matmul on Barrelfish and Linux ...... 47 6.17 Number of steals with 8 workers of matmul ...... 49 6.18 Number of threads spawned in 8 workers of matmul ...... 49 6.19 Measurements of strassen on Barrelfish and Linux ...... 50 6.20 Number of steals with 8 workers of strassen ...... 52 6.21 Number of threads spawned in 8 workers of strassen ...... 52

x Chapter 1 1 Introduction 1.1 Overview

With the advancement of technology, computer hardware has developed for years so that it have changed computers from big machines to small ones, altered processors from single-core to many-core and made them more diverse. Different new processors (cores) and heterogeneous hardware have led to the demand for scalable operating sys- tems (OSes) which adapt to operate on such environments; therefore the multikernel architecture [8] concept has arisen as one of OSes for scalable parallel computers. In this architecture, an OS is considered to be distributed, hence it obviously inherits ben- efits from a distributed system like heterogeneity, large-scale ability, less communication latency, and etc. Barrelfish OS is an instance of the multikernel model. When the von-Neumann model for sequential programming is not appropriate to HPC, parallel programming model has emerged to exploit underlying hardware and scalable OSes in which programmers can make use of parallelism of programming language at high level. One well-known model is task-centric model which uses shared-memory to interact be- tween threads. It is known that multithreaded and shared-memory programming models have exposed good performance on multicore machines with operating systems that they have been developed for. One of the representatives of this model is Cilk [19]. In Cilk, an execution unit is a task which is distributed across cores through a work-stealing scheduler designed for shared-memory machines. In order to get benefits from a scalable OS and a parallel programming model, there is an idea to combine these two factors into one paradigm, that is running a multi-threaded share-memory application in a multiker- nel. This thesis focuses on the implementation of this idea by porting Cilk to Barrelfish OS.

1.2 Problem statement

Since Cilk programming language is not originally developed for multikernel, the porting in this thesis is to make Cilk applications execute in Barrelfish OS. The implementation addresses the bellowing concerns:

• Porting proposes a paradigm that allows Cilk to run on top of Barrelfish as well as

1 takes on the challenges of modifications to both Cilk and Barrelfish from compile time to runtime.

• Porting insures that shared-memory programming model works correctly in a message-based OS, that is Cilk’s scheduler performs in a proper way and causes no issues to Barrelfish.

• Porting investigates Cilk’s efficiency and which level of performance could be achieved.

1.3 Related work

WOOL [23], another multithreaded language, also belongs to task-centric model. WOOL is small light weight C library which is managed by tasks at user level. Its scheduler employs work-stealing mechanism and exhibits good performance as well. Cilk and WOOL share some common features hence porting WOOL to Barrelfish is a good way to compare the performance between them.

1.4 Report layout

The remainder of this thesis is organized as follows. Chapter 2 introduces the background of the topic. Chapter 3 describes Barrelfish architecture and outlines some of its concepts and notions. Chapter 4 discusses Cilk language in addition to demonstrates how it works. Chapter 5 documents the possibility of porting Cilk to Barrelfish and lists out changes on both sides. Chapter 6 evaluates the performance of some Cilk application after porting and chapter 7 concludes.

2 Chapter 2 2 Background When problems are getting bigger, multi-core architecture has been a trend to perform complex computations. The emergence of multicore architecture has promoted the de- velopment of processors such as heterogeneous processors, etc. Thus, there are more scalable operating systems born to meet the rapid change of hardware. However, mul- tiple cores and OSes on scalable machines are not sufficient for computations, that is workload should be distributed across cores as much as possible with the help of parallel programming models.

2.1 OSes on multiple processors

As the computer hardware has developed for decades, OSes need to change in order to utilize the advanced technology. Traditional OSes like Unix and Windows can offer high performance with the small number of homogeneous cores. However these OSes does not work well with scale-up computers [13]. Hence, OSes for multicore architecture is necessary because they have solved some problems of traditional OSes such as hardware diversity, cache coherence and inter-communication. The era of such OSes is around the corner and we have some implementations at hand.

2.1.1 Factored Operating System (fos) fos [11] is an operating system for scalable multicore, manycore, cluster and cloud sys- tems, and the OS uses space sharing instead of time sharing to increase scalability. As its name implies, fos factors each operating system service into a set of servers which are designed similarly to distributed services. These servers, which implements traditional kernel services, does not conflict with end user applications over implicit re- sources because they are bound to different processing cores. System servers use message passing to communicate.

2.1.2 Tessellation Tessellation [12] is a manycore OS which supports resource management including real- time and QoS guarantees. The OS is structured on Space-Time Partitioning (STP) and

3 Two-Level Scheduling. In Tessellcation, parallel software components are encapsulated in an abstract called a Cell. STP ensures isolation and partitioning of Cells. Two-Level Scheduling consists of the resource distribution process to Cells as the first level and user-level scheduling within a Cell as the second level. An application is decomposed into components and services running in Cells, and these Cells communicate via secure channels.

2.1.3 Barrelfish Barrelfish [8] has been inspired from the multikernel structure and demonstrates good features such as hardware diversity, dynamicity and scalability. Each core hosts a kernel which manipulates hardware components belonging to that core, thus each core is an independent node in a connected network and the whole OS can be viewed as a dis- tributed system. All components in Barrelfish communicate via messages, which has solved the problems of cache coherence and locks. Unlike other traditional OSes, Bar- relfish ties the notion of an application to dispatchers. There is one dispatcher of an application created on one core. Although Barrelfish is a message-based OS, it supports shared-memory applications through a logical notion called a domain.

2.2 Parallel programming models

In the past, parallelization involves threads and locks which are manipulated at low-level, that is programmers must take care of mapping computational tasks onto processors. The trend of multicore processors has brought an advent of parallel programming models to exploit the underlying parallel systems. There are many paradigms of programming model but can be classified into: shared memory and message passing.

2.2.1 Shared memory With shared memory model, multi programs can access simultaneously to the same memory, hence shared memory can be seen as a means for programs to pass data to each other. There are several models falling into this type:

2.2.1.1 Task-centric or task-based model Task-centric [22] parallelism is an explicit programming model in which programmers are free to divide tasks and define synchronization points. A task is a unit of parallelism. A synchronization point is the statement to suspend the execution till all the previous tasks have resumed. The model helps programmers focus on exposing concurrency because the program structure is similar to the sequential version. Task-centric model includes Cilk, WOOL, OpenMP 3.0, TBB, etc.

4 2.2.1.2 Explicit threading Explicit threading [20] requires the programmer to cope with operations on threads, including creating, associating threads to functions, as well as synchronizing and con- trolling the interactions among threads. Thread management is done with explicit in- vocations to APIs at instruction-level. There are some prominent libraries in use today such as pthreads, Java, Windows threads, etc.

2.2.2 Message passing In this model, objects interact to each other by sending and receiving messages. The communication among objects can be either synchronous or asynchronous. Message passing can play a role as a crucial part of a language (Erlang) or as library calls (MPI, multikernel OSes, etc.). There is no shared-memory concept in message passing, and messages cost less than shared memory hence the trend for message-based imple- mentation is dominant especially in distributed systems. Message passing guarantees scalability because it requires no locks and cache coherence [8].

2.3 Porting

This thesis investigates the ability of applying a parallel programming model on a scalabe operating system by porting Cilk to Barrelfish. The main point of the porting is to make a shared-memory work-stealing scheduler operate on a message-based system with heterogenous hardware. Additionally, this thesis also evaluates the behavior of Cilk on the new platform.

5

Chapter 3 3 Barrelfish OS 3.1 Introduction

3.1.1 Overview Current OSes are considered to have some disadvantages due to they might lag behind the development of cores and other hardware components in the future [4]. For that reason, the multikernel architecture [8] has come out to be one of the proposals for the problem of hardware heterogeneity. The main idea of this model is that the OS should operate separately from the underlying hardware, speed up the communication among cores and make use of distributed features. One implementation of the multikernel is Barrelfish OS. Barrelfish OS [8] which is a cooperative product between ETH and Research has been inspired by the multikernel concept to meet the challenges of scalability and hardware diversity.

3.1.2 Multikernel structure In an environment of different cores, the multikernel [8] design considers each core as an independent node in an interconnected network which communicates to its peers by messages and shares no memory. The multikernel structure (figure 3.1) follows these principles: • Explicit inter-core communication: cores communicate to each other by messages and provide no shared-memory concept. • Hardware neutrality: the OS structure is separated from the hardware. • Replicated state: the OS state is replicated and maintained consistent by messages. Instead of having one OS running on top of all cores, each core runs a mini OS or kernel which manipulates its hardware resources as well as communicates and exchanges data with the others. From this point of view, a core can be considered as a small OS and the whole machine looks like it hosts a complete OS, that is, the OS is distributed [4]. This design allows tasks to be better distributed among the cores and utilizes the processing more efficiently.

7 App App App App

Agreement OS node OS node OS node OS node algorithms state state state Async messages state replica replica replica replica Arch-specific code

Heterogeneous cores x86 x64 ARM ... GPU

Interconnect

Figure 3.1: The multikernel structure. Source [8]

3.2 Conceptions and Notions

Since Barrelfish employs multikernel structure, the whole system can be viewed as a network of interconnected independent OS nodes. Like other conventional OSes, the virtual memory of Barrelfish is also factored into user space and kernel space:

• user space: manages local and distributed operations

• kernel space: manipulates operations relating to hardware components

Figure 3.2 depicts the overview of Barrelfish structure. In what follows, this thesis will discuss some concepts and notions of Barrelfish.

CPU driver Each kernel space has one CPU driver which manipulates operations relating to access to the core and other hardware components. CPU driver handles processes from user level or interrupts from local devices or other cores as a trap so that they can be treated as events. A CPU driver shares no OS state with others and it comprises these features: event-driven, single-threaded, and nonpreemptable [8]. There is no direct communication among CPU drivers [1].

Monitor There is one monitor for one CPU driver but it is at user-space level. Since a CPU driver is stateless, its monitor takes care of state coordination through communicating.

8 Domain

App App App App

Dispatcher Dispatcher Dispatcher Dispatcher Dispatcher

User Monitor Monitor Monitor Monitor space: URPC

Kernel CPU CPU CPU CPU space: driver driver driver Send IPI driver

X86-64 X86-64 X86-64 X86-64 Hardware: CPU/APIC CPU/APIC CPU/APIC CPU/APIC MMU MMU MMU Cache-coherence, MMU Interrupts

Figure 3.2: Barrelfish structure. Source [6, 8]

It not only controls the communication inside that core or with other cores [1] but also maintains global consistency by a set of agreement protocols [8].

Dispatcher Unlike a monolithic OS, Barrelfish ties the notion of dispatchers to applications [7] in the way that a dispatcher runs and schedules threads from an application. Addition- ally, when an application spreads on multiple cores, the same number of dispatchers linked to these cores are created. These related dispatchers together build up a domain. As a new dispatcher is added to an existing domain, all dispatchers there need to be scheduled again and CPU driver performs this task by means of upcall entry points. Though dispatchers in a same domain share a virtual address space, they use messages to communicate to each other [1]. There are two modes for a dispatcher: enabled and disabled. When it runs user thread code, it is in enabled mode. When it runs dispatcher code, it becomes disabled [1].

Message passing According to Baumann et al., message passing has advantages over shared memory in some ways and Barrelfish exploits the message-passing style for communication between components. This model not only avoids the problem of using locks but also utilizes the interconnect and reduces latency. There are two types of messages: • intra-core communication: messages are on the same core and handled by the CPU driver. • inter-core communication: messages are transferred between cores and controlled by the monitor

9 Both types of messages are accessible to user processes using the Barrelfish library.

Application An application in Barrelfish is actually a set of related dispatchers, called a domain. There might be several dispatchers per core but there is one dispatcher per domain on a core. Dispatchers may control threads spawned on one or more cores. Threads on one core are scheduled by the local dispatcher, while threads from different cores interact through dispatchers. Barrelfish provides a library including APIs of POSIX-like threads for user applications to deal with [8]. Even though Barrelfish is based on message-passing model, it does offer shared- memory applications by a common virtual address space of all dispatchers in a domain. An application can require more cores during its execution with domain spanning. This can be achieved thanks to APIs [7]. Chapter ?? will go into detail this matter.

Memory management In Barrelfish, the memory unit in a core exists locally but needs to be managed globally. Any request for memory from a process is done when the CPU driver invokes operations through capabilities’ interface. Barrelfish uses capabilities to track objects’ ownership in the memory. Capabilities are user-level references to kernel objects or physical memory and are manipulated by system calls with retype and revoke operations. Barrelfish does not manage virtual memory after the fashion of other OSes, particularly, user-level code has to take care of allocation and deallocation memory through page tables [8].

System Knowledge Base - SKB Barrelfish stores information of underlying hardware and its current state in a repository called System Knowledge Base - SKB. The OS and user application can access this service through messages [8]. The knowledge base is the references to (1) manage core diversity, (2) determine device in inter-connection and (3) improve cache sharing [5].

3.3 Building, Compiling and Booting

3.3.1 Building There are two main directories for building Barrelfish: source and build. Hake [3] will clone required files from the source to the build by reading all Hakefiles that exists scat- tered in the source tree and then generating rules have read into one Makefile that is stored in build directory. This Makefile contains a set of directives from which neces- sary targets can be built. Since Hakefiles are specified by Haskell expressions, editing these expressions can modify the build targets. (Refer to [3] for more details) Here are examples of Hakefiles for application and library:

10 [ build library { target = "mylib", cFiles = [ "debug.c", "decode.c", "util.c" ], addCFlags = [ "-Wno-shadow" ] } ]

[ build application { target = "hello-cs", cFiles = [ "hello.c" ], addLibraries = [ "mylib" ] } ]

3.3.2 Compiling

Barrelfish is a research OS and still on its way to being developed, hence all its ap- plications and itself must be compiled at the same time [10]. As mentioned above, the Makefile will include the directives of these applications which are compiled through gcc together with the Barrelfish source. Up to the time of this thesis, Barrelfish have supported only static executables. All images of the OS itself and other applications must be static so that the binaries can be loaded into memory by the linker and loader.

3.3.3 Booting

When Barrelfish boots [9], appropriate files need to be in the static binary form. There is a file called menu.lst consisting locations to these binaries. Ones also need to specify along with the binary the core involving in the boot process as an argument. To continue with the example in section 3.3.1, these lines [9] are required to be in menu.lst file in order to boot hello-cs application on two cores.

# General user domains module /x86_64/sbin/hello-cs core=0 server module /x86_64/sbin/hello-cs core=1 client

First of all, the boot process initiates in core 0 then boot-module spawnd in this core will boot other cores. There are two ways to start up an application on multicores. First, ones can specify the number of cores to be spawned in the menu.lst file. Second, user programs control the number of spanned cores in terms of domain notion.

11 3.4 Summary

ETH and has introduced Barrelfish OS which is a new platform to operate on multiple-core environment. Barrelfish follows the multikernel structure [8] in a such way that there are mini OSes running on top of cores so that together they form a distributed OS and obviously Barrelfish inherits some advantages from this model like scalibility and diversity [4]. Although message passing is the mainstream in Barrelfish components to communi- cate, it supports multi-threaded, shared-memory applications. This is feasible thanks to dispatchers because they can schedule their own threads as well as support memory synchronization and thread migration among them [7].

12 Chapter 4 4 Cilk This chapter discusses Cilk [19], a multithreaded language designed at MIT for general- purpose programming running on shared-memory architecture. The heart of Cilk is its scheduler which operates work-stealing [14, 19] algorithm. Together with Cilk compiler, Cilk scheduler contributes to the efficiency and good performance [15]. There are two kinds of overheads in Cilk implementation: work and critical-path length. Unlike previous versions, Cilk-5 [14] makes full use of work-first principle by moving overheads from the work to the critical path.

4.1 Brief Overview

Cilk is a parallel programming language and follows ANSI C rules with some additional keywords like cilk, spawn and sync [19], plus two more advanced key words inlet and abort for nondertermining programs. Without these keywords and running on one processor, the serial elision or C elision of a Cilk program results. Every Cilk program has an elision and they have the same semantics.

• cilk: identifies a Cilk function; without the keyword, it becomes a standard C function.

• spawn: indicates an asynchronous execution of a Cilk procedure, that is when a parent function spawns a child, it executes in parallel with the child instead of suspending itself. Whenever a thread creates a spawn, it also spawns a thread called successor thread to handle return values from this child. Keyword spawn is only applied to cilk procedure.

• sync: insures the return values of spawned threads can be used properly by en- forcing the calling thread to wait for all its children threads have resumed. The spawn and sync keywords together specify parallelism and synchronization while the compiler and runtime system take charge of scheduling.

• inlet: defines a function inside a Cilk procedure to guarantee atomicity in such a way that preventing several procedures from changing the same variables.

• abort: nests in an inlet to safely spawns off procedures.

13 One typical example of a Cilk program is the program to compute the nth Fibonacci number. A Cilk program is shown in Figure 4.1 (b), and 4.1 (a) shows its elision. int fib (int n) { cilk int fib (int n) { if (n<2) { if (n<2) { return (n); return (n); } else { } else { int x,y; int x,y; x = fib(n-1); x = spawn fib(n-1); y = fib(n-2); y = spawn fib(n-2); sync; return (x+y); return (x+y); } } } }

(a) (b)

Figure 4.1: (a) A serial C program and (b) A parallel Cilk program to compute the nth Fibonacii number. Source [17, 19]

A Cilk program comprises of procedures, and one procedure contains a series of or- dered nonblocking threads. The execution can be viewed as a directed acyclic graph, or dag which is built up from procedures as rounded rectangles, threads as vertices and operations such as spawn, return, or continuation as edges. Procedures have levels in a dag. Initial thread and final thread are at level 0. Any thread in level n spawns a child thread, this new thread will be at level n + 1. By following all dependencies in the dag, Cilk scheduler can accomplish correct execution order. For the example of Fibonacci, the figure below shows a dag of fib(3). The colors of the Cilk program in Figure 4.1 (b) indicates three types of threads in the algorithm.

Level 0 3 continuation

spawn return Level 1 2 1

1 0 Level 2

Figure 4.2: The Cilk dag computes the 3th Fibonacci number. Source [15, 17]

14 4.2 Compiling

4.2.1 Compilation process Cilk provides cilkc command to compile Cilk programs into executable files. This command is responsible for the compiling progress including two main steps: (1) .cilk files are transformed to .c files by the Cilk type-checking preprocessor cilk2c, (2) gcc compiles these .c files into object files and then links them with the Cilk runtime libraries. Figure 4.3 shows a process of compiling a Cilk program.

source-to-source translator fib.cilk Cilk2c Cilk Run Time System C compiler Cilk fib.c gcc RTS

fib.o ld fib linking loader

Figure 4.3: The compilation process of a Fibonacci program. Source [18, 19]

Cilk supports both static and dynamic executable. By default, cilkc generates a dynamic binary file, but programmers can specify a static file with argument -static.

4.2.2 Compilation strategy In Cilk-5, both compiler and runtime system take part in scheduling to alleviate over- heads with respect to work-first principle. The compiler generates two clones of a procedure: • a fast clone: is a serial version of the procedure and little support parallelism. When a procedure is spawned, the fast clone is invoked. In the fast clone, a spawn is translated into an activation frame which can be considered as a procedure instance and a sync statement compiles to a no-op. • a slow clone: is a parallel version and full support parallelism. The victim executes the slow clone when it is stolen and the clone is resumed on the thief. The slow clone translates a spawn as the same with the fast clone except that the activation frame contains restore local variables if resuming. For the sync statement, the slow clones checks for existence of its outstanding child. If any, work-stealing starts.

15 Since fast clones are more expensive than slow clones, the converting a procedure to a slow clone when stolen contributes to minimize overheads. The runtime system then links two clones to make the whole Cilk implementation by some protocols.

4.3 Scheduling

Cilk scheduler employs work-stealing mechanism to perform load-balancing among multiple processors which provides both empirical and analytical efficiency. The schedul- ing uses a protocol called THE which exploits the work-first principle to minimize work overheads1.

4.3.1 Work-stealing scheduler Work-stealing is an algorithm to implement workload distribution across processors in which idle processors become thieves and try to steal tasks from random processors, called victims. Cilk-5 scheduler also follows work-first principle to pass overheads from critical path to critical path. The technique achieves time bounds as follows:

• Provably: TP = T1/P + O(T∞) (expected time)

• Empirically: TP ≤ T1/P + c∞T∞ where:

-T P : the execution time on P processors

-T 1: the work (execution time on 1 processor)

-T ∞: the span (critical-path length or computational depth)

-c ∞: critical-path overhead

4.3.2 Implementation The scheduler is implemented through shared memory and based on Dijkstra-like, mutual- exclusion protocol. In Cilk-5, each processor or worker holds a ready deque (double-ended queue). The ready queue contains a list of ready activation frames2. A deque has a head and a tail as its two ends. The tail operates like a stack from which the workers push and pop tasks. The head of the deque is where the thief steals tasks. The worker adds a frame to

1The execution time of a Cilk program is measured by relying on work and critical path variables to achieve accuracy. 2An activation frame can be seen as a procedure instance and is ready when all its arguments are supplied, that is all its predecessors are executed.

16 the tail of the deque before a spawn and removes it after a spawn. Figure 4.4 illustrates runtime data structures of a deque in Cilk-5. Cilk-5 is developed for shared-memory machines, the work-stealing scheduler employs Dijkstra’s protocol for mutual exclusion. Hence, thief and victim access simultaneously to the same memory. The protocol is called THE because it uses three shared variables T (Tail), H (Head) and E (Exception). Since all deques are visible to all workers, there are some conflicts among them. (1) One victim is chosen by several thieves but only one can pick it up. (2) Worker and thief try to remove the same frame.

stolen by the thief

frame H

closure T

push and pop locally

Figure 4.4: Data structures of a deque in one processor at runtime. In Cilk 5.4.6, a deque comprises closures. Each closure contains a pointer to an activation frame as well as other information like a lock, join counter - indicates the number of outstanding children, etc. H and T stand for indices of the head and the tail where H ≤ X < T

Locking Cilk provides ReadyDeque locks and Closure locks for mutual exclusion to obtain atomicity. A processor must grab the ReadyDeque Lock first to work with a deque and grab the Closure lock to steal task.

THE protocol The implementation of the protocol is based on these assumptions:

- reads and writes are atomic

- shared memory is sequentially consistent3

In simplified THE protocol which is shown in Figure 4.5, the head and the tail are indexed respectively by H and T where T ≥ H. Only the thief can increase the index H because the thief steals task from the head. Whereas the worker alters T because it pushes and pops task from the tail. Intuitionally, push operation is safe due to a worker

3Sequential consistency guarantees that all memory operations of a processor are in order.

17 works locally with its tail. In contrast, a pop operation falls into three cases. Let N denote the number of frames in a deque.

• case (a):N > 1. It is safe for the thief and the victim to extract frames from both ends.

• case (b): N = 1. If the thief and the victim attempt to get the frame simultane- ously, either of them will detect that H > T. If detector is – the thief: H will be reset and the stealing fails. – the victim: it will grab ReadyDeque L then pop the frame.

• case (c): N = 0. Both the thief and the victim fail.

Thief 1

2 H

3

4 H H = T

5 T Victim 6 T (a) (b) (c)

Figure 4.5: Interactions between thief and worker in the three cases. Shaded squares illustrate frames, H and T are the head and the tail of the deque.

The full THE protocol is achieved by adding exceptions E. The thief uses E as a pre- requisite to manage the stealing. For the worker, it uses E instead of H to compare with T. In accordance with the work-first principle, work overhead is minimized in some ways. First, no overhead with pushing and few operations with popping. Second, the worker caches exclusively H and T. Finally, the overhead for a victim to acquire L is counted for critical-path.

Frame Stealing occurs whenever the worker is out of work, however it would not always get the frame. Cilk-5 implements a deque as set of closures and each closure holds a frame. There are four states for a closure and it decides whether or not the stealing is successful.

• CLOSURE_RUNNING: the worker is working on that closure.

18 • CLOSURE_SUSPENDED: the closure has become an orphan and is in error state.

• CLOSURE_READY: the closure is ready for worker to steal. This is the only the valid state for stealing.

• CLOSURE_RETURNING: the closure holds return values.

4.4 Summary

Cilk is a multithreaded programming language that follows ANSI C rule and designed for shared-memory machines. It is has been known that Cilk exhibits high-performance in parallel applications. Same as previous versions, Cilk-5 obeys work-first principle in terms of minimizing the work overheads and move them to the critical path. Unlike other versions, Cilk-5 compiler and runtime system take part in scheduling to exploit work-first principle. In the compilation strategy, one procedure is translated into a slow clone (when stolen) and a fast clone (when spawned) because fast clones cost much than slow clones and contribute considerably to work overhead. Cilk scheduler contributes considerably to the good behavior thanks to an efficient randomized work-stealing al- gorithm. Runtime system inspires a Dijsktra-like mutual-exclusive protocol, called THE protocol, to implement ready queues and perform work-stealing among processors. This principle plays as a guide for further optimization.

19

Chapter 5 5 Porting Cilk to Barrelfish Barrelfish is a multikernel OS in which message-based style is used for communication among components, while Cilk is programming language designed for shared-memory architecture. Both of them operate on multiple cores and expose good performance in their original implementation. The idea of porting Cilk to Barrelfish is to benefit Cilk’s behavior from running on the new platform and evaluate in which level of performance that Cilk can accomplished.

5.1 Challenges

The aim of this thesis focuses on adapting Cilk to Barrelfish, that is a Cilk application is needed to be compiled and executed on top of Barrelfish. The porting answers the question of how Barrelfish accepts a shared-memory model like Cilk. In Barrelfish user-space, each domain has its own virtual address space. A virtual address space can be shared either by sharing hardware page tables or synchro- nizing hardware page tables by messages among dispatchers. Additionally, Barrelfish user environment includes a standard C library as well as a set of APIs for POSIX- like threads [8]. Hence, porting a C-based shared-memory framework to Barrelfish is straightforward. Even the above discussion has shown that the porting is possible, there are some challenges when porting Cilk to the new platform since Cilk is originally developed for Linux, Windows and MacOS, which widely support shared memory.

• Barrelfish is an experimental OS, therefore it must be compiled together with its applications [10].

• Barrelfish is designed as a multikernel architecture and it is completely different from other OS that Cilk was initially implemented.

• When running a Cilk application on Barrelfish, programmers should be aware of thread creation and synchronization on multiple cores as well as memory allocation and deallocation. Since threads do not automatically migrate among cores, and Barrelfish leaves the memory deallocation to user-level code because it does not has memory management mechanism like garbage collection.

21 The porting is made up of these steps to address these challenges:

1. Compile Cilk runtime library. This requires major changes in data structures and functions including adding, deleting and replacing.

2. Modify Barrelfish building component Hake to plug in the Cilk compiler cilkc.

3. Convert Cilk fork/join scheduling model into a new one that Barrelfish can accept

4. Adjust Cilk configure to show up runtime statistics.

The rest of this chapter introduces both the original and compatible models, then describes modifications on Cilk as well as Barrelfish.

5.2 Multithreaded Model

The main point of the porting is to embed Cilk runtime library in Barrelfish. Cilk sched- uler runtime uses shared-memory algorithm, while Barrelfish supports message-passing style. Barrelfish provides a logical concept domain which determines the boundary of an application. Domain spanning allows an application to take in more cores during its execution and then these cores can share a virtual memory address space. This section describes the original Cilk scheduler and then introduces how to make the scheduler compatible with Barrelfish. All illustrations are made with a 4-core machine.

5.2.1 Cilk on the original platform In Cilk-5 runtime, the scheduler maintains a global context consisting of a shared global state updated among workers and parameters which are read only from work- ers. When starting, all the deques are loaded into memory (see Figure 5.1) and ma- nipulated through context by the workers. This context is initialized as the appli- cation enters the library and passed throughout the process. When the thief performs work-stealing, it operates on the victims’ deques and stealing is made by modifying parameters pointing to the stolen closures achieved from the victim. The library uses pthread_create() to spawn workers and invokes pthread_join() to wait for the ter- minations of the workers. The runtime library includes one master thread to control the scheduler, several threads to act as workers and bunch of threads to perform computation. Cilk scheduler maps one worker to one thread and leaves the mapping threads onto cores to the OS scheduler (Figure 5.2). In normal case, the number of workers is equal to the number of cores. However, programmers can create as many workers as they wish, consequently an individual core or a kernel-level thread will execute several user-level threads. Dur- ing an execution, a worker thread sticks to one core, but computational threads can automatically migrate among cores to balance workload.

22 Shared memory

Network

P0 P1 P2 P3

Figure 5.1: Model of shared-memory scheduler in Cilk, which has four cores. A column indicates for a deque. There is one deque per core. All cores can access to the four deques through a shared-memory space.

workers

OS scheduler

P0 P1 P2 P3

Figure 5.2: Multithreaded model of Cilk scheduler in its original platform.

5.2.2 Cilk on Barrelfish OS In general, the Cilk-5 scheduler will work in any SMPs which supports cache coherence. Barrelfish is able to run in SMP environment, nevertheless the situation goes different. Because Barrelfish can be viewed as a distributed OS where every core maintains its own memory and other peripheral devices. Any program starts up in one core and the dispatcher of that core handles threads from the program. Deques and other variables are loaded in the memory but only the first core can operate on them, that is, the global context is visible to the first core. For the example of a machine with four cores, the Cilk scheduler loaded in the system can be illustrated in Figure 5.3. With the original runtime library, the dispatcher creates threads to play as workers; then these workers are scheduled to run in one core and cannot automatically migrate to other cores. There

23 are two ways to exploit the multikernel architecture: (1) dispatchers from other cores use message passing to synchronize the virtual memory, and (2) the first core extends its domain to cover other cores. To implement the first option, programs run on multiple cores in terms of dispatchers in different cores exchange messages to synchronize virtual address spaces. For each update by one dispatcher, all the dispatchers perform message passing to make these spaces look the same [7]. For the second option, Barrelfish provides the logical concept domain and supports primitives for domain spanning. Obviously, it takes less effort for the latter option thanks to available APIs at hand. As discussed in sections Dispatcher and Application in 3.2, the concept domain defines a group of dispatchers (or cores) in an application. Once an application requires more cores, the domain extends to these cores accordingly. Recall that a dispatcher, which is tied to a core, corresponds to one unified virtual address space and schedules its own threads spawned by the application. In a domain, when a new dispatcher is created and added, it will point to the same vspace and cspace. Put differently, a domain looks like a shared- memory environment and makes all the shared variables are available for all dispatchers (hence workers), see Figure 5.4. All dispatchers of a domain access equally to the virtual address space through messages with mem_server.

dispatcher

P0 P1 P2 P3

Figure 5.3: Model of Cilk scheduler in Barrelfish, which has four cores.

Although an environment like shared memory is created, the runtime also needs some modifications for workers to operate normally. Since Barrelfish OS is distributed, all the cores are independent, no upper layer is present on top of the cores that maps threads to them. Unlike other OSes in which multithreading is done with few built-in functions, multithreading in Barrelfish is performed manually by either (1) spawning a thread per core or (2) spawning threads on one core, then distributing them across cores. This thesis focuses on option (1) because Barrelfish allows to create threads on a specific core simply with primitives, whereas thread migration involves either spinlocks or sending messages [7]. Figure 5.5 shows the multithreading model of workers in which one core manages one worker thread and no system scheduler is required. The model offers some bellowing features.

• There can be multiple workers per core where one dispatcher handles multiple worker threads. Dispatchers can schedule its threads, hence, maintain its workers.

24 domain

dispatcher dispatcher dispatcher dispatcher

P0 P1 P2 P3

Figure 5.4: Model of Cilk scheduler in Barrelfish with a domain in a 4-cored machine. A dispatcher knows which domain it belongs to, but it does not know its neighbors. All dispatchers of a domain access equally to the virtual address space through messages with mem_server

workers

P0 P1 P2 P3

Figure 5.5: Multithreading model of Cilk scheduler in Barrelfish.

• Threads from different workers execute and terminate independently. Therefore Cilk scheduler requires a method of thread synchronization among workers.

• Worker threads are mapped directly to cores one by one at user-level code.

25 5.3 Modifications on Cilk

5.3.1 Compile time

Cilk library

Despite Barrelfish using message passing, some concepts like mutexes, andcondition variables exist in both Cilk and Barrelfish library but they are different in struc- tures. To make safe for multiple threads to access simultaneously to the same resources, Barrelfish also needs mutual exclusion including facilities and primitives containing spin- locks, semaphores, etc. Cilk utilizes Pthread APIs while Barrelfish possesses a POSIX-like library. The ap- proach is to replace POSIX by POSIX-like subroutines. One of the solutions for these adaptions is making a Pthread wrapper for both data types and functions. For those are not supported in Barrelfish such as pthread attribute, scheduler policy, other pthread functions, and some subroutines relating to architecture, the solution is to sim- ply eliminate all. When ported to Barrelfish, each worker thread executes independently in one core, thread priority and light-weight processes are not necessary. However, some built-in functions to get system information at runtime which depends on specified ar- chitecture cannot be replaced with similar ones in Barrefish.

Cilk program

Barrelfish is not self-hosted, thus the OS’s files and its applications must be compiled into static binaries and loaded into memory alongside. In accordance with this rule, Cilk command cilkc compiles a program to an object first, then links the object with Cilk static library into one static executable. Figure 5.6 shows the compilation progress.

fib.c cilkc fib.o cilkc fib

libcilk.a

Figure 5.6: Compilation progress of a Cilk program in Barrelfish. Command cilkc com- piles and links a Cilk program computing the nth Fibonacci. Cilk runtime library is marked by libcilk.a

Furthermore, to display runtime system statistics, programmers needs to append flag -cilk-profile in compiling and linking commands. (Refer to [19] for details of Cilk flags.) For example, enter this command to execute the program and obtain its statistics:

> fib --stats 2 30

26 5.3.2 Runtime

Threads Cilk-5 implements thread management with pthread_create(), pthread_join() and some other pthread mutex subroutines. Threads are automatically scheduled to execute on all cores, and cores will bear the workload equally. While in Barrelfish, there is no support for POSIX thread management. In order to monitor the thread creation and synchronization, Barrelfish has a library of POSIX-like subroutines helping programmers to implement wait/notify mechanism that somewhat mimics the behavior of create and join in POSIX (see Figure 5.7). The process is comprised of:

• 1: spanning the domain across cores – The current core maintains its virtual memory space and that space boundary extends core by core. Therefore, if the system has N cores, the spanning will cross N - 1 cores. To guarantee all cores has spanned, the master thread waits for a callback function after the spanning in one core has completed. The suspension lasts until the master thread receives N - 1 signals. – Function domain_new_dispatcher(curr_core_nr, spanned_cb, ...) al- lows domain spanning in Barrelfish in which curr_core_nr is the id of the core to span and spanned_cb is the pointer to the callback function which ensures the completeness.

• 2: performing thread creation and synchronization – The master thread invokes function domain_thread_create_on() to create a thread on a specified core, then this thread in turn invokes a worker. In fact, the worker commences right after the invocation meanwhile the master thread continues to spawn another one. The mechanism works fine with a small workload, but it arises invalid access to memory when the workload gets bigger. Apart from the first worker, other workers still keeps reference to variables that might have been cleared up by the first worker. To synchronize, the porting will halt worker threads until all of them have been spawned. – Since Barrelfish does not follow POSIX create and join, the master thread must suspend and wait for all the spawned threads to finish. There is no function in Barrelfish like pthread_join(), for this reason callback functions are needed. When a worker is about to terminate, it invokes a callback function for whom the master thread is waiting. – To sum up, Barrelfish supplies function for thread creation but programmers have to perform thread synchronization manually.

• 3: protecting the memory – The OS allocates virtual memory in response to requests from applications, nonetheless it leaves memory deallocation to user-level code. Before each

27 worker terminates, the worker thread must guarantee all allocated memory is freed by both free() and thread_exit(). The former is to empty memory allocated by alloc() and the latter is to clear all memory created by the worker thread itself during execution.

– Another case to take care of the memory is when there are multiple workers computing small workload. According to the algorithm, the first worker will commence with computation right after it has been spawned. If the work- load is small, it finishes and terminates meanwhile other workers have just started but nothing left for them. However, pointers in the global context may still exists and cause exception as the workers try to reach the invalid address space. Therefore, programmers keep track of not only allocation and deallocation memory but also validity of global variables to decide whether or not to access them.

Master thread

Workers

Figure 5.7: wait/notify mechanism to replace POSIX create and join. The mas- ter thread creates four worker threads then suspends until it receives four callback complete signals.

Statistics

As discussed above, programs must be compiled with flag -cilk-profile to display runtime system statistics. When Cilk compiler receives the flag, it inserts profiling code in the compilation so that the scheduler would be able to collect system statistics at runtime. Nevertheless, this information cannot be viewed after porting. This is due to the fact that we have used the host OS, for instance Linux, to compile the program and used ported static library at runtime, thus the information cannot link together at once. To switch on this statistics, cilk-conf.h file contains some variables to be edited.

28 5.4 Modifications on Barrelfish

5.4.1 Hake Since Barrelfish uses Hake to build the system, and a single Makefile is a final result after Hake has collected rules from all the Hakefiles in the source directory. The Makefile contains a list of commands from which needed files would be compiled by the default compiler - gcc. For example, with a Cilk program fibonacci, the command to compile the file will be like: gcc [gccFlags] [barrefishFlags] -o fib.o -c fib.cilk The above command is invalid because Cilk uses its own compiler cilkc. In order to replace the default gcc with a desired compiler, we need to modify Hake tool so that it could accept other compilers and add the desired compiler in Hakefile. As discussed in section 3.3.1, the structure of a Hakefile to build an application is layed out as follows:

[ build application { target = "hello", cFiles = [ "hello.cilk" ], addCFlags = [ "-Wno-redundant-decls" ], omitCFlags = [ "-fno-builtin" ], addLibraries = [ "cilk" ], addCompiler = "cilkc" } ]

The new option addCompiler is added to compile a specific module, like Cilk in this context. Programmers can feed the name of the desired compiler into the new option. In Hake tool, the name of the compiler will be added in three main stages: (1) compile a program to object .o, (2) assemble an object to an assemply .S and (3) link to an executable. Hake tool source which lies under src/hake contains the target files to edit, then we will have the right command for a special program at the end. cilkc [cilkFlags] [barrefishFlags] -o fib.o -c fib.cilk

5.4.2 Makefile When Hake builds the Makefile consisting of commands, it also adds to these commands some CFlags so that the gcc can compile correct OS files for Barrelfish. For a specific module like Cilk, although omitCFlags and addCFlags permits programmers to remove desired flags, Cilk files still would not be compiled promptly. The reasons are two-fold. First, there is some flags that cannot be modified by Hake. Second, editing Hake only applies to .o and .S files while leaving the binaries unchanged, thus, linker and loader would produce an unwanted file. It turns out to be another limitation of this thesis when modifcations have been made manually to Cilk commands.

29 5.5 Summary

Porting Cilk to Barrelfish is to build up a shared-memory application on top of a message- passing OS. In order for Cilk to be compatible with Barrefish platform, the adaptation answers concerns in compiling phase and runtime model. Modifications are made on both Cilk and Barrelfish, but Cilk has suffered more changes. The first thing is to deal with the Cilk runtime library by replacing POSIX thread with Barrelfish’s subroutines in terms of a wrapper, then to modify arguments in cilkc command to compile a correct object so that it can associate with Cilk static library. However, altering thread execution model in the scheduler is vital for the porting. In the new scheme, all cores share a same domain (hence share a same virtual address space) and the Cilk scheduler spawns a worker directly on each. Unlike traditional OSes, one core hosts one worker thread without bewaring of scheduling threads among cores so that the scheme would exploit the multikernel structure. Basically, it would be better if we logically tie a shared-memory application to a group of related dispatchers or domain so that it should run on multiple cores and multiple threads from dispatchers on different cores can access the shared memory.

30 Chapter 6 6 Benchmarks The above section proves the feasibility of porting Cilk to Barrelfish through a model which elaborates how Cilk can run on top of Barrelfish. To study how good performance Cilk would expose, this chapter presents benchmarks of 6 Cilk programs executing on Barrelfish. The purposes of the benchmarks are first to prove that the model works in terms of scalability of Cilk over cores and second to investigate the efficiency of Cilk by comparing to Cilk on Linux.

6.1 Environment settings

The benchmarks are set up in a machine running a simulator called Simics[21] for both Linux and Barrelfish because Barrelfish is not self-hosted. Simics is a full-system simula- tor which provides a complete virtual platform including CPUs, hardware and operating systems. By default, an atomic operation in the simulation process such as an instruc- tion, an exception or an interrupt takes one cycle (see Table 6.1). There are difference between Linux and Barrelfish in terms of RAM. For booting, Linux uses the same kernel image to all cores whereas Barrelfish needs different images for heterogeneous cores.

Processor Intel Pentium 4 4HT Intel Pentium 4 4HT CPU Frequency 125MHz 125MHz Number of cores 8 8 RAM 512MB 1GB OS Linux Barrelfish

Table 6.1: Hardware configurations of the virtual machine

In order to have better parallel processing of cores, cpu-switch-time is set to 1. It means Simics switches from one core to another every cycle. However it will take much more time to perform a simulation.

31 6.2 Measurements

6.2.1 Measurements of serial applications In serial versions1 of Cilk programs, this thesis only measures the execution time2. The execution time is TS which is counted from application’s start to its end.

6.2.2 Measurements of Cilk applications There are many quantities which are needed in Cilk applications. Figure 6.1 shows how to compute some of them.

Runtime library

application Cilk_start … Domain spanning Parallel workers … Cilk_terminate

Figure 6.1: Cilk application invokes the runtime library. Domain spanning is an extra step to create a shared-memory environment. After a domain is formed, workers run in parallel to distribute workload.

• Tp: The execution time is computed by the different time between the start and termination of an application in the runtime library.

• T1: The total execution time the scheduler takes to compute all the tasks.

• TW : The execution time for all workers to compute the tasks. • The number of threads: Each thread represents a typical function. The conversion from a function to a thread occurs at compile time. • The number of steals: A steal is counted if its closure status is ready. Cilk library uses states to manage operations of a closure such as stealing, working, waiting, etc. • Domain spanning cost: the extra cost to create a shared memory for all the workers. Once the domain is formed, all the workers performed as they are designed. This cost does not depend on the size of an application but the number of cores in the system. Table 6.2 shows the spanning domain costs and these values are depicted by figure 6.2. 1A serial version is achieved by removing all Cilk keywords. 2Since we have no real hardware to perform the benchmarks, the execution time means the number of clock cycles that the machine takes.

32 Number of cores Cost 2 5.6 x 107 4 1.1 x 108 8 2.5 x 108

Table 6.2: Spanning domain cost. Measurements are made in cycles.

·108 2.5

2

1.5

1

0.5

Spanning domain overhead (cycles) 2 3 4 5 6 7 8 Number of cores

Figure 6.2: Spanning domain overheads over cores are counted in cycles.

6.3 Experiments

6 Cilk programs which are extracted from examples package of the Cilk distribution are compiled with gcc 4.1 and 4.6 at optimization level 3 and taken in both Linux and Barrelfish. All the programs, which are in parallel programming with different strategies, include: • cilksort: sort an array of n integers, authors: Matteo Frigo and Andrew Stark. • fft: Fast Fourier Transform compute a vector whose size is power of 2, author: Matteo Frigo. • fib: calculate the nth Fibonacci number. • lu: decompose n x n matrix, where n is at least 16 and power of 2, author: Robert Blumofe. • matmul: Rectangular multiply two dense n x n matrices, good for cache, author: Matteo Frigo. • strassen: multiply two dense n x n matrices with Strassen’s algorithm, authors: Michael Bender, Stuart Schechter, and Bin Song.

33 Fisrt of all, recall that T1 denotes the sum of execution time of all procedures in a program running on N cores, TS denotes the execution time of a serial version and Tp, where 1 < p ≤ N, is the execution time of paralleling program on N cores. Table 6.3 shows execution time in seconds of 6 serial programs on Linux and Barrelfish.

Program size Barrelfish Linux cilksort 10,000,000 2.4 x 109 2.12 x 109 fft 222 1.8 x 109 1.39 x 109 fib 40 4.1 x 1010 3.72 x 1010 lu 2048 2.4 x 1010 3.19 x 1010 matmul 1024 1.8 x 1010 1.98 x 1010 strassen 1024 2.5 x 109 2.94 x 109

Table 6.3: Execution time of 6 serial Cilk programs. Measurements are made in cycles.

Overall, it would be good if the value of T1/TS = c1 around 1 in order to show Cilk complies with the work-first principle. The speed up is computed by TS/TW , that is the spanning domain cost is excluded. Instead, the spanning domain cost is presented by the percentage of total execution cycles. The experiments of 6 Cilk programs which are performed with 1, 2, 4 and 8 cores lays out beneath:

34 Cilksort

Cores Measurements Barrelfish Linux 9 9 TP 2.97 x 10 2.14 x 10 9 9 TW 2.97 x 10 2.14 x 10 1 9 9 T1 2.97 x 10 2.14 x 10 T1/TS 1.25 1.01 TS/TW 0.8 0.99 9 9 TP 2.04 x 10 1.2 x 10 9 9 TW 1.98 x 10 1.2 x 10 9 9 2 T1 3 x 10 2.14 x 10 T1/TS 1.26 1.01 TS/TW 1.2 1.77 spanning cost 2.75 % 9 8 TP 1.64 x 10 7.4 x 10 9 8 TW 1.53 x 10 7.4 x 10 9 9 4 T1 2.96 x 10 2.16 x 10 T1/TS 1.25 1.02 TS/TW 1.55 2.86 spanning cost 6.72 % 9 8 TP 1.51 x 10 5.02 x 10 9 8 TW 1.25 x 10 5.02 x 10 9 9 8 T1 2.97 x 10 2.18 x 10 T1/TS 1.25 1.03 TS/TW 1.89 4.22 spanning cost 16.6 %

Table 6.4: Measurements (in cycles) of cilksort on Barrelfish and Linux.

35 3

9 Barrelfish Linux 2.5

2

1.5

1

Execution time (s) x0 10 .5 2 4 6 8 Number of cores

Figure 6.3: Comparison of TW of cilksort on Barrelfish and Linux.

4 Barrelfish Linux

3

2

Speedup vs. C-elision 1

2 4 6 8 Number of cores

Figure 6.4: Speedup vs. serial versions of cilksort. Comparison is made with TS/TW

36 CoreId Barrelfish Linux 0 13 67 1 40 55 2 18 75 3 16 61 4 17 55 5 24 44 6 14 60 7 13 58 sum 155 475

Table 6.5: Number of steals with 8 workers of cilksort.

CoreId Barrelfish Linux 0 142336 37847 1 73359 41117 2 73711 40520 3 4683 41856 4 5070 36589 5 4397 37290 6 5163 38801 7 4752 39536 sum 313471 313556

Table 6.6: Number of threads spawned in 8 workers of cilksort.

·105 1.5 Barrelfish Linux

1

0.5

Number of threads per core 0 0 2 4 6 CoreId

Figure 6.5: Thread distribution over 8 cores of cilksort.

37 FFT

Cores Measurements Barrelfish Linux 9 9 TP 1.8 x 10 1.44 x 10 9 9 TW 1.8 x 10 1.44 x 10 1 9 9 T1 1.8 x 10 1.44 x 10 T1/TS 1.04 1.03 TS/TW 0.96 0.97 9 8 TP 1.6 x 10 7.41 x 10 9 8 TW 1.5 x 10 7.41 x 10 9 9 2 T1 1.9 x 10 1.44 x 10 T1/TS 1.09 1.03 TS/TW 1.16 1.88 spanning cost 3.56 % 9 8 TP 1.3 x 10 4.03 x 10 9 8 TW 1.2 x 10 4.03 x 10 9 9 4 T1 2.0 x 10 1.46 x 10 T1/TS 1.12 1.05 TS/TW 1.47 3.46 spanning cost 8.42 % 8 8 TP 7.7 x 10 2.3 x 10 8 8 TW 5.2 x 10 2.29 x 10 9 9 8 T1 1.9 x 10 1.47 x 10 T1/TS 1.1 1.06 TS/TW 3.35 6.08 spanning cost 32.26 %

Table 6.7: Measurements (in cycles) of FFT on Barrelfish and Linux.

38 8 Barrelfish Linux 15

10

5 Execution time (s) x 10

2 4 6 8 Number of cores

Figure 6.6: Comparison of TW of FFT on Barrelfish and Linux.

6 Barrelfish Linux

4

2 Speedup vs. C-elision

2 4 6 8 Number of cores

Figure 6.7: Speedup vs. serial versions of FFT. Comparison is made with TS/TW

39 CoreId Barrelfish Linux 0 40 56 1 49 40 2 59 56 3 46 49 4 45 58 5 53 58 6 36 46 7 47 56 sum 375 419

Table 6.8: Number of steals with 8 workers of FFT.

CoreId Barrelfish Linux 0 257625 147523 1 139811 157587 2 125510 160789 3 148518 156886 4 141931 159667 5 138981 152118 6 134389 140215 7 137654 149661 sum 1224419 1224446

Table 6.9: Number of threads spawned in 8 workers of FFT.

·105

2.5 Barrelfish Linux

2

1.5 Number of threads per core 0 2 4 6 CoreId

Figure 6.8: Thread distribution over 8 cores of FFT.

40 Fib

Cores Measurements Barrelfish Linux 10 10 TP 4.22 x 10 3.16 x 10 10 10 TW 4.22 x 10 3.16 x 10 1 10 10 T1 4.22 x 10 3.16 x 10 T1/TS 1.02 0.85 TS/TW 0.98 1.18 10 10 TP 2.27 x 10 1.56 x 10 10 10 TW 2.25 x 10 1.56 x 10 10 10 2 T1 4.23 x 10 3.12 x 10 T1/TS 1.03 0.84 TS/TW 1.83 2.38 spanning cost 0.25 % 10 9 TP 1.35 x 10 7.81 x 10 10 9 TW 1.33 x 10 7.81 x 10 10 10 4 T1 4.33 x 10 3.12 x 10 T1/TS 1.06 0.84 TS/TW 4.98 9.5 spanning cost 0.82 % 9 9 TP 8.41 x 10 3.91 x 10 9 9 TW 8.23 x 10 3.91 x 10 10 10 8 T1 4.38 x 10 3.13 x 10 T1/TS 1.06 0.84 TS/TW 4.98 9.5 spanning cost 2.92 %

Table 6.10: Measurements (in cycles) of fib on Barrelfish and Linux.

41 Barrelfish 10 4 Linux

3

2

1 Execution time (s) x 10 0 2 4 6 8 Number of cores

Figure 6.9: Comparison of TW of fib on Barrelfish and Linux.

10 Barrelfish Linux 8

6

4

Speedup vs. C-elision 2

2 4 6 8 Number of cores

Figure 6.10: Speedup vs. serial versions of fib. Comparison is made with TS/TW

42 CoreId Barrelfish Linux 0 9 23 1 22 14 2 13 36 3 6 26 4 17 20 5 19 15 6 20 22 7 13 21 sum 119 177

Table 6.11: Number of steals with 8 workers of fib.

CoreId Barrelfish Linux 0 104005003 103684616 1 103916448 103773168 2 103457696 103650790 3 103418929 103723370 4 103098557 103570200 5 103322399 102217914 6 103309483 103562224 7 103372191 103718424 sum 827900706 827900706

Table 6.12: Number of threads spawned in 8 workers of fib.

·108 1.04

1.03

1.03

1.02 Barrelfish Linux Number of threads per core 0 2 4 6 CoreId

Figure 6.11: Thread distribution over 8 cores of fib.

43 LU

Cores Measurements Barrelfish Linux 10 10 TP 2.46 x 10 3.23 x 10 10 10 TW 2.46 x 10 3.23 x 10 1 10 10 T1 2.46 x 10 3.23 x 10 T1/TS 1.01 1.01 TS/TW 0.99 0.99 10 10 TP 1.45 x 10 1.65 x 10 10 10 TW 1.44 x 10 1.65 x 10 10 10 2 T1 2.56 x 10 3.23 x 10 T1/TS 1.05 1.01 TS/TW 1.68 1.94 spanning cost 0.39 % 9 9 TP 8.39 x 10 8.68 x 10 9 9 TW 8.28 x 10 8.68 x 10 10 10 4 T1 2.56 x 10 3.26 x 10 T1/TS 1.05 1.02 TS/TW 2.94 3.67 spanning cost 1.32 % 9 9 TP 5.23 x 10 4.8 x 10 9 9 TW 4.98 x 10 4.8 x 10 10 10 8 T1 2.53 x 10 3.28 x 10 T1/TS 1.04 1.03 TS/TW 4.89 6.64 spanning cost 4.78 %

Table 6.13: Measurements (in cycles) of LU on Barrelfish and Linux.

44 Barrelfish 10 3 Linux

2

1 Execution time (s) x 10

2 4 6 8 Number of cores

Figure 6.12: Comparison of TW of LU on Barrelfish and Linux.

Barrelfish 6 Linux

4

2 Speedup vs. C-elision

2 4 6 8 Number of cores

Figure 6.13: Speedup vs. serial versions of LU. Comparison is made with TS/TW

45 CoreId Barrelfish Linux 0 734 927 1 847 871 2 762 918 3 806 889 4 761 909 5 790 898 6 817 927 7 833 879 sum 6350 7218

Table 6.14: Number of steals with 8 workers of LU.

CoreId Barrelfish Linux 0 247020 235323 1 227702 231733 2 233682 234301 3 251696 235116 4 229314 234199 5 245164 233409 6 251993 235133 7 186686 234043 sum 1873257 1873257

Table 6.15: Number of threads spawned in 8 workers of LU.

·105

2.4

2.2

2 Barrelfish Linux Number of threads per core 0 2 4 6 CoreId

Figure 6.14: Thread distribution over 8 cores of LU.

46 Matmul

Cores Measurements Barrelfish Linux 10 10 TP 1.61 x 10 1.89 x 10 10 10 TW 1.61 x 10 1.89 x 10 1 10 10 T1 1.61 x 10 1.89 x 10 T1/TS 0.87 0.95 TS/TW 1.14 1.05 10 10 TP 1.2 x 10 1.39 x 10 10 10 TW 1.19 x 10 1.39 x 10 10 10 2 T1 1.61 x 10 1.89 x 10 T1/TS 0.88 0.95 TS/TW 1.54 1.42 spanning cost 0.39 % 10 10 TP 1.1 x 10 1.15 x 10 10 10 TW 1.09 x 10 1.15 x 10 10 10 4 T1 1.7 x 10 1.89 x 10 T1/TS 0.92 0.96 TS/TW 1.69 1.72 spanning cost 1.0 % 9 10 TP 8.31 x 10 1.03 x 10 9 10 TW 8.05 x 10 1.03 x 10 10 10 8 T1 1.67 x 10 1.89 x 10 T1/TS 0.91 0.96 TS/TW 2.28 1.93 spanning cost 3.0 %

Table 6.16: Measurements (in cycles) of matmul on Barrelfish and Linux.

47 Barrelfish 10 1.8 Linux 1.6

1.4

1.2

1

Execution time (s) x0 10 .8 2 4 6 8 Number of cores

Figure 6.15: Comparison of TW of matmul on Barrelfish and Linux.

Barrelfish Linux 2

1.5 Speedup vs. C-elision 1 2 4 6 8 Number of cores

Figure 6.16: Speedup vs. serial versions of matmul. Comparison is made with TS/TW

48 CoreId Barrelfish Linux 0 67 102 1 57 79 2 95 36 3 73 92 4 71 82 5 87 59 6 55 76 7 75 73 sum 580 599

Table 6.17: Number of steals with 8 workers of matmul.

CoreId Barrelfish Linux 0 107424 107736 1 107788 106421 2 107756 108384 3 107574 107456 4 107616 107765 5 107798 107990 6 107574 107810 7 107802 107770 sum 861332 861332

Table 6.18: Number of threads spawned in 8 workers of matmul.

·105 1.08

1.08

1.07

1.07 Barrelfish 1.06 Linux Number of threads per core 0 2 4 6 CoreId

Figure 6.17: Thread distribution over 8 cores of matmul.

49 Strassen

Cores Measurements Barrelfish Linux 9 9 TP 2.69 x 10 3.13 x 10 9 9 TW 2.69 x 10 3.13 x 10 1 10 10 T1 1.61 x 10 1.89 x 10 T1/TS 1.05 1.06 TS/TW 0.95 0.94 9 9 TP 1.37 x 10 1.7 x 10 9 9 TW 1.36 x 10 1.7 x 10 9 9 2 T1 2.66 x 10 3.15 x 10 T1/TS 1.04 1.07 TS/TW 1.86 1.73 spanning cost 4.09 % 8 8 TP 9.37 x 10 9.82 x 10 8 8 TW 9.36 x 10 9.82 x 10 9 9 4 T1 2.68 x 10 3.19 x 10 T1/TS 1.05 1.08 TS/TW 2.72 3 spanning cost 5.98 % 8 8 TP 8.78 x 10 6.26 x 10 8 8 TW 6.76 x 10 6.26 x 10 9 9 8 T1 2.66 x 10 3.25 x 10 T1/TS 1.04 1.11 TS/TW 3.77 4.7 spanning cost 12.5 %

Table 6.19: Measurements (in cycles) of strassen on Barrelfish and Linux.

50 9 3 Barrelfish Linux

2

1 Execution time (s) x 10

2 4 6 8 Number of cores

Figure 6.18: Comparison of TW of strassen on Barrelfish and Linux.

5 Barrelfish Linux 4

3

2 Speedup vs. C-elision 1 2 4 6 8 Number of cores

Figure 6.19: Speedup vs. serial versions of strassen. Comparison is made with TS/TW

51 CoreId Barrelfish Linux 0 5 10 1 5 5 2 5 9 3 5 3 4 5 5 5 6 7 6 5 6 7 6 4 sum 42 49

Table 6.20: Number of steals with 8 workers of strassen.

CoreId Barrelfish Linux 0 3446 754 1 384 749 2 319 760 3 602 747 4 424 748 5 235 750 6 240 751 7 357 748 sum 6007 6007

Table 6.21: Number of threads spawned in 8 workers of strassen.

Barrelfish 3,000 Linux

2,000

1,000 Number of threads per core 0 0 2 4 6 CoreId

Figure 6.20: Thread distribution over 8 cores of strassen.

52 6.4 Evaluation

From the results of the experiments, Cilk on Barrelfish does not perform better than on Linux in most of the cases even I have measured only the execution of parallel workers. Cilk on Barrelfish takes many cycles and shows poor speedup. It is obviously that stealing and thread distribution on both platforms are different. Cilk on Linux performs much stealing and offers good thread distribution. In Barrelfish, core 0 tends to compute much more threads than the others, for example in strassen, hence workload is not equally distributed. As discussed above, the spanning domain cost does not depend on the size of an application but it is proportional to the number of cores, that is grows as the number of cores increases. This cost is small with large-time applications, for example matmul (around 3% with 8 cores), but becomes considerable with small-time applications for example FFT (around 32% with 8 cores). In summary, with the execution of Cilk programs on Barrelfish and Linux taken in a simulator Simics, Cilk cannot maintain its good performance in most of the experiments. Although, Cilk has kept the work-first principle on the new platform, the time bounds cannot be obtained.

53

Chapter 7 7 Conclusion 7.1 Contribution

This thesis is an implementation of applying a parallel programming model (task-centric model) on a scalable OS by porting Cilk to Barrelfish and so far it has accomplished these bellowing targets:

• Proving that it is possible for a shared-memory application to run on top of Bar- relfish OS which uses message passing for communication.

• Investigating to what extent Cilk’s performance has exposed in the new platform.

This porting has fortified the statement that Barrelfish supports shared-memory appli- cations on account of its ANSI-C based library, virtual memory management and sets of POSIX-like threads. Barrelfish considers an application running on multiple cores as a group of related dispatchers called a domain. Moreover, the porting has explored the capability of Barrrelfish to specify another compiler instead of gcc to build the program source, it is Cilk compiler cilkc in this context. As the rule of thumb to execute an application, it must be a static binary and be compiled toghether with all Barrelfish files. Cilk has exposed good performance on traditional OSes but unfortunately this at- tractive feature cannot remain on Barrelfish even the time bound property. The only thing stays unchanged is the work-first principle. Statistics shows that Cilk’s per- formance in Barrelfish is worse than in Linux from speedup ratios to execution time. The key point for a shared-memory application can operate in Barrelfish is a common virtual address space. The model which has been developed with what we have at hand in Barrelfish looks like a shared-memory environment in which global variables can be accessible from all cores involving in the application. Once the common virtual address space is formed, threads from different dispatchers can equally operate on this space. Although the overhead for spanning domain does not contribute much to the execution time of applications with large workload, it increases when a system is scaled up. Bearing in mind that all the benchmarks are measured in Simics, however the statistics is worth to trust because Simics is a full-system simulator.

55 7.2 Future Work

At this moment, Barrelfish operates like a experimental platform rather than a com- plete OS, thus applications for the systems are limited. However, there might be some applications which require a particular compiler. Therefore, the option specify to in Hakefile needs more implementation so that programmers can be free to specify or remove wanted arguments in compiling or even linking phase. As discussed above, a shared-memory application relies upon a spanned domain across cores. When the application exits, this domain still remains. In the next execution, when the application spans domain again likewise and if the system memory is small the OS may allocate the old address space for the new domain and cause pagefault errors because the OS maps capabilities to the same page table. For that reason, the spanned domain must be killed as the application terminates. To clear up memory in a domain, the system needs to track which dispatchers belong to which domain. Further work should cope with the paucity of time-sharing of workers on one core, then applications would have the number of workers larger than the number of cores. This might need a protocol to map workers on cores so that the scheduler could exploit core efficiency as much as possible and might involve virtual memory management.

56 References

[1] The Barrelfish Operating System. Documentation: Overview [Online] Available from: http://www.barrelfish.org/TN-000-Overview.pdf. [Accessed 18/06/2012].

[2] The Barrelfish Operating System. Documentation: Glossary [Online] Available from: http://www.barrelfish.org/TN-001-Glossary.pdf. [Accessed 18/06/2012].

[3] The Barrelfish Operating System. Documentation: Hake [Online] Available from: http://www.barrelfish.org/TN-003-Hake.pdf. [Accessed 18/06/2012].

[4] A. Baumann, S. Peter, A. Schüpbach, A. Singhania, T. Roscoe, P. Barham, and . Isaacs. Your computer is already a distributed system. Why isn’t your OS?. In Proceedings of the 12th Workshop on Hot Topics in Operating Systems, Monte Verità, Switzerland, 2009.

[5] Adrian Schuepbach, Simon Peter, Andrew Baumann, Timothy Roscoe, Paul Barham, Tim Harris, Rebecca Isaacs. Embracing diversity in the Barrelfish manycore operating system. Proceedings of the Workshop on Managed Many-Core Systems (MMCS), Boston, MA, USA, June 2008.

[6] M. Maas and R. McIlroy. A JVM for the Barrelfish Operating System. In 2nd work- shop on Systems for Future Multi-core Architectures (SFMA’12)

[7] Rik Farrow. The Barrelfish Multikernel: An Interview with Timothy Roscoe; lo- gin:, vol. 35, no. 2, April 2010: [Online] Available from: http://www.usenix.org/ publications/login/2010-04/pdfs/roscoe.pdf. [Accessed 18/04/2012].

[8] A. Baumann, P. Barham, P. Dagand, T. Harris, R. Isaacs, S. Peter, T. Roscoe, A. Schüpbach, and A. Singhania. Multikernel: A new OS architecture for scalable multicore systems. In Proceedings of the 22nd ACM Symposium on OS Principles, pages 29–44, Big Sky, MT, USA, 2009.

[9] Ihor Kuz Getting started. In Barcelona Barrelfish Workshop, 6-7 September, 2010. [Online] Available from: http://wiki.barrelfish.org/BarcelonaWorkshop2010. [Accessed 18/06/2012].

[10] Barrelfish Wiki (2009) Programming_for_Barrelfish. [Online]. Available from: http://wiki.barrelfish.org/Programming_for_Barrelfish. [Accessed 18/04/2012].

57 [11] D. Wentzlaff and A. Agarwal, Factored Operating Systems (fos): The Case for a Scalable Operating System for Multicores, SIGOPS Operating System Rev., vol. 43, no. 2, pp. 76–85, 2009.

[12] J. A. Colmenares, S. Bird, H. Cook, P. Pearce, D. Zhu, J. Shalf, S. Hofmeyr, K. Asanovic, and J. Kubiatowicz, Resource Management in the Tessellation Manycore OS, Proceedings of the Second USENIX Workshop on Hot Topics in Parallelism (HotPar’10), Berkeley, California, USA, 2010.

[13] S. Boyd-Wickizer, A. T. Clements, Y. Mao, A. Pesterev, M. F. Kaashoek, R. Morris, and N. Zeldovich, An Analysis of Linux Scalability to Many Cores, 9th USENIX Symposium on Operating Systems Design and Implementation, 2010.

[14] M. Frigo, C. E. Leierson, and K. H. Randall. The implementation of the Cilk-5 multi- threaded language. In Proceeding PLDI ’98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation, pages 212–223, New York, NY, USA, 1998.

[15] R. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall and Y. Zhow. Cilk: an efficient multithreaded runtime system. In PPoPP ’95: Proceed- ings of the fifth ACM SIGPLAN Symposium on Principles and practice of parallel programming, R. L. Wexelblat, Ed., ACM, pages 207–216, New York, NY, USA, 1995.

[16] B. C. Kuszmaul. Cilk provides the "best overall productivity" for high performance computing: (and won the HPC challenge award to prove it). In Proceeding SPAA ’07 Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures, pages 299–300, New York, NY, USA, 2007.

[17] Charles E. Leiserson. Multithreaded Programming in Cilk - LECTURE 1. [On- line]. Available from: supertech.csail.mit.edu/cilk/lecture-1.ppt. [Accessed 18/09/2012].

[18] Charles E. Leiserson. Theory of parallel systems - Lecture 10: Cilk Implementation. [Online]. Available from: http://ocw.mit. edu/courses/electrical-engineering-and-computer-science/ 6-895-theory-of-parallel-systems-sma-5509-fall-2003/lecture-notes/ lecture10.pdf. [Accessed 18/09/2012].

[19] Cilk 5.4.6 Reference Manual (1998) Massachusetts Institute of Technology. [Online]. Available from: http://supertech.csail.mit.edu/cilk/manual-5.4.6.pdf. [Ac- cessed 18/04/2012].

[20] Clay Breshears.The Art of Concurrency - A Thread Monkey’s Guide to Writing Parallel Applications. 1sh ed. USA. O’Reilly Media; 2009

58 [21] P. S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hålberg, J. Högberg, F. Larsson, A. Moestedt, and B. Werner. Simics: A full system simulation platform. IEEE Computer, 35(2):50 - 58, February 2002.

[22] Artur Podobas, Mats Brorsson, and Karl-Filip Faxén. A Quantitative Evaluation of popular Task-Centric Programming Models and Libraries. KTH, Sweden, December 2012.

[23] Karl-Filip Faxén. Wool-A work stealing library. In Newsletter ACM SIGARCH Computer Architecture News, 36(5):93-100, December 2008.

59

TRITA-ICT-EX-2013:66

www.kth.se