Porting Cilk to the Barrelfish OS
Total Page:16
File Type:pdf, Size:1020Kb
Porting Cilk to the Barrelfish OS CHAU HO BAO LE KTH Information and Communication Technology Master of Science Thesis Stockholm, Sweden 2013 TRITA-ICT-EX-2013:66 KTH Royal Institute of Technology Dept. of Software and Computer Systems Degree project for the degree of Master of Science in Information and Communications Technology Porting Cilk to the Barrelfish OS Author: Chau Ho Bao Le Supervisor: Georgios Varisteas, MSc Examiner: Prof. Mats Brorsson, KTH, Sweden Abstract Barrelfish operating system is an experimental instance of multikernel structure which exhibits good features such as hardware heterogeneity, scalability, dynamicity, etc. Bar- relfish is in progress and lacks applications. Therefore, there is a need to investigate the efficiency of applications running in Barrelfish and one of candidates is a shared-memory application. To conduct an empirical study, Cilk is chosen inasmuch as its runtime li- brary is designed for shared-memory architectures and it has been known to expose good performance. This thesis focuses on making Cilk run on top of Barrelfish in order to reach two goals: portability which is described to be supported by Barrelfish, and good speed afterwards. The porting involves compiling Cilk runtime source code by replacing its pthread subroutines with set of APIs in Barrelfish and then changing the way Cilk scheduler spawns worker thread on multiple cores. However, the main point of the porting is to make different cores access to the same virtual address space. Luckily, Barrelfish provides a notion of domain which specifies the number of cores in an application so that these cores can share the same memory space. This thesis also has carried out benchmarks on some Cilk programs and found that Cilk does not perform as well as it is expected. In addition measurements on parallel workers shows that Cilk on Barrelfish takes more cycles to perform computation. Although Cilk still maintains work-first principle, it cannot achieve the time bound. The spanning domain cost is proportional to the number of cores, but it will matter if applications take small time to complete. Key words: Barrelfish, Cilk, porting, multikernel, shared-memory, work-stealing, mes- sage passing Acknowledgment I would like to show my gratitude to Professor Mats Brorsson who has helped me by giving such good advice at the very first steps of this thesis and has been patient with my silly issues. This thesis has received a significant supervision from PhD. Researcher Georgios Varis- teas. I am deeply thankful to him because he has given me suggestions, instructions as well as his experience is this field. Finally, the special thanks go to my family and friends for their love and supports through the duration of my studies. Stockholm, March 20, 2013 Chau Ho Bao Le v Contents 1 Introduction 1 1.1 Overview . .1 1.2 Problem statement . .1 1.3 Related work . .2 1.4 Report layout . .2 2 Background 3 2.1 OSes on multiple processors . .3 2.1.1 Factored Operating System (fos) . .3 2.1.2 Tessellation . .3 2.1.3 Barrelfish . .4 2.2 Parallel programming models . .4 2.2.1 Shared memory . .4 2.2.1.1 Task-centric or task-based model . .4 2.2.1.2 Explicit threading . .5 2.2.2 Message passing . .5 2.3 Porting . .5 3 Barrelfish OS 7 3.1 Introduction . .7 3.1.1 Overview . .7 3.1.2 Multikernel structure . .7 3.2 Conceptions and Notions . .8 3.3 Building, Compiling and Booting . 10 3.3.1 Building . 10 3.3.2 Compiling . 11 3.3.3 Booting . 11 3.4 Summary . 12 4 Cilk 13 4.1 Brief Overview . 13 4.2 Compiling . 15 4.2.1 Compilation process . 15 4.2.2 Compilation strategy . 15 4.3 Scheduling . 16 4.3.1 Work-stealing scheduler . 16 vii 4.3.2 Implementation . 16 4.4 Summary . 19 5 Porting Cilk to Barrelfish 21 5.1 Challenges . 21 5.2 Multithreaded Model . 22 5.2.1 Cilk on the original platform . 22 5.2.2 Cilk on Barrelfish OS . 23 5.3 Modifications on Cilk . 26 5.3.1 Compile time . 26 5.3.2 Runtime . 27 5.4 Modifications on Barrelfish . 29 5.4.1 Hake . 29 5.4.2 Makefile . 29 5.5 Summary . 30 6 Benchmarks 31 6.1 Environment settings . 31 6.2 Measurements . 32 6.2.1 Measurements of serial applications . 32 6.2.2 Measurements of Cilk applications . 32 6.3 Experiments . 33 6.4 Evaluation . 53 7 Conclusion 55 7.1 Contribution . 55 7.2 Future Work . 56 viii List of Figures 3.1 The multikernel structure . .8 3.2 Barrelfish structure . .9 4.1 A serial C and a Cilk program to compute the nth Fibonacci number . 14 4.2 The Cilk dag computes the 3th Fibonacci number . 14 4.3 The compilation process of a Fibonacci program . 15 4.4 Runtime data structures of a deque . 17 4.5 Interactions between thief and worker in the three cases . 18 5.1 Model of shared-memory scheduler in Cilk . 23 5.2 Multithreaded model of Cilk scheduler in its original platform . 23 5.3 The model of Cilk scheduler in Barrelfish . 24 5.4 Model of Cilk scheduler in Barrelfish with a domain . 25 5.5 Multithreading model of Cilk scheduler in Barrelfish . 25 5.6 Compilation progress of a Cilk program in Barrelfish . 26 5.7 wait/notify mechanism to replace POSIX create and join ....... 28 6.1 Cilk application invokes the runtime library . 32 6.2 Spanning domain overheads over cores . 33 6.3 Comparison of TW of cilksort on Barrelfish and Linux . 36 6.4 Speedup vs. serial versions of cilksort ................... 36 6.5 Thread distribution over 8 cores of cilksort ................ 37 6.6 Comparison of TW of FFT on Barrelfish and Linux . 39 6.7 Speedup vs. serial versions of FFT ..................... 39 6.8 Thread distribution over 8 cores of FFT .................. 40 6.9 Comparison of TW of fib on Barrelfish and Linux . 42 6.10 Speedup vs. serial versions of fib ...................... 42 6.11 Thread distribution over 8 cores of fib ................... 43 6.12 Comparison of TW of LU on Barrelfish and Linux . 45 6.13 Speedup vs. serial versions of LU ...................... 45 6.14 Thread distribution over 8 cores of LU ................... 46 6.15 Comparison of TW of matmul on Barrelfish and Linux . 48 6.16 Speedup vs. serial versions of matmul ................... 48 6.17 Thread distribution over 8 cores of matmul ................ 49 6.18 Comparison of TW of strassen on Barrelfish and Linux . 51 6.19 Speedup vs. serial versions of strassen ................... 51 6.20 Thread distribution over 8 cores of strassen ................ 52 ix List of Tables 6.1 Hardware configurations of the virtual machine . 31 6.2 Spanning domain cost . 33 6.3 Execution time of 6 serial Cilk programs . 34 6.4 Measurements of cilksort on Barrelfish and Linux . 35 6.5 Number of steals with 8 workers of cilksort ................ 37 6.6 Number of threads spawned in 8 workers of cilksort ........... 37 6.7 Measurements of FFT on Barrelfish and Linux . 38 6.8 Number of steals with 8 workers of FFT .................. 40 6.9 Number of threads spawned in 8 workers of FFT ............. 40 6.10 Measurements of fib on Barrelfish and Linux . 41 6.11 Number of steals with 8 workers of fib ................... 43 6.12 Number of threads spawned in 8 workers of fib .............. 43 6.13 Measurements of LU on Barrelfish and Linux . 44 6.14 Number of steals with 8 workers of LU ................... 46 6.15 Number of threads spawned in 8 workers of LU .............. 46 6.16 Measurements of matmul on Barrelfish and Linux . 47 6.17 Number of steals with 8 workers of matmul ................ 49 6.18 Number of threads spawned in 8 workers of matmul ........... 49 6.19 Measurements of strassen on Barrelfish and Linux . 50 6.20 Number of steals with 8 workers of strassen ................ 52 6.21 Number of threads spawned in 8 workers of strassen ........... 52 x Chapter 1 1 Introduction 1.1 Overview With the advancement of technology, computer hardware has developed for years so that it have changed computers from big machines to small ones, altered processors from single-core to many-core and made them more diverse. Different new processors (cores) and heterogeneous hardware have led to the demand for scalable operating sys- tems (OSes) which adapt to operate on such environments; therefore the multikernel architecture [8] concept has arisen as one of OSes for scalable parallel computers. In this architecture, an OS is considered to be distributed, hence it obviously inherits ben- efits from a distributed system like heterogeneity, large-scale ability, less communication latency, and etc. Barrelfish OS is an instance of the multikernel model. When the von-Neumann model for sequential programming is not appropriate to HPC, parallel programming model has emerged to exploit underlying hardware and scalable OSes in which programmers can make use of parallelism of programming language at high level. One well-known model is task-centric model which uses shared-memory to interact be- tween threads. It is known that multithreaded and shared-memory programming models have exposed good performance on multicore machines with operating systems that they have been developed for. One of the representatives of this model is Cilk [19]. In Cilk, an execution unit is a task which is distributed across cores through a work-stealing scheduler designed for shared-memory machines. In order to get benefits from a scalable OS and a parallel programming model, there is an idea to combine these two factors into one paradigm, that is running a multi-threaded share-memory application in a multiker- nel.