Porting a C Library for Fine Grained Independent Task Parallelism to Enea OSE RTOS

Porting a C Library for Fine Grained Independent Task Parallelism to Enea OSE RTOS Master of Science Thesis YIHUI WANG Master’s Thesis at Xdin AB Supervisor: Barbro Claesson, Xdin AB Detlef Scholle, Xdin AB Examiner: Ingo Sander, KTH TRITA-ICT-EX-2012:141 Acknowledgements I would like to express my sincere gratitude to my supervisors, Detlef Scholle and Barbro Claesson from Xdin AB, for giving me the precious opportunity to accom- plish this master thesis and giving me constant support. I would also like to thank Ingo Sander, my examiner from KTH for the useful suggestions and important guid- ance in writting this report. I would like to give my thanks to other thesis workers in Xdin, especially Sebastian Ullström and Johan Sundman Norberg, who help me a lot with both the project and the thesis. The amazing time that we spent together became one of my most precious memories in life. My thanks extends to everyone in Xdin AB and Enea AB who are friendly and hospitable. Finally I would like to thank my beloved family for their support through my entire life. I could not go so far in academic without their support. In particular, I must acknowledge my friend Junzhe Tian for his encouragement and constant assistance through the duration of my master study. Yihui Wang Stockholm, Sweden June, 2012 Abstract Multi-core starts an era to improve the performance of com- putations by executing instructions in parallel. However, the improvement in performance is not linear with the number of cores, because of the overhead caused by intercom- munication and unbalanced load over cores. Wool provides a solution to improve the performance of multi-core systems. It is a C library for fine grained independent task parallelism developed by Karl-Filip Faxén in SICS, which helps to keep load balance over cores by work stealing and leapfrogging. In this master thesis project, Wool is ported to the Enea OSE real-time operating system aiming at supplying an ap- proach to improve the performance of the multi-core system. To reach this goal, multi-core architecture, task parallelism algorithms, as well as POSIX threads are studied. Besides, hardware synchronization primitives which are de- fined by processors are studied and implemented in Wool. The target hardware for this study is Freescale P4080 board with eight e500mc cores. Wool is ported on the same target with both Linux and OSE operating systems. Finally, the porting is tested and verified. Contents List of Figures List of Tables List of Abbreviations 1 Introduction 1 1.1 Background . 1 1.2 Problem statement . 1 1.3 Goals . 2 1.4 Method . 3 1.5 Contributions . 3 I Theoretical Study 5 2 Multi-core 7 2.1 Multi-core operating system . 7 2.2 Comparison . 9 2.3 Summary . 10 3 Parallel Computing 11 3.1 Parallel performance metrics . 11 3.2 Levels of parallelism . 12 3.2.1 Instruction-level parallelism (ILP) . 12 3.2.2 Thread-level parallelism (TLP) . 12 3.3 Parallel programming models . 12 3.4 Design in parallel . 13 3.5 Summary . 14 4 Threads 17 4.1 Thread overview . 17 4.1.1 Process & thread . 17 4.1.2 Threaded program models . 18 4.2 Thread management . 19 4.3 Mutual exclusion . 19 4.4 Conditional variables . 20 4.5 Synchronization . 21 4.5.1 Atomic operations . 21 4.5.2 Hardware primitives . 21 4.6 Threads on multi-core . 22 4.7 Summary . 22 5 Wool 25 5.1 Basic concepts . 25 5.1.1 Worker . 25 5.1.2 Task pool . 26 5.1.3 Granularity in Wool . 26 5.1.4 Load balance in Wool . 26 5.1.5 Structure . 27 5.2 Work stealing . 27 5.3 Leap frogging . 29 5.4 Direct task stack algorithm . 31 5.4.1 Spawn . 31 5.4.2 Sync . 32 5.5 Wool optimization . 32 5.5.1 Sampling victim selection . 32 5.5.2 Set based victim selection . 33 5.6 Other multi-threaded scheduler . 33 5.7 Summary . 34 II Implementation 35 6 OSE 37 6.1 OSE fundamentals . 37 6.2 OSE process & IPC . 38 6.3 OSE for multi-core . 39 6.3.1 OSE load balancing . 40 6.3.2 Message passing APIs . 40 6.4 OSE pthreads . 40 6.5 Summary . 41 7 P4080 and TILE64 43 7.1 Features . 43 7.1.1 Instruction set . 43 7.1.2 Memory consistency . 44 7.1.3 Memory barrier . 45 7.2 Freescale QorIQ P4080 platform . 46 7.2.1 P4080 architecture . 46 7.2.2 e500mc core . 46 7.3 Tilera’s TILE64 . 47 7.3.1 TILE64 architecture . 47 7.3.2 TILE . 47 8 Porting Wool to P4080 49 8.1 Design . 49 8.1.1 Storage access ordering . 49 8.1.2 Atomic update primitives . 50 8.2 Implementation . 50 8.2.1 Prefetch & CAS . 50 8.2.2 Gcc inline assembler code . 51 8.2.3 Enea Linux . 51 8.3 Experiment . 52 8.3.1 Experiment: Fibonacci code with and without Wool . 52 8.3.2 Analysis . 53 9 Porting Wool to OSE 55 9.1 Design . 55 9.1.1 Malloc library . 55 9.1.2 Pthread . 57 9.2 Implementation . 58 9.2.1 Load module configuration . 58 9.2.2 Makefile . 59 9.2.3 System information . 59 9.2.4 Application on OSE multi-core . 59 9.3 Experiments . 59 9.3.1 Experiment on OSE & Linux . 59 9.3.2 Experiment on the performance vs. the number of cores (1) . 60 9.3.3 Experiment on the performance vs. the number of cores (2) . 61 10 Conclusions and Future Work 63 10.1 Conclusions . 63 10.1.1 Theoretical study . 63 10.1.2 Implementation . 64 10.1.3 Result Discussion . 64 10.2 Future work . 65 Appendices 65 A Demo Results 67 Bibliography 69 List of Figures 1.1 Implementation Steps . 3 2.1 Multi-core Architecture . 8 2.2 Architecture of SMP System . 8 2.3 Architecture of AMP System . 9 4.1 Mutex Structure [14] . 20 4.2 Condition Variable [15] . 20 5.1 Work Stealing Structure. 28 5.2 Work Stealing. 29 5.3 Leap Frogging . 30 5.4 Spawn. 31 6.1 Processes in Blocks with Pools in Domains [26] . 38 6.2 Message Passing . 39 6.3 MCE: Hybrid AMP/SMP Multiprocessing [31] . 40 7.1 P4080. 46 7.2 Tilera’s TILE64 Architecture [38] . 48 8.1 Enea Linux. ..

Load more