Porting a Library for Fine Grained Independent Task Parallelism to Enea OSE RTOS

Master of Science Thesis

YIHUI WANG

Master’s Thesis at Xdin AB Supervisor: Barbro Claesson, Xdin AB Detlef Scholle, Xdin AB Examiner: Ingo Sander, KTH

TRITA-ICT-EX-2012:141

Acknowledgements

I would like to express my sincere gratitude to my supervisors, Detlef Scholle and Barbro Claesson from Xdin AB, for giving me the precious opportunity to accom- plish this master thesis and giving me constant support. I would also like to thank Ingo Sander, my examiner from KTH for the useful suggestions and important guid- ance in writting this report.

I would like to give my thanks to other thesis workers in Xdin, especially Sebastian Ullström and Johan Sundman Norberg, who help me a lot with both the project and the thesis. The amazing time that we spent together became one of my most precious memories in life.

My thanks extends to everyone in Xdin AB and Enea AB who are friendly and hospitable.

Finally I would like to thank my beloved family for their support through my entire life. I could not go so far in academic without their support. In particular, I must acknowledge my friend Junzhe Tian for his encouragement and constant assistance through the duration of my master study.

Yihui Wang Stockholm, Sweden June, 2012 Abstract

Multi-core starts an era to improve the performance of com- putations by executing instructions in parallel. However, the improvement in performance is not linear with the num- ber of cores, because of the overhead caused by intercom- munication and unbalanced load over cores. Wool provides a solution to improve the performance of multi-core sys- tems. It is a C library for fine grained independent task parallelism developed by Karl-Filip Faxén in SICS, which helps to keep load balance over cores by work stealing and leapfrogging.

In this master thesis project, Wool is ported to the Enea OSE real-time operating system aiming at supplying an ap- proach to improve the performance of the multi-core sys- tem. To reach this goal, multi-core architecture, task par- allelism algorithms, as well as POSIX threads are studied. Besides, hardware synchronization primitives which are de- fined by processors are studied and implemented in Wool. The target hardware for this study is Freescale P4080 board with eight e500mc cores. Wool is ported on the same target with both and OSE operating systems. Finally, the porting is tested and verified. Contents

List of Figures

List of Tables

List of Abbreviations

1 Introduction 1 1.1 Background ...... 1 1.2 Problem statement ...... 1 1.3 Goals ...... 2 1.4 Method ...... 3 1.5 Contributions ...... 3

I Theoretical Study 5

2 Multi-core 7 2.1 Multi-core operating system ...... 7 2.2 Comparison ...... 9 2.3 Summary ...... 10

3 Parallel Computing 11 3.1 Parallel performance metrics ...... 11 3.2 Levels of parallelism ...... 12 3.2.1 Instruction-level parallelism (ILP) ...... 12 3.2.2 Thread-level parallelism (TLP) ...... 12 3.3 Parallel programming models ...... 12 3.4 Design in parallel ...... 13 3.5 Summary ...... 14

4 Threads 17 4.1 Thread overview ...... 17 4.1.1 & thread ...... 17 4.1.2 Threaded program models ...... 18 4.2 Thread management ...... 19 4.3 Mutual exclusion ...... 19 4.4 Conditional variables ...... 20 4.5 Synchronization ...... 21 4.5.1 Atomic operations ...... 21 4.5.2 Hardware primitives ...... 21 4.6 Threads on multi-core ...... 22 4.7 Summary ...... 22

5 Wool 25 5.1 Basic concepts ...... 25 5.1.1 Worker ...... 25 5.1.2 Task pool ...... 26 5.1.3 Granularity in Wool ...... 26 5.1.4 Load balance in Wool ...... 26 5.1.5 Structure ...... 27 5.2 Work stealing ...... 27 5.3 Leap frogging ...... 29 5.4 Direct task stack algorithm ...... 31 5.4.1 Spawn ...... 31 5.4.2 Sync ...... 32 5.5 Wool optimization ...... 32 5.5.1 Sampling victim selection ...... 32 5.5.2 Set based victim selection ...... 33 5.6 Other multi-threaded scheduler ...... 33 5.7 Summary ...... 34

II Implementation 35

6 OSE 37 6.1 OSE fundamentals ...... 37 6.2 OSE process & IPC ...... 38 6.3 OSE for multi-core ...... 39 6.3.1 OSE load balancing ...... 40 6.3.2 Message passing APIs ...... 40 6.4 OSE pthreads ...... 40 6.5 Summary ...... 41

7 P4080 and TILE64 43 7.1 Features ...... 43 7.1.1 Instruction set ...... 43 7.1.2 Memory consistency ...... 44 7.1.3 Memory barrier ...... 45 7.2 Freescale QorIQ P4080 platform ...... 46 7.2.1 P4080 architecture ...... 46 7.2.2 e500mc core ...... 46 7.3 Tilera’s TILE64 ...... 47 7.3.1 TILE64 architecture ...... 47 7.3.2 TILE ...... 47

8 Porting Wool to P4080 49 8.1 Design ...... 49 8.1.1 Storage access ordering ...... 49 8.1.2 Atomic update primitives ...... 50 8.2 Implementation ...... 50 8.2.1 Prefetch & CAS ...... 50 8.2.2 Gcc inline assembler code ...... 51 8.2.3 Enea Linux ...... 51 8.3 Experiment ...... 52 8.3.1 Experiment: Fibonacci code with and without Wool . . . . . 52 8.3.2 Analysis ...... 53

9 Porting Wool to OSE 55 9.1 Design ...... 55 9.1.1 Malloc library ...... 55 9.1.2 Pthread ...... 57 9.2 Implementation ...... 58 9.2.1 Load module configuration ...... 58 9.2.2 Makefile ...... 59 9.2.3 System information ...... 59 9.2.4 Application on OSE multi-core ...... 59 9.3 Experiments ...... 59 9.3.1 Experiment on OSE & Linux ...... 59 9.3.2 Experiment on the performance vs. the number of cores (1) . 60 9.3.3 Experiment on the performance vs. the number of cores (2) . 61

10 Conclusions and Future Work 63 10.1 Conclusions ...... 63 10.1.1 Theoretical study ...... 63 10.1.2 Implementation ...... 64 10.1.3 Result Discussion ...... 64 10.2 Future work ...... 65

Appendices 65

A Demo Results 67

Bibliography 69 List of Figures

1.1 Implementation Steps ...... 3

2.1 Multi-core Architecture ...... 8 2.2 Architecture of SMP System ...... 8 2.3 Architecture of AMP System ...... 9

4.1 Mutex Structure [14] ...... 20 4.2 Condition Variable [15] ...... 20

5.1 Work Stealing Structure...... 28 5.2 Work Stealing...... 29 5.3 Leap Frogging ...... 30 5.4 Spawn...... 31

6.1 Processes in Blocks with Pools in Domains [26] ...... 38 6.2 Message Passing ...... 39 6.3 MCE: Hybrid AMP/SMP Multiprocessing [31] ...... 40

7.1 P4080...... 46 7.2 Tilera’s TILE64 Architecture [38] ...... 48

8.1 Enea Linux...... 52 8.2 Fib (5) ...... 53

9.1 Fib(43) with Wool Library on the Linux & OSE on P4080 ...... 60 9.2 Fib(12) with Large Task Body on Multiple Cores ...... 61 List of Tables

2.1 Comparison of Multiple Processors and Multiple Cores...... 9 2.2 Comparison between Multiple Threads and Multiple Cores ...... 10

3.1 Comparison of Shared Memory Model and Message Passing Model . . . 13

4.1 Processes and Threads ...... 18 4.2 Advantages and Disadvantages of Threads on Multi-core ...... 22

7.1 Instruction Set Comparison ...... 43 7.2 Memory Consistency Models ...... 45 7.3 Memory Barriers & Instruction ...... 45

8.1 Time for Fib(43) with Wool Library on the Linux X86 Platform . . . . 53

9.1 Fib(46) with Memalign, Malloc and Calloc ...... 57 9.2 Fib(43) with Wool Library on Multiple Cores ...... 60

10.1 Target ...... 64

List of Abbreviations

Pthread POSIX thread

OSE Operating System Embedded

P4080 QorIQ P4080 Eight-Core Communications Processors

CMP Chip Multi-Processor

SMP Symmetric Multi-Processing

OS Operating System

SMT Simultaneous Multi-Threading

ILP Instruction-Level Parallelism

TLP Thread-Level Parallelism

API Application Programming Interface

LIFO Last In First Out

FIFO First In First Out

IPC Inter Process Communication

MCE Multi-Core Embedded

ISA Instruction Set Architecture

CISC Complex Instruction Set Computer

RISC Reduced Instruction Set Computing

MPPA Massively Parallel Processor Array

Mutex Mutual Exclusion

Fib Fibonacci

Chapter 1

Introduction

1.1 Background

This thesis project is conducted at Xdin AB1, a technology and IT consultant com- pany, developing and delivering competence for world leading companies. The thesis is part of the MANY2 project hosted by ITEA23. The objective of MANY is to provide scalable, reusable and fast developed software for embedded systems.

Multi-core became popular in the recent decade, which takes advantage of single- core with parallel computing. However, the overhead caused by synchronizations among processors becomes a problem as the number of cores keeps on increasing. An efficient task parallelism scheduler is demanded to decrease the communication overhead and improve the performance.

Wool4 is a C-library for fine grained independent task parallelism on top of POSIX threads (pthread) [1], which is developed at SICS5 by Karl-Filip Faxén. It can be regarded as a low overhead scheduler for concurrent programs. The demands on an efficient parallel computing make it important to port Wool to OSE6 multi-core platforms to gain a better performance. This master thesis focuses on porting Wool to Enea OSE real-time operating system.

1.2 Problem statement

Wool is a C library concentrating on reducing the communication overhead and keeping load balanced across multiple cores. It is designed to give a better perfor-

1http://xdin.com/ 2Many-core programming and resource management for high-performance Embedded Systems 3http://www.itea2.org 4Wool’s homepage: http://www.sics.se/ kff/wool/ 5Swedish Institute of Computer Science 6A real-time embedded operating system created by Enea AB

1 CHAPTER 1. INTRODUCTION mance for multi-core systems, like P40807, because it helps to distribute the work load to each core with a low overhead. Wool works on Linux on X868 currently. To port Wool to Enea OSE on P4080, a good knowledge on Wool task parallelism, PowerPC instruction set and OSE operating system are required. Differences be- tween the operating systems and the hardware targets should be considered before the porting. The main problems are listed below.

P1: Wool task parallelism Wool works on task parallelism. How tasks are assigned to different cores and coop- erate to achieve a more efficient parallelism is an interesting topic. The modification of the Wool source code to fit the new platform is the main part of the project, so it is important to find out which part of source code should be modified to port Wool to the new platform without changing the task parallelism.

P2: Pthread Wool is a C library based on POSIX threads (pthreads), which is a library provided by the operating systems. The Linux operating system implements pthreads, and all the thread functions and data types are declared in the pthread header file. However, the Enea OSE operating system is based on OSE processes and message passing instead of pthreads. Though most pthread APIs are implemented and supported by OSE to make it easier to port pthread based applications, changes and reconfigurations of OSE pthreads remain to be problems in this project.

P3: Hardware synchronization primitives Hardware synchronization primitives are used in the implementation of Wool, which are used for the optimization of Wool. These hardware primitives are defined by computer verdors. To port Wool to the new target P4080, one must add these primitives for the specified target. These hardware dependent code relies on the memory consistency models, memory barriers and the type of the instruction set. These low level assembly code should be embedded in the C program.

1.3 Goals

This project is conducted in 2 phases: Step1: Port Wool to P4080. 1. Configure P4080 with Linux kernel. 2. Modify the hardware dependent code of Wool using gcc inline assember.

7http://www.freescale.com 8X86: A series of computer microprocessor instruction set architectures based on the Intel 8086 CPU.

2 1.4. METHOD

OSE Linux Linux (pthread)

X86 P4080 P4080

Figure 1.1. Implementation Steps

3. Verify Wool on P4080

Step2: Port Wool to OSE. 1. Configure OSE on P4080. 2. Modify OSE libraries, including the pthread library 3. Verify Wool upon OSE pthreads.

1.4 Method

This thesis lasts for 20 weeks, including two phases: theoretical study and implementation. The theoretical study is conducted during the first 10 weeks. Background knowledge of Wool, OSE and P4080 is prepared by reading man- uals and Papers.

In the second phase, design and implementation of Wool on P4080 are con- ducted first, and followed by test verification of the implementation. Finally, the performance of Wool is tested on the new platform and a demonstration is conducted. The platform is a Freescale board with QorIQ P4080 Eight- Core Communications Processors and the implementation is done according to Xdin AB standards.

1.5 Contributions

This master thesis involves the study of parallel programming on multi-core, POSIX thread and Wool task parallelism strategies. Enea Linux has been setup on the Freescale P4080 platform. The source code regarding to the hardware dependent primitives has been changes to fit the new platform. And a set of tests are performed to verify the results.

Wool is also ported to Enea OSE on the Freescale P4080 platform. The main problem is that some of the OSE libraries differ from Linux libraries. To fix the library problem, additional configurations and functions are used. Wool is configured as a load module in the implementation. By compiling the

3 CHAPTER 1. INTRODUCTION

Wool library together with the application code, the application gains a better performance. The performance of Wool has been tested on the new platform and the results are compared with that on Enea Linux.

4 Part I

Theoretical Study

5

Chapter 2

Multi-core

According to Moore’s Law, the chip performance doubles every 18 months [2]. To keep this trend, micro-processor vendors used to improve the com- puters’ performance by increasing its clock frequency on single-core processor. However, the increase in performance reaches a bottleneck, deal to the power consumption and heat dissipation, which grows exponentially with the clock frequency. Therefore multi-core comes as a solution, which improves the per- formance with multiple execution units executing in parallel [3]. However, it increases the difficulty to program a multi-core system for the massive paral- lelism, and the inter-core dependencies can decrease its performance as well.

Multi-core is a processor which integrates more than one processing unit on a single chip [4]. Each unit is called a "core" which is independent. Multi-core is also called Chip Multi-Processor (CMP) because the cores fit on a single processor. The architecture of a multi-core processor generally includes sev- eral cores, one system bus, private or shared caches (There is always at least one level cache that is shared, which helps to speedup data transfer time, re- duce cache-coherency complexity and reduce data-storage redundancy.) and a shared memory, as shown in Figure 2.1. It differs from a single core processor in both architecture and parallel computing.

2.1 Multi-core operating system

Operating systems on multi-core are mainly divided into Symmetric Multi- Processing (SMP) and Asymmetric Multi-Processing (AMP). SMP requires homogeneous hardware with the same operating system on each core, while AMP is heterogeneous with different operating systems running on different cores. There are also bare systems with no operating system running.

7 CHAPTER 2. MULTI-CORE

Core1 Core2 CoreN CPU CPU CPU . . . L1 Data Cache L1 Data Cache L1 Data Cache L1 Instr Cache L1 Instr Cache L1 Instr Cache

L2 Cache

System Bus Shared Memory Multi-core Processor

Figure 2.1. Multi-core Architecture

SMP systems

SMP systems (see Figure 2.2) are the most commonly used form, which allow any core to work on any task regardless of where the data is located in memory. There is a single operating system (OS) image for SMPs. Each processor is able to access the shared memory and complete a given task, which is contrary to master and slave processors [5]. So it is easy to distribute tasks among cores to balance the workload dynamically.

Application Application . . . Application Operating System Core1 Core2 . . . CoreN

Figure 2.2. Architecture of SMP System

AMP systems

AMP systems (see Figure 2.3) are used in some embedded devices where cores are assigned different roles. Independent OS images for cores are used, and each OS kernel has a dedicated local memory and shared memory. Tasks cannot be merged between processors easily in AMP, and resources cannot be treated the same over the system [5].

8 2.2. COMPARISON

Application Application Application OSE OS Linux OS . . . Linux OS Core1 Core2 CoreN

Figure 2.3. Architecture of AMP System

2.2 Comparison with multi-processor & multi-threading

Multi-core systems are unlike multi-processor systems and multi-threading. To illustrate multi-core system clearly, we make a comparison below.

Multi-processor system

Multi-processor Systems require more power to be driven because signals be- tween cores are routed off-chip. Compared with a multi-processor system, a multi-core system is faster with inter-core bus and cache-snoop [4]. Differences between them are shown in Table 2.1.

Multiple Processors Multiple Cores Two different chips connected by Connected within a chip bus Parallelism needs external software Multi-processes run in parallel auto- support matically Heat consumption Less heat consumption - Lower package cost

Table 2.1. Comparison of Multiple Processors and Multiple Cores.

Multi-threading

Simultaneous Multi-threading (SMT) belongs to instruction level parallelism, which can run on both single-core and multi-core. Multi-threading is used to avoid pipeline stalls. If one thread is waiting for a hardware resource to response, the other thread can take over the task and execute without waiting. However, if both of the threads are waiting for the same hardware resource, they will stall. Differences between multi-core and SMT are shown in Table 2.2.

9 CHAPTER 2. MULTI-CORE

Multiple Threads Multiple Cores Instruction level parallelism Thread level parallelism Share one core and L1 cache More than one set of cores and L1 caches One large super-scalar core Several cores

Table 2.2. Comparison between Multiple Threads and Multiple Cores

2.3 Summary

Multi-core allows microprocessors to achieve a boost in performance without increasing the power consumption and the complexity of the hardware com- ponents. However, the improvement in performance can only be achieved when the multi-core architecture combines with the parallel programming. Multi-core brings the designers with difficulties like cache coherency and inter- communication overhead. If these problems are handled improperly, perfor- mance may even decreases. To summarize, software designers should come up with an efficient parallel strategy to take full advantage of the architecture of multi-core.

10 Chapter 3

Parallel Computing

Parallel computing executes computations simultaneously instead of serial computing, which executes one instruction at a time. The main idea of par- allelism is to break a problem into small parts, assign them to different ex- ecution units, execute them simultaneously and complete the problem faster [6]. A highly efficient parallel computation can improve the performance of the system, but to program a parallel hardware is difficult for the massive of parallelism. Parallel computing models and challenges are discussed in this chapter.

3.1 Parallel performance metrics

There are various ways to evaluate the improvement of performance in a par- allel computation. Actual speedup in a parallel computation can be measured as: T imeserial Sp = T imeparallel

This function shows that the upper bound of speedup equals to the number of CPUs. However, the speedup is not only limited by the number of CPUs, but also the proportion of the program that can be parallelized, which can be calculated with Amdahl’s law. By splitting a program into serial program and parallel program, the maximum speedup is: 1 Sp ≤ P (1 − P ) + N P: The proportion of a program that can be made parallel N: The number of cores

11 CHAPTER 3. PARALLEL COMPUTING

The upper limit of speedup is determined by the serial fraction of code. So by parallelizing as much as possible, it is possible for us to get a better perfor- mance [4]. The basic idea of parallelism is to break a program of instructions into small pieces and execute them simultaneously. However, with the num- ber of cores increasing, the performance is not increasing linearly due to the overhead caused by cache effects, extra synchronization and bus contention [7]. Therefore tasks (pieces of a program) need to be large enough to run in parallel to reduce overhead, but not so large that it could lead to the problem that there is not enough work to be run in parallel to keep load balance.

3.2 Levels of parallelism

There are different ways to implement parallelism to improve the performance. By using parallel programing, we can decrease the time needed for a single problem, and increase the throughput at the same time.

3.2.1 Instruction-level parallelism (ILP) Instructions can be overlapped and executed in parallel when they are inde- pendent. By ILP, processors will reorder the instruction pipeline, decompose them into sub-instructions and execute multiple instructions in a simultane- ous way. ILP is implicitly parallelism, which reduces the latency of memory accesses by applying the pipelined techniques and super-scalar architectures1. Furthermore, by reordering instructions, processors can perform useful work instead of stalling on data and instruction dependencies [8].

3.2.2 Thread-level parallelism (TLP) TLP is the main architecture for high performance multi-core or multiple processors. TLP means that when a thread is idle waiting for memory access, another thread is initialized and run immediately, so that the pipeline can stay in ’busy’ state all the time. In addition, the throughput of the system is increased.

3.3 Parallel programming models

Parallel programming models exist as an abstraction above hardware and memory architectures [9]. They can be applied to any type of hardwares or memory architectures. For example, a shared memory model on a distributed

1Splitting an instruction into several stages: instruction fetch, instruction decode, execution, memory, write back.

12 3.4. DESIGN IN PARALLEL memory machine appears as a shared memory to the users but is physically distributed. There are several parallel programming models stated below.

Shared memory In this model, programs share a common piece of memory, which is accessed by programs asynchronously. A typical example of this model is a global variable, which can be accessed and modified by all the programs. Shared variables can be protected by various mechanisms like locks (see Section 4.5), which help to control the accesses to them [9].

Message passing Distributed memory module is used in this model. Each processor owns its data in a private memory, and different processors exchange data by sending and receiving messages. processors need to cooperate with each other, for example, a send must match a receive operation [9]. A comparison of shared memory model and message passing model is shown in Table 3.1.

Shared memory Message passing communication load and store send and receive memory shared or private private synchronization required when sharing data implicit

Table 3.1. Comparison of Shared Memory Model and Message Passing Model

3.4 Design in parallel

Though parallel computing improves performance, challenges like deadlock and race conditions, exist because of the competition of shared resources (pro- cessor, memory or devices) and sequential dependencies (data dependency and communication and synchronization). To cope with these challenges, syn- chronization primitives are used, see Section 4.5. To design in parallel, the following aspects should be paid attention to.

Decomposition To convert a serial program into a parallel one, the first step is to break the program into several pieces. There are two ways to implement decomposition: domain decomposition and functional decomposition. Domain decomposition means that each task works on a partition of the data. While function decom- position means each task performs a partition of the overall work [10].

13 CHAPTER 3. PARALLEL COMPUTING

Inter-communication If the partitioned tasks are assigned to different processors and they share data. Communication must be performed, which will introduce both latency and overhead. The cost of communication needs to be considered before de- composition. While waiting for an event or some data, a process stalls and waits, which is called blocking. Otherwise, the process is called non-blocking, and can make progress without suspending while waiting.

Synchronization & data dependency Synchronization is used to synchronize different tasks and ensure data correct- ness. Most synchronization operations like semaphores are based on memory barriers. A memory barrier makes the memory operations before the bar- rier visible to all the tasks and then continues to execute. Applications with little data dependency can benefit most from multi-core system because less synchronizations are needed which are costly.

Load balancing In a multi-core system, the overall performance is up to the slowest CPU. So it is important for each task to have a similar work load, and keep the workload on each CPU balanced. Dynamic work assignment could be used to keep load balancing.

Granularity Granularity is a quantitative measure of the ratio of computation to communi- cation. According to the amounts of work being done between communication events, Granularity is divided by fine-grain parallelism and coarse-grain par- allelism. Fine-grain parallelism consists of higher number of small tasks and more communication is needed. It helps to keep load balancing, at the cost of higher overhead.

3.5 Summary

To achieve a better performance of multi-core, one should fully convert a serial program to several fragments to execute them concurrently. Parallel computing enables concurrent computation but it increases the programming complexity. While different levels of parallelism could be combined to gain a better performance.

Parallel challenges need to be paid attention to by the designers. Parallel

14 3.5. SUMMARY programs are designed upon parallelism models. The shared memory model and the message passing model are commonly used to implement parallelism. When applying the shared memory model, one should synchronize between processes to keep memory coherence and avoid other problems like deadlock. Solutions to these problems are given in the Chapter 4.

15

Chapter 4

Threads

Threads are commonly used to implement parallelism in a multi-core system with shared memory. They are used to increase the throughput and decrease the idle time for processing units. There are several versions of threads, while the most commonly used threads are POSIX threads (also called pthreads), which is a standardized C language threads specified by the IEEE POSIX 1003.1c standard. Linux implements native POSIX threads on it and our dis- cussion is limited within pthreads in this section [11].

The pthread APIs can be grouped into four groups: thread management, mutex variables, condition variables and synchronization. The features of pthread APIs make it simpler to guarantee memory coherency and implement synchronization.

4.1 Thread overview

A thread is an independent stream of instructions that can be scheduled to run as such by the operating system [12]. Threads are defined in the program environment and initialized by the compiler; they run simultaneously in a process independently and are charged by the operating system. Threads are used for improving the performance with less overhead and fewer system resources compared with processes. The actual execution unit of threads is a processor [11].

4.1.1 Process & thread A process includes the program in execution and all the resources that the pro- gram involved. Processes are created by the operating system, with a large amount of information, like program, control words, data and status etc. A thread is started by a process, which is the basic element in a process running

17 CHAPTER 4. THREADS as a logic flow. There would be several threads executing in parallel within one process, which is called multi-thread process. Each thread is independent because it has its own thread context, thread ID, pointer, stack, register and condition word, and they are recognized by thread ID. A comparison between thread and process is listed in Table 4.1.

Category Process Thread Address space A process has its own address Threads in the same process space protected by operating share the same address space system Interaction Shared locations in operating shared memory within the system process Context Switching heavy, the entire process state light, only current register must be preserved state needs to be saved.

Table 4.1. Processes and Threads

Managing a thread requires fewer system resources and less overhead than a process. Multiple threads can overlap CPU work with I/O operations (with one thread waiting for I/O, another thread is performed by CPU). On the other hand, thanks to the threads, share resources within process must be synchronized. Therefore, it increases the difficulty to write and debug the program.

4.1.2 Threaded program models Parallel programming is suitable to be applied in a threaded program. Meth- ods of designing parallel programs (like decomposing an serial application into independent small tasks) is given in section 3.4. Here we discuss threaded pro- gram models.

Manager/worker: A manager is a single thread, which assigns work to other threads (workers). The manager takes charge of task assignments. The size of the worker pool can be either static or dynamic.

Pipeline: By breaking a task into smaller pieces, each thread takes care of one piece of code and executes in series. Different threads work concurrently and the task is executed in parallel.

Peer: Unlike manager/worker model, threads are equal in the peer module. Each thread, a worker, can assign tasks to others.

18 4.2. THREAD MANAGEMENT

4.2 Thread management

There are four status of a thread, ready, running, blocked and terminated, which can be changed by thread management APIs and the status of shared resources. Threads can be created and attributes (joinable, scheduling) can be assigned with these routines.

Creating and terminating threads There is one default thread per process, while other threads should be created and initialized by the default one. Each thread is named by an unique ID (identifier). The threads become peers after they are created, which means there is neither hierarchy nor dependency between them. Each thread can create a new thread. Once the tasks are done, a thread will terminate itself. They can also be terminated by other threads or the main function. Thread attributes can be set by the arguments in the routines [11].

Joining and detaching threads Joining is a way to synchronize threads. Once worker threads completes their tasks, they are joined with their master thread. While some tasks need not to be joined, so we detach them to free some system resources [11].

4.3 Mutual exclusion

Mutex is short for "mutual exclusion", which is used to protect data then multiple write operations occur. Mutex is a way to implement thread syn- chronization. Once a thread enteres a critical section, other threads must wait until it finished. Even if multiple threads ask for the same mutex, only one of them will success. It is useful in parallel executions because data in the critical path can only be modified by one thread at a time, so that "race" is avoid [11].

Mutex is implemented as a lock (see Section 4.5). Once it is created and initialized, threads will attempt to lock it. While one thread succeeds to lock it and performs some executions, the losers block that call or they can un- block it with trylock instead of lock call. After a thread completes work in the critical section, mutex is released and available for other threads to lock [13]. The structure of Mutex is shown in Figure 4.1.

19 CHAPTER 4. THREADS

Figure 4.1. Mutex Structure [14]

4.4 Conditional variables

Conditional Synchronization means a thread will stay blocked until the sys- tem satisfied the condition. In other words, it makes the thread wait before the condition is satisfied, like waiting for a notification. It is also a way to implement synchronization. Conditional variables differ from mutex because conditional synchronization synchronize threads based on variable values in- stead of controlling the thread accesses to the protected data. What’s more, multiple threads may be permitted accesses to the condition at the same time. The conditions are specified by programmers [11].

Figure 4.2. Condition Variable [15]

A condition is implemented by sharing the status of the condition of the shared data. When a specific condition is satisfied, the thread will be woken

20 4.5. SYNCHRONIZATION up. There are three operations on conditional variables: wait(L)1, (L)2 and broadcast(L)3, which are atomic operations. While a thread is waiting for a conditional variable, it is blocked until the condition is signaled. Signal a conditional variable is used to wake up another thread which is waiting for it. Broadcasting is waking up all threads which is in a blocking wait state [11]. In Figure 4.2, we can see that one thread waits on condition ready, then wakes up and proceeds, while the other thread signals condition ready.

4.5 Synchronization

In the multi-core system, threads share memory and other resources, which requires synchronizations to coordinate of parallel tasks, including serializing memory instructions, protecting shared resources and waiting for multiple tasks to reach a specified point. Synchronization coheres within threads and protects shared memory by constrainting relative instruction orderings. While synchronization can be a major factor in decreasing parallel speedup because tasks has to wait for other’s completion.

4.5.1 Atomic operations An atomic operation is a simple way to achieve synchronization by working on data types [13]. Atomic operations are performed with no possibility of inter- ruption. It is visible as either completed or not started without intermediate state [16]. Atomic operations are used to implement other synchronization primitives, like semaphore, which do not block competing threads when it access shared data, that makes it possible to achieve a better performance [13].

4.5.2 Hardware primitives Memory barriers To achieve a better performance, both CPUs and compilers reorder the in- structions to gain a fully paralleled pipeline. However, after such reordering optimization, the instructions which access the shared memory are performed out of order, which may lead to incorrect results, especially when there are data dependencies.

A memory barrier is a common way to synchronize threads on multi-core.

1Release its lock and wait. Once it completes, it indicates that the lock is required by others 2Execute the waiting thread once and then go on execution. The lock is still held by the original thread 3Allow all the waiting thread working, the lock is still hold by the original one.

21 CHAPTER 4. THREADS

It is a non-blocking mechanism, which is implemented by instructions to en- sure memory accesses perform in the expected order by forcing the processor to see a load or store positioned in front of the barrier before it sees the ones after the barrier [13]. In a word, it enforces ordering on the memory in one thread, and guarantees that all other threads have a consistent view of memory in the system. Memory barriers are always defined and specified by processors.

4.6 Threads on multi-core

Multiple threads on multi-core differs from single-core system, see Table 4.2. Threads need not to wait for resources (like CPUs) on multi-core because they run independently on their own cores and do not compete for resources, e.g. floating point unit, and etc. There are two main differences between multi- thread on single-core and multi-core: caching and priority.

Advantages Disadvantages Increased performance Data races Better resource utilization Deadlocks Efficient data sharing Code complexity Less resource is needed to change contexts Portability issues Communications between tasks is simple Testing and debugging difficulty

Table 4.2. Advantages and Disadvantages of Threads on Multi-core

Cache synchronization becomes a topic for threads on multi-core, because the shared memory modified by two threads may interfere with each other [13]. Besides, the threads having higher priority cannot ignore the lower ones in multi-core system because they can execute in parallel, which may lead the system to an unstable state.

4.7 Summary

The use of threads enables multi-core achieve performance boosts by parallel- ing computations. Threads on multi-core need to be synchronized to cohere private caches and protect shared resources. Most synchronization techniques like locks and atomic operations generally involve the use of memory bar- riers and kernel-level synchronizations [13], which are hardware dependent. However, synchronization helps to ensure correctness at the sacrifice of per- formance. The more strict a barrier is, the more costly it is.

22 4.7. SUMMARY

Parallel overhead is the barrier to get a desired speedup. These overhead and latencies are caused by threads creation, synchronization, scheduling mecha- nisms as well as inter communications. To get a desired speedup, one should reduce the bus contention as well as communication overhead. Therefore, a scheduler for multi-core system is required to keep load balancing over the threads (to get threads work in full parallelism) without inducing much over- head. It is up to a designer to trade-off between sufficient parallelism and parallel overheads to make a final decision.

23

Chapter 5

Wool — A low overhead C scheduler

To achieve a good performance on multi-core, problems are broken-down into independent tasks to execute concurrently. It is of great importance to ap- ply an efficient scheduler on multi-threaded computations to keep load bal- ancing with a low overhead. Work stealing and leap frogging are dedicated for scheduling multi-threaded computations, and help the system to achieve a good performance by distributing work to underutilized processors with a minimized communication overhead between threads.

Wool is a C library applying work stealing and leapfrogging, which is de- veloped by Karl-Filip Faxén at SICS, aiming at improving the performance on multi-core system by distributing sequential programs over multiple cores. Wool is a low overhead user level task scheduler, providing lightweight tasks on top of pthreads [1] (see Section 3.3). According to test results, which are shown in his paper [17], the performance of Wool is compatible with that of Clik, Intel TBB and OpenMP on an eight core system.

5.1 Basic concepts

5.1.1 Worker Each thread acts as a worker which is initialized with the same attribution. In Wool, each physical processor is assigned one thread by default, which means that each processor owns one worker. It is possible for each processor to have multiple workers, but one worker per processor is recommended because that context switch and locality cache collisions on the same processor will increase the overhead. On the other hand, the number of workers should be large enough to keep each processor busy [1]. Workers work as peers in Wool (see Section 4.1.2), who take care of task management with a pool of tasks that are ready to execute [17].

25 CHAPTER 5. WOOL

5.1.2 Task pool A task pool is a pool of tasks which are managed by a worker. As the task pool grows and shrinks dynamically, Wool implements it as a stack (dequeue) [17]. Newly spawned tasks are placed on top of the pool, and old tasks can be stolen by other workers from the bottom of the queue, more details are given in Section 5.1.5 on page 27.

5.1.3 Granularity in Wool As it is mentioned in Section 3.4 on page 14, granularity measures the ratio of computation to communication, which reflects the performance of task paral- lelism. We can measure the efficiency of parallelism could be measured in two aspects: task granularity Ts Gt = (5.1) Nt and load balancing granularity

Ts Gl = (5.2) Nm where Ts: Sequential execution time (with no task overheads). Nt: The number of tasks spawned. Nm: The number of migrations of tasks, in our case, the number of steals.

Task granularity represents the average useful work per task. Lower task granularity means higher overhead. Load balancing granularity measures the average useful work per steal [1].

Wool uses a fine grained parallelism strategy, which is good for load bal- ancing. There is a finest grain constant, that defines the smallest sequential program to execute. It is mainly used in the loop functions, where more than one tasks may be executed on one worker as sequential tasks instead of being executed as parallel tasks, if the cost of one iteration task is less than the finest grain constant. By defining this constant, Wool increases the load balancing granularity with lower task granularity.

5.1.4 Load balance in Wool The concept of load balance has been given in Section 3.4. Load balance is an important factor to measure the performance of the system. A good scheduler

26 5.2. WORK STEALING could distribute work evenly to each processor. In Wool, work stealing is defined by stealing tasks from the bottom of the task pool, which locates closest to the root. In this way, fewer steals are needed to distribute work evenly because of the relatively larger parallel regions [18]. A deeper discussion is given in Section 5.2 on page 29.

5.1.5 Structure A multi-threaded computation consists of multiple threads, with one thread per processor. A thread (usually starts one worker) takes care of a pool of tasks, which are organized in a doubly-ended queue. There are two pointers pointing to the top and the bottom of the queue respectively. Newly created tasks are added to the top of the queue and are ready to execute, so the queue is also called a ready queue. A ready queue is treated as a stack by its own processor, which pushes and pops tasks from the top of the queue as LIFO1. But it is regarded as a queue by other processors (who attempts to steal work) and tasks are stolen from the bottom of the queue, like FIFO2 [19]. Figure 5.1.5 is a work stealing model.

Tasks are initialized and ready to execute when they are spawned. Newly created tasks are always inserted on the top of the task pool and they are called children tasks, while the tasks who create the new tasks are called par- ent tasks. Parent tasks are located closer to the root than children tasks in a dedicated task pool. The model of Wool is shown in Figure 5.1.

5.2 Work stealing

Work stealing is an algorithm for task parallelism and load balancing across processors [20]. Whenever a worker has no tasks to do, it attempts to steal tasks from another worker. The victim3 is chosen randomly. If the victim has no tasks in its own task pool, then the steal attempt fails. If a steal attempt fails, the thief will choose another victim. In Wool, workers choose victims linearly from a random start choice of victim [20]. Once a thief steal successfully, the thief4 executes the task, and returns the results to the victim, then the steal process ends. In this way, we can keep underutilized processors working instead of idling while other processors are busy working.

1Last In First Out 2First In First Out 3Victim: refers to the worker which has tasks in its task pool, it becomes a victim when tasks are stolen by other workers. 4Thief: refers to the worker which steals tasks from other workers.

27 CHAPTER 5. WOOL

Processor1 Processor2 ProcessorN

Pop Push ......

Empty Top Task C . . . Task B Task A Task D . Bottom . . . . . Worker1 Worker2 WorkerN

Figure 5.1. Work Stealing Structure.

There are three status of a worker.

Working: A worker is working on a task.

Stealing: A worker has no task to do and starts to steal. Workers will keep on stealing until it gets a task to do.

Spin: A worker stalls and waits for the results of the stolen task which has not been finished yet.

The work stealing algorithm starts with only the main task in the ready queue. The main task will spawn tasks and other workers start to work by stealing tasks. The work stealing process is showed in Figure 5.1.

We assume at a time, there are four tasks in the system, three tasks in Worker1 and one task in WorkerN. There is no task in Worker2. In Worker1, Task A is the parent task, and the tasks on top of it are children tasks. Figure 5.2 shows the next time stamp: Worker1 spawns a new task (Task E) and puts it on top of the task pool; Worker2 has no work to do, so it steals from other workers randomly. It finds that there is a task in Worker1 and steals the task (Task A), then Worker2 becomes a thief; WorkerN has its own task, so it executes its own task (Task D). Then in the next time stamp: suppose that Task A spawns new tasks Task F and Task G. They are children tasks of Task A, so that they are put in the same worker with their parent task (Task A) in Worker2.

28 5.3. LEAP FROGGING

Processor1 Processor2 ProcessorN

Pop Push Task A Task D ...... Empty Top Task E Task C Top Bottom . . . Task B Task G Bottom Task F ...... Worker1: Victim Worker2: Thief WorkerN

Figure 5.2. Work Stealing.

Compared with work sharing Work stealing is a lazy scheduling, which means that a processors will not as- sign tasks to others until they are idle. It achieves lower overhead than eager scheduling, which keeps assigning tasks to keep load balance. A typical eager scheduling is work sharing, which attempts to migrate some of the tasks to other processors once new tasks are generated in order to distribute the work load [21].

The advantage of a work stealing algorithm is the migration of the tasks is less frequently compared with work sharing. Another advantage is that, by stealing tasks from the bottom of the queue, parent tasks are stolen which would generate children tasks later. In this way, it reduces the stealing times, because parent tasks are close to the root, it is more efficient to steal a par- ent task than steal several children tasks. It helps to keep load balance and reduces overhead by reducing the times of steals.

5.3 Leap frogging

Leap frogging is a way to optimize the work stealing algorithm. It helps to en- hance load balance and improves performance by reducing the blocking time. In work stealing algorithm, the victim is blocked and must wait for the results come out if a task is stolen by others and is in execution. Leap frogging is implemented by stealing tasks from the thief while waiting and further it helps to finish its own task. In this way, it avoids deadlock and solves the memory

29 CHAPTER 5. WOOL problem, since the task that the victim steals back is always a task that the victim would have executed if no steals happen [23]. Otherwise, there are a number of drawbacks by stealing work from a random worker, for instance, a task pool may grow beyond its size, since stealing will add a new stack on top of the blocked task.

The leap frogging process is showed in Figure 5.3, which is the next time stamp of Figure 5.2. After worker2 stealing Task A, Task A spawns two more tasks: Task F and Task G. At the same time, Worker1 completes Task E, Task C and Task B. And WorkerN stills work on Task D.

As there is no task in Worker1, it begins to synchronize with Task A, which has been stolen by Worker2 and still in execution. Instead of waiting for Task A completes, Worker1 tries to steal tasks back from the thief and help to finish Task A (A parent task can only finish on condition of all its children tasks are done). So Task F is stolen back because it is in the bottom of the ready queue.

In this way, workers can work in parallel and the load balance is enhanced. However, if the task which the victim steals back is really small, and no other tasks could be stolen back, the victim will and wait for the results come out. Then it will loss parallelism in this extreme situation.

Processor1 Processor2 ProcessorN

Task F Task G (Task A) Task D . . . . . Pop .

Top Top Bottom . . . Task G Bottom ...... Worker1: Victim Worker2: Thief WorkerN

Figure 5.3. Leap Frogging

30 5.4. DIRECT TASK STACK ALGORITHM

5.4 Direct task stack algorithm

The Wool scheduler uses macros and inline functions to implement indepen- dent task parallelism. The basic operations are spawn and sync, which are like asynchronous function call. With a spawn, a task is created and put into its ready queue, while it executes only when the control reaches the corre- sponding jsync, rather than being executed immediately [17]. The task may be done by either the processor which spawns it, or a thief who steals it.

5.4.1 Spawn By spawning a task, the task is initialized and the space is allocated. The task is put into the worker who spawns it. The task is neither executed nor returns a value until it reaches a sync. Any task in the task pool can be stolen by other workers. If it has not been stolen when control reaches the corresponding sync, it is executed inline by the same worker who spawned it. Otherwise, the worker will get the results back from the thief [1].

Processor1 Processor1

Spawn . . . . Top Empty Top Task4 Task3 Task3 Task2 Task2 Task1 Task1 . . . . Worker1 Worker1

Figure 5.4. Spawn.

A spawn operation does the following assignments. First, allocate a task descriptor on top of the task pool and initialize the descriptor with the input arguments. Secondly, initialize the status of the task with NOT STOLEN. This parameter will change after the task is stolen by other workers. Then, get the owner of the task, if the task is not stolen, the owner is the worker itself, otherwise, the owner is a thief. This parameter is important when syn-

31 CHAPTER 5. WOOL chronize the task or leap frog. A task with a wrapper function5 can either be stolen by others or inlined by itself. In the end of the process, the top pointer is moved one step up, and points to the first blank space where the next spawned task will be put into [19].

5.4.2 Sync A pair of sync & spawn is like a LIFO(last-in, first-out). A spawn pushes a task in the task pool while a sync pops the top task off. When the code reaches a sync, it begins to find the latest spawned & unsynchronized task and execute it. By Synchronizing, a task is popped from the task pool and synchronized in one of the following ways according to different situations.

Case1: If the task is not stolen, the worker itself executes it by call.

Case2: If the task is stolen and finished, the result is returned to the victim.

Case3: If the task is stolen but not finished, then leap frogging.

5.5 Wool optimization

Work stealing and leap frogging reduce the communication overhead and keep load balanced in a multi-core system. However, as the number of cores in- creasing up to 50 or more, randomly work stealing algorithm will lead to significant overhead which affects the work stealing efficiency and cannot be ignored. Nonrandom victim selection is proposed to solve this problem. Two advanced algorithms are discussed in this Section: sampling victim selection and set based victim selection in this section.

5.5.1 Sampling victim selection In stead of stealing work randomly from the first victim found by the thief, a thief samples several workers before each steal operation and chooses the task which is closest to the root to steal. In the task tree, the tasks closer to the root are typically larger, so that stealing tasks which are close to the root can contribute to load balance with smaller overhead (fewer steal operations take place for larger tasks are stolen). Sampling victim and selecting take extra time, but not as much as the time it takes to complete one steal operation. When the number of cores increases to a larger number, like 128 cores, the performance would be much improved [18].

5Wrapper executes tasks and returns the value of task functions.

32 5.6. OTHER MULTI-THREADED SCHEDULER

5.5.2 Set based victim selection When the number of thieves is significant, the contention for task pool is fierce. That means more steal attempts compared with the number of successful steals. In this case, thieves will only steal from a subset of workers, which is a private random permutation P of the induces of the other workers (these workers are its own potential victims) [18]. When steal starts, the thief picks up a random start in the permutation and proceeds through it attempting to steal. If there are no work to steal, it fleshes and starts over from the starting point [18].

5.6 Other multi-threaded scheduler

Clik Intel Cilk6 is designed by the MIT Laboratory for Computer Science, which is a general-purpose programming language designed for multi-threaded parallel computing. Clik provides simple linguistic extensions to ANSI C. For example, keywords spawn and join to implement work stealing scheduler. Clik++ is open source but the compiler (ICC compiler with Intel Cilk Plus) and tool suite are commercial.

Intel TBB Intel TBB7 is short for "Threading Building Blocks". It is open source, repre- senting a task-based parallelism in C++. The TBB library manages threads dynamically and provides templates for common parallel patterns. It is im- plemented with work stealing scheduler and it could keep load balancing au- tomatically.

OpenMP The OpenMP APIs8 are in C/C++ dedicated for shared memory parallel programming. It is built on top of native threads. A number of compilers implement the OpenMP API, like gcc, Visual Studio, and etc. Supported languages include C/C++ and Fortran. However, it is designed for shared memory systems, which does not work for distributed memory systems.

6http://supertech.csail.mit.edu/cilk/ 7http://threadingbuildingblocks.org/ 8http://openmp.org/wp/

33 CHAPTER 5. WOOL

Wool Optimizations & Limitations Compared with other multi-threaded schedulers, Wool is much simpler and has a good performance at the sacrifice of some limitations, for instance, only independent tasks may be executed in parallel [22]. In addition, task descrip- tors are in fixed size in the current Wool9 [24, 25].

5.7 Summary

Wool is a macro-based parallelism library in C. It is based on two basic opera- tions: spawn and sync. Wool focuses on independent fine grained parallelism, based on work stealing and leap frogging. It has compatible performance than Cilk, TBB and OpenMP. However, it is limited in independent tasks that is not flexible.

9To gain a desired speedup, Wool make all task descriptors in the same size, to simplify the management of the task pool, and to control the total size of task arguments.

34 Part II

Implementation

35

Chapter 6

OSE

Enea OSE is short for "Operating System Embedded" that has two types of kernels: OSE Real Time Kernel for embedded systems, and OSE Soft Kernel for host computers (like UNIX workstation and PC). Both kernels can work in single-core and multi-core systems [26]. OSE is a distributed operating system that is based on message passing. This chapter is based on Enea documents and white papers [26, 27, 28, 29, 30].

6.1 OSE fundamentals

Load module Modules are applications, which can be included in the core module (core module is a model that is linked with the kernel during the compile time). or they can be built as separately linked load modules and loaded at runtime [27]. Each load module is assigned a piece of memory in the system. It consists of a block of processes and a block pool. Memory is shared by processes within a block. Communication within or cross load module is done by message passing. Each load module is linked to one core dynamically in the run time, and parallelism is realized at the load module level [26]. Figure 6.1 shows the structure of OSE.

Memory pool A memory pool is the basic type of the memory, which has buffers and stacks allocated in it. There is one global memory pool with system processes and data, which is called the system pool and is allocated in kernel memory. There are local pools which belong to their own processes. These pools can be created dynamically or statically. The maximum size and fragment size can also be changed dynamically.

37 CHAPTER 6. OSE

Figure 6.1. Processes in Blocks with Pools in Domains [26]

Block A block is used to group OSE processes together, and acts like a process which could be started, stopped and killed. Each block may have its own memory pool. If a pool becomes corrupted, this will only affect the block connected to it without any influence on other blocks. Normally, new process (child process) should be part of the same block as its parent unless it is specified when created.

Domain A domain is a memory region contained with programs, which is formed by grouping one or several pools in a separate "domain". In this way, it can avoid the danger when a signal contains a pointer to the memory pool of the sender, for the receiving process has the ability to destroy the pool. By forming a domain, the user can choose to copy the signal buffer from the sender while sending a signal across segment boundaries.

6.2 OSE process & IPC

The OSE kernel is based on message passing between processes. This IPC (In- ter process communication) is implemented as a simple API between processes in a distributed single or multi-core system [29]. OSE processes are equivalent to POSIX threads with some special features, including an individual stack, a set of specific variables and register values. Processes within a load module share the CPU time allocated by the kernel based on their priorities. So a process is not always running, instead it has three states: ready, waiting and running (almost the same as a thread, see Section 4).

A static process is created by the kernel at the start of a load module, which is

38 6.3. OSE FOR MULTI-CORE not allowed to be killed during the existence of the load module. In contrast, dynamic processes can be created, configured and killed during the run-time. Each process is assigned a process identity (ID) in its life.

Figure 6.2. Message Passing

Inter process communications and synchronizations are handled by signals. A signal is used as an acknowledge message (semaphores, barrier or moni- tors) or a data-carrying message. Each process is assigned one unique signal queue. Each message carries the information of the sender, receiver and the owner, which makes it easier to trace and redirect. A process can only be received by the dedicated process which the message is sent to and it is upon the process to choose which message to accept. This process is called Direct Asynchronous Message Passing (DAMP). Message passing between cores is supported by OSE using the same message passing API as that used in the single-core version of OSE [28]. Process of message passing is shown in Figure 6.2.

6.3 OSE for multi-core

The OSE Multi-core version is available for PowerPC from OSE 5.4 and later, which is called MCE (Multi-core Embedded). The architecture of MCE is shown in Figure 6.3. The MCE is a hybrid of SMP and AMP (see Section 2.1), which performs as SMP on application level with a homogeneous single OS image and a shared memory programming model. It also has AMP char- acteristics on the kernel level with multiple schedulers for multiple cores and the support of message passing [30]. One OSE image controls all the cores and shared resources. Cores are marked with cpu_id. Cpu_id0 is the main execution unit from which the system is started. OSE supports SMP transparently without modifying applications to be moved from single core to multi-core. With the kernel thread running on the main execution unit, programs in load modules are allowed to be run on any execution unit. Entire programs can be move between execution units

39 CHAPTER 6. OSE while running, apart from interrupt or timer interrupt processes. Programs can also be locked to a given execution unit by the designer[26].

6.3.1 OSE load balancing Nowadays, OSE offers signal interfaces in the program manager to measure the load statistic and move programs between execution units. Applications regard OSE as a SMP since it handles work load distribution over cores. But to achieve a determinism require, it is up to the applications to control load balancing. Parallelism is at a load module level [33]. So thread level parallelism (see Section 3.2.2) and dynamic load balancing are expected in OSE, which is what Wool does.

6.3.2 Message passing APIs OSE message passing APIs are supported in MCE to communicate over cores. The message passing model fits multi-core better. Because in the shared memory model, inter communication overhead grows with the increment of the number of cores. When there are a greater number of cores, the shared memory model has a higher overhead than message passing. What’s more, synchronizations are not needed because they are implicit in this process. However, message passing may also induce higher latency when parallel pro- grams have much dependency.

6.4 OSE pthreads

POSIX Thread is supported by OSE. The OSE pthread is a pthread emula- tion layer that implements most of the thread standard in the POSIX 9945-1 (1996) standard, which is essentially 1003.1c [27]. OSE implemented a subset of the POSIX thread, which is dedicated for simplifying the porting process

Figure 6.3. MCE: Hybrid AMP/SMP Multiprocessing [31]

40 6.5. SUMMARY for the POSIX thread based applications. It is recommended to use native OSE processes, which are more efficient than OSE pthreads.

An OSE pthread could be regarded as a prioritized OSE process, except that killing a thread is not allowed (it could be terminated instead). Fast semaphores are used to implement OSE pthread Mutexes, so they cannot be used for other purpose. Pthreads in the core module need no special con- figuration, while pthreads in the load modules require that the program is configured in the shared mode [27].

6.5 Summary

First, OSE is based on the message passing model, which is a good solution for multi-core. Secondly, OSE supports POSIX thread by mapping OSE processes to pthreads. Most functions of pthread are supported, which makes OSE a friendly programming environment for migrating POSIX-compliant applica- tions. Thirdly, GCC Complier is supported by OSE, which makes it easier to port Wool, for the same compiler is used when Wool is compiled in Linux. Last but not least, LINX helps to communicate over cores in MCE regardless of the operating system. It makes it possible to build a of Wool (with either pthreads or signals) which could run on a platform with multiple operating systems.

41

Chapter 7

P4080 and TILE64

Multi-core platforms Freescale QorIQ P4080 with eight cores and Tilera TILE64 with 64 cores are discussed in this chapter, regarding to their features and ar- chitectures. The Freescale P4080 is the target board for the porting. TILE64 is discussed for comparison.

7.1 Features

7.1.1 Instruction set Instruction set architecture (ISA) is one of the most important features of the computer architecture related to programming. An instruction set includes a specification of a set of instructions, data types, addressing modes and etc. Operation bits of the instruction set describe the number of bits an instruction takes in a specific CPU. The main stream CPUs are either 32bits or 64 bits. The instruction set can also be classified into CISC and RISC. RISC includes the most frequently used instructions with the same number of bits in every instruction. While CISC emphasis on hardware which has complex instruc- tions and the soft code size could be smaller. Several ISAs are compared in Table 7.1.

Architecture Bits Instruction set PowerPC 32/64 RISC Tile 32 RISC IA-64 64 EPIC1 SPARC 64 RISC X86-64 64 CISC X86 32 CISC

Table 7.1. Instruction Set Comparison

43 CHAPTER 7. P4080 AND TILE64

Programs built on different instruction sets differ a lot. One CISC style in- struction may need to be implemented by several instructions in RISC style.

7.1.2 Memory consistency To enhance the performance on multi-core systems, large caches are set to reduce the overhead of memory accesses, but it rises a problem that the mem- ory operations may be performed out of order and the data in the same mem- ory location may present different values to different processors [32]. Differ- ent computer architectures build different models addressing this problem by pipelining memory accesses which is called memory consistency.

The memory consistency model is a contract between the shared memory and the programs running on it [33]. They are mainly divided into sequential consistency and relaxed consistency. The sequential consistency performs the access to memory in the same order as the program order to each individual processor. However, sequential consistency is too complicated and causes a lot overhead. The relaxed consistency means that instruction reorderings are per- mitted but synchronizations are required when there are data dependencies. In the relaxed consistency model, memory barriers are needed to constrain the memory access order. Most processors define a relaxed memory consistency model.

Out-of-order The out-of-order execution means the execution order2 of a program differs from the program order3 due to both compiler and CPU implementation op- timizations. Perceived order4, which is determined by caching, interconnect, and memory-system optimization may differ from execution order. Out-of- order execution is an optimization to give a better performance, but it may lead to unexpected results. Here is an example of out-of-order.

Initial State: x = 0, s=0 P1 P2 while s == 0 x = 42; ; //memorybarrier; Printx; s=1;

2The order that the individual instructions are executed on a given CPU. 3The order specified in the code of a given CPU. 4The order that a given CPU perceives its and other CPUs’ memory operations.

44 7.1. FEATURES

If this piece of code is executed in program order, the result will be 42, for x is printed after the execution of s = 1. If the program is executed or perceived out-of-order, the result may be 0, because P1 may see s = 1 before x = 42. Memory barriers are needed for protection in this case.

Memory consistency models for different microprocessors Different Processors supply different memory consistency models. For in- stance, X86 and Sparc have strict constrains which do not reorder write op- erations, while PowerPC, IA64 have weaker constrains which need memory barriers to constrain execution order and perceived order. Table 7.2 lists dif- ferent memory consistency models defined by different processors. [36] [32] [38]

Order IA64 PowerPC Sparc X86 TILE Read After Read? Y Y N N Y Read After Write? Y Y N N Y Write After Write? Y Y N N Y Write After Read? Y Y Y Y Y

Table 7.2. Memory Consistency Models

7.1.3 Memory barrier Memory barriers (memory fences) are mainly used with weak consistency mod- els to constrain memory access order and ensure data correctness. There are different kinds of barriers, for example, store fence (SFENCE), which con- strains the execution order of write operations is the same as the program order. Generally, the more strict the memory barrier is, the more costly (pipeline is forced to drop the pre-fetched instructions). Developers should only use memory barriers when it is necessary. Memory barriers are hardware dependent, that are defined by processor vendors. Table 7.3 lists some types of memory barriers and instructions.

SFENCE MFENCE sparc membar(StoreStore) membar(StoreLoad or StoreStore) x86_64 sfence mfence lwsync msync TILE MF MF

Table 7.3. Memory Barriers & Instruction

45 CHAPTER 7. P4080 AND TILE64

7.2 Freescale QorIQ P4080 platform

7.2.1 P4080 architecture The Freescale P4080 Q or IQ integrated multi-core communication processor5 is based on eight Power Architecture6 processor cores – e500mc7.

The P4080 is a high performance networking platform, which can be used in routers, switches, base station controllers, and general-purpose embedded computing systems. Compared with multiple discrete devices, it offers a better performance and simplifies the board design [35].

Figure 7.1. P4080.

7.2.2 e500mc core P4080 is based on eight e500mc cores and it supports both symmetric and asymmetric mode. Each core is a 32-bit low power processor. The e500mc core is based on the Power Architecture technology dedicated for embedded systems. The e500mc is a super-scalar dual issue processor (two instructions per clock cycle), which support both out-of-order execution and in-order com- pletion. With seven-stage pipeline, e500 cores is able to perform more instruc- tions per clock [35].

5Freescale Semiconductor 6Power Architecture is a broad term to describe similar RISC instruction sets for microproces- sors developed and manufactured by such companies as IBM, Freescale, AMCC, Tundra and P.A. Semi. 7E500: A 32-bit microprocessor core based on Power Architecture [34].

46 7.3. TILERA’S TILE64

Memory consistency model The Power ISA provides weak consistency memory access model to create the opportunities for processors to reschedule memory transactions [36]. 1. Read after read may be reordered unless caching-inhibited is in use. 2. Write after write can be reordered. 3. Read after write, write after read can only be protected by msync.

Memory barrier primitives msync: This instruction ensures that all read and write preceding msync have completed before subsequent instructions. lwarx and stwcx: The instructions are used to perform a read-modify-write operation to memory. They ensure that only one processor modifies the mem- ory location between the execution of the lwarx instruction and the stwcx in- struction. With the combination of lwarx and stwcx, Instructions like prefetch and compare-and-exchange could be implemented.

7.3 Tilera’s TILE64

7.3.1 TILE64 architecture TILE64 is a multi-core processor with a mesh network of 64 cores (each core is called a TILE). It could be called Massively Parallel Processor Array (MPPA). TILE64 delivers scalable performance, power efficiency, and low processing latency [38].

7.3.2 TILE A TILE core has 32-bit, RISC instruction set, with Three-way VLIW8 pipeline for instruction level parallelism.

Memory consistency model There are two properties of TILE memory consistency model: instruction re- ordering rules and store atomicity [38]. This model belongs to relaxed memory model (see Section 7.1.2). To a given processor, memory accesses as well as its visible order to other processors may be reordered, except the following cases [38]: 1. Data dependencies in a single processor are enforced, including read after write, write after write and write after read.

8VLIW: Very long instruction word, which refers to a processor architecture designed to take advantage of instruction level parallelism [38]

47 CHAPTER 7. P4080 AND TILE64

Figure 7.2. Tilera’s TILE64 Architecture [38]

2. Local visible order is determined by data dependencies through registers or memory. 3. The global visible order cannot be determined by the local visible order.

Memory barrier primitives TILE processors define a relaxed consistency model, which need memory bar- riers to make the piece of memory visible to guarantee that internetwork com- munications occur in order [37]. The TILE64 processor has instructions of memory fences and global barrier syncs. MF: Memory fence (MF) instruction is provided to ensure memory operations prior to the fence are globally visible before the operations after the fence [38].

48 Chapter 8

Porting Wool to P4080

In the design of porting Wool to the Freescale P4080 platform with the Linux operating system, the idea is to walk through the source code and modify the code quoting with the hardware primitives. As Wool is a C library which is built upon the operating system, we do not have to take care of the other part of the code other than the hardware dependent part. The operating system running on P4080 is Enea Linux, which supports all the libraries used in Wool.

8.1 Design

The hardware synchronization primitives applied in Wool are mainly memory related code, like memory fences, atomic operations and memory allocation functions, which have been mentioned in Chapter 4 & 7. The object is to add the hardware synchronization primitives defined by the e500mc processor, to the Wool’s source code and write it in the gcc inline assembly format.

Freescale introduces the e500mc processor based on PowerISA .2.061 in the QorIQ family chips. The e500mc processor is a 32-bit processor supporting shared storage between processors [39]. Its predominant model is weakly con- sistent, which allows the reordering of code and provides an opportunity to improve the performance over the stronger consistency module. In this case, it is up to the programmer to take care of the ordering and synchronization instructions when shared storage is used among multiple cores.

8.1.1 Storage access ordering Different memory fences used in Wool should be defined in the header file. A SFENCE is short for store fence, which ensures that all write operations

1A new instruction set, combining late versions of POWER and PowerPC instruction sets. Designed by IBM and Freescale.

49 CHAPTER 8. PORTING WOOL TO P4080 preceding the barrier are committed before the subsequentwrite operations are issued. A MFENCE is short for memory fence, which controls all the memory accesses (both read and write operations). It ensures that memory access operations before the barrier are committed before the subsequent memory access operations are initialized. A memory fence will induce more overhead than a load fence or a store fence. It is because that CPU would discard all the pre-fetched instructions and empty the pipeline. The e500mc core defined the following operations which are implemented in Wool. sync: provides full sequential consistency. So it is used as MFENCE in Wool, which is strict to any instruction ordering. msync: ensures that all instructions preceding msync have completed before msync completes. It also ensured the data accesses across all storage classes. It could be used as MFENCE also. One can regard it as an alternative operation to sync. lwsync: is a low overhead memory fence, which constrains the following accesses: read after read, write after write and write after read. It is used as SFENCE in Wool. isync: causes the prefetched instructions to be discarded by the core, which ensures that the instructions preceding to isync are fetched and then executed before the subsequent instructions. But isync does not affect data accesses.

8.1.2 Atomic update primitives Atomic update primitives used in Wool are the prefetch and CAS operations, which are based on lwarx and stwcx. lwarx creates a reservation and stwcx stores a word on condition that there exists a reservation created by lwarx on the same storage location. The reservation will be lost if another processor modified the same storage location before stwcx, then a new reservation must be set to perform the atomic operation.

8.2 Implementation

8.2.1 Prefetch & CAS The prefetch primitive loads and replaces a word in storage atomically [39]. The following assembly code shows the processor fetches a new value in r4 and stores it in r3, and the old value in r3 is reserved in r5. The key property of this operation is that it updates a memory value atomically [16]. loop : lwarx r5,0,r3 #load the old value in r5 from r3 and reserve stwcx. r4,0,r3 #store new value in r4 to r3 if still reserved

50 8.2. IMPLEMENTATION

bne - loop #loop if lost reservation The Compare and Swap (CAS) primitive compares a value in a register with a word in storage [39]. It loads the old value from r3 to r6 and compares the value pointed by r4 with the old value. If they are equal, modify the contents(r3) of r4 to the given value r5. The old value will be returned in any case. This is a typical atomic CPU operation to achieve synchronization. It avoids that two threads update the value at the same time, because the second one will fail and re-compute[16]. loop : lwarx r6,0,r3 #load the old value and reserve cmpw r4 ,r6 #whether r4 equals to r6 bne - exit #skip if not stwcx. r5,0,r3 #store new value if still reserved bne - loop #loop if lost reservation exit : mr r4 ,r6 #return value from storage

8.2.2 Gcc inline assembler code The hardware primitives (assembly code) described in the design section should be embedded into the Wool’s source code, which is written in C. So the as- sembly code is written as gcc inline assembler, which is supported by the gcc compiler. Gcc inline assembler should be written according to its basic rules. The basic format is: asm ("instructions": outputs: inputs: registers-modified); The first parameter includes instructions with ";" to separate them. The sec- ond and third parameters could be either a storage location or a register. They could be left empty when no input or output is used. The assembly in- structions used in Wool are operations on the memory, so the last parameter should include "memory" to declare that the memory has been changed.

8.2.3 Enea Linux The operating system used is Enea Linux, which is powered by the Yocto Project2 open source configuration. The provides standard tools, which ensures quick access to the latest Board Support Packages (BSPs) for the most common hardware architectures [40]. The architecture of Enea OSE is shown in Figure 8.1. The operating system could assign tasks to different cores automatically.

2http://www.yoctoproject.org/

51 CHAPTER 8. PORTING WOOL TO P4080

Figure 8.1. Enea Linux.

8.3 Experiment

A Fibonacci code is used as the test code. There are two versions of the test code, with Wool (fib_Wool) and without Wool (fib_wo). The main difference is that the tasks are executed at once without wool, while with Wool, tasks are SPAWNed and placed in a task pool instead of executing immediately, and the tasks are executed only when they reach a correspond SYNC. Execution time is measured for comparison.

8.3.1 Experiment: Fibonacci code with and without Wool fib_Wool — Pseudo test code with Wool for each iteration: SPAWN( Fibonacci, n-1 ); k = CALL( Fibonacci, n-2 ); m = SYNC( Fibonacci ); return m+k; fib_wo — Pseudo test code without Wool for each iteration: k = Fibonacci(n-1) m = Fibonacci(n-2) return m + k; Fibonacci(43) is tested on different platforms with different gcc optimizations. From Table 8.1, we can see that Wool should apply gcc optimization to achieve a good performance. From the first two lines in the Table, we can see that fib with Wool gains a speedup of 4.32 and 4.70 respectively by using the op- timization option of gcc. While fib without Wool has a speedup of 2.08 on the X86 platform and a speedup of 1.54 on the P4080 platform. Wool is an optimized code for the compiler and gives a much better performance with the gcc optimization option.

By comparing the results of fib with Wool and without Wool (gcc -O3), we

52 8.3. EXPERIMENT could see fib with Wool gives a better performance on P4080 than fib without Wool (9.2s with Wool compared with 11.13s without Wool). While things are different on the X86 platform, fib with Wool needs more time to finish the tasks (5.14s with Wool while 4.7s without Wool). This is because that the program runs on one core on X86, the overhead is created by extra SPAWN and SYNC operations. When multiple cores are used in a system, Wool will distribute tasks to different cores with a low overhead to give better perfor- mance. There are more discussions in the Chapter 9.

Code Gcc optimization X86 P4080 fib_Wool gcc -O0 22.23(s) 43.25(s) fib_Wool gcc -O3 5.14(s) 9.2(s) fib_wo gcc -O0 9.79(s) 17.13(s) fib_wo gcc -O3 4.7(s) 11.13(s)

Table 8.1. Time for Fib(43) with Wool Library on the Linux X86 Platform

8.3.2 Analysis Take fib(5) as an example, and suppose that there are multiple cores in the system.

fib(5)

fib(4) fib(3)

fib(3) fib(2) fib(2) fib(1)

fib(2) fib(1) fib(1) fib(0) fib(1) fib(0)

fib(1) fib(0)

Figure 8.2. Fib (5)

From 8.3.2, we can see that there are 15 tasks in total in fib(5). If fib(5) is executed without Wool, all the 15 tasks will be done by one core and the overall time is 15 execution time. If the Wool library is used, one processor will execute fib(5), spawn fib(4) and execute fib(3), and then spawn fib(2) and execute fib(1). Then sync fib(3) and fib(5) (fib(4) & fib(2) in red, which fib(3) & fib(5) in blue sync with, may be stolen by other cores). The critical path

53 CHAPTER 8. PORTING WOOL TO P4080 of completing fib(5) consists of 2 spawns, 2 syncs and 3 task units at least (suppose there are infinite number of cores and the other spawned tasks are stolen and executed by other processors).

Each SPAWN and SYNC takes little time when each task unit is large. How- ever, when each task unit consists only SPAWN and SYNC (like fib_Wool), the overhead created by them could not be ignored. As shown in Table 8.1, fib(43) with Wool running on one core (X86) takes 5.14s, while it takes 4.7s without Wool. The overhead mainly includes the cost of the extra SPAWN and SYNC operations.

54 Chapter 9

Porting Wool to OSE

When porting Wool from the Linux operating system to Enea OSE, the library differences should be considered first. OSE is designed to work with OSE pro- cesses and message passing, but it also provides most pthread APIs. As Wool is based on top of pthreads, OSE pthreads are tested first. As OSE supports the gcc compiler, it makes it easier to port Wool. Modifications in Wool has been done to fit it on OSE. Besides, the OSE load module configuration is also edited to support Wool.

9.1 Design

OSE supplies Standard C libraries, but there exists differences between the OSE library and the Linux library. Some functions are not supported by OSE, which need to be added when porting Wool.

9.1.1 Malloc library The data alignment refers to the relationship between the data address and the size of memory blocks."A variable located at a memory address that is a multiple of its size is said to be naturally aligned [41]." The alignment rules derive from hardware. Some computer architectures have stringent require- ments on the data alignment, in which unaligned data alignments will result in a processor trap. Wool is supposed to work on multiple platforms, so that data should be naturally aligned to make it fit on different platforms.

OSE malloc library The malloc library includes heap memory allocation functions [28]. Four stor- age allocation functions are defined in this file, including malloc, realloc and calloc. Malloc allocates dynamic memory for an object in the specified size,

55 CHAPTER 9. PORTING WOOL TO OSE which could in private mode or shared mode. Calloc allocates dynamic mem- ory for an array of objects in a specified size. Realloc is used to change the size of the dynamically allocated memory.

The pointers returned by the storage allocation functions are unique and discontinuous. The pointers remain valid from the point of allocation un- til deallocation [28]. They can be set as private pointers or shared pointers. A shared pointer can resize or deallocate a memory block from any OSE pro- cess whereas a private pointer can only be operated on by the process which allocated it. Private pointers could be collected when the owner process ter- minates even though they may not deallocated explicitly.

An allocated pointer is always private by default in OSE. Pointers should be set in the shared mode in Wool in order to share pointers between OSE processes to synchronize among workers. It could be set in the following three ways: (1) The block execution model is shared. (2) The block environment variable HEAP_SHARED_MODE is set. (3) Code is compiled with MALLOC_SHARED.

Setting MALLOC_SHARED is the most convenience way, which can forcibly made allocations shared when compiled. This macro forcibly enables shared heap mode pointers when using standard allocation functions like malloc. An OSE error will be generated if there is insufficient memory to satisfy an allo- cation attempt [28].

The compiler and C library handle data alignments. It will return a mem- ory that is aligned along an 8 byte boundary on 32-bit systems and a 16 byte boundary on 64-bit systems [41] via malloc, calloc and realloc functions. When a larger boundary is required, such as a page, additional functions are needed, like memalign.

Linux malloc library Expect these basic functions, advanced memory allocation functions are added in the library supplied by Linux. memalign is used in Wool which are included in the malloc library, but not in the OSE malloc library.

Memalign allocates a dynamic memory with the size of bytes set by the second parameter. The aligned memory address is a multiple of alignment, which must be a power of 2 [41].

56 9.1. DESIGN

void * memalign (size_t alignment, size_t size); memalign is used when a larger boundary or more strict alignment is required. The advantage of it is to improve the performance. More information regard- ing to how does it improve the performance is available in [16].

Comparison A simple test is performed to compare the performance of malloc, calloc and memalign on the Linux platform. Results are shown in Table 9.1.

gcc optimization memalign malloc calloc gcc -O0 94.36s 100.30s 94.32s gcc -O3 21.94s 22.32s 22.36s

Table 9.1. Fib(46) with Memalign, Malloc and Calloc

Memalign has the best performance because of the size of the boundary. Mal- loc and calloc have a compatible performance, and both of them can guarantee the correctness of the code with the same size of aligned memory. As there is no memalign function supplied by OSE, we substitute it with malloc at the expense of performance.

9.1.2 Pthread The OSE pthread emulates the POSIX thread, and supports most of the thread standard in the POSIX 9945-1 (1996) standard, which is essentially 1003.1c [27]. In general, the native OSE processes are used instead of the OSE pthreads to gain a higher efficiency. But OSE pthreads make it easier to port POSIX applications to OSE, like Wool.

Properties A OSE Pthread is implemented on top of a prioritized OSE process, therefore it has similar properties with a OSE process. Besides, each Pthread ID is unique and is the same with its corresponding OSE process ID. Unlike OSE processes, OSE pthreads cannot be killed. OSE pthreads should be termi- nated as pthreads, like pthread_exit [42].

Pthreads are widely used in systems with shared memory. The OSE block which pthreads are running on should be set with a shared heap, so that all resources are taken from the same heap and resources will be freed when killing the block. Users could specify which OSE block a thread should belong to [27].

57 CHAPTER 9. PORTING WOOL TO OSE

For OSE 5.5 and later, OSE pthreads could be used in both core module and load module. Considering Wool is a C library, we build it as a load module, which should be configured in the shared mode. More information is available in [44].

Mutex OSE pthread Mutexes are implemented with fast semaphores. So that fast semaphores could not be used in OSE pthread applications [27].

Condition variable Condition Variable functions are not well supported in OSE [43]. But they could be implemented in the OSE process with condition_signal and condi- tion_wait signaling. It is recommended to have a signaler process, a waiter process and a conditional server process, which could also be customized ac- cording to the requirement. However, it is not like other Pthread APIs which could be used without change.

Pthread bind with cores Programs are created within one core by default. All of the pthreads started in Wool will be assigned to one core. To distribute pthreads to multiple cores, the program should be set in the supervisor mode, so that it can take effect of a system call which can move functions to different execution units. As P4080 has 8 cores, we start 8 pthreads, and bind each pthread to each core using ose_bind_process() function [26].

A system call ose_bind_process is used to bind an OSE process, block or domain to a specified core. It can also used to rebind (move) processes or blocks during run time. If a process is specified, only that process will be bind; If a block is specified, all the processes will be bind to that core together with the block; If a domain is specified, all the processes and blocks inside the domain is bind to the same core. This system call is only valid in a SMP system [44].

9.2 Implementation

9.2.1 Load module configuration In order to enable Wool running on multiple cores, load module should be configured in the shared mode. In order to bind a pthread with a dedicated CPU, a load module should be set in the supervisor mode instead of user mode

58 9.3. EXPERIMENTS to apply the system call. Besides, the stack size should be big enough to align each worker enough memory space.

9.2.2 Makefile Include all the source code which you want to compile together into the Make- file. The compiler will be called and it will link all the source code included in the Makefile. For example, if we want to compile fib with the support of Wool library, we should include both wool.c and fib.c in the Makefile. Makefile should be in the same folder with the application code you want to build.

9.2.3 System information Some multicore API calls are used to check the information of pthreads, load modules and cores. These system information is useful for debugging.

9.2.4 Application on OSE multi-core The OSE architecture makes it possible to migrate an OSE single core pro- gram to multi-core without changes. All single core C runtime APIs are sup- ported by OSE MCE [26]. Processes could be spread out onto multiple cores to achieve parallelism by MCE automatically. Processes communicate with each other via OSE signaling which is the same with single core applications. However, multi-core systems could neither assume only one entity at the time can execute code nor assume that a higher priority provides exclusiveness, be- cause the multiple parallel execution units have multiple entities which could execute processes with different priorities at the same time. So it is of great importance to ensure all the depended code are locked on the same core in a multi-core environment. As Wool is a scheduler trying to keep load balancing among cores, all of the pthreads (actually prioritized processes) should be set with the same priority [45].

9.3 Experiments

The same test case is performed on OSE with that on Linux. The Fibonacci test code with and without Wool are tested.

9.3.1 Experiment on OSE & Linux The target used for this test is the Freescale P4080 board. The test code are performed on Enea Linux and Enea OSE respectively. Fib(43) is compiled with the highest optimization option provided by gcc. From the column chart we can see fib(43) on Linux and OSE have compatible

59 CHAPTER 9. PORTING WOOL TO OSE

Figure 9.1. Fib(43) with Wool Library on the Linux & OSE on P4080 performance. Fib_Wool has a relatively better performance. The reason for such little improvement has been illustrated in Section 8.3.2. And the following test further proves the analysis.

9.3.2 Experiment on the performance vs. the number of cores (1) As it is allowed to bind pthreads to cores, we can setup the number of working cores. Two experiments are performed: one is fib(43), the other is fib(12) with a slowdown loop, which is used to enlarge the size of each task.

Code number of cores Operating system Time for fib(43) (s) fib_Wool 1 OSE 19.154 fib_Wool 2 OSE 9.949 fib_Wool 3 OSE 9.188 fib_Wool 4 OSE 8.939 fib_Wool 5 OSE 8.896 fib_Wool 6 OSE 8.885 fib_Wool 7 OSE 9.459 fib_Wool 8 OSE 9.424

Table 9.2. Fib(43) with Wool Library on Multiple Cores

From Table 9.2, we can see the performance is almost doubled using two cores compared with using one core. However, when the number of cores goes on increasing, the performance is not much improved. On the contrary, the performance begin to decrease when more than seven cores are used. That means, the overhead induced by synchronizations affect its performance. The reason for this may be because the basic version of Wool is used here. If the advanced version has been ported, the results is expected to be better. To

60 9.3. EXPERIMENTS further verify whether Wool works well on a multi-core system, the following test is conducted.

9.3.3 Experiment on the performance vs. the number of cores (2) An extra slowdown loop is added into the test code, which makes the program 20,000,000 times slower than before. That means, the size of each task is in- creased. The results are shown in Figure 9.2. We can see that the performance of fib_Wool is improved continuously with the number of cores.

Figure 9.2. Fib(12) with Large Task Body on Multiple Cores

The red columns indicate an ideal speedup, while the blue columns show the real speedup. The differences between them are overhead induced by extra synchronizations among threads.

In Figure 9.2, the performance of Wool has been improved linearly. However, when it comes to 8 cores, the overhead increases greatly. That is because Core0 acts as a worker when it comes to 8 cores. Considering that Core0 also takes care of other system tasks, it is reasonable that the speed of the test code is affected.

61

Chapter 10

Conclusions and Future Work

10.1 Conclusions

This master thesis discusses Wool and its performance on multi-core system. In the theoretical study part, multi-core, parallel programming and pthreads are discussed. As multi-core starts an era to improve the performance of com- putations, Wool is developed, providing a solution to carry out computations concurrently and efficiently. In the implementation part, Wool is ported to a new platform and a set of tests are performed on the new platform.

10.1.1 Theoretical study Answers to the questions raised in Chapter1 are stated below.

P1: Wool task parallelism Wool implements task parallelism with work stealing and leap frogging. Source code regarding to hardware synchronization primitives and OSE libraries need to be modified to fit the new platform.

P2: Pthreads The OSE multi-core edition MCE supports most functions of pthreads which are built upon OSE processes. However, functions of pthread conditions are not fully supported by OSE. If these functions are used, one should fit it in OSE processes. This part is stated in Chapter 6 and Chapter 9.

P3: Hardware architecture Wool is able to work on P4080 with hardware specific synchronization prim- itives guaranteeing its performance. To multi-core systems, the number of cores influence Wool’s performance. Features, like sampling victim selection,

63 CHAPTER 10. CONCLUSIONS AND FUTURE WORK are added as the number of cores keeps on growing. This part is illustrated in Chapter 5 and Chapter 9.

10.1.2 Implementation To port Wool a C library to another platform, one should consider hardware related code, library support, compiler, and application configurations.

Porting from X86 to P4080 When porting Wool from one platform to another, hardware dependent code should be modified to fit the new platform. All the hardware dependent code in Wool are memory related primitives, which are defined by the processor. These primitives are written in the gcc inline assembler format to be compiled together with the rest of the code.

Porting from Linux to OSE The operating systems supply with the libraries and compilers. Both of them may affect the porting. To port Wool to OSE, gcc compiler is used (gcc is also the compiler used on Linux platform), but there are some functions missing in the libraries provided by OSE. So one must manage to make proper changes to the code to compensate. Another thing is that the file system and the architecture of OSE are different from Linux, so configurations should be changed accordingly.

10.1.3 Result Discussion Table 10.1 shows the results of fib(43) on different platforms. fib(43) has more than one billion tasks spawned, executed and synced. The extra cost induced by spawns and syncs is quite low considering the large number of spawn and sync operations (one billion spawns and syncs cost 0.43s, which is the differ- ence between use Wool and without Wool on Linux(X86)).

Time Wool Platform with without OSE(P4080) 9.42s 11.50s Linux(P4080) 9.2s 11.13s Linux(X86) 5.14s 4.71s

Table 10.1. Target

Wool gives a better performance on the P4080 multi-core platform. The fun- damental operations SPAWN and SYNC are quite cheap, so that an low

64 10.2. FUTURE WORK scheduler overhead could be achieved. Besides, Wool works with pain ordi- nary C, which is unlike C++ used by Intel TBB or C# used by Microsoft TPL. From this respect, Wool is a good choice for Enea OSE. However, as Wool im- plements independent task parallelism, the usage of Wool is limited within independent tasks, and the users should take care of the task dependencies.

10.2 Future work

(1) Though Wool could be run on OSE, it is recommended to replace the pthread primitives with OSE message passing, which will make it more effi- cient. Besides, Wool works on top of pthreads, which need explicit synchro- nization when workers communicate over cores. With OSE, communications among cores need not to be synchronized when using message passing.

(2) Porting is done with the basic version of Wool. It is worth to port the newest version to OSE, which may give a better performance with more complicated algorithms.

(3) Enea Hypervisor supports both OSE and Linux operating systems. It is worth to work out a hypervisor version of Wool (with both pthreads and signals) that can work over the system regardless of OS types.

(4) A serial of target boards may enhance their performance with Wool. For example, multi-core: Ambric Am2045, Sun Sparc T3 (Rainbowfalls), IBM Power7, and etc. They may become further target platforms.

65

Appendix A

Demo Results

ose5> ./romfs/wooll.gz -p 4 12 Worker started Worker started Worker started Worker started Worker 558303160 stole a task from worker 558175144 Worker 558596184 stole a task from worker 558175144 Worker 558461016 stole a task from worker 558303160 Worker 558175144 is leapfrogging with worker 558596184 Worker 558175144 stole a task from worker 558596184 Worker 558596184 is leapfrogging with worker 558175144 Worker 558596184 stole a task from worker 558175144 Worker 558175144 is leapfrogging with worker 558596184 Worker 558175144 stole a task from worker 558596184 Worker 558596184 is leapfrogging with worker 558175144 Worker 558596184 stole a task from worker 558175144 Worker 558175144 is leapfrogging with worker 558596184 Worker 558175144 stole a task from worker 558596184 Worker 558596184 is leapfrogging with worker 558175144 Worker 558596184 stole a task from worker 558303160 Worker 558175144 is leapfrogging with worker 558303160 Worker 558175144 stole a task from worker 558303160 Worker 558303160 is leapfrogging with worker 558175144 Worker 558303160 stole a task from worker 558175144 Worker 558175144 is leapfrogging with worker 558303160 Worker 558175144 stole a task from worker 558303160 Worker 558303160 is leapfrogging with worker 558175144 Worker 558303160 is leapfrogging with worker 558596184 Worker 558303160 stole a task from worker 558596184 Worker 558175144 stole a task from worker 558303160

67 APPENDIX A. DEMO RESULTS

Worker 558596184 is leapfrogging with worker 558303160 Worker 558596184 stole a task from worker 558303160 Worker 558303160 is leapfrogging with worker 558596184 Worker 558303160 is leapfrogging with worker 558175144 Worker 558303160 stole a task from worker 558175144 Worker 558175144 is leapfrogging with worker 558303160 Worker 558596184 stole a task from worker 558461016 Worker 558303160 is leapfrogging with worker 558461016 Worker 558303160 stole a task from worker 558461016 Worker 558175144 stole a task from worker 558303160 Worker 558303160 is leapfrogging with worker 558175144 Worker 558303160 stole a task from worker 558175144 Worker 558461016 is leapfrogging with worker 558303160 Worker 558175144 is leapfrogging with worker 558303160 Worker 558461016 is leapfrogging with worker 558596184 Worker 558461016 stole a task from worker 558596184 Worker 558303160 stole a task from worker 558461016 Worker 558175144 stole a task from worker 558303160 Worker 558303160 is leapfrogging with worker 558175144 Worker 558303160 stole a task from worker 558175144 Worker 558175144 is leapfrogging with worker 558303160 Worker 558175144 stole a task from worker 558303160 Worker 558303160 is leapfrogging with worker 558175144 Worker 558303160 stole a task from worker 558175144 Worker 558175144 is leapfrogging with worker 558303160 Worker 558303160 stole a task from worker 558461016 Worker 558461016 is leapfrogging with worker 558303160 Worker 558461016 stole a task from worker 558596184 Worker 558596184 is leapfrogging with worker 558461016 Worker 558596184 stole a task from worker 558461016 Worker 558461016 is leapfrogging with worker 558596184 Worker 558461016 stole a task from worker 558596184 Worker 558596184 is leapfrogging with worker 558461016 result = 144 time is 28.519000. workers stoped Yay! ose5>

68 Bibliography

[1] Karl-Filip Faxén, "Wool 0.1 users guide," http://www.sics.se/kff/wool, June 1,2009

[2] M. Hill and M. Marty, "Amdahl’s law in the multicore era," IEEE Comput., vol. 41, no. 7, pp. 33-38, Jul. 2008.

[3] Kumar, R.; Farkas, K.I.; Jouppi, N.P.; Ranganathan, P.; Tullsen, D.M.; , "Single-ISA heterogeneous multi-core architectures: the potential for processor power reduction," Microarchitecture, 2003. MICRO-36. Proceedings. 36th An- nual IEEE/ACM International Symposium on , vol., no., pp. 81- 92, 3-5, Dec. 2003

[4] S. Akhter and J. Roberts. "Multi-Core Programming: Increasing Performance through Software Multi-threading," Intel Press, 2006.

[5] Paul N. Leroux and Robert Craig, "Easing the Transition to Multi-Core Pro- cessors," Information Quarterly, Vol. 5, No. 4, pp. 34-37, 2006

[6] Timothy M.; Beverly S.; and Berna M., Patterns for Parallel Programming (First ed.). Addison-Wesley Professional, 2004.

[7] Duranton, M.; , "The Challenges for High Performance Embedded Systems," Digital System Design: Architectures, Methods and Tools, DSD 2006. 9th EU- ROMICRO Conference on , vol., no., pp.3-7, 2006.

[8] Chu, M.; Ravindran, R.; Mahlke, S.; , "Data Access Partitioning for Fine- grain Parallelism on Multicore Architectures," Microarchitecture, 2007. MI- CRO 2007. 40th Annual IEEE/ACM International Symposium on , vol., no., pp.369-380, 1-5 Dec. 2007.

[9] Gene Golub and James M. Ortega. Scientific Computing: An Introduction with Parallel Computing, Academic Press Prof., Inc., San Diego, CA, USA, 1993.

[10] Nicholas Carriero and David Gelernter. 1989. How to write parallel programs: a guide to the perplexed. ACM Comput. Surv. 21, 3 (September 1989), 323-357.

[11] Blaise Barney, "POSIX Threads Programming," Lawrence Livermore National Laboratory, https://computing.llnl.gov/tutorials/pthreads.

69 BIBLIOGRAPHY

[12] "IEEE Standard for Information Technology- Portable Operating System In- terface (POSIX)- Part 1: System Application Program Interface (API)- Amendment J: Advanced Real-time Extensions [C Language]," IEEE Std 1003.1j-2000 , vol., no., pp.0_1-88, 2000.

[13] Apple Developer, Threading Programming Guide, 2010.

[14] Paul Bridger, Concepts and Synchronisation Primitives, http://www.paulbridger.com/deadlock/, 2006.

[15] Bil Lewis and DanielJ. Berg; "Prentice Hall. Pthreads Primer, A Guide to Multithreaded Programming," , California, USA, 1995.

[16] John L. Hennessy and David A. Patterson. In Praise of Computer Architecture: A Quantitative Approach , pages 196-264, 2007.

[17] Faxén, K.-F.; , "Efficient Work Stealing for Fine Grained Parallelism," Parallel Processing (ICPP), 2010 39th International Conference on, vol., no., pp.313- 322, 13-16 Sept 2010.

[18] Karl-Filip Faxén and John Ardelius, "Manycore work stealing," In Proceedings of the 8th ACM International Conference on Computing Frontiers (CF ’11). ACM, New York, NY, USA, Article 10 , 2 pages, 2011.

[19] Karl-Filip Faxén, "Wool-A work stealing library." SIGARCH Comput. Archit. News 36, 5 (June 2009), 93-100, 2009.

[20] Sen, Siddhartha. "Dynamic Processor Allocation for Adaptively Parallel Work Stealing Jobs," Thesis, Chapter4, Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2004.

[21] Blumofe, R.D.; Leiserson, C.E.; , "Scheduling multithreaded computations by work stealing," Foundations of Computer Science, 1994 Proceedings., 35th An- nual Symposium on , vol., no., pp.356-368, 20-22 Nov 1994.

[22] Faxén, K.-F.; , "Efficient Work Stealing for Fine Grained Parallelism," Parallel Processing (ICPP), 2010 39th International Conference on , vol., no., pp.313- 322, 13-16 Sept. 2010.

[23] David B. Wagner and Bradley G. Calder. 1993. "Leapfrogging: a portable technique for implementing efficient futures." SIGPLAN Not. 28, 7 (July 1993), 208-217.

[24] Podobas, Artur and Brorsson, Mats and Faxén, Karl-Filip (2010) A Compari- son of some recent Task-based Parallel Programming Models. In: 3rd Workshop on Programmability Issues for Multi-Core Computers , 24 Jan 2010, Pisa, Italy.

70 BIBLIOGRAPHY

[25] Karl-Filip Faxén, Christer Bengtsson, Mats Brorsson, H˙akan Grahn, Erik Hagersten, Bengt Jonsson, Christoph Kessler, Björn Lisper, Per Stenström, Bertil Svensson. "Multicore computing - the state of the art. SICS," 2008

[26] Enea, Enea OSE Architecture User’s Guide, REV.BL140702.

[27] Enea, Enea OSE Core User’s Guide, REV.BL140702.

[28] Enea, OSE Application Programming Interface Reference Manual, REV.BL330018.

[29] Patrik Stromblad, "White Paper: ENEA Multicore: High performance packet processing enabled with a hybrid SMP/AMP OS technology", Enea AB Whitepapers, 2009.

[30] Enea, "High-Availability RTOS for Complex, Distributed Systems," Enea AB, OSE 5.5 Datasheet, 2010.

[31] Magnus Karlsson, "The Future Multicore OS – A Visionary Look into the Future of Multicore Embedded Operating Systems", Enea.

[32] Ibrahim Hur and Calvin Lin, Memory scheduling for modern microprocessors. ACM Trans. Comput. Syst. 25, 4, Article 10 (December 2007), 2007.

[33] Kourosh Gharachorloo, Daniel Lenoski, James Laudon, Phillip Gibbons, Anoop Gupta, and John Hennessy, Memory consistency and event ordering in scalable shared-memory multiprocessors, SIGARCH Comput. Archit. News 18, 3a (May 1990), 15-26. 1990.

[34] Freescale, "e500mc Core Reference Manual", REV.1 03/2012.

[35] Freescale, "P4080 QorIQ Integrated Multicore Communication Processor Fam- ily Reference Manual", Rev.0, 04/2011.

[36] Freescale, "EREF: A Programmer’s Reference Manual for Freescale Embedded Processors," Rev. 1 12/2007.

[37] Wentzlaff, D.; Griffin, P.; Hoffmann, H.; Liewei Bao; Edwards, B.; Ramey, C.; Mattina, M.; Chyi-Chang Miao; Brown, J.F.; Agarwal, A.; , "On-Chip Interconnection Architecture of the Tile Processor," Micro, IEEE , vol.27, no.5, pp.15-31, Sept.-Oct. 2007.

[38] Tilera, "TILE Processor User Architecture Manual," page 274279, REL.2.4, DOC.NO.UG101, 05/2011.

[39] Power ISA Version 2.06 Revision B, July 23, 2010.

[40] Enea AB, Enea Linux User’s Guide, REV.20120419_1426.

71 BIBLIOGRAPHY

[41] Robert Love, "Linux System Programming," Chapter 8, O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.

[42] Enea software AB, ENEA OSE Pthreads User’s Guide.

[43] Enea software AB, ENEA OSE Pthreads Reference Manual.

[44] Enea AB, OSE System Programming Interface Reference Manual, REV.BL141108.

[45] Enea AB, OSE 5 Migration Guide, REV.BL150551.

72