UvA-DARE (Digital Academic Repository)

On the compilation of a parallel language targeting the self-adaptive virtual processor

Bernard, T.A.M.

Publication date 2011 Document Version Final published version

Link to publication

Citation for published version (APA): Bernard, T. A. M. (2011). On the compilation of a parallel language targeting the self-adaptive virtual processor. Print partners Ipskamp.

General rights It is not permitted to download or to forward/distribute the text or part of it without the consent of the author(s) and/or copyright holder(s), other than for strictly personal, individual use, unless the work is under an open content license (like Creative Commons).

Disclaimer/Complaints regulations If you believe that digital publication of certain material infringes any of your rights or (privacy) interests, please let the Library know, stating your reasons. In case of a legitimate complaint, the Library will make the material inaccessible and/or remove it from the website. Please Ask the Library: https://uba.uva.nl/en/contact, or a letter to: Library of the University of Amsterdam, Secretariat, Singel 425, 1012 WP Amsterdam, The Netherlands. You will be contacted as soon as possible.

UvA-DARE is a service provided by the library of the University of Amsterdam (https://dare.uva.nl)

Download date:27 Sep 2021 INVITATION On the CompilationOn of a Parallel targeting SVP Language TO ATTEND THE PUBLIC DEFENSE OF MY THESIS

ON FRIDAY MARCH 11TH 2011 at 12:00PM

IN THE AGNIETENKAPEL, OUDERZIJDS VOORGBURGWAL 231, AMSTERDAM

RECEPTION AFTERWARDS

Thomas A.M. Bernard On the Compilation of On the Compilation of a Parallel Language a Parallel ISBN 978-90-9026006-8 Language targeting the 90000 > targeting the Self-adaptive Virtual Processor Self-adaptive Virtual Processor

9 789090 260068 Thomas A.M. Bernard Thomas A.M. Bernard On the Compilation of a Parallel Language targeting the Self-Adaptive Virtual Processor

Thomas A.M. Bernard

On the Compilation of a Parallel Language targeting the Self-Adaptive Virtual Processor

ACADEMISCH PROEFSCHRIFT

ter verkrijging van de graad van doctor aan de Universiteit van Amsterdam op gezag van de Rector Magnificus prof. dr. D.C. van den Boom ten overstaan van een door het college voor promoties ingestelde commissie, in het openbaar te verdedigen in de Agnietenkapel op vrijdag 11 maart 2011 te 12:00 uur

door

Thomas Antoine Marie Bernard geboren te Meaux, Frankrijk Promotiecommissie: Promotor: prof. dr. C.R. Jesshope Overige leden: prof. dr. P. Klint prof. dr. G.J.M. Smit dr. M. Beemster dr. C.U. Grelck Faculteit: Faculteit der Natuurwetenschappen, Wiskunde en Informatica

The work described in this thesis was carried out in the section of Computer Systems Architecture of the University of Amsterdam, with the financial sup- port of:

• the University of Amsterdam, • the NWO Microgrids project,

• the European FP-7 Apple-CORE project, • the Advanced School for Computing and Imaging (ASCI).

Advanced School for Computing and Imaging

c Copyright 2010 by Thomas A.M. Bernard

ISBN 978-90-9026006-8

ASCI dissertation series number 211.

Author contact: [email protected] or [email protected] Print partners Ipskamp, Enschede Our greatest glory is not in never falling, but in rising every time we fall. Confucius

Contents

1 Introduction 1 1.1 Classical microprocessor improvements ...... 2 1.2 Multicore architectures ...... 2 1.3 Exploiting concurrency as a solution ...... 3 1.4 Impact of concurrency on software systems ...... 9 1.5 Contribution of this thesis ...... 12 1.6 Overview of this thesis ...... 12

I Foundations 15

2 Background in parallel computing systems 17 2.1 Approaches in concurrent execution models ...... 18 2.2 Relevant parallel architectures ...... 22 2.3 Modeling concurrency in ...... 24 2.4 Requirements for a concurrent execution model ...... 25

3 SVP Execution Model and its Implementations 27 3.1 Our approach to multicore programming ...... 28 3.2 Presentation of the SVP execution model ...... 29 3.3 Hardware implementation: Microgrid ...... 36 3.4 Software implementation: µTC language ...... 44 3.5 SVP system performance ...... 49 3.6 Discussion and conclusion ...... 56

i CONTENTS

II Compilation for Parallel Computing Systems 57

4 From basics to advanced SVP compilation 59 4.1 Basics in transformations ...... 60 4.2 SVP compilation schemes ...... 63 4.3 Under the hood of SVP compilation ...... 67 4.4 Conclusion ...... 83

5 On the challenges of optimizations 85 5.1 Hazards with optimizations ...... 86 5.2 Investigating some optimizations ...... 87 5.3 Discussion and conclusion ...... 97

6 Implementing the SVP compiler 99 6.1 Role of the compiler ...... 100 6.2 Compiler design decisions ...... 101 6.3 Compilation challenges ...... 107 6.4 Discussion and conclusion ...... 113

7 SVP evaluation 117 7.1 Evaluation of SVP compilation ...... 118 7.2 Evaluation of SVP computing system ...... 122 7.3 Discussion and conclusion ...... 137

III Discussion and conclusion 139

8 Discussion and conclusion 141 8.1 Thesis overview ...... 141 8.2 Limitations ...... 142 8.3 Future work ...... 144 8.4 Conclusions ...... 146

A µTC language syntax summary 147

Summary 157

Samenvatting 162 ii CONTENTS

Acknowledgements 165

Publications 167

Bibliography 174

Index 175

iii CONTENTS

iv List of Figures

1.1 Partitioning a sequential cooking recipe into tasks ...... 5 1.2 Communication between concurrent tasks of a cooking recipe . . 6 1.3 Synchronization between concurrent tasks of a cooking recipe . . 6 1.4 Management of concurrent tasks of a cooking recipe ...... 7 1.5 Overview of a standard software system ...... 9 1.6 The bridge between Software world and Hardware world . . . . 11

2.1 Computing system domains ...... 18

3.1 SVP parallel computing system overview ...... 28 3.2 Illustration of an SVP family creation ...... 29 3.3 An SVP family ...... 30 3.4 SVP inter-thread communication with a Global channel ...... 31 3.5 SVP inter-thread communication with a shared channel ...... 32 3.6 SVP inter-thread communication channels ...... 33 3.7 Illustration of an SVP concurrency tree ...... 35 3.8 Different states of an SVP thread ...... 40 3.9 SVP register window layout ...... 42 3.10 Mapping of hardware registers to architectural registers ...... 44 3.11 µTC example of a simplified reduction ...... 46 3.12 Thread function definition ...... 48 3.13 Gray area between create and sync ...... 49 3.14 Functional diagram of a 16-core Microgrid ...... 50

v LIST OF FIGURES

3.15 Speedup and stall rate for Livermore kernels 1 and 7 ...... 52 3.16 Speedup of sine function ...... 53 3.17 Speedup of Livermore kernel 3 ...... 54 3.18 Performance of FFT ...... 55

4.1 Basic compilation scheme T ...... 63 4.2 Simplified compilation scheme T for a thread function ...... 63 4.3 Compilation scheme T for a thread function ...... 65 4.4 Compilation scheme for µTC create action ...... 66 4.5 Compilation scheme for µTC break action ...... 66 4.6 Compilation scheme T involving a C function call ...... 67 4.7 Call gate inserted instead of function call ...... 67 4.8 The compilation process as a black box ...... 68 4.9 A simple compiler ...... 68 4.10 A classic optimizing three-stage compiler ...... 69 4.11 A modern optimizing three-stage compiler design ...... 70 4.12 A work-flow representation of a compilation process ...... 71 4.13 Composition of a µTC program with concurrent regions . . . . . 76 4.14 Creation graph example of a program ...... 77 4.15 The relationship of a single concurrent region ...... 79 4.16 Control flows of sequential and concurrent paradigms ...... 80 4.17 CFG representation of an SVP create block ...... 80 4.18 DFG of an SVP shared synchronized communication channel . . 81

5.1 Example of optimization side-effects on SVP code ...... 87 5.2 Example of optimization side-effects on communication channels 88 5.3 SSA transformation ...... 89 5.4 Example of unreachable code ...... 90 5.5 Example of valid code removal ...... 90 5.6 CFG representation of thread function “foo” ...... 91 5.7 Code example with CSE ...... 92 5.8 Code example with PRE ...... 92 5.9 Example of combining instruction ...... 93 5.10 Example of copy propagation ...... 94 5.11 Instruction reordering example ...... 94 vi LIST OF FIGURES

5.12 Instruction reordering example with create sequence ...... 95 5.13 Dependency chain between operations ...... 96

6.1 Compiler composition of GCC 4.1 Core Release ...... 105 6.2 Location of changes in GCC-UTC ...... 106 6.3 Shared object used as a token to enforce sequential constraints . . 110

7.1 Instruction mix of Livermore kernels in µTC ...... 119 7.2 Comparison of code size between unoptimized and optimized code ...... 120 7.3 Comparison of instruction size between hand-coded and com- piled code ...... 121 7.4 Comparison of execution cycles between hand-coded and com- piled Livermore kernels ...... 122 7.5 Functional diagram of a 64-core Microgrid ...... 123 7.6 BLAS DNRM2 in µTC ...... 125 7.7 Performance of DNRM2 on one SVP place ...... 126 7.8 N/P parallel reduction for the inner product ...... 128 7.9 IP performance, using N/P reduction ...... 129 7.10 Performance of the ESF ...... 132 7.11 Performance of the matrix-matrix product ...... 133 7.12 Computation kernel for the 1-D FFT ...... 134 7.13 Performance of the 1-D FFT ...... 135

vii LIST OF FIGURES

viii List of Tables

3.1 List of SVP instructions which can be added to an existing ISA . . 37 3.2 List of SVP register classes ...... 42 3.3 List of µTC constructs ...... 46 3.4 List of µTC types ...... 46 3.5 Create parameters which set up the family definition ...... 48

ix LIST OF TABLES

x Chapter 1 Introduction

I think there is a world market for maybe five computers Thomas J. Watson, Chairman of IBM, 1943

Computer-based systems are ubiquitous in daily life: in fridges which are regulated with an embedded Integrated Circuit (IC); in car control systems which manage the major driving features and engine controls; etc. The commodity computer market is driven by demands on delivering more efficient and reli- able computing appliances at frequent intervals of time: from a few months to a few years. These computing systems have to cope with the ever-growing in- crease of resource-intensive applications. Perhaps the mobile phone market is the most relevant market in which to observe this demand, during the market explosion of the last two decades [1]. A cell phone has evolved from a simple voice-calling device to an advanced multi-purpose personal assistant imple- menting the latest standards [2]. Not only do modern cell phone devices allow users to make voice calls, they also connect to the Internet, send and receive emails, organize the user’s agenda, and much more. Also referred to as smart phones, these portable devices assist the user in her/his daily life with various applications: video calls, embedded camera, phone games, on-line chatting ap- plications, social networking, etc. Therefore, the cell phone market hardware demands have exploded due to the pressures of having more advanced fea- tures embedded in their applications. Nonetheless, smart phones still have to deal with the physical issues inherited from traditional IC design: low battery life and additional heat after extended usage.

1 Chapter 1. Introduction

1.1 Classical microprocessor improvements

To allow such technology to exist, extensive research has been conducted by semiconductor companies who provide the heart of computing systems: the integrated circuit also known as the chip. The first microprocessor, or processor, emerged in the early 1970s incorporating functions of the computer’s Central Processing Unit (CPU) on a single chip [3]: arithmetic unit, on-chip memory (registers and caches), control and logical unit, clock, pipelines. For the last five decades, IC improvements have been made by the two following methods:

• increasing the processor’s clock rate and, • increasing the number of instructions issued per cycle.

Announced in 1965 in [4], Moore’s law predicts that approximately every two years, the number of transistors on a chip doubles. Since then, the trend of proces- sor performance has shown that the law depicts a certain reality in chip design. However, the more transistors present on a chip, the more physical issues ap- pear such as:

• power consumption and, • heat dissipation.

These classical problems of chip improvements are difficult to overcome, for example, in the domain of embedded systems such as smart phones design, where these problems become concrete annoyances for users. Indeed, mobile computing devices have a short battery life especially when using computing- expensive applications such as video calls, etc. Moreover, the devices expend excessive heat after long usage. Consequently, semiconductor companies have to look at different chip design approaches for future chip generations. To summarize, demand in new computing systems persists, while the offer struggles to provide efficient solutions for relieving these computing systems of the aforementioned physical issues. These solutions focus on using conven- tional improvements. Therefore, there is an urgent need to find other ways for microprocessor design improvements.

1.2 Multicore architectures

Exploiting parallelism is not new. Parallel programming paradigms, architec- tures and languages have been researched for decades; they have existed since the 1970s and 1980s [5]. However, they have never been mainstream [6]. In the last few years though, the move to parallel architectures in mainstream computing became a reality [7]. New challenges in computing when exploit- ing concurrent architectures appear as reported in [8] and they have become

2 Chapter 1. Introduction urgent to solve. These concurrent architectures are called Chip-MultiProcessors (CMPs) and they are introduced as a solution to save mainstream computing with its never-ending quest of better performance. Nielsen et al.[9] describe the challenges and the benefits of multicore archi- tectures in the context of High-Performance Computing (HPC). Indeed, dealing with multiple processors on a die pushes forward the issues of taking full ad- vantage of this technology. Improving the performance of applications requires exposure of extreme levels of software parallelism. Sohi [10] suggests that the appearance of CMPs changed the directions of the computing research agenda. Articles [11, 12] present also the concerns with multicore architectures. We think that one of the biggest problems facing the computer industry today is the challenge of programming the expected progression of Many-Core Chip- MultiProcessors (MCCMPs) that will reflect the beyond-frequency advances in computer performance due to the vestiges of Moore’s law. It is anticipated that tens or even hundreds of thousands of cores will be possible on a single chip at the end of silicon scaling. Moreover, it is well understood that power efficiency requires many simpler cores in a processor architecture rather than fewer more complex ones. However, exploiting these new architectures requires the expo- sure of explicit concurrency in the code they execute, in contrast to the implicit concurrency exploited in more complex cores. This in turn requires applica- tions to expose this concurrency, either explicitly by the programmer using some concurrency model or automatically using parallelizing compilers, nei- ther of which is easy [13]. This challenge is well acknowledged [14, 15, 16], but the concurrency revolution is happening now and urgently requires new tools and new ways of thinking [17].

1.3 Exploiting concurrency as a solution

1.3.1 The different levels of parallelism

The concurrency revolution is happening now in computing systems [17]. The impact of concurrency on software systems appears at all levels, from the ap- plication side to the architecture side. Achieving better performance using concurrency relies on finding parallelism in the algorithms of a given appli- cation. The parallelization of an algorithm into tasks is referred to Task-Level Parallelism (TLP). TLP focuses on distributing execution tasks (e.g. threads) across different parallel computing nodes. Unfortunately in some applications, it is difficult to discover parallelism, for example, an unconvertible sequential algorithm that cannot be decomposed in small functional tasks. Hence, there is no way to extract any benefit in concurrent executions of such algorithms compared to a sequential execution. Parallelism limitations of an application are expressed with Amdahl’s law: the speedup of a program using multiple com- puting nodes in parallel computing is limited by the sequential section of the program. Therefore, it is important that an architecture can also take advantage of a finer-

3 Chapter 1. Introduction grained concurrency with operations, named Instruction-Level Parallelism (ILP). Thus, even if the functional parallelism of a given program is non existent, there are still ways to improve performance by performing operations simul- taneously. In addition to this, another type of concurrency can be exploited with Data Level Parallelism (DLP). For instance with a loop (sometimes called Loop-Level Parallelism (LLP)), data parallelism is achieved when each comput- ing node performs the same task on different pieces of distributed data.

1.3.2 Approaching concurrency with a cooking recipe example

Concurrency concepts are difficult to comprehend. For that reason, we use in this section a practical example with a cooking recipe problem. We assume this problem (with a large size) to grasp the essence and the significance of concur- rency concepts and their relevance to performance. A recipe from a cook book is fundamentally sequential when you read it. There is a list of steps to perform to accomplish the dish which is the end result. The sequential way to process is to start from the first step of the recipe and perform it. Once this first is fin- ished, the next one on the list can be processed until completion. And this is the same for all of the following steps until the end of the recipe. The advantage of this is that the processing is deterministic; we know at any point in the recipe what is the status of the overall execution and the outcome of a transition from one state to another is predictable. We are interested in rendering the recipe’s execution parallel while keeping it deterministic. With a large problem size, the recipe has to be produced for 100 guests in a fully-equipped kitchen with a crew of cooks and a multi-unit cooker. It thus becomes important to execute the recipe more efficiently and faster; the sequential way would take too much time when serving the dish (i.e. the result of the recipe) to everyone within the shortest time between the first and the last served.

1. Boil salted water in a pot. Once the water is boiling, cook the pasta for 10 minutes. After draining them, put some olive oil and stir. 2. For the sauce, slice garlic cloves on a board. In a pot, put some olive oil on a low-heat fire and add the sliced garlic. Mince the ham and add it to the pot. Stir regularly and add chopped basil leaves. Season the pot to your taste. 3. Once the pasta and the sauce are ready, mix both and cook them slowly for few minutes.

The ingredients are assumed to be the data to compute in this recipe problem. The resources on which to execute on are a multiple-fire cooker including a crew of cooks. From the steps, we can observe a trivial partitioning into tasks to make the recipe’s execution faster. Using a TLP approach, we can isolate the sections that can be done independently. Figure 1.1 illustrates the partitioning of the sequential-oriented recipe. For instance, Task 1’s goal is assimilated to

4 Chapter 1. Introduction the cooking of the pasta. The big problem of making the dish is decoupled into 3 sub-problems:

1. Task 1: Cook the pasta.

2. Task 2: Prepare the sauce.

3. Task 3: Assemble and finalize the dish.

The granularity of the task to be performed is important to distinguish while dealing with parallelism. The granularity can be coarse-grained with Task 1 which is a sequence of operations: pour water in a huge pot; light up the fire on the cooker; salt the water; place the pot on the fire. The granularity can be fine-grained and closer to the atomicity of an operation. For instance, in Task 2, the ham can be minced by multiple resources (i.e. multiple cooks), in the sense that the meat is the data and using a DLP approach in this matter. The ham is separated in several portions to be taken care of individually. Moreover, it is interesting to add that another type of parallelism can be performed to extract concurrency: ILP can be used to take advantage of getting simultaneous actions within Task 1: lighting up the fire on the cooker; pouring water in the pot can be done separately.

Task 1

1. ... 2. ... Task 2 3. ...

Task 3

Figure 1.1: Partitioning a sequential cooking recipe into tasks. On the left-hand, the cooking recipe is a sequence of steps that can be accomplished one after another. This sequence of processing is also called sequential execution. The recipe can be decomposed into small tasks on the right-hand side. Decomposing in smaller problems may enhance the execution of the overall problem by exposing of task-level parallelism. What matters, in the end, is to be able to accomplish the execution of the recipe faster, if possible, and remain sure of a correct result comparable to the sequential way of processing the recipe. This determinism of results is a priority in parallel processing and to accomplish that further steps in defining and re- fining the application are needed. Thus, once the tasks have been generated, it is important to formalize the communication that can occur between tasks. For instance, Task 3 requires data from Task 1, i.e. the cooked and drained pasta, as illustrated with Figure 1.2. Therefore, this exchange of data is notified explicitly in the recipe’s definition and the cooks are thus aware of it.

5 Chapter 1. Introduction

Task 1 Task 2

Task 3

Figure 1.2: Communication between concurrent tasks of a cooking recipe. Task 3 re- quires data from task 1 (i.e. the cooked pasta) and task 2 (i.e. the ready-to-be-added sauce). Therefore, communication channels are exposed and the flow of dependent data becomes visible. We can also see that task 1 and task 2 are independent from each other.

Moreover, it is also important to isolate the synchronization points between the different tasks of the cooking recipe. One task has been first accomplished before another dependent task can be started. Figure 1.3 shows the points of synchronization necessary for the good coordination while executing the cook- ing recipe. Task 3 waits for Task 1 and Task 2 to be finished before it can start its execution. Consequently, we mark the recipe with a synchronization point for Tasks 1 and 2 and we put it before Task 3.

START

Task 2

Task 1

Task 3

END

Figure 1.3: Synchronization barriers between concurrent tasks of a cooking recipe. The diamond shapes represent the points of coordination necessary for proper execution. Without coordination between the tasks, there is no guarantee of handling inter-task dependencies properly. The Start and End are respectively the entry and exiting points of the recipe. Once the recipe has been partitioned into ‘small’ tasks, once the inter-task communication is clearly identified and once the synchronization points are visible, the concurrency management stage becomes the last step for the recipe designer (the chef) to accomplish. Concurrency management comprises: the mapping stage, which takes care of which resource and where a task will be

6 Chapter 1. Introduction performed; the scheduling considers all the inter-tasks dependencies to define and regulate the order of executions of the tasks and their operations; then the dynamic resource management deals with the availability of resources during the execution of the recipe. This is illustrated with Figure 1.4. For instance, if a cook leaves the kitchen after starting the execution of the recipe or if a cook comes in during the execution, it might ask the task to be scheduled and mapped elsewhere on the available resources.

Task 1 Task 2 Task 3

Concurrency management

Multiple- Crew of cooks unit cooker

Figure 1.4: Management of concurrent tasks of a cooking recipe comprises multiple steps. First, mapping specifies where a task is to execute. Scheduling, considering the dependencies of the program such as inter-task communication and synchronization points, orders the tasks to be executed. Dealing with multiple resources, which may be unknown at design time, implies a dynamic management of resources while executing the tasks.

Some execution issues can occur for instance when one of the burners breaks down; the operation being accomplished on it is therefore not achieved. The en- tire recipe depends on that and becomes deadlocked. This is called a resource deadlock. A deadlock can also occur when the recipe has not been well designed by the chef; during execution, it just blocks unexpectedly and the operation or task waiting for another dependent one to finish never starts. Similar to a deadlock, a livelock can also happen if the cooks waiting for the ham to proceed never actually process it by constantly changing their states. Overall nothing progresses and a livelock is present. Another problem, called a race condition, can occur when two cooks try to access the same portion of the ham for exam- ple; therefore, they compete to get it first. A way to prevent that is to set up a mutual exclusion (also called mutex) to avoid the simultaneous use of a common resource. To conclude, the major problem faced by concurrency is to be able to express these concepts properly and correctly while designing an application (i.e. the cooking recipe). We use this example as an analogy to represent a heteroge- neous multi-unit target platform (i.e. the kitchen with the multi-unit cooker, the crew of cooks and the chef). Dealing with multicore architectures raises these concurrent programming problems.

7 Chapter 1. Introduction

1.3.3 Concurrent software design and issues

Software design evolves and requires new ways of thinking for programmers when designing applications [18]. Designing a concurrent application resem- bles what we have looked at with the cooking recipe example. Thus, there are several major notions to keep in mind and their related issues when designing a problem in a concurrent manner.

Partitioning The partitioning stage of design is intended to expose opportu- nities for parallel execution. Hence, the focus is on defining a large number of small tasks in order to yield what is termed a fine-grained decomposition of a problem.

Communication The tasks generated by the partitioning stage are intended to execute concurrently but cannot, in general, execute independently. The com- putation to be performed in one task will typically require data associated with another task. Data must then be transferred between tasks so as to allow com- putation to proceed. This information flow is specified in the communication phase of a design.

Synchronization In the third stage, development moves from the abstract to- ward the concrete. Developers revisit decisions made in the partitioning and communication stages with a view to obtaining a work-flow of the algorithm that will execute. The synchronization points are then marked to coordinate the program’s execution correctly.

Concurrency Management In this stage of the parallel algorithm design pro- cess, developers specify where each task will execute, referred as to Mapping. This mapping problem does not arise on a single computing node or even some multiple computing nodes, such as shared-memory computers that provide au- tomatic task scheduling. Scheduling the tasks while executing the application is also part of this stage. Moreover, dynamic resource management handles the potential problems of mapping tasks as resources become available or unavail- able in the set of resources. Developers have to deal with the issues involved with these stages. In current approaches the description and management of concurrency are not decoupled, which results in a mixture of concerns that overwhelm developers making par- allel application development very difficult and error-prone. This lack of a clear separation of concerns also means a lack of appropriate high-level abstractions in both the architecture and application, which precludes portability between different platforms. Thus what currently happens is that applications are either developed targeting specific platforms or existing applications are retargeted to

8 Chapter 1. Introduction a platform through a painstaking process of static application mapping and the introduction of platform-specific functionality into the application itself.

1.4 Impact of concurrency on software systems

1.4.1 On a whole system

This concurrency revolution, presented in [13, 17], has an impact on software systems. Users utilize applications without knowledge of the machinery un- derneath as shown on Figure 1.5. Applications are executed on the hardware; the provides the interface between the software and the hard- ware. Having a multicore architecture requires an adapted toolchain in order to operate, comprising a concurrency-oriented operating system and applica- tions. Users do not need to be aware of the machinery. Consequently, this toolchain must be adapted to handle the features of this new concurrent target platform. Nowadays, the major issue to cope with is the multicore program- ming menace which is already acknowledged by the community [14, 15, 16]. The previous section exposed the concepts of concurrency that have to be em- bedded into software systems. The problem is that multicore architectures exist but the tools are not yet ready.

USER

APPLICATIONS

OPERATING SYSTEM

HARDWARE

Figure 1.5: Overview of a standard software system. The user on top of the layers employs applications such as a game on a cell phone. The gaming application is then executed on the hardware while being interfaced by the operating system. Moreover, although new concurrent target platforms are here, old sequen- tial target platforms and their sequential-oriented applications are still heav- ily used. For that reason, the software community must consider backward compatibility of old software systems with these new platforms. Execution toolchains are composed of various layers of software programs. Figure 1.5 depicts interactions between these different layers of a software system used in

9 Chapter 1. Introduction the execution of a program. Based on an architecture at the bottom, a series of software programs is required to make proper usage of the hardware machin- ery. The Operating System (OS) layer directly provides an interface between the hardware and other software layers. The OS provides a Run-Time System (RTS) managing executions of applications. Moreover, the System Layer supplies the standards libraries for network, file system, etc. Applications can be using spe- cific components taken from the Framework Layer. In the end, layers of software systems need to be adapted to concurrency for coming multicore architectures. The execution toolchain of a software system is necessary on all computing systems to execute applications. To build these applications, we need look at the development toolchain used to render an executable application. Figure 1.6 instances the major tools necessary for software development.

1.4.2 Bridging the software and the hardware worlds

In this thesis, we look at a particular part of software system development which bridges the software side and the hardware side. Introducing concurrency concepts as a new paradigm requires appropriate concurrency-aware tools. Ar- ticles [19, 20] analyze the different existing parallel programming models and languages available to developers. They capture the abstraction of parallelism in these models and evaluate them. Despite the effort in programming model research, the tools are not yet ready. Figure 1.6 abstracts levels of software and hardware design and development tools. For software developers, the goal is to rely on a tool that would just do the job after expressing the tasks to be per- formed. In other words, developers want an easy-to-use and reliable toolchain for their development. Compilers are a major component in software develop- ment. Often developers rely on them to optimize their code for an architecture that they do not need to be completely aware. The role of the compiler is to take advantage of the targeted architecture with respect to the tasks to be ac- complished within the programmed applications. Therefore, the pressure on this component (in Figure 1.6, see the central box) comes from the software side where the compiler must support all source language features, and must also understand the meaning of the program in order to get the most optimized re- sult. Pressure also arises from the hardware side which has no idea nor concept of the software implementation constraints; the compiler must know where in the program the code can be improved and tweaked to take the greatest advan- tage for the targeted architecture. Consequently, bridging these two worlds is the bottleneck where the compiler, the software system transforming a user-level language into a machine-level language, operates as a middle man. This middle-man tool has be to aware of both sides of this divide in order to function; the introduction of multicore architectures has already raised concerns [21]. The work in this thesis is based on an underlying compiler development for a new parallel programming lan- guage (called µTC, a C-like language with keyword extension and new seman- tics) targeting a many-core architecture (called the Microgrid). Furthermore,

10 Chapter 1. Introduction

Applications

Software Frameworks Libraries, Runtime, Debuggers, Profilers

Source languages

Compilers

Target / Machine languages

Binary files Hardware Architectures Integrated Circuits, Microprocessors, Semiconductors

Figure 1.6: The bridge between software world and hardware world. We can see the ma- jor layers present in Software Development for an application. The developer takes care of an application programmed in a source language using a development framework and tools such as debugger or profiler. Libraries can be used as well in the applications to enhance some properties of the application. Compilers then convert the application into its corresponding machine-language representation. Once in binary format, the ar- chitecture executes the application using its microprocessor(s).

this thesis describes a novel concurrent execution model - the Self-Adaptive Virtual Processor (SVP) - which has been developed by the Computer Sys- tems Architecture group (CSA) at the university of Amsterdam. Various imple- mentation work related to this model is also part of the CSA group’s research (aforementioned µTC and Microgrid, respectively language and hardware im- plementations). The main goal of SVP is to provide an operational computing system including various areas of research such as hardware emulation using cycle-accurate simulation, high-level program simulation on conventional ar- chitectures, memory simulation, concurrent language design, operating system and compilers. Having an exotic programming language (µTC in opposition to classic C-like languages) necessitates the presence of an adapted and dedi- cated toolchain, stress is especially put on compilers to perform valid program transformations without loss of program semantics and to improve the code to obtain as efficient execution as possible.

11 Chapter 1. Introduction

1.5 Contribution of this thesis

This thesis focuses on the multicore programming menace and how we tackle that by compiling the SVP model to its various implementations. Additionally, this thesis concentrates on the issues of bridging two major components of the system: the first is the language implementation as a concurrency description mechanism; the second is the architecture implementation as a multicore plat- form. This thesis illustrates the issue of generating efficient and correct code for this multicore architecture. Moreover, this thesis exposes the changes re- quired and the challenges encountered to integrate native concurrency idioms and assumptions into an existing, conventional, sequential-oriented compiler. At first glance, this work targets a specific audience of technical computer scientists involved in compilation. Nevertheless, this work also represents a reflection on the limits of existing engineering methods to tackle the challenges of dealing with concurrency. The essence of the technical contribution of this work is then used as a theory of engineering on the limitations of current com- puting systems and other concurrency-based systems. In the context of facing the multicore programming menace, this work becomes relevant to an audi- ence dealing with issues of concurrency-based compilation, language design, and multicore programming issues.

1.6 Overview of this thesis

In this thesis we aim to expose the changes and the challenges facing the integra- tion of native concurrency idioms from a concurrent execution model into an existing sequential-based system. In Chapter 2, we study the other past and current works related to parallel computing systems. In addition, we evaluate how close or how different our research is compared with this work at different levels of granularity (at the model-, the architecture-, the compiler-, the language-level). In Chapter 3, we describe a candidate to tackle the “multicore programming menace” with the Self-Adaptive Virtual Processor (SVP), an abstract concur- rent execution model. This model combines fine-grained threads (concurrent composition is assumed at all levels) with dataflow synchronization between threads. As such it is a compromise between capturing maximal concurrency (i.e. dataflow) and providing efficient implementation (i.e. threads). It pro- vides deadlock free composition and captures locality and regularity via the constraints of the model, which means it is amenable to compiler analysis and code transformation even in the absence of a specific target architecture. Also like sequential and dataflow models it provides determinism of results. We then depict two implementations derived from this SVP model: a hardware tar- get implementation named Microgrids, and a source language implementation named µTC (pronounced ‘myoo - ti: - si:’). After that in Chapter 4, we discuss the bridge between these two implementations resulting in compiling from the source language µTC into the target architecture Microgrids. We also study the

12 Chapter 1. Introduction differences between classical compilation and SVP compilation. Chapter 5 in- vestigates conflicts that SVP properties have over conventional compiler op- timizations. Later Chapter 6 exposes the challenges of SVP compilation with current compilation systems. Moreover, we explain the methods used to embed these concurrency concepts. Chapter 7 discusses evaluation of SVP compilation with scientific problems. Finally, Chapter 8 summarizes this thesis work as a reflection on existing engineering methods, referred to as theory of engineering, and on the future of compilers as concurrency-aware systems.

13 Chapter 1. Introduction

14 Part I

Foundations

15

Chapter 2 Background in parallel computing systems

If people never did silly things, nothing intelligent would ever get done. Ludwig Wittgenstein, Philosopher

In this chapter, we present a selection of past and current work conducted in both academia and industry in parallel computing systems. We can thus evaluate the relevance of our research and position it against this work. Parallel computing has been researched since the 1970s where most of the concepts have been already defined [5]. At first, focused on scientific computing and HPC, its move to mainstream computing became clear in the 2000s [7]. We note that it is difficult to position our research since we tackle the multicore programming challenge with a full system approach introduced in Chapter 3. Therefore, at this point, it is important, as illustrated in Figure 2.1, to distinguish the different domains and their influences. The comparisons with other work can be done at different levels of granularity of the system: at the model-level, comparing concurrency approaches; at the system-level, comparing the system as a whole against others; at the approach-level, in ways of dealing with concurrency; at the compiler-level, in the sense of how it deals with concurrency. This chapter answers these questions and also exposes at the end the requirements that we consider priorities for a concurrent execution model.

17 Chapter 2. Background in parallel computing systems

Execution Model

Programming Languages

Compilers

Architectures

Figure 2.1: Overview of computing system domains and their intersections. The domain of execution models encapsulates the domains of programming languages, compilers and architectures. These domains are derived from the execution models with their properties. The compiler domain overlaps both domains of languages and architectures.

2.1 Approaches in concurrent execution models

Computing systems involving concurrency at some levels are already present in the market. With some exceptions, it is observed that approaches are made either at the architecture level or at the programming level. This section gathers an overview of execution models. It is important to note that this overview of approaches is not complete since new approaches appear frequently. Moreover, we note that there are different angles for observing each model. The program- ming model is the first aspect; then, each execution model targets a specific area which can be a particular architecture to program or more general-purpose tar- gets. Furthermore, in the context of this thesis, we also look at the model’s im- plementation in their tool sets. Top-down approaches come from the software side and aim to target general architectures rather than a specific one. Bottom- up approaches project forces from the hardware layer onto the software layer; the resulting software definition is then dedicated for this architecture.

2.1.1 CUDA

A characteristic example of typical hardware and programming model cur- rently available in the industry is NVidia’s Compute Unified Device Architec- ture [22] (CUDA). This execution model is pushed from the bottom-up by the architecture. In other words, the CUDA execution model is dedicated to CUDA graphics hardware and is not portable to other architectures. CUDA makes use of GPU arrays by providing a programming interface that allows the offload- ing of computations from the CPU via a high speed bus, for example AGP. The limitations of this model (both hardware and programming framework), when compared to the work in this thesis (SVP execution model in Chapter 3), are twofold: the programming model requires highly explicit concurrency through

18 Chapter 2. Background in parallel computing systems the use of many code directives, with the hardware itself relying on an ‘acceler- ator’ model, where more general computations are carried out on a CPU, and hence computations carried out on the CUDA architecture are bound by bus speed limitations. Also, the scheduling of work/threads lacks the flexibility and concurrent composition allowed by the SVP execution model. The CUDA execution model defines concurrent regions where programmers do not need to manage thread creation or thread destruction. Moreover, CUDA employs Single-Program Multiple-Data (SPMD) techniques; a concurrent region will ex- ecute at run-time in a set of concurrent threads running at the same time.

2.1.2 Dataflow approach

Approaches with dataflow-like models have a long history of research which have now received additional focus, given the impending limits of conven- tional performance scaling. Lee et al. [23] present key points of dataflow archi- tectures and multithreading. The dataflow model of execution offers attractive properties for parallel processing. This model captures program’s dependen- cies, is asynchronous, and bases instruction execution on operand availability; synchronization of parallel tasks is implicit in this model. Moreover, schedul- ing is dynamically achieved by satisfying program dependencies. Lee et al. also expose the convergence of the control-flow and dataflow models of execu- tion; the authors expect to build hybrid architectures using dataflow approach in conventional control-flow thread execution. To conclude, this article says that the eventual success of dataflow computers depend on their programma- bility. In this thesis, the SVP execution model tackles this challenge with the Microgrid dataflow-driven architecture implementation (cf. Section 3.3) and the imperative-language µTC language implementation (cf. Section 3.4). Historically, closely related work can be observed from around 20 years ago at MIT during Arvind’s research on combining dataflow principles with Von Neumann computing [24, 25]. This work explored the potential benefit of pro- viding the scheduling of dataflow computing on conventional instruction sets. In Transactional Memory (TM) [26] , dependencies are discovered dynamically, in contrast to the static capturing of dependencies in the SVP model which fa- cilitates very fine-grained concurrency. TM can also be fine-grained as soon as there are not too many dependencies. More recent work similar to the prin- ciples used in SVP is the WaveScalar architecture [27]. This achieves pure- dataflow execution by using an implicit program counter for memory read- /write ordering. However, there are restrictions in the amount of concurrency that this can expose: constraints on memory reordering; broadcasting data to- kens in a large parallel system (because of read/write to memory); dispatching instruction tokens in a large parallel system; limitations in data dependency ex- posure in a program because of memory limitations. These memory limitations are two-fold: one bounded (i.e. synchronizing memory) and the other practi- cally unbounded with VM, i.e. regular memory, which implies a performance hit. In other words, regular memory poses no real limitation but synchroniz-

19 Chapter 2. Background in parallel computing systems ing memory does. In conclusion dataflow machines use a bottom-up approach pushing up properties from the architecture to the software level.

2.1.3 UPC

Unified Parallel C (UPC) [28], designed for high-performance computing (HPC), is aimed at mainstream parallel architectures. The model presents explicit con- current execution using a single shared, partitioned address space, referred to as Partitioned Global Address Space (PGAS). Each variable may be directly read and written by any computational unit (i.e. processor), but each vari- able is physically associated with a single processor. UPC is an extension of the C programming language; it implements a front-end extension of the GCC framework with a library invoked during compilation of UPC constructs, in GUPC [29]. UPC constructs map onto hardware processes using high-level threads (Pthreads implementation). The GCC implementation is extended with new language constructs at the front-end and middle-end levels. Even though UPC has a uniform programming model for both shared and distributed mem- ory hardware, it does not capture properly the data dependencies in the SVP model (cf. Chapter 3). The top-down approach of UPC is not to provide a full computing system in opposition to other approaches, but to give a parallel language implementation with an associated compiler implementation.

2.1.4 OpenMP

OpenMP [30, 31] is a well-known standard in industry for exploiting paral- lelism. OpenMP provides directives to the Fortran, C/C++ programming lan- guages to support shared-memory architectures. OpenMP uses code pragmas to expose concurrency in programs, whereas other approaches prioritize lan- guage extension with special constructs. A compiler (e.g. GCC with GOMP [32] supports the latest OpenMP 3.0 standards) interprets this concurrency infor- mation; OpenMP standards use a dedicated library to support concurrency id- ioms. With its pragma approach of exposing concurrency in programs, OpenMP pressurizes compilers with concurrency issues at the level of compiler opti- mizations. The concurrency constructs are not fully exposed to the compiler’s internals which generate gray areas in how to correctly compile concurrent re- gions. To deal with non-sequential OpenMP code, compiler extensions have been incorporated in the GCC compiler 4.4. Nonetheless, it targets in particular SMPs with coarse-grained concurrency and lacks efficient inter-thread synchro- nization [33]. Another drawback to use OpenMP is the large variety of code annotations and their options which make the programmability more complex for developers.

20 Chapter 2. Background in parallel computing systems

2.1.5 Sieve

Sieve [34] can be seen as an alternative to OpenMP, and is released by Cold- play [35]. Sieve is a C++ parallel programming system which targets simplify- ing the parallelization of concurrent regions. Its approach is with a new con- struct to encapsulate parallel regions (over loops) to secure memory consisten- cies. Side-effects within the Sieve block are delayed until the end of the scope (of this encapsulation). As with any other language extension techniques or code annotation techniques, Sieve suffers a little from the fact that an appli- cation’s code needs to be redesigned/reengineered. Overheads for developers are not negligible when language constructs are too numerous or complicated to use. Nonetheless, the advantage is the simpleness of this encapsulation. On top of that, the compiler has a clear vision of hazardous areas in the code where compiler optimizations can break code consistencies.

2.1.6 Cilk

Cilk [36] is a general-purpose programming language which is designed for multithreaded parallel computing. The language extends the C programming language with a few language keywords to handle explicit parallelism. In this top-down approach, the programmer identifies concurrent regions which can be run safely. Thread management is performed implicitly by the scheduler. Cilk has a fork-join concurrency approach, but it is not focused on dataflow machines. Moreover, the Cilk run-time scheduler is in charge of mapping con- currency to hardware resources; in contrast to other approaches where con- structs map directly to hardware processes or operations. Frigo [37] discusses multithreaded programming in Cilk with a simple Fibonacci example. Investi- gation on many-core architectures in [38] present the architectural requirements necessary to enable multicore programmability with Cilk. A major advantage of Cilk is a separation of concerns between the parallelism exposure and the resource mapping which permits a Cilk program to run, without rewriting, on any number of processors, including just one.

2.1.7 Google Go

Google Go [39] is an imperative concurrent programming language based on the C programming language and designed by Google. Go defines explicit communication channels between concurrent regions. Go uses a top-down ap- proach which aims at general-purpose parallel machines with shared memory. In terms of implementation, Go uses an extended version of the GCC compiler platform (a new language front-end on top of the compiler and small changes in the middle-end) which is named GCCGo [40]. Concurrent regions are mapped onto goroutines which are executed in parallel with others, including the caller (or creator). A group of goroutines are multiplexed onto multiple threads; ex- ecution control is moved between them by blocking them when sending or

21 Chapter 2. Background in parallel computing systems receiving messages over channels. Moreover, in current compiler implementa- tion, the goroutines are mapped onto Pthreads which make them heavier than lightweight threads.

2.1.8 Pthreads

The Portable Operating System Interface (POSIX) Threads (Pthreads) API [41] targets shared-memory multiprocessor architectures (e.g. SMPs) and is a set of routines which are incorporated in imperative languages (e.g. C/C++) to ex- pose concurrency. The Pthreads method results in creating high-level threads and requires explicit developer’s attention for thread creation and thread de- struction. Workload partitioning and task mapping are explicitly specified by developers, at design-time, in the routine definitions. Moreover, developers must be aware of race conditions and deadlocks in their code when multiple threads access the shared data. To achieve that, the API provides mutexes for mutual exclusion (only one thread enters the critical section) and semaphores (several threads may enter the critical section at the same time). Skillicorn et al. [19] present in further details the existing methods of parallel programming models and languages.

2.1.9 MPI

MPI [42] along with Pthreads [41] probably represent most parallel program methods used for explicit concurrency programming. Kasim et al. [20] made a survey of existing parallel programming models where the pros and cons for each model are decrypted. The MPI approach provides message-passing communication routines among computational processes to model a parallel program running on a distributed memory system. MPI has implementations in several known programming languages, such as C/C++/Fortran, Java, and C#. Similar to the Pthreads method, it is the developer’s responsibility to ex- pose workload partitioning and task mapping. However, ‘thread’ (i.e. process) management is done implicitly whereby it is not necessary to code the creation, scheduling, or destruction of processes. MPI generates then high-level threads where developers have a big responsibility in the way they partition the pro- gram’s tasks to achieve good performance.

2.2 Relevant parallel architectures

The ultimate goal of architecture design is the never-ending quest for perfor- mance gain. Due to the physical constraints preventing frequency scaling, ar- chitecture designers are looking at parallelism to dispatch workloads over mul- tiple computational hardware units. The work in this thesis, the SVP execution model, describes a hardware implementation called the Microgrid in a DRISC

22 Chapter 2. Background in parallel computing systems manner [43]. We have selected several multicore architecture designs that con- trast with the SVP execution model introduced in Chapter 3. The UltraSPARC T1 and T2 processors target Internet, database and network servers and are also referred to as Niagara [44] and Niagara II [45]. The latter has 8 cores and sup- ports only 64 threads in hardware. Each core has its own L1 cache, FPU and two ALUs and all cores share a 4MB L2 cache. The major difference between this and a Microgrid is twofold: one is parametric and one fundamental. As described above, a DRISC core can support 100s of cores and 10s of thousands of threads. In programming, Niagara utilizes speculative threading, whereas DRISC manages user or compiler generated concurrency explicitly, with con- servative (dataflow) execution rather than speculative execution. Although the Niagara may appear to be the closest work to that described in Chapter 3, WaveScalar from the University of Washington [27] with its dataflow execution model, is conceptually closer. Both models capture a program’s de- pendencies and the difference is in the execution model. WaveScalar uses the dataflow firing rule, whereas DRISC uses RISC instruction execution with block- ing on register reads. Both models contextualize their synchronizing memory, e.g. via waves in the WaveScalar or in DRISC as concurrent thread contexts in an extended register file. In WaveScalar, all synchronizations are producer- consumer but in DRISC, there is a bulk synchronization on the termination of a family that extends synchronization to the model’s shared memory. Both models define locality statically and create and distribute concurrency dynami- cally, which are necessary features for efficient compilation and execution. Both models also tolerate high levels of latency and can exploit the latency in a dis- tributed cache-coherent shared memory. This property can greatly reduce the pressure on the processor-memory interface. The conceptual similarities were strengthened recently. In prior work on WaveScalar, the sequential model of programming was managed by providing an ordering on loads and stores but this proved to be too sequential and the model has now been extended with the concept of threads to manage memory concurrency [27]. The other work that is conceptually related to Microthreading (i.e. SVP exe- cution model) is the Data-Driven Multithreading Chip Multiprocessor (DDM- CMP) [46] proposed by the University of Cyprus. DDM-CMP is a coarse- grained dataflow model based on managing blocking threads from the cache interface of a regular processor. Each core in this design has its own L1 cache in- corporating a Thread Synchronization Unit (TSU), which implements synchro- nization on the blocking threads before executing those threads as a conven- tional code segment. A system network connects the cores and a TSU network connects TSUs. The TSU stores a dataflow graph in its Graph Memory and a number of thread queues, which maintain the status of threads and code blocks. The scheduling of complete threads is done at run-time using the dataflow fir- ing rule and a thread is never issued before all of its operands are ready, at which point it can execute to completion because of the blocking semantics. The Intel 48-core Single-chip Cloud Computer (SCC) [47] resembles a scal- able cluster of computational cores. Intel Labs mentions the incorporation of

23 Chapter 2. Background in parallel computing systems technologies intended to scale multicore processors to 100 cores and beyond, such as an on-chip network, advanced power management technologies and support for ‘message-passing’. The Computer Systems Architecture group, at the University of Amsterdam, has received, at the end of 2010 an Intel SCC plat- form to perform programmability experiments using SVP as execution model.

2.3 Modeling concurrency in compilers

Concurrent execution models require an adapted toolchain to render multicore programmability feasible. The way concurrency is exposed in the concurrent source language is a first constraint on how the toolchain is designed. As men- tioned in Section 2.1, some concurrent programming languages embed in their design new language construct which explicitly expose the parallelism in an application’s code. Compilers may visualize concurrent regions depending on the exposure of parallelism in a library or with native integration into com- piler’s internals. Boehm [48] shows that a thread-based model implemented (Pthreads model is used in the experiments) as a library presents limitations in the use of compiler optimizations. Consequently, these optimizations mod- ify the semantics of the code generated for an application, which may cause them to be unusable. Therefore, problems do not come only with the exposure of concurrency, but also from the way tools are designed. OpenMP [30] uses a pragma approach which is implemented in a compiler with library support. This limits accurate diagnostics and error handling. Adisson et al.[49] men- tion overheads caused by program design and the sequences of library calls to support concurrent regions. Modeling concurrency in compilers become a problem that cannot be ignored anymore. Gray areas in a program’s code must be reduced to avoid problems; this is a well-known motivation for concurrent programming language design. However, the stress must be also put onto their compiler implementation where gray areas in compiler internal representations must be avoided as well. Edward Lee, in paper [50], warns of the danger of exposing a thread-based programming model to the vagaries of non-deterministic scheduling in a multi- processor environment. His own carefully managed software developments re- vealed bugs that had lain dormant in the code for years until exposed to a mul- ticore environment, where scheduling is no longer a deterministic interleaving. Adapting applications to a new concurrent programming language requires considerable changes to structure or restructure the code of these applications. Gabb et al.[18] discuss new ways of engineering and designing concurrent ap- plications where programming models and their tools are important pieces of the programmability puzzle. Berkeley University, in [8], summarizes the key concepts of the multicore programming menace. The emphasis is also placed on the difficulty to make compilers converge onto multicore technology. A lot of compilers are available for research (retargeting or reuse). Nonethe- less, there are hardly any which consider concurrency assumptions in their

24 Chapter 2. Background in parallel computing systems infrastructure. Small-sized compilers such LCC [51], ACK [52], Tiny C com- piler [53] are not reconfigurable frameworks and offer low flexibility to retar- get to new architectures and to add new components. Their small size gives the wrong idea of the simpleness for concurrency extensions. Larger compiler frameworks embed a large set of target architectures and a number of aggres- sive inter- and intra-procedural optimizations. Typically, GCC [54], LLVM [55] offer a good platform for concurrent support with the presence of extension ports. Attempts to use these frameworks for parallel programming have al- ready been tried [56, 57]. However, the very large size of these compilers and their complexity require a considerable amount of time and manpower to achieve production quality. The flexibility and modularity of the compiler in- frastructure should offer ways to model and embed concurrency assumptions. This thesis work looks at this question of modeling native concurrency in a compiler infrastructure.

2.4 Requirements for a concurrent execution model

Based on the previous sections, the point in this section is to abstract the essence of previous work in parallel computing in order to conceive requirements for a concurrent execution model. The ultimate goal is the gain of performance for parallel architectures, and consequently to find other ways to save mainstream computing from physical restrictions (the processor and memory walls). Given this, the move to multicore would provide a solution for future computing sys- tems, but the programmability is still a challenge [16, 14]. Backward compat- ibility of existing applications is of high importance; industry does not want to spend money in redesigning and reengineering applications that are already developed and certified to work perfectly without bugs. The free lunch is over, says Sutter in [13]; the concurrency revolution is here with the presence of mul- ticore architectures [17]. New tools and new thinking from the software indus- try are necessary to cope with the programmability of these multicore architec- tures. The concurrency revolution is at all levels of the chain from the archi- tecture up to the programming language including the development tools [8]. Developers want to have a rather easy way of developing for novel parallel computing systems to avoid the nightmares of concurrency programming and design. The main issue is then the need of recompilation for the same applica- tion as soon as the target platform changes. The management of concurrency would be more appreciated if it could be opaque to developers and left to the programming paradigm with only the exposure of concurrency. Consequently, the target platform must be able to perform the scheduling and resource man- agement efficiently according to the program’s description. Moreover, reducing the impact on current computing systems with respect to the introduction of this concurrency would be appreciated. This would ren- der possible the potential reuse of existing proven technology. Therefore, the design of new tools would benefit from former or antecedent technology if this

25 Chapter 2. Background in parallel computing systems can be reused in a concurrent context. The main problem remains in the pro- grammability of multicore architectures. So far, sporadic computing systems implement concurrency as a whole paradigm versus uses of code labeling for tagging concurrent areas in a program. The CSA group’s proposition is to vi- sualize a parallel computing system where concurrency and its concerns are treated at different layers of the system. We consider necessary the following concepts in an execution model for deal- ing with parallel architectures and parallel programming models. In addition, there is a need for a separation of concerns between resource availability and their mapping to ease off the pressure on developers. Finally, parallel code should scale automatically on different architecture configurations without re- compilation. We need a concurrent execution model that can offer and facilitate the following concepts:

• scalability over a distributed multicore architecture, • programmability of multicore architectures, • binary code compatibility across arbitrary number of cores (no need to tailor the code across recompilations), • support concurrency and resource management dynamically.

The following chapter presents the Self-Adaptive Virtual Processor concur- rent execution model, called SVP, and its software and hardware implemen- tations - research undertaken by the CSA group. This thread-based execution model captures the previously enunciated concepts and is the basis of the par- allel computing system used in this thesis.

26 Chapter 3 Self-Adaptive Virtual Processor Execution Model and its implementations

Moving targets may be fine in hunting, when it comes to generating a compiler this is rather inconvenient to say the least. Sven-Bodo Scholz, University of Herdfordshire, UK

Chapter 2 presented other works in parallel computing and comparisons with our approach described in this chapter. We describe an abstract concur- rency execution model named the Self-Adaptive Virtual Processor Execution Model as base of our parallel computing system. The CSA group works to define and to refine this model which is a candidate for the exploitation of con- current systems. This execution model is the underlying cement of the research undertaken in this thesis; the following chapters use it as a base of the SVP concurrency paradigm.

The contents of this chapter are based on these publications:

• T. Bernard, K. Bousias, L. Guang, C. R. Jesshope, M. Lankamp, M. W. van Tol and L. Zhang – “A General Model of Concurrency and its Implementation as Many-Core Dynamic RISC Proces- sors”, in Proceedings of International Conference on Embedded Computer Systems: Archi- tectures, MOdeling and Simulation (IC-SAMOS 2008), pp. 1–9, ISBN 9781424419852, July 2008. • T.A.M. Bernard, C. Grelck, M.A. Hicks, C.R. Jesshope, R. Poss – “Resource-agnostic program- ming for many-core microgrids”, 4th Workshop on Highly Parallel Processing on a Chip, HPPC 2010, August 2010.

27 Chapter 3. SVP Execution Model and its Implementations

We discuss the SVP concurrency model with its concurrency properties such as exposition of parallelism, mechanisms of synchronization, ways of commu- nication, etc. Moreover, we present the other derivations from this execution model such as the architecture and language implementations. We then present other implementations of the SVP model such as high-level simulator (using Pthreads technology). At the end of this chapter, we provide an evaluation (with results) of the SVP parallel computing system comprising a hardware im- plementation as Microgrids and a set of hand-coded scientific problems.

3.1 Our approach to multicore programming

We, i.e. the CSA group, work on the SVP execution model and its derivations in the hardware, language and compiler implementations. We tackle the “mul- ticore programming menace” by following a three-fold approach illustrated in Figure 3.1: software, compiler and architecture. In this chapter, we present the separation of concerns with this approach which allows us to tweak and dis- tribute responsibilities at different levels in the toolchain.

SVP

Language Architecture implementation implementation

Compilation μTC Microgrid Runs on

μTC Microgrid Compiler Simulator

Figure 3.1: Overview of the SVP parallel computing system where the model is imple- mented into a language implementation as µTC and into a hardware implementation as the Microgrid. The tools (in rounded containers) depicted here are: the compiler trans- lating from µTC to Microgrid and implementing the SVP paradigm; the simulator of the Microgrid architecture with cycle-accurate emulation. Our group’s research goal is to provide a whole parallel computing system derived from the SVP execution model and evaluate it with appropriate bench- marks and applications. The software implementation encodes the SVP prop- erties which permit the exposure of explicit parallelism in the source code of an application; however, there is no need to deal with the distribution of con- currency and management of resources. The architecture implementation takes care of solving these issues; it then removes from developers the concerns re- lated to concurrency and resource management. Furthermore, the SVP-aware

28 Chapter 3. SVP Execution Model and its Implementations compiler preserves and potentially optimizes the concurrent applications tar- geting the Microgrid platform.

3.2 Presentation of the SVP execution model

The SVP execution model [58] offers a uniform means of capturing dynamic concurrency. This is done with a parameterized create action that forks a named family of identical blocking threads as illustrated in Figure 3.2. This may be ap- plied recursively; hence, the model’s bulk synchronization is by named family (sync action in Figure 3.2). An SVP thread may contain a number of synchro- nizing objects which provide dependency constraints on the execution of that thread. These objects are set by one thread and read by its adjacent thread, cre- ating dependencies between asynchronously executing threads. These depen- dency constraints guarantee locality and freedom of deadlock (see article [59] for formal proofs).

Parent A Children B ti create issue point B(3) B(2) B(1) ti+1 sync completion point

Figure 3.2: Illustration of an SVP family creation event with multiple concurrent control flows of threads. The child family is created at the issue point in the parent thread. All threads of the child family are terminated after the completion point. Note that this figure does not illustrate the inter-thread communication.

3.2.1 Thread family and creation

All SVP threads are grouped in families as shown in Figure 3.3. A hierarchy appears where a parent thread A is the creator of a family of child threads B. Each child thread is identical in the task they accomplish. In essence, an SVP family is a parameterized group of statically homogeneous, but dynamically heterogeneous threads that are created en masse. The creation of child threads is performed in the order of their index, i.e. B(1), then B(2), and B(3) in Fig- ure 3.3. The threads’ executions are constrained with the dependencies they may contain in the description of their tasks. We focus on these dependencies in Section 3.2.2.

29 Chapter 3. SVP Execution Model and its Implementations

Parent A

Family

Child Child Child B(1) B(2) B(3)

Figure 3.3: An SVP family is a group of child threads created by a parent’s thread. The figure represents a parent A creating a family of 3 child threads B. A family of threads may be bounded to a set of resources. This figure does not illustrate the relationships between threads.

An SVP thread creates a family of threads with a parameterized create action as illustrated in Figure 3.2 in a fork manner at the issue point. At this point, two separate domains are present: the parent and the children. Each SVP thread has its own flow of control and is only bounded in its execution by the exposition of data dependencies on other threads’ execution. The completion point of the family is done with the sync action with the join manner in the parent thread. There, all child threads are terminated; all control flows converge at this point. Every SVP thread can create families of its own; this allows composition and therefore makes the model hierarchical. Illustrated with Figure 3.3, a family is characterized by the index sequence of the threads to be created, a reference to the thread body and a definition of unidirectional synchronization communication channels from, to and within the family.

3.2.2 Inter-thread communication

The SVP model provides explicit ways to expose the communication which is available during program’s execution. Inter-thread communication is accom- plished through synchronized communication channels. In other words, “syn- chronized” means that a channel uses a dataflow mechanism with blocking reads and non-blocking writes. More precisely, there is a synchronization event linked to a write in a communication channel. The first read from this channel will then consume this synchronization event and then unblock thread’s exe- cution waiting for the data to come through the channel. After the first read though, i.e. the other reads from the same channel, there are no other synchro-

30 Chapter 3. SVP Execution Model and its Implementations nization events on this channel.

Child B (3)

read(x_g)

write(x_g) read(x_g) Child B Parent A G (2)

read(x_g)

Child B (1)

Figure 3.4: Example of an SVP inter-thread communication with a global communica- tion channel. The communication resembles to a star-like pattern where the channel is written in the parent thread. There, the data is distributed over the channels towards the child threads where it will be read. The figure exposes the reads and the writes on the channel. The blocking read property of the SVP model disables the thread’s execu- tion until the data becomes available in the channel. A global communication channel is read-only in the child thread.

In Figure 3.4, we look at the mechanism of the global synchronized commu- nication channel. The parent thread A creates a family of threads and defines only one global channel in the parameters of the create action. This channel is unidirectional from the parent thread to the child threads. More precisely, it is written only once by the parent; it is read-only for the child threads. Each child thread gets a copy of the data propagated in this channel from the parent thread. Therefore, the pattern of this channel resembles a star with as many points as child threads. Figure 3.5 introduces the shared synchronized communication channel. This channel has two distinct ends in the figure: an incoming end of the channel symbolized with read(y s) and an outgoing end with write(y s). Also defined in the parameters of the create action, the dependency chain starts in the parent thread where the outgoing shared channel is written once and is available for reading to the first indexed thread of the family in the incoming shared channel. The shared channel is implemented with a producer-consumer mechanism; it only describes communication between adjacent threads (i.e. thread with in- dices i and i+1), parent and the first indexed thread, the last indexed thread and the parent. The thread writes once in its outgoing shared channel and the

31 Chapter 3. SVP Execution Model and its Implementations consecutive thread only reads from its incoming shared channel. Consequently, this communication channel looks like a ring where each thread is an edge of the communication pattern. The first read of this channel is synchronized (un- blocking thread’s execution); only the first write to it is taken into account.

S write(y_s)

read(y_s) Child B Child B (2) (3)

write(y_s) read(y_s)

S S

read(y_s) write(y_s)

Child B Parent A (1) read(y_s)

write(y_s) S

Figure 3.5: Example of SVP inter-thread communication with a shared communication channel. The communication resembles a ring-like pattern. With this communication channel, write and read accesses are synchronized and work with a producer-consumer mechanism. The write in the shared channel is made by the producer thread. The data is sent over the channel to the consumer thread (the adjacent thread with an incremented index value i+1) where the first read on it is synchronized. The shared communication channel, by definition, is written once; it can be read multiple times. These two ways of communication expose dependencies between the threads as summarized in Figure 3.6. Because of this explicit exposure of the depen- dency chains, the SVP model is deadlock free and guarantees locality. The for- mal proofs of this property is in [59].

3.2.3 Thread and family interruption

An SVP family can be interrupted during its execution in two different ways. The first way is with the usage of the break action similar to the break semantics in the sequential model. This action is done within one of the threads of the family; when it occurs, the rest of the threads to be executed are stopped. Then, the break permits the transfer of an object for the breaking thread back to the parent thread. This is necessary when handling exceptions in thread families.

32 Chapter 3. SVP Execution Model and its Implementations

write(y_s) S Child B Child B read(y_s) (2) (3)

write(y_s) read(x_g) read(y_s) read(x_g)

S G S

read(x_g) read(y_s) write(y_s)

write(x_g)

Child B read(y_s) Parent A (1) S write(y_s)

Figure 3.6: Example of SVP inter-thread communication. After the issue point of an SVP create event in Parent A, this is an example of SVP inter-thread communication using the two different SVP synchronized communication channels, i.e global synchro- nizing object x g and shared synchronizing object y s. Assuming here that parent A generates a family B of three threads with two thread parameters: x g and y s. The inter-thread communication is exposed with accessor methods to the synchronizing ob- jects: read(Var) and write(Var) (they are not function calls) where Var is a synchronizing object. Note that the last read(y s) (between child B(3) and Parent A) is valid in Parent A only after synchronization (completion point).

Furthermore, it is necessary where unbounded loops (e.g. while-loops) are possible in the sequential model; with the SVP model, a family of threads can be allocated in blocks of a given size and terminated in one of its threads by the break action. In principle, unbounded families are possible, however this is constrained by a dynamic space/time trade-off. The second way of interrupting thread families is possible with the kill action from outside the targeted family of threads. The kill action employs the family identifier and destroys the entire set of threads of the targeted family without preserving any of the temporary results. Typically, this action is useful when an operating system wants to destroy/stop specific thread families when an unexpected event occurs. The difference between these two actions is that one can be performed within the family of threads (with the break action) and then return an object to the creating environment. The other action is only performed from outside the thread family and targets the entire family, not only one single thread.

33 Chapter 3. SVP Execution Model and its Implementations

3.2.4 Place of computation

The SVP model provides a separation of concerns between the definition of a program’s function and its resource mapping. The latter is achieved by binding a named resource, called a place, to a family on its creation, which can happen at any level in the family hierarchy. The model defines a binding of unit of work to a set of resources with the create action. By definition, a unit of work is a family of threads. This family is then assigned to a set of resources on which to be run on. This set of resources is defined by place objects defined in the parameters of the create action. Places are opaque and implementation-defined. Depending on the implementation, a place can be a virtual resource or a physical one and can have various properties, allowing threads with specialized code to be run on heterogeneous systems via a common mechanism. Regardless of the imple- mentation, there is a special class of place that must be supported, the exclusive place. This class has the property that of all families created at that place, only one will run at a time. Exclusive places provide synchronization between unre- lated threads and add non-determinism to the execution model. Typically, they will be used by system code, e.g. in sharing or allocating resources.

3.2.5 Memory consistency model

The SVP model uses a relaxed memory consistency model that is managed by create and sync events and possibly by writes to synchronizing objects. It pro- vides a single, flat address space with a restricted consistency model, such as Location Consistency (LC) [60]. A thread in a family cannot reliably see mem- ory writes made by other threads except for those that its parent could see at the point where the family was created, those that any thread of a subordinate family can see after that family has terminated, and those dependent on a syn- chronizing object used as a reference when that object is written. The model requires LC’s acquire and release operations to be implemented on synchro- nization actions such as creation and termination of threads and writing to and reading from synchronizing objects. There is no way for threads to explicitly synchronize access to any memory location with another, unrelated thread. Informally, the model defines memory consistency as follows: at any time, each thread has a consistent view on subsection of memory, such that reads and writes to this view are well defined as long as that thread is the only one writing to that location between synchronization events. The consistency view is shared between threads on the synchronization events: create, sync, read- ing/writing global/shared communication channels and creating on an exclu- sive place. This means, specifically, that:

• for references passed through a global channel, threads in a family can only see and use what the parent thread could see at the point of the create, as long as the parent does not change the memory before the family terminates.

34 Chapter 3. SVP Execution Model and its Implementations

• the parent thread cannot see the changes made by the child threads until it has synchronized on the family’s termination; • a thread cannot see the writes made by a previous thread in the family except for writes to objects which are sent via shared channels. • in order to share data between unrelated threads, all threads must create a family on the same exclusive place to synchronize consistent access to a shared location.

Specifically, this consistency model makes no guarantees for a thread seeing writes made by an unrelated thread at some point in time. This relieves the memory implementation from ensuring global consistency, allowing more ex- perimentation and optimization with implementations.

3.2.6 Concurrency tree

Every SVP thread can create families of its own; this makes the model hierar- chical. Therefore, an SVP program is represented with a hierarchy of families. This hierarchy resembles a tree of thread families and their organization, and consequently the concurrency present in that SVP program. This representation is called a concurrency tree where each node is a thread which creates a family of threads, called a creating thread. A leaf of the concurrency tree represents a non-creating thread or leaf thread. Consequently, an SVP program is composed of families of creating threads and families of non-creating threads.

A

create

sync

B1 Bn

Dependency chain

Barrier sync

Figure 3.7: Representation of a program’s concurrency via a concurrency tree. Each circle is a thread; circles positioned at the same level and with the same direct creator is of the same thread family. The concurrency tree depicts all the thread families of an SVP program including creating threads (i.e. node in the tree) and non-creating threads (i.e. leaf in the tree). Furthermore, the tree contains dependency chains between the threads of a family and with the ascendancy (i.e. parent thread). Figure 3.7 is an example of a program’s concurrency tree. On the left-hand side, nodes including leaves are exposed with their single ascendant and their

35 Chapter 3. SVP Execution Model and its Implementations descendants (only for nodes). We name the ascendant of a family of threads the parent thread. In this figure, each circle is a thread; circles, positioned at the same level and with the same direct parent, are of the same thread family. A parent thread is therefore a creating thread. Moreover, a thread A can create a family of threads B. These threads B are named children of thread A. On the right-hand side in Figure 3.7, we focus one thread family that comprises the children seen as leaf threads in the concurrency tree and the parent thread. The concurrency tree contains information regarding to dependencies between parent and children, plus between threads of a same family. These dependency chains are global and shared communication channels. Relevant information is extracted from the tree related to the program con- currency. For instance, the depth of a program’s concurrency tree, calculated from the first parent at the top of the concurrency down to the lowest of the leaves, reflects the level of nested families which can be a problem with resource mapping and potentially resource starvation when too many thread families are nested. Other information related to the program concurrency, such as the degree of creating families and non-creating families, is useful to developers when exploring concurrency patterns of a program.

3.3 Hardware implementation: Microgrid

A Microgrid is a chip-multiprocessor using microthreaded processors (also re- ferred to as SVP cores) that implement the SVP model directly in hardware [61]. This section explains both the chip architecture and the processor architecture, which is based on a DRISC processor [43] and executes an extended standard instruction set. This section describes the architectural features of the SVP mul- ticore implementation as far as necessary to understand the constraints on our compiler development.

3.3.1 The Microgrid: extension of an existing ISA

The SVP core implementation extends an existing ISA with extra instructions capturing the model properties. These instructions are shown in Table 3.1. In addition to that, the SVP core also implements registers as i-structures [62], which can suspend and reschedule any number of hardware supported threads (currently 256 per core), further implementation details are presented in [61]. Parameterized thread creation is implemented with the ALLOCATE and SET- like instructions that acquire and set a family table entry (i.e. an on-chip memory structure that holds information for thread management at the family level), fol- lowed by a CREATE instruction that reads the family table entry and terminates asynchronously by setting a register (the family’s SYNC). As a side-effect, it iterates the creation of all threads defined by the family table entry (e.g. maximum number of executing threads per core at any time).

36 Chapter 3. SVP Execution Model and its Implementations

This creation is constrained by the resources available or as specified in the family table. A minimum resource set is one thread table entry (i.e. similar to family table but at the level of threads) and one register context on a single core, which yields sequential execution. However, a family of threads created can be distributed, as far as resources are available, to a number of cores for throughput and to a number of threads per core to support latency tolerance.

allocate gets a family entry in setstart sets the starting bound memory and sets the of family. context and the place where the family will be run on. create action which gen- setlimit sets the limit bound of erates a family of the family. threads. swch marks the use of long- setstep sets the step of the latency-operations re- family. sults. end marks the end of the setblock sets the maximum thread code. number of threads per code. break within a thread code, setbreak sets an object in par- terminates the current ent’s context for col- family context and re- lecting family break- turns the value to par- ing value. ent. kill from outside the tar- geted family, termi- nates the current fam- ily context.

Table 3.1: List of SVP instructions which can be added to an existing ISA.

37 Chapter 3. SVP Execution Model and its Implementations

3.3.2 Processor overview

A microthreaded processor (i.e. SVP core) uses an in-order issue pipeline with both in-order and out-of-order completion of instructions. The core principle of the processor is to avoid stalls and speculative execution in order to maximize utilization of the pipeline and energy efficiency, respectively. Cache misses, family and thread creation, synchronization and FPU operations are all imple- mented by issuing the operation without keeping track of outstanding depen- dencies in the pipeline, and continuing to execute instructions from the same thread. Any instruction that attempts to use the result of an operation which has not yet completed has its thread suspended on the target register and is woken up when it is written. The register file is modified by adding state bits to each register and logic for handling these states on reads and writes. This ability to execute many threads in a single pipeline without stalling gives a mi- crothreaded processor the ability to hide large amounts of latency. To support this number of threads, the processor has data structures to store and manage the families and threads and a relatively large register file to hold all of the threads’ contexts. Dynamic allocation of register contexts to families allows the thread context size to be tailored to its code, optimizing register file usage. Microthreaded processors employ the following data structures to manage thread families and threads:

• family table, on-chip memory structure that holds information for thread family management. • thread table, on-chip memory structure that holds information for thread management. • active list, gathers the list of threads that are currently active in the pipeline of an SVP core. • waiting list, collects the list of threads that are blocked until the data they wait for is delivered. • ready list, lists the threads are just ready to be executed in the pipeline and wait to be allocated.

3.3.3 Microthreaded pipeline

The pipeline in an SVP core is a simple in-order issue RISC pipeline with mod- ifications. When the Read stage determines that one or more of the operands of an instruction are not available, it will change the instruction to a no-op that writes the thread’s ID back to that register and marks it Waiting. The Fetch stage of the pipeline uses the processor’s Active List to get the next thread to execute from. It will read the Thread Table, Family Table and I-Cache and buffer the in- formation while it executes that thread. A switch to another thread is required when:

38 Chapter 3. SVP Execution Model and its Implementations

• the current thread reaches the end of the cache-line,

• the thread executes a jump or branch,

• an instruction in the thread uses a register that is not Full, or

• an instruction has been annotated with a SWCH or END.

In the first two cases, the thread proceeds to a cache-line that may not be present in the I-Cache and to avoid stalling the pipeline, execution switches to the next thread on the Active List. At the end of the pipeline, the thread is put onto the Ready List to fetch its new cache-line. In the third case, the thread needs data that is not present and cannot continue. The thread will suspend on the register and be put on the Ready List when the register is written. The SWCH instruction is used after an operation to mark the use of previous long- latency-operations’ results. In the case of a result that is not ready yet, the thread’s execution is blocked until the data becomes available. In the meantime, another thread’s instructions are issued to the pipeline.

3.3.4 Family and thread management

SVP threads have different states as illustrated in Figure 3.8. Thread life starts when is been allocated; its state goes from empty to waiting. The thread then goes in its ready state as soon as its instruction data is being processed. Af- ter that, the thread becomes running when it enters the pipeline. At that state, several cases can happen. First of all, the thread executes its computations com- pletely and it ends in the unused state where it and its context will be deallocated and put in empty. However, if at the running state, data is unavailable the thread will be suspended state until data becomes written and put in as waiting. Con- sequently, the thread will be put in reactivated in its ready state. Another case can happen when data is there but there is a thread switch tag (SWCH) onto the ongoing operation; then, the thread’s state becomes waiting and then put again in ready state. These two cases trigger a thread switch (SWCH) which activates another thread from the ready state to be put as running, if there is any. When a thread creates a family, an entry in the Family Table is allocated to store the family’s state. This state includes the number of threads created so far for the family, parameters for creating new threads and a link to the par- ent thread. An independent hardware process is responsible for creating the threads of created families, up to a program-specified upper bound. All non- register state for the threads created on a processor is stored in the Thread Table. Threads start as soon as they are created and terminate after executing any in- struction annotated by the special END annotation. When a thread terminates, its context is reused for the next thread in the family unless all threads in the family have been created. All resources related to a family are released when all threads in that family have terminated and no more references to the family exist. Threads can be in different states besides running and threads in one state

39 Chapter 3. SVP Execution Model and its Implementations

thread ends running

unused thread enters thread switch; the pipeline invalid data

ready suspended thread switch; valid data

empty data is written instruction data has been loaded thread is allocated waiting

Figure 3.8: Different states of an SVP thread. may need to be moved to another state en masse. To do this efficiently, these states are managed with a linked list with a field in the Thread Table, such that moving threads from one state just means appending one linked list to another. This mechanism is used in the following cases:

• When a register is written that has several threads waiting on it, these threads are woken up by taking the head and tail pointer from the register and appending that list to the processor’s list of threads that should be executed, the Ready List. • Threads that are waiting on an I-Cache line for their instructions are linked in a list with the head pointer in the cache-line entry itself. This allows the I-Cache, when the line has been read, to immediately make all threads that were waiting on it available to the pipeline via the Active List. • All allocated threads are also maintained in a per-family membership list (requiring a second link field per thread). This list allows the processor to append all threads of a family, regardless of their state, to the empty list when the family is forcibly terminated.

3.3.5 Thread context and semantics of objects

Each SVP thread has its own context of registers (register classes are listed in Table 3.2 and explained in Section 3.3.6) and private/shared memory areas. Inside a thread context, there are two types of objects present: conventional

40 Chapter 3. SVP Execution Model and its Implementations objects which are not synchronized, and the second type is related to the syn- chronizing objects implementing SVP model communication channels named shared and global channels. The model’s shared objects are identified in the code as a subset of a thread’s register context, while register file address decoding transparently implements inter-context dependencies. This works in a similar manner to windowing in the SPARC architecture [63], except the windows are not fixed, can be written to concurrently and are automatically distributed be- tween the adjacent register files in a cluster of cores when a create instruction is distributed. In particular, the ALLOCATE instruction identifies the cluster of cores that a CREATE instruction will be distributed to and this requires an implementation-dependent place type. Section 3.3.7 explains the layout of the register window and its mapping onto the architectural registers.

3.3.6 Thread communication in the Microgrid

SVP communication is implemented with unidirectional synchronized commu- nication channels, previously explained in Section 3.2.2. More precisely, these channels utilize synchronizing objects mapped to registers with states (imple- mented as i-structures [62]). One of the two ways of inter-thread communica- tion is implemented by reading from and writing to shared synchronizing ob- jects. This occurs between the parent and the first child thread and between adjacent threads created in a family, supporting only linear dependency chains, and between the last child thread and the parent as illustrated in Figure 3.5. This linearity ensures the model’s freedom from deadlock and also provides an abstraction of locality of communication between threads in a family. In the Microgrid of SVP cores, for every shared synchronization required between threads, the compiler must allocate two registers in a thread’s context, one which is read only and which waits for the previous thread’s write and one that is write once that signals to its successor thread in the family. The two registers are called the dependent and shared registers respectively (for exam- ple in Figure 3.6, read(y s) and write(y s) respectively mapped to dependent and shared). These names are derived from a partitioning of the thread’s context into four register classes as presented in Table 3.2. This partition is specified by an assembly directive (i.e. ALLOCATE, see in Table 3.1) at the beginning of the thread code. This data supplied is used on family creation to define a number of offsets into a thread’s register context that are then used to implement any communication required between distributed contexts, i.e. between a shared register on one core and a dependent register on another. This mechanism gives us a distributed-shared register file for all threads in a family where thread-to- thread communication is restricted to the overlapping windows of one thread’s shareds with the subsequent thread’s dependents. The other SVP communication channel is implemented over the global regis- ters which are read-only (in the child thread context). These registers are ini- tialized during thread creation from the parent’s context utilizing a ‘pull’ mech- anism. During thread creation, thread initialization takes (“pulls”) the created

41 Chapter 3. SVP Execution Model and its Implementations family’s globals and shareds directly from the context of the parent thread. Fig- ure 3.4 illustrates this with a single global synchronized communication chan- nel in a family of 3 threads.

Register classes locals These registers are read/write to the local thread only. globals These registers are written by the creating thread and are read-only to all threads within a family. shareds These registers are for communication between adjacent threads in the family, they are written once by the local thread and may be read locally or by the successor thread in the family. The synchronization event is on the first write. dependents These registers are read only and access the previous thread’s shared registers. The synchronization event is on the first read.

Table 3.2: List of SVP register classes.

As the overhead of thread creation and scheduling in our implementation is small (from 2 to 10 cycles for the former and on every clock cycle for the latter), threads may be very fine-grained. It is not uncommon for a thread definition to comprise fewer than ten instructions. Thus, the normal approach of allocating a stack per thread from the heap is not an appropriate solution here. In SVP, where possible, every function is created as a thread of control and it would take as long as threads execution to allocate a stack from the heap (e.g. hun- dreds of cycles). Section 3.3.7 explains the register layout and their mapping onto the architectural registers.

3.3.7 Register window and register mapping

The previous section partly mentions the initialization of thread context at thread creation. In essence, an SVP register window comprises 4 register classes: global, local, dependent, shared as illustrated with Figure 3.9 . Each virtual register window is a representation of the context of an SVP thread; this register win- dow is mapped onto architectural registers as Figure 3.10 illustrates.

0 N Reserved

Globals Dependents Locals Shareds

Base Top

Figure 3.9: Layout of a virtual register window for an SVP thread. Four register classes ordered from the base to the top: Global, Dependent, Local, Shared. Figure 3.9 depicts the structure of a virtual register window with a size N (e.g. the Alpha instruction set used as experimental target platform, 32 integer

42 Chapter 3. SVP Execution Model and its Implementations registers and 32 floating-point registers). The global region starts from the base of the register window, then followed respectively by the dependent region, the local region and the shared region. The first two regions (i.e. global and depen- dent) are initialized at thread creation from the creating environment (i.e. par- ent’s context) for the first thread. For the following threads, the global region is initialized from the creating environment, whereas the dependent region is initialized from the shared region of the previous thread in the index range. Therefore, the last region (i.e. shared) is bounded with the dependent region of the adjacent thread (the following thread in the index range or back to the parent thread if the current thread is the last indexed thread from the family). The register window is named virtual since the context is specific to the thread it is associated. However, when the mapping, done by the architecture, occurs several cases might happen:

• The entire family is executed on the same SVP core. The globals of the same family then map to the same architectural registers. The shareds and dependents of adjacent threads are mapped onto the same architectural register. The locals have their own segment of architectural hard registers.

• The family is spread onto a cluster of SVP cores. The globals, used by the threads localized onto the same SVP core, are mapped on the same archi- tectural registers. When a couple {dependent, shared} of two adjacent threads is executed onto two distinct SVP cores, the Microgrid takes care of distributing the data over the SVP cores. Implementation details are described in [61].

It is important to note that register mapping is completely opaque to the de- scription of the program (i.e. the machine code). In other words, the way of execution of a family does not need to be known before execution-time. The Microgrid architecture allocates the distributed contexts (from the threads of the same family) considering the scheduling of the threads over the SVP cores. Figure 3.10 illustrates the case of the creation of a family of 3 threads localized onto the same place (i.e. same SVP core). At initialization, the parent’s context is required to store in its locals the parameters of the thread function being cre- ated. They need to be store in a contiguous frame; they are passed via registers if there is room for all of them. Otherwise, the thread local storage (similar to a conventional stack but with concurrency management) is used. To summa- rize, for threads with few parameters, these can be passed using shared or global registers, but this requires the compiler to make a register allocation within the necessarily limited context of registers. While this is generally possible at the lower levels of the concurrency tree, at higher levels, the combined number of local objects and parameters to a thread may still require a stack. Given the limited number of hardware threads per chip (256 for each core on a chip) we are able to provide a fixed partition of the virtual address space to act as local memory for a thread (thread local storage). Thus when required, this allows us to spill registers and pass parameters in a conventional manner. The overhead

43 Chapter 3. SVP Execution Model and its Implementations

Architectural registers visible in parent thread A 0 5 10 15 20 25 30 G0 G1 G2 G3 D0 D1 D2 D3 D4 D5 L0 L1 L2 L3 L4 L5 L6 L7 L8 S0 S1 S2 S3 S4 S5 Z (not mapped)

Physical register file (hardware) ......

G0 G1 D0 D1 D2 L0 L1 L2 L3 ... S0 S1 S2 Z 0 5 (not mapped) 30 Architectural registers visible in thread B(1) G0 G1 D0 D1 D2 L0 L1 L2 L3 ... S0 S1 S2 Z 0 5 (not mapped) 30 ... in thread B(2)

G0 G1 D0 D1 D2 L0 L1 L2 L3 ... S0 S1 S2 Z 0 5 (not mapped) 30 ... in thread B(3)

Figure 3.10: Mapping of hardware registers to architectural registers: example sharing between a parent thread A and a child family B of 3 threads, where all threads are created on the same core. The child family B has been created with 2 global thread parameters, 3 shared thread parameters and 4 locals. The offset for the first local register (available for exclusive use by its thread) in the virtual architectural register window of the parent is 10, whereas it is 5 for the child thread. In each thread both the registers shared with the previous sibling (D i.e. dependent) and next sibling (S i.e. shared) are visible. for this is much smaller, since it only requires a single instruction to initialize a register with a stack pointer plus normal stack management overhead.

3.4 Software implementation: µTC language

The µTC language is introduced in [64]; this section describes the language implementation of the SVP general abstract concurrent execution model into a C-based language. µTC modifies the C programming language with SVP extensions. SVP threads run programs defined by thread functions, expressed with a syntax similar to that of C functions. In every thread, an index variable is automatically predefined according to the index numbering in the family. A create construct triggers SVP family creation. Synchronizing SVP channels are exposed as thread function parameters in the C language (either shared or global) and behave like variables with special read and write semantics. Concurrent family creation is the default method of composing functions and loops in µTC programs. The extra constructs in µTC are language primitives and can be compared to other approaches [20] where concurrency is available through external libraries or native primitives. µTC uses primitives because concurrency is assumed to be the normal method as program composition. Library interfaces capture a more coarse-grained approach and can induce a significant overhead. Boehm [48] de- scribes the limits and dangers of library-based concurrency that we have tried to avoid in the SVP model.

44 Chapter 3. SVP Execution Model and its Implementations

As required by SVP, µTC enables resource-agnostic concurrent programs. The developer is only responsible for exposing the concurrency and revealing the dependencies. Mapping and scheduling is resolved by the hardware and the program needs to be compiled only once to be executable on any Microgrid configuration.

3.4.1 The principles of the µTC language

The µTC abstract machine is defined through the following SVP extensions to the C abstract machine:

• thread functions, elementary programs for concurrent threads, defined by syntax similar to C functions; • thread families of concurrent threads running the same thread function, each being distinguished by thread indices, integers automatically prede- fined to a different value when each thread starts; • thread creation, an elementary action of the abstract machine causing the creation of a thread family; • synchronizing objects, exposed as data objects in the C language with spe- cial read and write semantics; this is further separated into shared parame- ters, shared between adjacent threads in a family, global parameters shared by all threads in a family and termination synchronizers, which cause a reading thread to wait for termination of an asynchronous operation; • asynchronous termination, another elementary action of the abstract ma- chine, causing the termination of an identified thread family, from within one of the threads with the break action.

3.4.2 Introduction to µTC programming

µTC is a system-level programming language that exposes concurrency explic- itly in the definition of tasks. The language embeds in its constructs the SVP properties such as thread family creation with the create construct, thread fam- ily synchronization barrier, etc. The developer is aware of the concurrency of the problem she/he defines; she/he only exposes the concurrency of a problem per se without having to actually manage the scheduling nor the mapping of the families. Figure 3.11 illustrates an example µTC program. The main construct shown is the reflection of SVP’s create action. A thread family is identified by family id and its range is bounded by the values of start, limit and step, as in common for loops. The other parameters are the place identifier, for the SVP place at which the family should be created, and the blocking factor which, if used, limits the number of threads created per core by the run-time system for the family.

45 Chapter 3. SVP Execution Model and its Implementations

/* thread function definition: 1 shared, 2 globals. */ thread void ddot(shared double res, double* x, double* y) { index i; res = x[i] * y[i] + res; } thread void main(void){ double a; create(fid;;0;1000;1;;) ddot(a = 0, b, c); sync(fid); }

Figure 3.11: A µTC example of a simplified reduction from the BLAS library. It repre- sents a reduction computed by a family of 1000 threads indexed with ‘i’, declared with the µTC index construct. The read to ‘res’ synchronizes with the previous thread; the two memory loads can be performed concurrently before synchronization; the read to ‘res’ in the next thread completes after the current thread completes its write to ‘res’. After synchronization (i.e. after the µTC sync construct), ‘a’ contains the result of ddot in the parent.

3.4.3 New keywords

µTC modifies the C programming language with SVP extensions listed in Ta- ble 3.3 and Table 3.4. Each function in µTC is a thread function and employs in its definition the thread keyword to distinguish itself from conventional C. In the thread definition, the parameters are SVP synchronized communication channels: global and shared channels. The use of the shared type differentiates one from the other. The bulk synchronization barrier is encoded in the sync construct that must be within the same scope as the create it refers to with a same family identifier.

thread defines a thread function. create is the construct for thread family creation. break within a thread family, terminates a family and returns an object to par- ent thread. sync barrier synchronization targeting a specific family. kill outside a family, terminates the threads of the targeted family.

Table 3.3: List of µTC constructs.

shared specifies a shared synchronized communication channel. index specifies an index identifier for threads of a family. family specifies a family identifier, unique per family.

Table 3.4: List of µTC types.

As SVP pledges, it is possible to preempt the execution of threads via two dis-

46 Chapter 3. SVP Execution Model and its Implementations tinct constructs. The break construct permits the preemption of threads from within one of the threads of the same family. The breaking thread can return an object to the parent environment as exception handler mechanism. The sec- ond way is with the kill construct which allows the external termination of all threads of the target family. Moreover, µTC provides an identification mech- anism for each thread family with a specific identifier, before family creation, using the family type. Within a family body definition, it is possible to distin- guish the indexed threads using the index construct.

3.4.4 Family creation

Family creation is a major property of the SVP model; it is encoded into the cre- ate construct in the µTC language. This main SVP create construct generates a parameterized group of statically homogeneous, but dynamically heteroge- neous threads encapsulated in a thread function fun (args...) as illustrated in Formula 3.1. Figure 3.11 illustrates a µTC implementation of a program from the BLAS library. The program runs a simplified reduction from the BLAS library where the parent thread produces a family of threads. This example also shows how the parallelism can be achieved through concurrent operations in the thread function ddot. It represents a reduction computed by a family of 1000 threads indexed with ‘i’, declared with the µTC index construct. The read to ‘res’ syn- chronizes with the previous thread; the two memory loads can be performed concurrently before synchronization; the read to ‘res’ in the next thread com- pletes after the current thread completes its write to ‘res’. After synchronization (i.e. after the µTC sync construct), ‘x’ contains the result of ddot in the parent. The main µTC construct is the implementation of the SVP create action which generates a family of threads with the parameters specified in the construct:

create(family id; place id; start; limit; step; block; break id)fun(args...). (3.1)

At family creation, a family of threads is identified by a unique family identi- fier family id and shaped by the start, the limit and the step bounds similar to the for-loop statement. It is also possible to define a breakable family; then, there is a need to define an object break id, which will collect the result in the parent thread when one of the threads breaks and returns an object. The developer can also define an area of resources where the family will be executed on within the place id as Section 3.2.4 describes. The create parameters are listed in Table 3.5.

3.4.5 Object semantics in µTC

µTC has two distinct types of objects that can be distinguished from the lo- cation where they are defined: non-synchronizing and synchronizing objects.

47 Chapter 3. SVP Execution Model and its Implementations

family id unique family identifier for the to-be-created family. place id place of SVP cores where the to-be-created family will be run, default is 0, on the same resource that the parent uses. start start bound of to-be-created family, the default value is 0. limit limit bound, the default value is 1. step step, the default is 1. block block bound which corresponds to the maximum of running SVP threads at any moment per SVP core, default is 0 and is equivalent to maximum of running threads possible (per core) allocatable by the hard- ware. break id identifier which would collect a value from the breaking child of the to- be-created family.

Table 3.5: Create parameters which set up the family definition.

The first type is simply declared within a thread function’s body. This type is named as local; it is used to declare a regular internal function object. This ob- ject does not live outside of the scope it is defined. Furthermore, no particular side-effects are encoded in the use of these objects. The second type is declared within the parameters of a thread function’s definition, as illustrated with Fig- ure 3.12. Therewith, these objects have special synchronized properties and are separated into global and shared objects. thread void foo(shared double x, double y) {

}

Figure 3.12: Thread function definition: 1 shared x, 1 global y.

The synchronizing objects have special read-and-write accesses within the body of thread functions. The global synchronizing object is read-only within the scope of the thread function. Moreover, the shared synchronizing object maps, as defined in Section 3.2.2, into a couple of two objects {read,write} re- spectively mapped onto a couple of {dependent,shared} registers in the cor- responding assembly representation. By definition, a shared object cannot be written before it is being read. The incoming shared object (read from depen- dent register) needs to be read first to consume its synchronization event; then, the outgoing shared (write to shared register) object can be written. Otherwise, the program’s execution will just deadlock. Between the issue point (at a create construct) and the completion point (at a sync construct), within the same scope, the objects used for parameter pass- ing to the thread function to be created are locked. This is due to architecture restrictions (cf. ‘pull’ mechanism used for thread parameter initialization, in Section 3.3.6). Therefore, these objects are located in a gray area (delimited be-

48 Chapter 3. SVP Execution Model and its Implementations

{ /* a and b are defined at this point */ create(fid;;0;1000;1;;) foo(a,b); /* gray area, a and b must not be modified */ sync(fid); }

Figure 3.13: Gray area between the issue point (i.e. create) and the completion point (i.e. sync) of a thread family. tween these two points) where thread creation takes place; the paradigm has no means to know when threads will be scheduled at this point. This is illustrated with a code sample in Figure 3.13. Consequently, any change in these objects within this gray area would provoke non-deterministic results upon the fam- ily’s execution. At the completion, the gray area stops and there are no locks on these objects.

3.4.6 Concurrency management

The developer exposes concurrent sections of a program and dependencies fol- lowing the model’s constraints. There is no need to actually expose the sched- ule nor map this concurrency on resources since the architecture will ensure this. This separation of concerns is an SVP property; it removes from devel- opers this heavy burden which would bind their implementation to specific architecture settings. Furthermore, another important SVP property, embed- ded in µTC is resource-agnosticism programming, where the configuration of the targeted architecture is not required at the time the program is being com- posed. The only thing a developer may be asked, is to define an abstract place of computation where a thread family would run. Developers have constructs to manage behaviors of thread families. With the break constructs, it becomes possible to reproduce WHILE loops from the sequential model. Furthermore, it is also possible to handle exceptions during thread executions. If this occurs, the break construct can return an object to the creating environment. The second construct is kill which allows a family to be terminated if it behaves unexpectedly or beyond a defined limit.

3.5 SVP system performance

This section is an initial evaluation of the SVP computing system comprising a target platform (i.e. the Microgrid simulator in Section 3.5.1) and a set of hand- compiled highly-optimized benchmarks (cf. Section 3.5.2). There is no SVP compiler involved in the experiments; in other words, the following evaluation is performed without knowledge of a working SVP compiler. In Chapter 7,

49 Chapter 3. SVP Execution Model and its Implementations

Section 7.2 presents SVP system performance results with the use of the SVP compiler. For further results of SVP system performance, articles [61, 58, 65, 66, 67] are also available.

3.5.1 Target platform: Microgrid simulator

The evaluation, performed in this section, uses the SVP software simulator [68] which is a platform emulation of a Microgrid. The emulation captures state transitions down to the lowest level in the core pipelines. It also provides a fully parameterizable, cycle-accurate simulation of all hardware aspects of the Microgrid: core functional units, interconnects, network-on-chip, memory ar- chitecture, and external memory interface. We used parameters suitable for hardware that can be built using current technology [61, 65]. The simulation executes a unit of work (family and any subordinate families) on a variable sized cluster of cores connected in a ring network. The selection of cluster sizes has no impact on the performance of each cluster other than the number of cores. Figure 3.14 provides a schema of this configuration.

Figure 3.14: Microgrid tile of 4 clusters of 4 cores (P0..P3) sharing an FPU between 2 cores and an L2 cache between 4. It illustrates all on-chip networks, two local rings between cores and L2 caches and routers (R) for the chip-wide delegation network.

3.5.2 Methodology: benchmarks and evaluation

Since the SVP compiler is not utilized in this evaluation, the results presented emulate small hand-compiled kernels, which nevertheless are representative of the computation found in many large-scale applications. Our test kernels in- clude several Livermore kernels [69], both independent and dependent ones, an Microgrid assembly version of the sine function using a Taylor series expan- sion, which is small and very sequential and the fast Fourier transform.

50 Chapter 3. SVP Execution Model and its Implementations

The motivation for this work is both the validation of the model implemen- tation as well as an initial evaluation of its performance, scalability and latency tolerance. The COMA memory system described in [70] is not yet incorporated into this emulation and the results presented here uses two extremes of memory implementation. The first is sequential and the second is an idealized parallel memory that is capable of handling multiple requests in parallel. Both have parameterized latency and the latter a parameterized number of banks. Unless otherwise indicated, each core in the cluster has a family table with 64 entries, a thread table size of 256 entries and a register file with 1024 registers. L1 I- and D-cache sizes were set to 1 KB per core for these results, except where indicated for comparison.

3.5.3 Data-parallel code

The first result set is for the data-parallel Livermore kernels 1 (in Figure 3.15(a)) and 7 (in Figure 3.15(b)) executed over a range of cores. Compiling such loops for the Microgrid is straightforward, the loop is captured by a single create and the thread code captures the loop body, minus the loop-control instructions (increment and branch). This code contains only local and global registers. The results record the execution time (in cycles) required for a problem size of 64K iterations. Figure 3.15 shows the speedup achieved for those kernels relative to its execution on a single processor. As can be seen, the speedup scales almost linearly with a deviation from the ideal of 20-40% at 128 cores due to the start-up involved. These are the cost of distributing thread parameters between cores and synchronizing between the cores on termination. Kernel 7 has more memory references than kernel 1 and performance saturates earlier in the scaling. Both kernels were re-executed with a 32 KByte cache to show the effect of D-cache size on performance. As can be seen this improves performance, but not significantly and shows the latency tolerance of the Microgrid. Neither experiment is limited by memory bandwidth by design. The stall rate shown in Figure 3.15 measures the per- centage of time when the pipeline is stalled due to hazards. However, it does not include the time when the pipeline is completely empty. For high instruc- tions per cycle (IPC), this gives us an indication of the utilization of the pipeline and since the model uses simple in-order pipelines without multiple instruc- tion issue and branch prediction, this metric is very important. It shows that even without those features, the architecture is able to use the pipeline very efficiently by interleaving many threads in the pipeline. We do not count com- pletely idle cycles as this can be detected and the core powered down. For low IPC therefore, this measures energy efficiency.

51 Chapter 3. SVP Execution Model and its Implementations

(a)

(b)

Figure 3.15: Speedup and stall rate (%) against number of cores in a cluster for Liver- more kernels 1 (top) and 7 (bottom). Two sets of results are presented using a 1 KByte and 32 KByte D-cache. The simulation uses a random banked memory.

52 Chapter 3. SVP Execution Model and its Implementations

3.5.4 Families with inter-thread dependencies

This section illustrates how the model captures and exploits dependencies. The code executed here contains one or more thread-to-thread dependencies that require shared registers, which constrain the execution sequence. The test ker- nels are an Microgrid assembly version of the sine function, implemented as a Taylor expansion of 9 terms (this was implemented with a single memory bank) in Figure 3.16, and the Livermore 3 in Figure 3.17, inner product of 64K iterations. As with the previous experiments, the same Microgrid binary code was executed on a variable sized cluster. Figure 3.16 shows the speedup against number of cores for the sine function. Threads are executed concurrently but dependencies between threads constrain its execution. Concurrency is still ex- ploited locally in tolerating memory latency and do-across schedules exploit a small amount of ILP in the thread code, where speedup is proportional to the number of independent instructions (before or after a dependency). When compared to sequential code, the Microgrid code is some 40% faster on a single processor and this increases to 70% faster on a cluster with 4 cores. This is due to the absence of any loop control overheads and high pipeline utilization from the few local threads.

Figure 3.16: Speedup of sine function.

Loop-based code generally captures three classes of computation, data paral- lel where the iterations are independent, recurrence relations like the sine func- tion and reductions. Because of the commutative and associative properties of the latter, these operations can be implemented in any order and concurrency may be exploited. Theoretically, one can obtain O(N) speedup and a latency of O(log2N) for binary operations. In the SVP model, several partial reductions can be performed in parallel with a final reduction giving the required result. The features of the model that allow this are the ability to create nested families

53 Chapter 3. SVP Execution Model and its Implementations

Figure 3.17: Speedup of Livermore kernel 3, a reduction. of threads, the ability to specify where a family will be executed (i.e. a place) and the abstract manner in which this is all captured. For an O(P ) speedup, the Microgrid assembly code can use the number of cores in a cluster, create a single thread per core that implements a distributed reduction on the result of subordinate families that are executed locally performing P partial reductions, one per processor. When this code is executed, near linear speedup is achieved on up to 128 cores for 64K iterations. This is illustrated in Figure 3.17. It should be noted that this code is also schedule independent and the only dynamic in- formation required is the number of cores in the default cluster.

3.5.5 FFT Performance

When there was no SVP compiler, it was difficult to provide performance com- parisons for large numbers of benchmarks because each simulation currently requires Microgrid assembly code to be produced by hand. However, in order for performance comparisons to be made against other published results, we have undertaken extensive simulation of the performance of a very common algorithm, namely FFT. Figure 3.18 shows these results. Again the same binary code was executed on a range of clusters from 1 to 64 cores. The results are presented in MFLOPS assuming a 1.5 GHz core and one FPU per core. All FPU operations are assumed to be pipelined with a latency of 3 cycles for adds and muls and 8 cycles for divs. Results are given for FFTs of length 2n, where n ranges from 8 to 20 by 4.

54 Chapter 3. SVP Execution Model and its Implementations

Figure 3.18: Performance of FFT in MFLOPS against cluster size for FFTs of length 2n.

Results show a maximum performance of 0.5 GFLOPs per processor, with linear scaling out to 64 cores for transform lengths of 64K and above. This maximum performance is limited by the number of FP operations issued per cycle, a static property of the code. The smallest transform of 256 points gives a maximum performance of between 5 and 10 GFLOPS and saturates at using 12 cores, or about 10 threads per processor.

55 Chapter 3. SVP Execution Model and its Implementations

3.6 Discussion and conclusion

This chapter has presented the properties of the SVP execution model and its implementations into a parallel programming language (i.e. µTC) and into a chip multi-processor architecture (i.e. Microgrid of SVP cores). The details of both the software and the hardware implementation are revealed to understand the SVP computing system; the complexity of these implementations is stressed to grasp the effort of providing the efficient SVP compiler discussed in the fol- lowing chapters. The CSA group has also interests in high-level simulations of the SVP model onto conventional processors and investigates this with a pThreads implementation [71]. The last part of the chapter has presented an overview of potential perfor- mance with hand-coded scientific problems onto the SVP computing system (comprising the Microgrid target simulator). The lack of a compiler shows lim- its on the range of scientific problems to use for evaluation. The µTC parallel programming language presents an easier way to program these benchmarks which will be then compiled by the SVP compiler. While hand-coding bench- marks for an unconventional processor architecture such as the Microgrid is possible, compiling programs directly from a high-level concurrency-oriented language automatically is often much more difficult. The next part of the thesis shows that an extended imperative-language compiler can successfully target the Microgrid architecture. This capability provides further evidence that the Microgrid architecture may be used as foundations for a general-purpose mul- ticore processor and SVP as the underlying paradigm.

Acknowledgments

The SVP model and its implementation is a group effort and the author would like to mention all members of the Computer Systems Architecture group for their input to this work: Mike Lankamp, Michiel W. van Tol, Li Zhang, Kon- stantinos Bousias, Raphael Poss, Michael Hicks, Simon Polstra, Clemens Grelck and Chris Jesshope. The following former members of the group also con- tributed to this research Joe Masters, Peter Knijnenburg and Guang Liang. The author would like to thank Raphael Poss for all the constructive criticisms he made during the elaboration of this chapter and his contribution to the evalua- tion section. The author would also like to thank Mike Lankamp for providing the simulator of the target platform (Microgrid) used in the evaluation section.

56 Part II

Compilation for Parallel Computing Systems

57

Chapter 4 From basics to advanced SVP compilation

The previous chapter introduced the SVP Execution Model and its hardware and software implementations. The CSA group aims to build a fully parallel computing system where the Self-Adaptive Virtual Processor execution model is the cement adhering the components. As Figure 3.1 explicitly illustrates, the SVP system consists of two implementations which are bridged via an SVP- aware compiler. In practice, a program utilizing the µTC language implemen- tation is translated into a corresponding machine-level representation targeting the Microgrid hardware implementation. In this chapter, we first introduce the reader to the idiosyncrasies of pro- gram transformations in an abstract manner. We then discuss the compilation schemes that are performed in order to bridge the µTC language implementa- tion and the Microgrid hardware implementation. To explain this, we also use a theoretical approach to address the description of compiling a concurrency- oriented language for a multicore architecture. The main contribution to the reader is to present the major abstract steps of program compilation. New and unconventional computer architectures are easy to envision. Nonetheless, programming these architecture is an entirely different matter. This research aims to prove that the SVP architecture imple- mentation can be targeted with an SVP-extended modern imperative compiler.

The contents of this chapter are based on this publication:

• T.A.M. Bernard, C. Grelck, and C.R. Jesshope – “On the Compilation of a Language for General Concurrent Target Architectures”, in Parallel Processing Letters, 20, (1), March 2010.

59 Chapter 4. From basics to advanced SVP compilation

Hence, we will set up the concurrent context in opposition to conventional as- sumptions. Moreover, the reader will get a clear understanding of the transfor- mations performed by our compiler displayed in Figure 3.1. We will also stress the conceptual issues that can endanger the concurrent code.

4.1 Basics in compiler transformations

4.1.1 A compiler is a transformer

Compilers are major computer programs in computer science because of their central position as shown in Figure 1.6. The reason why this component is sig- nificant is because it is the only bridge between two worlds: the Software world and the Hardware world. They are enabling technology allowing the software layers to work with the hardware layers. In practice, the main purpose of a compiler is to transform input code into a semantically-equivalent output code without any semantic loss. The input code written in a computer language, re- ferred as the source language, is transformed into another computer language, named target language. In most cases, the source language is a high-level lan- guage, e.g. Java, C, etc.; it is transformed into the target language which is lower-level than the source language, e.g. a machine language, series of ma- chine instructions.

T (X) ⇒ X0 (4.1)

A compiler can be seen as a transformer from an abstract perspective. For- mula 4.1 represents a transformer T which takes a program X. Then, the result of this transformation is the semantically-equivalent program X0. A simple analogy is to a natural language translator who conceptually performs the same tasks by translating a language he/she hears (or reads) into another one. The translator gives results in a spoken or written form. Moreover, a compiler is also capable of producing a more-optimized output program. Its purpose then is not only to generate correctly transformed pro- grams, but also an optimized program to be run more efficiently on the target machine or compiled into a smaller size if the target machine is an embedded system device. Again with our translator analogy, he can evaluate the input form in order to render a simpler translation with a smaller amount of words to express the same meaning or even to reproduce the same semantics with non-complex words. Formula 4.2 represents an advanced transformer T able to generate an opti- mized output program X∗. An extra input a may be required to select a set of parameters to control specific optimizations of the optimizing transformer. We can then command the transformer to give us multiple representations from the same input program; the translator would do similar if we tell him to only use layman’s words.

60 Chapter 4. From basics to advanced SVP compilation

T (X, a) ⇒ X∗ (4.2)

The major concern in translating is to preserve the semantics of the input program. During a G20-summit, we do not want the translator to misinter- pret the speech about nuclear-weapon proliferation agreements for instance. Therefore extreme care must be taken in ways of transforming an input. Con- sequently, the same applies to an optimized output when we want to obtain a more-concise program as a result. However, we do not want to lose any se- mantics from the input program. Throughout any program optimization stage, the program’s semantics must remain the same as the semantics of the input program X.

4.1.2 Different types of compilers

Multiple kinds of compilers can be distinguished and derived from the first definition. Here is a list of the most commonly known1:

• A source-to-source compiler is a type of compiler that takes a high-level lan- guage as its input and outputs a high-level language. For example, an au- tomatic parallelizing compiler will frequently take a high-level language program as input and then transform the code with the addition of paral- lel code annotations (e.g. OpenMP).

• A compiler-compiler is a compiler generator tool that creates a parser, inter- preter, or compiler from some form of formal description. For instance, a parser generator, whose input is a grammar (usually in BNF) of a pro- gramming language, and whose generated output is the source code of a parser.

• A just-in-time compiler first translates applications into an intermediate representation known as bytecode. This bytecode compilation is prior to execution. Bytecode is not the machine code for any particular computer, and may be portable among computer architectures. This is then com- piled to native machine code (i.e. dynamic compilation) when required, at run-time prior to executing the code natively, e.g. Java compiler.

We have introduced the different types of compiler for the sake of the reader who needs to know that compiler is a vague word in the broad computing world. In this thesis, when we refer to compiler, we assume the most common use of a compiler which is the translation from a high-level language (e.g. µTC) into a low-level representation (e.g. assembly code for the Microgrid).

1The reader must be aware of the fact that this list of compiler types is not mutually exclusive.

61 Chapter 4. From basics to advanced SVP compilation

4.1.3 The principles of a compiler

Compilers are engineered objects and very large software systems built for spe- cific transformation purposes. Building a compiler requires myriad design de- cisions which will be explained throughout this chapter and Chapter 6; each of which has an impact on the resulting compiler. There are two major principles of compilation that should not be compromised though during a compiler’s construction.

The compiler must preserve the semantics of the input program being compiled.

This first principle is fundamental in any transformation. The compiler must not lose any meaning of its input program. Correctness is then a priority in compiler transformation and it will be discussed later in the experimental eval- uation of the compiler in Chapter 7. The second principle that a compiler must observe is more pragmatic.

The compiler must improve the input program in some perceptible way (in the case of an optimizing compiler).

A conventional compiler improves upon the input program in different ways depending on the targeted machine. The improvement of a program is subject to interpretation in its meaning. For instance, in the environment of embedded systems, a low usage of energy-expensive operations is a priority. Therefore, the compiler will utilize in priority the least energy-consuming operations pos- sible for a specific task. In general, a compiler also tries to optimize the memory usage of a program by using the on-chip memory as much as possible (which is cheaper in access time but present in smaller quantities than off-chip memory). Even if, on some architectures, it is possible to run a program by only using off-chip memory. However, the program executed would be much slower in terms of execution cycles. Consequently, the improvement constraints in that case are the lowest possible execution time and the lowest use of time-costly off-chip memory. To summarize this section, we only presented what is necessary to under- stand the context of compiler transformations. The reader now has the prereq- uisites to comprehend the scope of the next section discussing the compilation schemes of the SVP compiler. In Section 4.3, we will explain in further de- tails how modern compilers are built in a more technical way. For now, in the following section, we keep an abstract perspective on how the compilation is being made. We want the reader to grasp in what context these transformations must occur and what are the requirements for correct transformations.

62 Chapter 4. From basics to advanced SVP compilation

4.2 SVP compilation schemes

In this section, we express SVP compilation transformations with a simple for- malism. These SVP compilation transformations bridge the SVP language im- plementation (i.e. the µTC language) into the SVP hardware implementation (i.e. the Microgrids architecture). In the state of the art, these compilation trans- formations are most commonly called ‘compilation schemes’. These describe the properties of a compilation process. Here the basic compilation scheme T is shown in Figure 4.1. Furthermore, this transformation rule T is an abstract rep- resentation of the SVP compiler. All the transformation rules described in this section are extensions of the regular C compilation schemes.

T q codey ⇒ instructions

Figure 4.1: Basic compilation scheme T takes a section of code as input on the left-hand side. Then, it gets translated into series of instructions and directives, on the right-hand side. SVP compilation is performed on each µTC thread function. In µTC, all func- tions are thread functions; regular C function calls are treated in a special way which is explained at the end of this section. Moreover, we consider a thread function as a procedure. The compilation processes apply on a program which comprises one or more procedures. As shown in Figure 4.2, a thread function is processed through the compilation scheme T as a whole.

8 u thread break type } .asm thread name > thread name(thread args) > .registers gi, si, li gf, sf, lf w  > w  < T w {  ⇒ thread name : w  w  > w Body  > T Body E thread args v ~ > J KJ J KK } : end

Figure 4.2: Compilation scheme T for a thread function. Here, we do not yet consider the special treatment of the thread arguments (i.e. compilation scheme E). This is further expounded in Figure 4.3. The result of the transformation T is the corresponding as- sembly procedure (i.e. starting with “.asm thread name”, finishing with “end”) on the right-hand side. A thread function is transformed by the SVP compiler into a sequence of in- structions and assembler directives, as the compilation scheme in Figure 4.2 il- lustrates. On the right-hand side of this figure, the assembler directive .registers is responsible of defining the register windows where “gi, si, li” are compiler- calculated numbers of registers for global, shared and local classes for the integer thread context thread name and respectively “gf, sf, lf” for the floating-point thread context. The end instruction is the exit point of the concurrent region. If Body does not have extra µTC constructs then regular C compilation schemes

63 Chapter 4. From basics to advanced SVP compilation apply. Figure 4.3 also illustrates the compilation of a thread function where any of its parameters uses a synchronized communication channel declared as shared or global in the µTC program. As discussed in Section 3.2.2 and Section 3.4.5, the channels are handled with care by the SVP compiler to preserve their seman- tics. The global variables are read-only in the scope Body. The SVP compiler does not assign any global variables in the produced code. At thread creation or when data becomes available in parent’s context, these variables are implic- itly initialized by the architecture. Moreover, all statements held in Body which are using variables declared as shared are preserved in the transformed pro- gram shown on the right-hand side of the figure. There is a distinction made by the compiler when a µTC shared variable is accessed: marked in the compi- lation scheme with accessor methods, write(id name) and read(id name) respec- tively mapped in shared and dependent register classes. Any thread function may create a subordinate family. The compilation scheme in Figure 4.4 shows the SVP create action on the left-hand side and the trans- formed output after compiler transformation on the right-hand side. The first component is the initialization of thread arguments in the parent context via T thread args . The allocate instruction holds offsets of the parent context for parameterJ passingK to the child context. “(w),(x)” are compiler-calculated off- sets in the parent context for the first passed global and first passed shared for integer variables and respectively “(y),(z)” for floating-point variables as ex- plained in [61]. Then, the set-like instructions set up the configuration of the family of threads to be created. The create instruction is responsible for the generation of the family of threads thread name.

64 Chapter 4. From basics to advanced SVP compilation

8 u thread break type } .asm thread name > thread name(thread args) > .registers gi, si, li gf, sf, lf w  > w  < T w {  ⇒ thread name : w  w  > w Body  > T Body E thread args v ~ > J KJ J KK } : end where E is defined as,

E q type id name, Resty ⇒ E Rest J K

E q shared type id name, Resty ⇒ id name ∪ E Rest J K

E q type id namey ⇒ ∅

E q shared type id namey ⇒ id name and F as,

T q id name1 = id name2; Resty shared set ( ˛ J K write(id name ) = id name ;˛ where id name ∈ {shared set}, 1 2 ˛ 1 ⇒ ˛ T Rest shared set ˛ id name2 ∈/ {shared set}. ( J KJ K ˛ id name = read(id name );˛ where id name ∈ {shared set}, 1 2 ˛ 2 ⇒ ˛ T Rest shared set ˛ id name1 ∈/ {shared set}. ( J KJ K ˛ write(id name ) = read(id name );˛ where id name , id name ∈ {shared set}. 1 2 ˛ 1 2 ⇒ ˛ T Rest shared set ˛ ( J KJ K ˛ id name = id name ;˛ 1 2 ˛ ⇒ ˛ otherwise. T Rest shared set ˛ J KJ K Figure 4.3: Compilation scheme T for a thread function where arguments have to be determined whether or not there are potential synchronizing objects declared as shared variables in the µTC program (i.e. used as SVP synchronized communication channels). The compilation scheme E determines whether thread arguments are shared variables or not and returns a potentially empty set of shared variables. This set is then used in combination with the compilation scheme F for dealing with synchronizing variables used in the statements of Body. The compilation scheme F assumes that the code has been flattened. Moreover, the variable identifiers id name1 and id name2 can refer to the same variable and are then distinguished by the index number. write(x) and read(x) are not function calls but accessor methods on a synchronizing object x.

65 Chapter 4. From basics to advanced SVP compilation

A thread function can be defined as breakable by using a break type other than void and by using the µTC break construct in its Body and is transformed as shown on Figure 4.5. This transformation involves the evaluation of the value of expression expr. The type of this value is then compared to the break type. This value will be collected in the parent’s context as shown in Figure 4.4 with the object defined with the setbreak instruction. Moreover, an extra exit edge is added to model a new potential exit point in the execution path of the program.

8 T thread args > > J K > allocate T family id , (w), (x), (y), (z) > J K > setplace T family id ,T place id > u create(family id; place id; } > J K J K > setstart T family id ,T start start; limit; step; <> J K J K T w  ⇒ setlimit T family id ,T limit w block; break id)  w  > J K J K v ~ > setstep T family id ,T step thread name(thread args); > J K J K > setblock T family id ,T block > > J K J K > setbreak T family id ,T break id > J K J K : create T family id , thread name J K Figure 4.4: Compilation scheme for µTC create action. There are three parts to distinguish in the right side: first the initialization of the thread arguments with T thread args ; second the family settings of the to-be-created family with the set-like instructions;J thirdK the create instruction.

T break(expr); ⇒ break T expr J K J K

Figure 4.5: Compilation scheme for µTC break action. The corresponding thread func- tion is defined as breakable by the presence of a break type other than void. The µTC language allows the use of regular C function calls. Nevertheless, the compiler transforms them as shown in Figure 4.6. On the right side of the figure (i.e. the parent context), instead of conventional caller code, the com- piler produces a family of one thread targeting a call gate and generates a fam- ily identifier. As shown in Figure 4.7, the call container function name (func- tion args) is generated and wrapped around the conventional sequential gener- ated caller code in a separated context, call gate. This allows the proper use of sequential functions in thread functions. The callee generated code is conven- tional sequential code.

66 Chapter 4. From basics to advanced SVP compilation

8 allocate T family id , (w), (x), (y), (z) <> J K T function name(function args) ⇒ create T family id , call gate function name J K J K :> sync T family id J K Figure 4.6: This is a special case of compilation scheme T where a regular C function call is present in the statement list held in Body. On the right side, two assembly procedures are the result of the transformation: the call gate and the thread name. The compilation scheme C corresponds to regular C compilation which, in the example, compiles the function call with sequential conventions.

.asm call gate .asm thread name .registers gi, si, li gf, sf, lf .registers gi, si, li gf, sf, lf call gate function name : function name : T function name(function args) T statements; J K J K end

Figure 4.7: This is the call gate with the conventional generated caller code introduced by the compiler (on the left-hand side) and the conventional generated callee code (on the right-hand side). Note that the compilation scheme T reflects regular C compilation schemes here.

4.3 Under the hood of SVP compilation

In the previous section, we have looked at the compilation schemes that the SVP compiler performs with a formalism which reflects the requirements of the SVP compilation process. Section 4.1 already introduced the basics of transforma- tions. In this section, we look under the hood of the compilation process to see, still from a theoretical standpoint, the steps of compilation. SVP compilation schemes extend standard C compilation schemes. We found inspiration for this section with knowledge from thorough compiler and compilation books such as [72, 73, 74].

4.3.1 Overview of compiler structure

Theoretically, a compiler is a single-box transformer as illustrated in Figure 4.8. We refer to the input as the source program and the output as the target program. The black-box transformation corresponds to Formula 4.1. A compiler must both understand the source program presented for compi- lation and map its properties to the target machine. These distinct tasks suggest a division of labor and an explicit design within the black box. The compilation process is decomposed into two major pieces: the front-end and the back-end as illustrated in Figure 4.9.

67 Chapter 4. From basics to advanced SVP compilation

Target Source Transformer Program Program Compiler

Figure 4.8: The compilation process as a black box.

A ‘simple’ compiler consists of a language-dedicated front-end that ‘under- stands’ the source-language program and a specific back-end that maps the program the target machine. These two pieces are combined inside an infras- tructure which drives the compilation process and provides data structure, tree and graph structure representations, etc. The front-end emits an internal ver- sion of the program, called Intermediate Representation and presented as IR in Figure 4.9. At any point in the compilation process, the compiler has an IR of the input program which will be then converted into the target machine code during the last stage of compilation.

IR Source Front-end Back-end Target Program Program

Compiler

Figure 4.9: Under the hood, a ‘simple’ compiler is a front-end dedicated on ‘under- standing’ the source-language program and a back-end ‘mapping program’s properties’ to the target machine. The two pieces are combined within a single infrastructure. We described here a simple two-stage compiler. The first stage (i.e. the front- end) ensures that the source program is well formed and it converts it into the IR. The second stage (i.e. the back-end) maps the IR program into the instruc- tion set and the finite resources of the target machine. In reality, compilers are more complex with multiple front-ends and back-ends. A front-end is dedi- cated to a language and the back-end is specific to a target machine with its forms of instructions and its memory structure. Therefore, having a language- independent and machine- independent IR gives many more possibilities of reusing and combining the different front-ends with various back-ends. In Section 4.1, we have seen that one of the principles of compilation is to improve the input code. The two-stage compiler design is thus inappropri- ate and then evolves into an optimizing three-stage compiler as shown in Fig- ure 4.10. The optimizer stage is an IR-to-IR transformer and performs one or several passes over it before emitting an optimized IR. Formula 4.2 represents an abstract vision of this optimizer. The compiler then has different levels of IR throughout the compilation process. The IR evolves accordingly the function- ality of passes. Finally, the back-end deals with an optimized IR when emitting the target program.

68 Chapter 4. From basics to advanced SVP compilation

IR IR Source Front-end Middle-end Back-end Target Program Program

Compiler

Figure 4.10: A classic optimizing compiler is formed with three distinct stages. An optimizer is therefore added in between the front-end and the back-end. This optimizer stage is often referred as middle-end.

The optimizer can make one or more passes over the IR, analyze the IR, and rewrite the IR. The optimizer may rewrite the IR in order to produce a faster target program or a smaller program from the back-ends. Each pass of the op- timizer has an objective and can be iterated until reaching the optimal solution of its functionality. Conceptually, in a three-stage compiler design, there is a clear separation of concerns inside the compilation process:

• the front-end’s concern is to understand the source program, to report the lexical, syntax, semantics errors, and to build the first IR form.

• the middle-end’s concern is to improve the IR form as much as possible, this is being done independently from the source language and from the target language.

• the back-end’s concern is to map the optimized IR form onto a bounded set of resources of the target machine in a way that aims to make efficient use of target resources.

In practice, modern compilers are based on a three-stage design with a com- plex infrastructure to allow combinations between a vast set of target machines, a large amount of optimization passes, and a wide set of source languages. The infrastructure provides a sufficient amount of control commands to the user to take advantage of the compiler features (e.g. abstracted in Figure 4.2, a set of parameters to control specific optimizations and commands a). The infras- tructure also provides generic symbol tables, tree and graph representations, generic and reusable methods (i.e. cross-platform compatible). Figure 4.11 represents an advanced compiler design which we use as our re- search framework. We now look closer at the compilation stages. Of these three stages, the middle-end has issues and problems that arise while optimizing a program and it will be discussed in Chapter 5.

69 Chapter 4. From basics to advanced SVP compilation

Front-end Middle-end Back-end

1

i

n

OPT

OPT gen.

IR OPT IR

CSA

Code Gen. Code

AST Scan/Parser

OPT manager O. Peephole Analysis+OPT Source Target Program Program

Infrastructure + compiler driver + data structures (graphs, symbol tables, trees, etc.)

Figure 4.11: A modern optimizing three-stage compiler design.

4.3.2 Classic Compilation work-flow

Within the scope of explanation for an imperative-based compiler, we now look at the compilation process itself. Represented as a work-flow in Figure 4.12, the compilation process is a sequence of stages performed inside the compiler. A stage is also defined by a series of passes. A pass is either an IR analysis, or an IR optimization, or an IR writing. A source program being compiled goes through all stages and various passes. Therefore the compilation process be- comes opaque and complex to the compiler user. However, the user can also use control commands to record the different IRs throughout the compilation process. Usually, these IRs do not contain all the information carried through the different stages mainly because they contain too much information to dis- play. Moreover, the IR form is not trivial to understand (depending on the stage where it is been recorded). This section introduces the different stages and passes of a classic compilation using a work-flow perspective.

The Front-end

The source program is first analyzed by the front-end of the compiler. The front- end is at the first stage of the compilation. It gives most of the feedback to the user concerning whether his program is well-formed. This is done during a first pass on the source program scanning the stream of characters used in defining the program’s code. After scanning, a stream of tokens is then generated from the input. The scanner is often referred as lexical analyzer. It aggregates symbols to form tokens and applies a set of predefined rules from the source language definition.

70 Chapter 4. From basics to advanced SVP compilation

Source Program (in source language)

SCAN PARSE CSA AST

FRONT-END

SSA- LIR OPT OPT CFG

OPT DRIVER

MIDDLE-END

CODE PEEPHOLE DFG OPT GEN. OPT

CODE OPT RA EMISSION

BACK-END

Target Program (in machine code)

Figure 4.12: A work-flow representation of a compilation process. This abstracts a mod- ern optimizing three-stage compiler design. The three stages are represented with con- tainers; the source program comes as input and the target program is the output of the work-flow. Each square block (block in container) defines a task in the work-flow and interacts (as a pass) with the program’s representation during compilation.

71 Chapter 4. From basics to advanced SVP compilation

After the lexical analysis, the stream of tokens is sent through the parser of the language-specific front-end. The parser determines the type of input tokens. For instance, if the token is a keyword of the source language or if the token is a regular expression. The parser recognizes the syntax of the source language and provides feedback to the compiler user about the problems with syntactic errors made in the source program. The parser’s output is a concrete model of the source program to be used by the later stages of the compiler. A token is now associated with a class and a location within the program. The parser then keeps track of where the tokens are found while parsing. Keeping in mind that the front-end performs extended analysis on the source program validity, the front-end contains a Context-Sensitive Analysis (CSA). It generates a large database concerning the details of the program. By checking how values are represented, verifying how values flow between variables, an- alyzing external component used in the source program, the context-sensitive analysis understands the structure of the computation. The type-checking anal- ysis is important to ensure that there is no problem related to wrong type flow between variables where values flow. All this information is extracted from the source code and then transformed into an abstract tree representation. This is the first Intermediate Representation (IR) of the program which is the output of the front-end. Moreover, the front-end provides feedback on program irreg- ularities resulting from invalid keywords, invalid regular expression, invalid types, etc, along with diagnostic information that the compiler user can use for refining and correcting the input program.

The Middle-end

The middle-end starts when the Abstract Syntax Tree (AST) has been generated from the front-end. This tree-based IR contains information about the code being analyzed and translated. This IR is independent from the source lan- guage and is the abstract program representation inside the middle-end. This then allows the middle-end to be generic and therefore reusable for different front-ends. Moreover, the IR used in the middle-end contains no information about the target language which hence provides an abstraction layer to get a target-independent IR. Thus, the middle-end can also be reusable for multiple back-ends. The passes in the middle-end consume the IR, analyze the IR and rewrite the IR. The first pass of the middle-end generates a three-address IR (also called Static-Single Assignment (SSA)). In this representation, most operations have the form x ← y op z, with an operator op, two operands y and z and one result x. Some operators, such as an immediate load and a jump, need fewer argu- ments. Sometimes, an operation with more than three addresses is needed. A new set of compiler-generated variables are introduced in the code and reveal new opportunities to improve the code on a simpler IR. The reason for this is that a variable is only assigned once within the same scope; it therefore permits more aggressive optimizations.

72 Chapter 4. From basics to advanced SVP compilation

Furthermore, at the beginning of the middle-end, another pass is derived from the tree IR to produce a Control-Flow Graph (CFG). The CFG models the flow of control of the source program. The atomic part of a CFG is called a basic block. A basic block is a sequence of operations that execute together. Control always enters a basic block at its first operation (e.g. a target from a jump) and exits at its last operation (e.g. a jump operation). Edges are used to expose possible transfer of control from one block to another one. The CFG is used in program analysis and by optimizations. As any part of the IR, the CFG can be rewritten by any compiler pass. On top of the various IRs (i.e. AST, CFG, SSA), the compiler has other inter- nal data structures which are used during the compilation passes. The symbol table is one of them and is an integral part of the compiler’s IR. The compiler encounters a wide variety of objects during compilation – variables, defined constants, procedures, functions, labels, structures and files. As mentioned be- fore, the compiler also generates many objects. The symbol table lists all the program’s objects with information related to their type, their location in the code, their scope, their dimension (for arrays), their field (for records or struc- tures), the number of parameters and their types (for functions), etc. These IRs are used by program analysis passes in the middle-end. A source program is composed of one or more functions. Within the compiler, a function is ab- stracted as a procedure. Advanced program analysis involves inter-procedural analysis which gives feedback on communicating values between procedures for instance. Here, the compiler analyzes which procedure call is related to a specific procedure. A typing-check is also performed to verify whether a call ar- gument has the proper type for a specific procedure argument. Moreover, pro- cedure calls are analyzed to produce a call graph which gives the nested level of a procedure from a program root perspective (in other words, the starting point). Using the IRs created by previous passes, the optimizations can be more ac- curate with sharper information. For instance, the Dead-Code Elimination (DCE) uses the CFG to evaluate which branch of the graph can never be reached by any flow. Then the DCE simply removes the useless branch from the graph. The result of such a pass is really interesting since it can be used to reduce the code size of a given program. Another interesting pass called Common Subex- pression Elimination (CSE) searches for identical expressions, evaluating to the same value and analyzes whether it is worthwhile replacing them with a single variable holding the computed value. In later stages, fewer operations will be generated and hence the target program will be smaller in size. Constant propa- gation looks for constant expressions which values are known at compile time. These constant expressions will then be substituted by their value. The main optimization done here is to produce fewer operations for a similar result. The target program produced will be composed of fewer operations.

73 Chapter 4. From basics to advanced SVP compilation

The Back-end

The limits of the middle-end and the back-end differ from one compiler to an- other. In modern compiler design, the back-end is not only responsible for generating the target program and to map to a limited amount of physical re- sources. The back-end has evolved to an optimizing back-end nowadays. Fur- ther optimizations can be done before code emission (i.e. the last stage of com- pilation) such as Peephole Optimizations. These machine-dependent optimiza- tions typically target small segments of code represented in the low-level IR (i.e. IR close to machine code). It recognizes sets of instructions that do not actually do anything, or that can be replaced by a leaner set of instructions. On some ar- chitectures, there is a special addition instruction for large words which allows with one single instruction an addition with more than two operands. More- over, some series of instructions can be simply substituted by faster ones; this is called Strength Reduction. These optimizations operate on a low-level IR which has been lowered when entering the back-end. Generally, this low-level IR al- most looks like machine code with pseudo-instructions using pseudo-registers (or virtual registers which are non-architectural registers not yet register-allocated). Modern back-ends analyze the IR, if not done yet at the middle-end stage, to extract dataflow information. Some modern compilers might reperform analysis on the source program being compiled such as Data-Flow Analysis (DFA). Since the representation of the program evolves throughout compilation stages, the compiler passes require accurate information about the program and therefore frequent updates of the IRs. DFA is a static code analysis (i.e. performed at compile time) and investigates the whereabouts of values in a program. Ob- serving how values flow, the compiler can know exactly how the code would behave at execution time. DFA is a necessity for optimal register allocation and other data-related optimizations. The last stage of compilation is one of the most important and often the one which makes the difference in terms of performance between compilers: code generation. The code generation is separated in three components:

1. Instruction Selection, by definition, chooses the proper instruction in the targeted instruction set for the selected operation. 2. Instruction Scheduling results in ordering the selected instructions. This scheduling is a speed optimization which has a critical effect on pipelined machines. Therefore, the instructions can be reordered to improve perfor- mance or hide latency based on heuristics of the targeted machine. 3. Register Allocation (RA) allocates variables of the program into hardware registers. The main concern is that the program can have an enormous amount of variables to map on a restricted amount of physical registers on the target machine. Based on heuristics, the RA optimizes as much as possible the mapping of the variables to hardware registers, for instance, by putting in hard registers the most-often-used variables and then re- duce the use of memory accesses (i.e. stack or other memory structures).

74 Chapter 4. From basics to advanced SVP compilation

Hardware registers have the quickest read-and-write access time in op- position to other architectural memory structures.

The last stage of compilation is code emission which produces the final opti- mized representation of the program into the target program. To accomplish all these stages, the compiler uses a considerable number of data structures to represent the details of the input program. Modern compilers are usually ‘reconfigurable’ frameworks which provide a set of generic cross-compatible methods and processes for different source languages and target architectures. The compiler framework provides:

• graph structures,

• tree structures,

• symbol tables or hash tables,

• singly-, doubly-, and multiply-linked list methods,

• graphical representations,

• diagnostic information,

• the infrastructure.

This section has explained how the classic compilation work-flow is a com- plicated and monumental process. Despite that, the compiler takes special care to follow the two main principles: program semantics preservation and program optimizations.

4.3.3 Compilation work-flow of the µTC language

After describing the classic compilation process (i.e. work-flow of stages) and the SVP compilation (in Section 4.2), we look now at the correlation between compilation of the SVP language implementation and conventional compila- tion. For the moment, we use an abstract level of representation to help the reader to grasp the essence of SVP compilation. The SVP compilation is com- parable to a work-flow which takes as input a source program using the µTC language and results in producing its corresponding representation using the Microthreaded Instruction Set as target language. The SVP compilation process extends, in practice, the standard C compilation schemes.

Composition of a µTC program

First, we have to observe how a given concurrency-oriented program is built from a functional perspective (i.e distinguish the functional blocks, the com- munication between the blocks). Indeed the source program, coded in µTC, is

75 Chapter 4. From basics to advanced SVP compilation composed of computational blocks which correspond to thread functions as il- lustrated in Figure 4.13. From a model view, a thread function is a concurrent region .A µTC program consists of a hierarchy of threads grouped by families in which the hierarchy is exposed with the concurrency tree . This concurrency tree hence shows the concurrent regions of a program and their relationships.

Program

thread foo (...) { ... }

Thread function

Figure 4.13: A µTC program is composed of concurrent regions on the left-hand side. A thread function is a concurrent region as represented on the right-hand side. And, this thread function defines a family thread job. This family of threads is created by the parent with the create action. For each thread of the family, a thread context is instantiated and contains its own context of variables. The context is constrained by dataflow dependencies. This figure does not represent any inter-thread dependencies.

Families of threads

A thread family encapsulates a group of threads at execution time. However, at compile time, a thread family’s work corresponds to a thread function. The terminology of a thread function being compiled is called a thread procedure. In the execution model, a thread procedure is the atomic concurrent region and it has its own context of variables. The communication channels are explicitly exposed in the program and they represent the dependencies between threads (i.e. with the parent thread, the child thread, the adjacent threads). In prac- tice, the compiler separately processes a thread function from the rest of the program. Nevertheless, once a complete scan of the program is performed, the compiler has a global view of the source program with the IRs. During SVP compilation, a µTC source program is mapped as a sequence of thread proce- dures. In Figure 4.13, a thread function in µTC is a thread procedure within the compiler and becomes at execution time a concurrent region. At execution time, a thread family is composed of heterogeneous threads which have their own private context. Nevertheless, these threads are statically homogeneous in their µTC code description.

76 Chapter 4. From basics to advanced SVP compilation

A proper internal representation of a µTC program

A thread family therefore can have multiple live contexts at execution time. Each indexed thread of this family has its own flow of control. The inter-context dependencies are exposed with synchronized communication channels. Conse- quently, after decoupling the µTC source program into a composition of thread procedures, the compiler must preserve their hierarchy in order to preserve the semantics of the source program. For that purpose, the compiler builds the concurrency tree on the same scheme as a call graph in the sequential paradigm. This concurrency tree can be seen also as a creation graph illustrated in Fig- ure 4.14. Note that the creation graph is context-sensitive, each concurrent region has its own context and inter-context dependency information is an- notated. Thread procedures are interconnected with the create action. As a comparison with the sequential model, a program is composed of routines that are interconnected with calls. A routine or sub-routine is, by definition, “a por- tion of code within a larger program, which performs a specific task and is relatively independent of the remaining code”. Hence, within the concurrent model, thread functions are the corresponding concurrent routines.

main 1

foo DEPTH 2

baz bar 3 4

Figure 4.14: Creation graph example of a given µTC program. Similar in appearance to the call graph, the creation graph is a representation of the concurrency tree of the program. The creation graph also contains creation relationships between concurrent regions (i.e thread function) as thread parameters in the µTC program. The depth is the level of encapsulation of a create from the root of the program (i.e. main thread).

A peek at optimizations

The compiler, at the middle-end stage, performs an advanced inter-procedural analysis to verify which create action is related to which thread function. Since the creator of a thread function is identified, type checking of the thread parame- ters can be performed as well. Conventional intra-procedural optimizations are performed on each thread function with respect to the synchronized communi- cation channels. This will be discussed in Chapter 5. Figure 4.15 represents the relationships of a concurrent regions with surrounding regions; these relation- ships must be exposed to the compiler to convey SVP assumptions in compiler

77 Chapter 4. From basics to advanced SVP compilation optimizations. The rest of the compilation stages take action on the thread pro- cedure being-compiled such as register allocation and the other passes in code generation. The resulting code representation of the source program is emitted. The output of the SVP compilation is a series of Microthreaded assembly proce- dures. As a non-concurrency-aware programmer, the resulting target program resembles to series of operations similar in the form to sequential procedures.

Internal representations: augmenting the conventional CFG

The concurrency tree is visible to the SVP compiler; therefore, the compiler is able to perform several actions on it: read it, write it and transform it. Potential optimizations can be done to restructure the tree shape in order to enhance the concurrency in a better way than the one exposed. The compiler here performs the second principle of compilation we mentioned earlier: optimizing the source program. To perform such optimizations and further compilation stages, the compiler also requires an accurate CFG. The CFG indeed needs to be aware of the concurrent assumptions related to the concurrent regions and their relation- ships (cf. Figure 4.15). Moreover, the compiler has to mark a create action and its related sync action. The control flow ‘leaves’ the concurrent region at the cre- ate region and eventually comes back at its related sync action. In practice, there is a difference here from a sequential paradigm where a call generates a branch in the control flow from the current context. While branching to the callee and until the end of its execution, the control flow is away from the caller’s code. In our case, a separate flow of control is generated at the create action while the creator’s flow of control continues after the create action as illustrated in Figure 4.16. The conventional CFG does not make the assumption of multiple concurrent flows of control; therefore, it requires special attention and must be extended properly to support such assumptions as illustrated in Figure 4.17. The CFG ex- poses the concurrent regions of a program and also the create and sync actions. The SVP compiler is then aware of which areas of the procedure may have more than one control flow. The sync action joins all of the flows of control from the child threads and the parent thread into the parent context. Therefore, after the sync action, the variables used between the child family and the parent are consistent.

Internal representations: updating the conventional DFG

While the concurrency tree contains information related to inter-region depen- dencies, the CFG contains much finer-grained information within the concur- rent region. Indeed, the basic blocks composing the CFG map finer behavior in- side thread functions. Therefore, inter-context data relationships with the par- ent or/and with the potential child family (if this is the creating basic block) and adjacent thread are visible to the SVP compiler. The scope of variables within a concurrent region is different from sequential assumptions, when the variables

78 Chapter 4. From basics to advanced SVP compilation

Parent domain P R S Family domain W W W S C(3) R R G R R C(1) R(x_g) S W (y_s)

W

thread foo (int x_g, shared int S R (y_s) y_s) { ... R temp = y_s ;

y_s = x_g + temp; W R ... }

Figure 4.15: The relationship of a concurrent region with adjacent regions and its parent. Parent P creates a family C of 3 threads which has two parameters: a global x g and a shared y s. The tokens G and S respectively represent their flow (by the arcs as ‘read’ R and as ‘write’ W) through contexts of concurrent regions. The shared and global channels are marked in the internal code representation, and thus are handled with care.

79 Chapter 4. From basics to advanced SVP compilation

Sequential Concurrent caller callee parent children

(L) create (L) call (L+1) after call

(L+i) sync

Figure 4.16: Comparison of the control flows of a sequential call on the left-hand side and a concurrent create on the right-hand side. In the sequential paradigm, when a call occurs, the control flow leaves the caller’s context and goes into the callee’s until completion of the callee’s job. Then, the control flow comes to the operation immediately after the call operation (i.e. at line L+1). Nonetheless, in the SVP concurrent paradigm, when a create takes place, the control flow separates into two: one flow goes on in the parent’s context and the other flow spawns the children threads. There, this control flow separates again in each child thread. At completion (i.e. sync operation at line L+i), the control flows join and after this operation, only one flow remains in the parent’s context.

entry block

create external block TF

sync

exit

Figure 4.17: CFG representation of an SVP create-sync block. There is one entry point to this block at the create operation and one exit point after the sync operation. The representation of the control flow separation is shown by two outgoing arcs from the create operation. The Thread Function (i.e. TF in the figure) is an external block and contains the CFG of the job to perform. At the sync, all the incoming arcs converge; subsequently, only one outgoing arc remains. are defined as synchronized communication channels. Through the DFA, the SVP compiler isolates the channels when realizing the Data-Flow Graph (DFG) of the source program. In SVP, the shared communication channel is defined as one single object in the source program which maps into a pair of objects: incoming and outgoing objects. In order to have an appropriate RA and opti- mizations, the DFG contains this information with the proper representation of

80 Chapter 4. From basics to advanced SVP compilation communication channels. The same works with marking the global communi- cation channel as read-only channel. Thus, the DFG is aware of the read-and- write accesses on SVP communication channels, as exposed in Figure 4.18.

IN(a) R thread void foo(shared op1 = double a) Wop1 { t ... t = a; Ropn ... = Wopn a = ; } OUT(a)

Figure 4.18: On the left-hand side, an SVP thread function using one single SVP shared synchronized communication channel. On the right-hand side, its corresponding DFG. The CFG and DFG of the source program are particularly important when they occur in inter- and intra-procedural optimizations. Any operation from a concurrent region may have a side-effect on another operation in another concurrent region. The SSA form, in the middle-end, is a good example of 3-address operations where the DFG is required to have information related to their inter-context data relationships.

Special SVP operators

The back-end of a compiler framework comprises two separate parts: the first is the back-end generator which is generic for any target platform; the second contains descriptions of target architectures about their instruction set, their memory structure, their register structure, etc. The back-end generator employs operator or native operators which correspond to operations in an abstract tar- get machine. This abstract target machine is an abstraction of target platforms with adjustable and generic instruction formats. In the context of SVP compila- tion, SVP operators extend this abstract target machine to natively implement the SVP actions. From the front-end, the SVP constructs are carried out and lowered down to SVP operators such as create, sync, kill, break operators. Do- ing this allows the possibility to retarget to another target platform with SVP properties. Moreover, the target description is also extended to map these SVP operators onto SVP instructions. The create construct in µTC splits into three blocks in the back-end:

• a first block for the arguments of the thread function,

• a second block for the settings of the create and the to-be-created family,

81 Chapter 4. From basics to advanced SVP compilation

• a third for the operators for generating the create.

The create implements a ‘special’ call with side-effects on the arguments used in the thread function argument list and on the control flows of the neighboring operations (in the execution order). The architecture implements the create op- eration using a ‘pull’ mechanism which generates binds between the location in the parent thread’s resources and the child threads’ resources. Consequently, between family creation (i.e. create) and family completion (i.e. sync), resources used for thread function arguments must neither be reassigned nor be reallo- cated. Moreover, the create operator implements the semantics for multiple control flows. The compiler is aware of control flows modifications at family creation. The control flow separates at the create operator in two parts: the code section right after the create operator and the child thread’s code about to be created. Multiple concurrent control flows generate inter-context depen- dencies and are captured with SVP inter-thread communication channels. The control flows merge at family completion. Locks on resources (used for thread functions arguments) are being released at family completion. The create oper- ator targets a thread function whose definition is controlled in a similar way to the the sequential paradigm using caller-callee controls following calling con- ventions. Other SVP actions have their back-end operators such as kill, break, sync, index, etc. The advantage for the compiler is the visibility of their actions and the verification of their semantics.

82 Chapter 4. From basics to advanced SVP compilation

4.4 Conclusion

To summarize this chapter, we have looked at the compilation basics for con- ventional compilers to present the complex and deep process of compiling con- ventional programs. We then have expressed the differences with SVP compi- lation in sufficient details to reflect the depth of change necessary to extend a conventional (imperative-language) compiler. This corresponds to the specifi- cations defined for the SVP compiler we have built with an abstract perspective of how it should transform µTC applications. For that purpose, we have given a concise overview of classic compilation to understand the peculiarities of SVP compilation; we have studied the steps involved in the process of compiling a concurrent language in a conventional compilation. Conventional optimizations are dangerous for any concurrency-oriented pro- gram defined with the SVP paradigm and also for concurrency in general. Ex- treme care has to be taken in that sense for any concurrency-related work in compiler research. The stress on compilation is put on the two major princi- ples of compilations: preservation of code semantics and optimization of the input program in some way. Embedding SVP concurrency, natively within the com- pilation per se, allows the reuse of the compilation infrastructure, with conse- quent changes. Consequently, this has a cost on our research. Chapter 5 inves- tigates the potential dangers of introducing concurrency in compiler optimiza- tion passes. Later Chapter 6 discusses the research costs for SVP compilation.

83 Chapter 4. From basics to advanced SVP compilation

84 Chapter 5 On the challenges of optimizations

A danger foreseen is half avoided Proverb

This chapter discusses the challenges of conventional compiler optimizations over SVP programs. We have previously discussed in Chapter 3 the properties of the SVP execution model and its implementations. SVP compilation is dis- similar to conventional (imperative) compilation as we have explored in Chap- ter 4. Some SVP properties clash with the assumptions in optimization algo- rithms used in modern compilation for imperative sequential paradigms. Research interests have been put in the dangers of conventional optimiza- tions over concurrency-oriented code. E. Lee [50] simply pledges against thread- based programming models because of new bug appearances in the code us- ing concurrent programming, data-parallel language extensions and message- passing libraries such as PVM and MPI. Gray areas appear for compilers in pro- grams’ description resulting in non-determinism. Boehm [48] explicitly presents problems to compile (with optimizations) concurrent code with Pthreads, a concurrent programming standard. Major problems reside in how concurrency is exposed to the compiler. Sarkar [75] discusses the necessary changes for com- piler frameworks to cope with the new paradigm shift of multicore program- ming. He mentions the indispensable work on compiler optimizations while addressing the challenges of code optimization of parallel programs. Sarkar discusses ongoing evolutionary effort (understand here extending existing meth- ods) to support optimization of parallel programs in the context of existing compiler frameworks, and their inherent limitations for the long term. He then outlines what a revolutionary approach will entail, and identify where its un-

85 Chapter 5. On the challenges of optimizations derlying paradigm shifts are likely to lie. In the context of this thesis work, it becomes relevant to investigate the scope of hazards that SVP programs en- counter through compiler optimizations. This section sorts out the optimiza- tions that must be constrained to work in this context; but also, it figures which optimizations are not required at all when generating valid Microgrid code; finally, it presents novel optimizations that could benefit the code. mi

5.1 Hazards with optimizations

This section covers the dangers of conventional optimizations when compiling SVP programs. In Section 4.1.3, the first important principle of compilation is about preserving semantics of an input program. This is our main concern after integrating new compilation schemes into the compiler’s internals. The scien- tific question here is to observe and to analyze the hazards of program opti- mizations over concurrency-oriented programs to achieve the second principle of compilation (i.e. optimize the program in some measurable way). In our re- search implementation, we have decided to embed concurrency constructs into the compiler infrastructure. Therefore, the compiler has an eye on a program’s definition. The validity of internal representations, such as SSA, CFG, DFG, AST, etc., is an immediate concern in providing any compiler optimization the proper abstraction of the problem to transform and optimize. With SVP, the concurrency tree is captured at ‘thread function’ atomicity and the extension of existing IRs with regard to the SVP concurrency constructs.

Purpose of compiler optimizations

During compilation, optimizations can be used a single time or multiple times until an optimal solution is reached. The term optimization signifies that the compiler discovers an optimal solution to a problem against the one taken as input. The gain may be a few operations, a lower use of resources, etc. In the end, the program has to run as fast as the non-optimized one but if possible with fewer instructions, fewer architectural registers, etc. Adding new com- pilation schemes (cf. Section 4.2) implies the addition of new assumptions in optimization algorithms (typically multiple choices in a switch-case scheme); therefore, there are more chances to break code consistency. Consequently, we have extended the scope of existing optimizations to support the SVP concur- rent paradigm.

SVP compilation and concurrent programming

The SVP compilation schemes require proper transformation rules in convert- ing from µTC code (with SVP and C constructs) into the appropriate operations.

86 Chapter 5. On the challenges of optimizations

On one hand, we discuss here optimizations that can endanger the code valid- ity. On the other hand, optimizations that can be done on the code to make it more efficient, i.e. elimination of non-executable sections, concurrency tree re- structuring. In concurrent programming and with SVP compilation, a major is- sue with concurrent regions is the presence of multiple control flows at the same time, whereas with the sequential paradigm one single flow is present at any time. This change of assumption breaks the sequential paradigm used in some optimizations, for instance, with combining instructions. Some optimizations, individually, are harmless for concurrent-oriented programs. Nonetheless, the combination of optimizations might endanger dramatically the concurrent code without any notification to developers. These unrelated optimizations can have disastrous impact on the code.

5.2 Investigating some optimizations

We have observed, once the code has been flattened into 3-address code, some inter- and intra-procedural optimizations at the middle-end level can poten- tially break SVP assumptions. Moreover, peephole optimizations (such as com- bining instructions) are also dangerous, especially for the sake of the synchro- nized communication channels of the SVP paradigm. Figure 5.1 depicts an example of an unmodified optimization algorithm. The side-effects of this al- gorithm simply destroy SVP semantics by removing SVP actions from the pro- gram’s code.

...... create(fid;...)foo(x,y); ...... sync(fid); ......

Figure 5.1: Example (in the left-hand side) of optimization side-effects on SVP code. In the right-hand side, the dead-code elimination simply removes SVP actions which are considered useless statements, results of an SVP-unaware algorithm. This optimization must be constrained to work in this context. The SVP compiler is extended with new compilation schemes on top of the existing C-language compilation schemes. The addition of new SVP operations and their corresponding properties requires special attention when interacting with compiler optimizations. The compiler must be aware of what it has to achieve. Moreover, the SVP synchronized communication channels have spe- cial semantics that clash with these optimizations, in particular the synchro- nization events linked while accessing these channels. For instance, a shared communication channel has two ends: an incoming shared (i.e a read from it) and an outgoing shared (i.e. a write to it). A synchronization event is bound on the first read on the incoming shared variable; another synchronization event

87 Chapter 5. On the challenges of optimizations is bound on the first write to a outgoing shared variable. Introducing tempo- rary variables, that break down a statement involving a shared variable, can potentially corrupt SVP code consistency. Figure 5.2 illustrates a case where a synchronized communication channel is broken with the introduction of tem- porary variables.

{ { ... int x.25; x = a + x; int D.1199; ... D.1199 = a; } x.25 = x + D.1199; }

Figure 5.2: Example (in the left-hand side) of optimization side-effects on SVP commu- nication channels. In the code, ‘x‘ is a shared communication channel and ‘a’ a global channel. During the middle-end stage, the synchronization events may be broken with the introduction of temporary variables (‘x.25’ and ‘D.1199’); new reads and writes ap- pear in the program. Note that the local declaration of ‘x.25’ must be propagated to a shared communication channel to keep the SVP semantics valid for subsequent compi- lation steps. Compiler optimizations are numerous and we made the following selection based on which optimizations are constrained to work in SVP compilation: SSA, common subexpression elimination, partial redundancy elimination, dead code elimination, combining instruction, copy propagation, instruction reordering.

5.2.1 SSA

The ‘Static Single Assignment’ form, often abbreviated as SSA form or simply SSA, is both a compiler optimization and an intermediate representation (IR) in which every variable is assigned exactly once. Usually occurring at the begin- ning of the middle-end, the SSA transforms existing variables in the original IR and splits them into versions, new variables typically indicated by the original name with a subscript, so that every definition gets its own version. Conse- quently, new local variables are introduced in the scope of the function. Doing that allows more aggressive optimizations. Pop [76] presents the benefits of in- tegrating the SSA in an imperative compiler implementation; code restrictions in the SSA form are reduced and diminished hazardous side-effects during op- timizations. Figure 5.3 shows the transformation of a piece of code into an SSA representation. The SSA form utilizes use-def chains which are explicit and each contains a single element (i.e. variable). A use is when the variable is consumed by a read; the def, or define, tags a variable in which a result of an operation is written. For this reason, the SSA form tends to add temporary variables in the local scope of the procedure. When integrating the SVP assumptions into

88 Chapter 5. On the challenges of optimizations

x = y; x_1 = y_1; y = y * 10; y_2 = y_1 * 10; x = y; x_2 = y_2;

Figure 5.3: SSA transformation from a simple source code. In the left-hand side, a piece of source code is the input of the SSA transformer. The result is the SSA form on the right-hand side. The SSA transformation simplifies the properties of variables with their versioning. the compiler’s internals, the problem arises with SVP communication channels. The semantics of these channels can be broken when transformation into SSA with the introduction of new versioned variables. The synchronization events may simply be corrupted by reads from and writes to these new variables. The idea to solve this problem is to use an ‘evolutionary’ approach as Sarkar [75] describes. In an ‘evolutionary’ manner, compilers incorporate new code in op- timization to extend its scope. ‘Revolutionary’ changes are necessary when new paradigm shifts arise and new components are needed and/or compilers need to be rewritten from scratch to support new paradigms. The changes for extending the SSA are to propagate information related to synchronized com- munication channels into the new versions of variables and handling properly the appearance of temporary variables when these are synchronized communi- cation channels. Therefore, the IR remains correct and safe for coming compiler optimizations.

5.2.2 Dead-code elimination

‘Dead-Code Elimination’ (DCE) is a compiler optimization that removes code that does not affect the program; this useless code is called dead code. Dead code includes code that can never be executed in the program’s code and is often re- ferred to as unreachable code. Moreover, dead code also affects dead variables which become irrelevant to the program. The DCE is a very important and useful optimization, because removing such code has two benefits: it shrinks program size, and it lets the running program avoid executing irrelevant oper- ations, which reduces its running time. The DCE may occur in the middle-end and also in the back-end. As soon as an optimization pass transformed the program internal representation, new dead code may have appeared. Conse- quently, the DCE is invoked several times during compilation; it basically uses the CFG of the program (cf. Cooper’s and Torczon’s book [74], pages 498 to 505) and also information from the SSA as Cytron et al. [77] discuss. As illus- trated with Figure 5.4, the leaves of the CFG may be useless code when they cannot be reached or simply because they do not return anything while exiting the program’s procedure. Figure 5.5 illustrates a case where µTC semantics are valid; but, the DCE al-

89 Chapter 5. On the challenges of optimizations thread void foo (int x, int y) { return x + y;

int z = x * y; /* unreachable statement */ }

Figure 5.4: The DCE algorithm discovers, in this C-language example, that the statement “int z = x ∗ y;” will never be reached since the function exits at the return statement. Therefore, this statement will be removed. gorithm has not been yet extended to support SVP assumptions. The statement “x = x”, containing a shared synchronized communication channel, is simply removed; even if the statement is valid in µTC and has synchronization seman- tics. The incoming shared is read and the outgoing shared is written by it. In this case, the DCE algorithm optimizes away the statement. thread void bar (shared int x) { ... x = x; ... }

Figure 5.5: The DCE algorithm considers that the statement “x = x” is a useless state- ment with a read and a write to the same variable. However, this is a valid µTC state- ment with synchronization semantics. This works as a synchronizer on the previous adjacent thread using the shared synchronized communication channel semantics.

Moreover, DCE also discovers useless code which is a piece of code consid- ered not necessary for the computations. Beyond extending the CFG and the SSA forms to support the SVP assumptions, our aim has been to define specific cases in the DCE algorithm where the new SVP operators (such as the create construct or the sync construct) must not be removed and where synchronized communication channels must also not be removed. We follow an ‘evolution- ary’ approach to improve the DCE algorithm. Furthermore, the position of an outgoing shared channel (i.e. write into a shared variable) is seen as a useless statement by the conventional DCE algorithm. As a result, this statement will be removed and the SVP program’s semantics are broken. At execution time, the consequence is disastrous; the SVP hardware will encounter a deadlock at this point of the program where the channel awaits something to be written by something that will never occur. Figure 5.6 represents a CFG of thread function “foo” of Figure 6.3.

90 Chapter 5. On the challenges of optimizations

ENTRY

int tmp = x; Block 1

create (fid; 0; 1000;1;;) bar(y);

sync (fid); Block 2

x = tmp; Block 3 EXIT

Figure 5.6: Control Flow Graph of thread function foo from Figure 6.3. Block 1 is the en- try point and block 3 is the exit. Block 2 is a create-sync block as illustrated in Figure 4.17. The variable ‘x’ is a shared synchronized communication channel. In block 3, the state- ment is considered necessary since ‘x’ is an outgoing shared which will send data to the adjacent thread. Similarly, create and sync constructs are necessary statements in block 2; the DCE algorithm is extended to support the SVP properties.

5.2.3 Common subexpression elimination

‘Common Subexpression Elimination’ (CSE) is an optimization that searches for instances of identical expressions; these expressions all evaluate to the same value. The CSE analyzes whether it is worthwhile replacing them with a single variable holding the computed value. Again, the problem with SVP integra- tion in the compiler infrastructure comes with the synchronized communica- tion channels. Figure 5.7 depicts a case where an expression is substituted. In the case that an substituted expression contains a synchronization channel, the loss of synchronization events simply breaks the SVP programming language’s assumptions. The program’s semantics are therefore broken. To prevent elimination of statements or substitution of expressions, the CSE algorithm has new cases to handle the presence of communication channels. It simply does not replace these expressions. The solution might make the pro- gram’s assembly slightly bigger, but this is the price to pay to support this in the compiler and preserve semantics. Further information about CSE is available in Muchnick’s book [73], pages 378 to 396.

91 Chapter 5. On the challenges of optimizations

x = y * z + A; tmp = y * z; w = y * z * B; x = tmp + A; w = tmp * B;

Figure 5.7: CSE works over this piece (in the left-hand side) and replaces the identical expression “y∗z” found in statements. In the right-hand side, statements are simpler and smaller which will reduce the number of operations. Nonetheless, operating on state- ments, where synchronized communication channels are involved, may simply break synchronization events of these channels.

5.2.4 Partial redundancy elimination

‘Partial Redundancy Elimination’ (PRE) is a compiler optimization similar to CSE. The PRE eliminates expressions that are redundant on some but not nec- essarily all paths through a program. An expression is called partially redun- dant if the value computed by the expression is already available on some but not all paths through a program to that expression. Cooper’s and Torczon’s book [74], pages 393 to 404, presents PRE algorithms. An expression is consid- ered fully redundant if the value computed by the expression is available on all paths through the program to that expression. This is combined with Lazy Code Motion (LCM) which can move earlier in the program’s path the redundant ex- pression; the removal of a partially redundant expression may reduce the code size of a code section. With SVP, the same case happens as with the CSE and the synchronized communication channels that can break if the PRE algorithm tackles an expression where a channel is involved. Figure 5.8 illustrates an ex- ample of PRE over a block of statements.

x = 2 * z * y; tmp_0 = 2 * z; k = 3 * z * y; x = tmp_0 * y; w = 2 * z - y; k = 3 * z * y; w = tmp_0 - y;

Figure 5.8: PRE works over the piece of code in the left-hand side. The expression “2 ∗ z” is partially redundant in the first and third statement. The PRE algorithm moves using LCM this expression at the beginning of the statement list into a new assignment into a new temporary variable. Then, the partial redundant expression is substituted by the temporary variable. Fewer operations will then be produced with the resulting optimized code. Moreover, PRE can eliminate partially redundant expressions by inserting the partially redundant expression on the paths that do not already compute it, thereby making the partially redundant expression fully redundant. This op- timization is important to reduce programmer’s design errors when inserting the same expressions at different code locations. The PRE algorithm is extended

92 Chapter 5. On the challenges of optimizations to support the SVP communication channels with care. Insertions and move- ments of expressions with communication channels are restricted to preserve code consistency.

5.2.5 Combining instruction

‘Combining instruction’, sometimes called combine, is a compiler optimization which occurs at both middle-end and back-end. At both levels, the concepts are similar but there are different assumptions on the operations to combine. At the middle-end level, it ‘combines’ several abstract operations into fewer opera- tions or a single one. At the back-end level, the same happens but with machine instructions instead of abstract operations. The SVP synchronized communica- tion channels are broken by this optimization, where a statement may simply disappear or be merged into another as illustrated in Figure 5.9. The fusion of an incoming shared, or outgoing shared, removes the semantics related to synchronization events.

y = x + 1; z = x + 2; z = y + 1;

Figure 5.9: Example of combining instruction. In the left-hand side, 2 statements with a dependency on variable ‘y’ between them. The combine algorithm discovers a canonical solution of these 2 operations into 1 in the right-and side. The variable ‘y’ is conse- quently removed from the resulting code and there are less operations in the optimized code. The main concern is to protect statements using synchronized communica- tion channels to be modified beyond SVP semantics. To achieve that, an ‘evo- lutionary’ solution is taken; the algorithm is simply extended to prevent unde- fined modifications of these statements.

5.2.6 Copy propagation

‘Copy propagation’, sometimes called copy prop, is a generic compiler optimiza- tion in which occurrences of targets of direct assignments are substituted with their values. For instance, a direct assignment is an instruction of the form “x = y”, that simply assigns the value of ‘y’ to ‘x’. Figure 5.10 presents a simple example. Copy propagation works with use-def and def-use chains which provide oc- currences of targets. A use-def chain contains the use of a variable and all defi- nitions, all the code locations where this variable is being assigned. In contrast to a def-use chain, it contains the definition of a variable, for instance an as- signment in the code, and all the code locations where this variable is used (i.e. read). The compiler takes care of updating these chains during optimizations.

93 Chapter 5. On the challenges of optimizations

x = y z = 7 + y z = 7 + x

Figure 5.10: Example of copy propagation. The input of the transformation is in the left- hand side. The algorithm of copy propagation yields the code in the right-hand side. The substitution of ‘x’ by ‘y’ reduces the amount of generated operations; the code size becomes smaller.

Further information about the algorithm is available in Muchnick’s book [73], pages 356 to 362. During compilation, the problem remains on the safety of the communication channels while the copy propagation pass occurs.

5.2.7 Instruction reordering

‘Instruction reordering’ is a compiler optimization mainly used with instruc- tion scheduling during code generation. It reorders the instructions of a proce- dure depending on time-and-cost heuristics. Happening in the back-end while scheduling the program’s instructions, this optimization shuffles earlier or later instructions in the sequence of instructions. For instance, a long-latency oper- ation, such as a load from memory, has more chances to appear sooner in the code to allow for time to obtain its result before it is needed. In other words, the waiting time is interleaved with other instructions. A simple case is illus- trated in Figure 5.11 with a create operation and its corresponding sync. The couple of a create operation and its corresponding sync operation must occur first with the create and then its related sync. The semantics become different when the sync operator is pushed by the instruction ordering before its corre- sponding create; the sync operation will be a no-op, but the create operation will not have its synchronization barrier.

...... create L5,0(L12) sync L5 ... create L5,0(L12) sync L5 ......

Figure 5.11: Instruction reordering example with the input in the left-hand side (it shows a couple create-sync). A create operation appears first in the sequence of instructions before its corresponding sync. The right-hand side shows a valid executable code, gen- erated by a non-updated algorithm, but with different semantics. The sync operation is then interpreted a no-op and the create operation is not synchronized anymore. With the SVP operators, the sequence of instructions in the family creation process has to follow a specific order. Another case of instruction reordering is illustrated in Figure 5.12, the create operator is pushed up before the related setting instructions and allocate instruction which allocates the data structure

94 Chapter 5. On the challenges of optimizations for the to-be-created family. The code semantics for the family creation are therefore broken.

...... allocate L3,4,10,7,17 load L14,foo(L1) setstart L3,L2 create L3,0(L14) setlimit L3,l4 setstart L3,L2 setblock L3,2 setlimit L3,L4 load L14,foo(L1) setblock L3,2 create L3,0(L14) allocate L3,4,10,7,17 ......

Figure 5.12: Instruction reordering example with the input in the left-hand side (it shows a part of the family creation process). The right-hand side shows the result if the instruc- tion reordering algorithm is not correctly extended. In this example, the create instruc- tion appears in the sequence earlier than the allocate which sets up the family entry and the place of computation. At execution-time, the hardware stalls when generat- ing thread family before allocating its data structure. The code semantics are therefore broken. Therefore, integrating new operators in the back-end must be done with care; the back-end must be aware of the proper sequence of operators before emitting instructions. Introducing a dependency chain throughout the series of instruc- tions solves this problem as shown in Figure 5.13. This optimization is used when the aggressive optimization mode of a compiler is enabled (e.g. optimiza- tion flags -01, -O2 and -O3 in GCC). The dependency chain between operations prevents reordering by the compiler. With respect to SVP communication chan- nels, the same must be done when the scheduler wants to reorder a write to a shared communication before the same channel has been consumed by a read. The extension of this compiler optimization has been done for optimization flag -O1 with an ‘evolutionary’ approach. Optimization flags -O2 and -O3 use their own pass of instruction reordering with a more complex algorithm. In the time frame of this research, these optimization flags have not been investigated; therefore, their use may cause what Figure 5.12 shows.

95 Chapter 5. On the challenges of optimizations

load L4,5 A load L5,17 load L6, 22

allocate L3,4,10,7,17 setstart L3,L2 B setlimit L3,l4 setblock L3,2

load L14,foo(L1) C

create L3,0(L14)

Figure 5.13: The dependency chain is between blocks and operations. Three distinct basic blocks show the create process: the first for parameter passing (i.e. block A), the second to set up family parameters (i.e. block B), the third to generate the create instruc- tion (i.e. block C). The dependency chain is between this three blocks and prevents the compiler from shuffling them around. Block C depends on block B which depends on block A. The second dependency chain is internal to each block and in the sequence of instructions for preventing what Figure 5.12 illustrates.

96 Chapter 5. On the challenges of optimizations

5.3 Discussion and conclusion

5.3.1 SVP optimizations

With SVP, in theory, there is no need to have SVP-specific optimizations to already achieve good performance over the Microgrid architecture configura- tions, as presented in Section 7.2. The Microgrid architecture tolerates high latency instructions. The pipeline interleaves threads’ instructions when one instruction requires data which is not yet ready. Consequently, during this re- search, investigating SVP-specific optimizations was not a prior interest. How- ever, with SVP compilation, the concurrency tree is the way of representing a program’s concurrency. The concurrency tree contains information on each thread family: the maximum number of threads per core (i.e. block) and the ideal place to run threads (i.e. place). Reshaping the concurrency tree of a pro- gram by changing the family block size and other family parameters is a po- tential direction. In order to achieve this, the target must be able to report back to the compiler, in some way, run-time metrics of the selected branch or the en- tire tree. This is called feedback compilation where a compiler and a target work together to find an optimal way to optimize a program. At run-time, the Mi- crogrid target profiles the behavior of the program’s concurrency tree. Across several executions, the compiler receives relevant information from the target to tune the family and thread information contained in the program’s concurrency tree. The optimizing algorithm will then reshape the workload of thread fami- lies and find the best balance over the program’s definition. Moreover, target’s resources are static at design-time; but, they may change at execution-time. The compiler can introduce variables and code sections at compile-time to permit other execution paths. These code sections and variables will be interpreted by the run-time system which sets and optimizes based on this special informa- tion. These ideas arose when investigating SVP compilation during this thesis work. The amount of work to implement them was too high considering the priority to get a working compiler to run experiments and the remaining time; therefore, this will be left for future work.

5.3.2 Conclusion

The investigation reported in this chapter contributes to the rest of this the- sis where compiler optimizations over concurrent code have to be taken into consideration. This investigation sorted out optimizations that must be con- strained to work in this context such as SSA, DCE, CSE, etc. Some optimiza- tions are just not required to gain good performance with the Microgrid con- figurations such as reordering long-latency operations sooner in the pipeline (the architecture already takes care of doing that at run-time by interleaving threads’ instructions in the pipeline). Some optimizations are novel optimiza- tions that could benefit the code require a so-called ‘revolutionary’ approach. This approach results in adding an entire new component to the compiler in-

97 Chapter 5. On the challenges of optimizations frastructure versus an ‘evolutionary’ approach consisting in extending existing optimization algorithms with new code sections. With SVP and its implementations, we believe in exposing concurrency con- structs directly to the compiler’s internals; therefore, it can be aware of con- currency interactions during program’s compilation. The main objective for compilers is to avoid gray areas in the code in concurrent programming. In or- der to achieve that, ‘evolutionary’ methods are possible with special handling cases in optimization algorithms. However, for future work, it would be in- teresting to investigate ‘revolutionary’ methods, such as the ones mentioned in Section 5.3.1. Furthermore, it would be also relevant to investigate interactions of more aggressive optimizations, for instance optimization flags -O2 and -O3 in GCC. Chapter 6 will show the challenges encountered to achieve the exten- sion of the compiler framework and to enable reuse of existing technology (e.g. compiler optimization and imperative compiler infrastructure). To conclude, we have investigated how sequential assumptions clash with SVP concurrent assumptions. The presence of a working SVP compiler proves that, despite the effort and the challenges, the integration of native concurrency in compiler’s internals is possible.

98 Chapter 6 Implementing the SVP compiler

Compilers are the Queen of Computing Science and Technology. They have long been the bridge from applications to systems, but now they determine which architectural features should be implemented in new hardware, as well as which new language features will be effective for software developers. David J. Kuck, Intel

While Chapter 4 has discussed the basics of conventional compilation and also advanced SVP compilation, Chapter 5 has reported the hazards on com- piler optimizations with SVP properties. Methods and solutions to expose the SVP properties in imperative-language compilation generated considerable re- search effort that this chapter presents as an experience report. In this chapter, we discuss effort invested in designing and extending our experimental compiler. The role of the compiler in the SVP system is critical, because it is the bottleneck within the SVP parallel computing system bridg- ing two worlds: the software and the hardware. The pressure on the compiler

The contents of this chapter are based on these publications:

• T.A.M. Bernard, C.R. Jesshope, and P.M.W. Knijnenburg – “Strategies for Compiling µTC to Novel Chip Multiprocessors”, in: International Symposium on Systems,Architectures, MOdel- ing and Simulation, S. Vassiliadis et al. (Eds.): SAMOS 2007, LNCS 4599, pp.127-138, 2007. • T.A.M. Bernard, C. Grelck, and C.R. Jesshope – “On the Compilation of a Language for General Concurrent Target Architectures”, in Parallel Processing Letters, 20, (1), March 2010.

99 Chapter 6. Implementing the SVP compiler is therefore very high. In that context, we discuss design decisions and their impact in compiler research including limitations due to the experimental en- vironment. One relevant question, for instance, concerns reasons not to write a new compiler from scratch instead of extending an existing framework. An- other question concerns the selection of the research compiler platform. We then discuss these compilation issues related to generic issues with concurrency and specific issues with the Microgrid platform.

6.1 Role of the compiler

As part of any computing system, the role of the compiler is often misunder- stood in terms of scopes of its properties and functionality. What a compiler does and is supposed to do is straightforward to developers: tell whether my code is correct, if so, improve it for this platform, but also, just to make it exe- cutable. Nevertheless, modern imperative-language compilers are, in general, very large pieces of software systems and are seen as black-boxes by software developers as discussed in Section 4.3. Their advanced infrastructure enables maximal modularity in combinations with language front-ends and target plat- form back-ends. In addition to that, to please developers, compilers contain a huge variety of passes comprising transformations and optimizations to take maximal advantage of any piece of input programs. Consequently, compilers are usually evaluated with the following criteria:

• speed: for large software system to compile such as the compiler itself, the speed of compilation is relevant to developers to permit a convenient time-frame while software development.

• space: while compiling a program, compilers deal with huge amounts of data and memory structures which directly impact on the previous cri- terion about compilation speed. Therefore, using an optimal amount of memory space while compiling is a major criterion for an efficient com- piler.

• feedback/diagnostics: developers are interested in a compiler that can let them know whether their program is correct. Therefore, the quality of di- agnostics is another relevant criterion for selecting the appropriate com- piler for any software development.

• debugging: based on diagnostics and feedback, compilers often offer inter- stage dumping representation involving extra information for advanced users which can pinpoint problem locations in the code.

• compile-time code efficiency: this criterion is based on a combination of com- piler components such as register allocation, instruction scheduler, etc. The quality of code directly depends on back-end features and optimiza- tion performance. The code size of an input program can be completely

100 Chapter 6. Implementing the SVP compiler

different, depending on whether the instruction scheduler is efficient for a specific architecture.

6.2 Compiler design decisions

We had to make design decisions to get results as soon as possible: consider- ing research time, implementation, experiments, etc. This section summarizes them and discusses the impact on the research. To give some perspective on the complexity of compiler research, the source-to-source SAC [78] compiler has been created from scratch by a crew of scientists in the 1990s. The idea was to generate a compiler for a functional language targeting the C language. Requirements for a compiler were therefore different for this project where the target is static and the language only evolves along with compiler’s features. Ever since, the SAC compiler gained maturity and new features with time and considerable effort 1. In the context of this thesis, the language and architecture features evolved during compiler development undertaken by a reduced crew (one PhD student as permanent researcher and developer and temporary sup- port from other CSA group members). Because of time constraints and avail- able manpower, we decided not to start a new compiler from scratch. Instead, we looked for a decent candidate as compiler framework to be extended and used as proof-of-concept. Writing a compiler from scratch is simply out of scope for this research. Moreover, there are no relevant scientific interests to reimple- ment existing technology that would consume a considerable amount of time.

6.2.1 Compiler selection

We have selected an imperative-language compiler for our research after in- vestigating a set of available compilers from industry and academia. Several criteria are relevant to allow a rapid deployment and implementation, then fo- cus on the important topics of his research. Besides selecting a compiler frame- work, time constraints of this research and the available manpower influenced the choice to have an active community of compiler developers and researchers. To reduce the overhead of reimplementation, there are also technical constraints related to the presence of a C front-end and an Alpha back-end. A big advan- tage would be having a modular and easy-to-modify compiler such as a com- piler framework. Open-source software systems are interesting where there are no copyright constraints and all the source code is available. In the long run, we aim to have a candidate that can support 32- and 64-bit instruction set to enable retargeting other instruction sets for further investigation.

1Summarized from a private communication with Dr. C. Grelck

101 Chapter 6. Implementing the SVP compiler

The CoSy compiler

The CoSy compiler framework [79] is an advanced modular compiler frame- work. The system is based on a central internal representation (CCMIR) which is an interface to all internal compiler components, named engines. These en- gines analyze, read, write and rewrite to the CCMIR. The whole concept is to obtain a composable compiler depending on requirements of a specific lan- guage, architecture and functionality. With this unorthodox design, the CoSy compiler is far ahead of its competitors in terms of flexibility to target new architectures, insert new engines and deploy a production-quality compiler ex- tremely fast with industrial standards. In theory, the CCMIR could be anno- tated to hold concurrency information and specific engines can be added or modified to support SVP properties. Nonetheless, in the context of this thesis’ research, there was a need for an Alpha back-end. The Microgrid target simula- tor already existed and employed an extended version of the Alpha instruction set. Unfortunately, developing an Alpha back-end from scratch was not rele- vant for the research topics of this thesis and the overhead was considered too high with restricted research time constraints (length of a PhD program).

The SUIF compiler

At first, the SUIF compiler [80, 81] was considered as a candidate for this re- search. We eventually turned it down because of a lack of documentation and support from an active community. When this research started (in early 2006), the compiler was not supported anymore. This compiler is designed to enable program analysis to parallelize sequential programs for shared-memory multi- processors.

The Open64 compiler

Open64 [82] is an open-source research compiler targeting a 64-bit architecture. Open64 is a suite of optimizing compiler development tools designed for Intel Itanium Processors. It employs 5 IR layers which get closer to machine lan- guage the further the compilation process proceeds. This gives a distinct com- partmentalization where new components can be added at different points of compilation. Open64 has a C/C++ and Fortran 90/95 front-ends. However, there is no Alpha back-end in this framework. University of Delaware (USA) is the gatekeeper of this project; nonetheless, no activity nor support were present when this research has started.

The ORC compiler

The ORC compiler [83] is a branch of the Open64 compiler. This compiler targets the IA-64 architecture specifically the Itanium Processor Family. This

102 Chapter 6. Implementing the SVP compiler project is based on the collaboration between Intel Corporation and the Chi- nese Academy of Sciences. The input languages are C/C++ and Fortran. The main purposes of ORC are new optimizations to better utilize Itanium archi- tectural features and new features to facilitate future research. Similar issues to Open64 are present about lack of support and documentation, plus lack of an Alpha back-end. Therefore, we decided not to use this framework.

The LLVM compiler

Partly based on the GCC toolchain, the LLVM compiler became more popular in the compiler domain in the last few years. At the beginning of this research, this compiler was not mature enough to be considered as a candidate. The initial motivation was to be more aggressive and more efficient than the GCC compiler with a simpler infrastructure. Consequently, some parts of GCC have been reused such as different language front-ends. A convenient advantage with LLVM is the intermediate form (IF) that can be used to interface with another tool and be fed into the LLVM framework. This IF is an interface with external tools such as the GCC toolchain. If we had to choose again a new framework now rather than at the beginning of this research, LLVM would probably be the candidate we would be using.

The LCC compiler

Little C Compiler (LCC) [51] is a small retargetable compiler for personal use and educational purposes: simple to understand with documentation. There are sufficient back-ends, including an Alpha one, which makes this compiler a decent candidate. However, it is limited; there is no real intermediate represen- tation nor advanced optimizations embedded in this software system. There- fore, there is little interest to use it for compilation investigation. The compiler candidate needs to have a longer-term development and support scheme.

The GCC compiler

The GNU Compiler Collection (GCC) [54] is one of the best known compilers in the market. Present on most UNIX-based systems, it is an open-source project targeting 32- and 64-bit architectures (back-ends: Alpha, MIPS, SPARC, , etc.). The input languages are C, C++, Objective-C, Fortran, Java, Ada, etc. Newer versions of GCC are frequently released and supported by a large com- munity of developers and users. New features and corrections appear at each release. The good point for GCC is that it is very suitable for a wide range of platforms. On the other hand, it is very difficult to understand its infrastructure and to modify it easily and quickly for our project with a small development team. Up-to-date documentation is still a problem as with any very large soft- ware system [84]; the presence of an active community with a quick response

103 Chapter 6. Implementing the SVP compiler time makes it a decent compromise. This imperative-language compiler is on the scene for 20 years; it has evolved with an important set of optimizations and efficient code generator which still make GCC an efficient compiler. A new reg- ister allocator (integrated register allocator) [85] has indeed made big improve- ments plus optimizations such as auto-vectorization. Nonetheless, GCC suffers from its infrastructure design decisions; the compiler shows its age. Conse- quently, extending it requires a lot of effort and time spent on verification and validation. This contrast became clear when we observed the difference in flex- ibility between CoSy and GCC while generating a dedicated compiler. CoSy enables that with its modularity-based construction scheme, whereas GCC can perform similar tasks but with more engineering effort. Nonetheless, recent effort is being done to modernize the entire infrastructure with the ICI plug- in initiative [86] and with the Modular-GCC initiative [87]. This compiler was chosen for the reason that it was the least worst compiler from the ones evalu- ated.

6.2.2 Impact of design decisions

Approach to SVP compiler conception

The conception of the compiler prototype follows the underlying SVP idiom to expose explicit concurrency in the source code (i.e. µTC). Obviously, the front- end must be able to parse and interpret the new constructs in the input lan- guage. We can either follow a compiler improvement such as the work realized in GUPC [29] where a minimum of compiler components have been extended and a library has been added to the GCC framework. In that case, minimal changes have been performed in the front-end to support the new keywords of the UPC language. However, in the context of our research, there are needs to extend deeper in its infrastructure in order to support SVP properties and the Microgrid architecture: extension of the Alpha back-end and back-end gen- erator with SVP assumptions. In theory, µTC constructs directly map onto in- structions for the Microgrid platform. By the way, OpenMP is implemented in a library within the GCC framework [32]. Consequently, at compile-time, the compiler invokes this library each time it deals with concurrency constructs. In our compiler implementation, we made the design decision to fully embed the SVP concurrency idioms in the compiler’s internals. Although, this integration requires more engineering work to get it to work, we believe in a long-term vision to support concurrency natively in the compiler framework. As a result, concurrency constructs are fully exposed to the compiler which can optimize the program’s representation with extended optimizations. This decision has partially caused the problems presented in this thesis. Conventional optimiza- tions that work very efficiently can be reused to enhance program’s code with respect to SVP semantics. Reusing compiler technology avoids wasting time reimplementing it from scratch in the context of this research.

104 Chapter 6. Implementing the SVP compiler

GCC structure

Figure 6.1 illustrates the partition of GCC components over the entire frame- work. We have calculated the lines of code (LOC) for each file of the compiler with SLOCCount [88]. We use as platform GCC 4.1 Core Release which is a special release of the compiler toolchain with one single C front-end, all exist- ing back-ends, and all optimization passes. The C front-end (FE) has a rather small size (around 4% of the overall amount) compared to GCC’s back-end gen- erator and back-end descriptions (BE) (65% of the software system). GCC has more that 30 working back-ends including x86, IA64, Alpha, MIPS, etc. The in- frastructure (INFRA) with a 11% size gathers garbage collection methods, data structure, debugging methods, compiler driver, etc. that are utilized in the en- tire framework. With this framework, we could retarget other back-ends in the future to evaluate the SVP architecture implementation with various instruc- tion sets.

BE 64.82%

LIB 6.45% ME INFRA 13.32% 11.44% FE 3.97%

Figure 6.1: Compiler composition of GCC 4.1 Core Release. Partition of code lines for each major GCC component: front-end (FE), middle-end (ME), back-end (BE), li- braries (LIB), infrastructure (INFRA). Measures calculated on an unmodified GCC 4.1 core release with a single C front-end, standard back-ends, and standard optimizations. These measures are calculated with SLOCCount [88].

With Figure 6.1, the GCC 4.1 framework shows a dominance of back-end code. The reason is mainly due to a large number of back-ends and a back-end generator that allows the addition of new back-ends. GCC 4.1 compartmental- izes optimization passes and transformation passes; this makes rather easy the addition of new optimizations. This compiler is clearly designed for adding new optimizations and new back-ends. Nevertheless, as soon as we step out of this paved road, the compiler internals are obscure with large and undocu- mented code files. Moreover, heavy inter-file dependencies are present which make the overall infrastructure troublesome.

105 Chapter 6. Implementing the SVP compiler

Analysis of SVP extension

Selecting GCC as our research compiler framework allows the reuse of ad- vanced compiler technology. Nonetheless, the scientific question is to know whether this technology can be reused when integrating concurrency idioms in the compiler’s internals. We have calculated and summarized in Figure 6.2 the effort we have invested in each compiler component. Although measured in LOCs, effort perspectives depicted here only give a measure of implementa- tion effort and not the engineering and research effort. It is hardly possible to represent the research time spent on each LOC. Sometimes few LOC required considerable research effort to be safely introduced into the framework. This problem is commonly known with very large software system improvements. Therefore, we note that LOC amounts render which part requires extensive modification to support SVP assumptions, SVP language implementation (i.e. µTC language), SVP architecture implementation (i.e. Microgrid target).

Location of changes in GCC-UTC 100 FE 88.52% ME BE 80 LIB INFRA

60

Percentage 40

20

7.1% 3.5% 0.15% 0.74% 0 FE ME BE LIB INFRA

Figure 6.2: Distribution of changes over GCC framework components. The GCC frame- work comprises 5 major components: front-end (FE), middle-end (ME), back-end (BE), libraries (LIB), infrastructure (INFRA). The changes are made in GCC 4.1 core release which is a special release of the compiler toolchain with one single C front-end, all ex- isting back-ends, and all optimization passes. A main extension is the addition of a new specific Alpha-extended back-end. Most changes in the middle-end are to handle SVP assumptions safely and properly during transformations. Figure 6.2 shows large changes in GCC’s back-end. A specific back-end sup- porting SVP instruction has been created and based on the existing Alpha back- end. The back-end generator is extended with methods to support architecture features regarding to SVP operators (part of the abstract machine) added to the abstract machine and synchronized communication channels. For that purpose, peephole and back-end optimizations are extended to permit SVP assumptions. Similarly in the middle-end, inter- and intra-procedural optimizations are ex- tended for SVP assumptions that are exposed in the µTC language. Generally,

106 Chapter 6. Implementing the SVP compiler exceptions in optimizations had to be taken care of with special exception han- dling cases with no need to redesign the entire optimization algorithm. How- ever, some aggressive optimizations have been completely disabled because of the hazardous affects they have on concurrent code. For instance, instruc- tion reordering, employed by the instruction scheduler, may cause trouble when changing the order of instructions and then break SVP assumptions. The back-end generator is extended to support SVP actions (properties of the execution model) as native operators. The SVP implementation has the prop- erty of extending an existing set of actions (extension of an existing instruction set [43]). Therefore, we apply the same scheme in a way that exposes these properties in the back-end; the back-end gets extended. The create action is implemented as a native back-end operator comparable to a special call with specific side-effects. At the create issue point, the flow of control acts differently from being transferred to the routine as in the imperative paradigm. The create operator maps the separation of control flows and allows the back-end opti- mizations to deal with multiple concurrent control flows. Other SVP actions, such as break, kill, are also implemented as native operators. Native operators are part of the GCC abstract machine that enables a more generic concept of operations. Thus, it becomes simpler to retarget to another instruction set.

6.3 Compilation challenges

This section collects the experience acquired during the compiler research and development, and the observations while compiling a concurrent program. The challenges encountered are both generic and specific to the Microgrid target: e.g. lack of information about new types or program structures throughout compilation stages and also dragging relevant information about concurrency from one stage to another without any loss.

6.3.1 Issues related to concurrency

The first of the challenges results from the way concurrency is exposed in the SVP concurrent programming paradigm. Integrating concurrent idioms into a modern imperative compiler is one technical contribution to this thesis.

General difficulties

Boehm [48] exposes the limits and dangers of compiler-driven optimizations when using library functions for exposing concurrency in code. However, in- tegrating SVP concepts natively into an existing compiler infrastructure allows one to reuse existing and adaptable optimizations which have been researched and improved for decades. Thus, it allows one to perform more efficient code

107 Chapter 6. Implementing the SVP compiler generation instead of reimplementing from scratch all existing optimization al- gorithms. In order to achieve this, the compiler must be made aware of the SVP constructs and semantics and must adapt to these in the code as sequential- based and parallel-extended optimizations. Moreover, the compiler must pro- tect concurrent regions in the code (especially synchronizing objects in µTC) from aggressive code optimizations. Besides the well-known issues of compiler code size and compiler code com- plexity, the major problem encountered is that this sequentially-oriented com- piler infrastructure is concurrency-agnostic. Such compilers have been devel- oped over years following an incremental development pattern rather than re- building the whole system from scratch when new assumptions or components are inserted. Furthermore, the compiler assumes sequential execution through- out its compilation stages: from optimization algorithms to code generation schemes. Our approach requires concurrency support, not by using plug-ins or external libraries, but by extending the compiler infrastructure, i.e. models presented in [20]. The novelty and hence contribution of this work, in addition to the obvious, i.e. an efficient working compiler for the Microgrid architecture, is in identifying and presenting the challenges encountered while extending a sequentially-oriented compiler.

Parallel assumptions

A bird’s eye view of the integration of the SVP constructs would be the exten- sion of the C sequentially-oriented front-end with the new language constructs which then map into extra instructions and special features in the back-end as the compiler transformations in Section 4.2 illustrate. However looking closer at the compiler infrastructure, it also has to support new assumptions for deal- ing with the idioms in order to produce valid code. Therefore, new nodes in the tree representation and new objects in the internal representations are added to support parallel semantics throughout the compilation stages. The extension of the infrastructure with new idioms implies the addition of new semantics in the assumptions made in the compilation schemes. Hence, the high-level to low- level representation translations are extended to propagate the SVP semantics. Sequentially-oriented compilers (in our case, the GCC compiler) make the assumption that only a single thread of control is running at any time. In the µTC language, the thread function is the smallest composition unit which is seen as a concurrent region with its own context of objects. Thus, a family of threads is a set of concurrent regions which are executed as multiple concurrent con- trol flows running at the same time between the issue point (i.e. create) and the completion point (i.e. sync) in the parent context, as shown in Figure 3.2. Furthermore, communication is allowed between concurrent regions using syn- chronizing objects. In passing parameters, communication may occur between the issue point and the completion point in the parent thread and all threads using globals and with the first child using shareds. Also synchronized com- munication may occur between sibling threads using only shared synchroniz-

108 Chapter 6. Implementing the SVP compiler ing objects, as shown in Figure 3.6. These objects are expressed in the internal compiler representations with an extra attribute of a given object type. Never- theless, they cause problems with sequential assumptions in the optimizations if the compiler does not sufficiently distinguish between the synchronizing ob- jects and the regular ones (cf. Chapter 5). For example in Figure 3.11, a conventional C compiler would optimize away the statement computing the reduction in thread function ddot during dead- code elimination since the code resides in a leaf of the control flow graph (CFG) which never returns anything (i.e. the statement is assumed useless). In SVP, the statement has different semantics since a synchronizing object is present. We also observe that some optimizations are harmless for sequential C code but dangerous for parallel µTC code, as reported in Chapter 5. Another example in Figure 6.3 uses a synchronizing object as a token across adjacent threads, with C semantics, the statements tmp = X; and X = tmp; would be optimized out with the use of various standard optimizations such as copy-propagation, instruction reordering or combining optimizations. With µTC semantics, the optimized program would not be semantically equivalent to the original program because of broken synchronized communication chan- nels. The µTC compiler is aware of this communication and ensures their va- lidity even in the case of aggressive rearrangements in instruction scheduling. Although those optimizations obey 100% the C semantics, they can change the meaning of µTC programs and cause visible effects that can break the µTC se- mantics. The optimizations have been adapted in order to avoid endangering the semantics of µTC code.

6.3.2 Issues related to the Microgrid

As part of the challenges are caused by architecture decisions that forced the compiler to be developed and extended in a specific way.

Resource mapping

In the case of µTC, a new kind of object has been added with different access semantics: shared synchronizing objects whose read and write operations have the i-structure semantics described in [62]. These are intended to directly map to registers with i-structure behavior in hardware. Here, we observe that the approach taken by the SVP core architecture implementation creates an issue that we believe has not been researched previously: as a way to maximize the number of threads per core within a limited physical register file, threads can be created with variable register windows where a variable number of physical reg- isters are mapped in the architectural register window of each thread for either local (exclusive) use, or physically shared as i-structures with other threads.

109 Chapter 6. Implementing the SVP compiler

/* thread function definition: 1 shared, 1 global. */ thread void foo(shared double X, double Y) { int tmp = X; create(fid;;0;1000;1;;) bar(y); sync(fid); X = tmp; } thread void main(void) { /* point of use: create a family of 1000 threads */ create(fid;;0;1000;1;;) foo(a = 0, b); sync(fid); /* after synchronization ‘X’ contains the result of foo in the parent. */ }

Figure 6.3: Example µTC: Shared object used as a token to enforce sequential con- straints.

An example is given in Figure 3.10. This removes the long standing assump- tion that the size of the architectural register window is a constant for all pro- grams. Also, the compiler cannot reuse physically shared registers via spills: a spill itself would require a read from the register, thus forcing synchronization with the previous thread and reduce concurrency; a write of a local temporary value that motivates a spill would unblock any read in the next thread with the temporary value instead of the final desired dependency. Only local regis- ters, exclusive to each thread, can be used to map temporaries and automatic variables. As a naive route towards reusing existing register allocators for the C language, we could arbitrarily select a common fixed number of registers for shared and local use by all programs; these numbers would be part of a standard appli- cation binary interface that compilers would adhere to when generating code. However, a conservative selection that would both leave sufficient room for sharing between threads and local use by each thread would cause excessive pressure on the register file, by forcing the hardware SVP processes to allocate a large physical register window for every thread created; this in turn would reduce the local concurrency by capping the number of threads that can be si- multaneously allocated on each core. As this has a direct impact on the amount of latency the architecture can tolerate on asynchronous operations, we deem it desirable to adapt the architectural register window size to the requirements of each thread function. The framework where this issue is addressed can be described as follows:

110 Chapter 6. Implementing the SVP compiler

• the selection of the layout of the visible register window for a given thread function is the responsibility of the compiler; it is provided to the assembler along with the generated code for the thread function;

• given a Microthreaded architecture, this selection is constrained by the ar- chitectural limits, namely the maximum number of visible architectural registers per class (e.g. integers/floats) and possible special registers which cannot be shared (e.g. ReadAsZero);

• due to µTC semantics, the selection is further constrained by the arbitrary composability of constructs allowed by C and, by extension, µTC; in particu- lar, the maximum number of registers that can be declared as parameters for a given thread function F to be shared at run-time with its parent can- not be greater than the maximum number of local registers available for sharing by any thread function that can create a family of F .

We formalize this as follows: considering a single register class on an SVP ar- chitecture implementation where there are at most R visible registers, the com- pilation of a thread function Fp and any of its potential child Fc is constrained by the following formulae:

2Sp + Gp + Lp ≤ R (6.1) ∗ Sc + Gc + Cp + Lp ≤ Lp (6.2)

where Sc/p, Gc/p and Lc/p are the number of shared, global and local registers in the register window layout declared to the assembler for Fc and Fp, Cp is the ∗ number of overhead local registers to perform the create operation, and Lp the number of local registers that must be reserved in Fp for other uses (e.g. thread local storage pointer). Formula 6.1 represents the architectural constraint; the number of shared registers is represented twice since there are sharing regions with both adjacent threads, as illustrated in Figure 3.10. Formula 6.2 repre- sents the sharing of the local register window of a thread with its child family. Both formulae imply a recursive constraint on the entire concurrency tree of a program. In addition to these “hard” constraints, the architecture also suggests optimizing register file usage, by choosing register windows for all thread func- tions involved in a program such that the total number of physically required registers is kept as small as possible. This framework is new since an optimal solution to this formalized constraint system is a code generation algorithm where the number of local registers avail- able for code generation is both an input and an output of register allocation. In particular, while the selection of a larger number of local registers in general al- lows for reduced memory accesses, it is even more important to allow for more concurrency by reducing pressure on the register file.

111 Chapter 6. Implementing the SVP compiler

Register window

Along with the register mapping, the SVP architecture implementation uses variable register classes (global, dependent, local and shared). These register classes are dynamically-sized based on the thread function to which they cor- respond. One thread function’s context has one register window which dif- fers from one thread function to another. In addition to the previous section, the back-end would require different register conventions for different thread functions. These thread functions are instantiated with the create mechanism which reuses the call mechanism of the sequential model with special extension to handle SVP side-effects on control flows and parameter passing. The calling convention is then not fixed with this register window scheme and the regis- ter allocation is severely impacted as previously described. The SVP compiler makes restrictions on fixed register classes since the back-end register allocator appeared to be too complex to realize within the research time frame of this thesis work.

Communication channels

The exposure of concurrent regions and the exposure of inter-thread commu- nication is also a specific issue with the Microgrid target implementation. This issue has been extensively covered in Chapter 5. In theory, communication channels are synchronizing objects added in µTC and mixed with conventional non-synchronizing objects. The synchronizing objects are also present in the Microgrid instruction set with a mapping onto a pair of registers for shared communication channels. Representing communication channels in concurrent programming paradigm is a challenge; SVP implements communication chan- nels on registers. The language and hardware specifications of the implementa- tion of these global and shared communication channels are the source of most aforementioned challenges. The main advantage of this implementation is the inter-thread communication is cheap when using registers and thread family creation is cheap as well.

112 Chapter 6. Implementing the SVP compiler

6.4 Discussion and conclusion

6.4.1 Language and architecture pressure

Compiler research implies the existence of an input language and a target archi- tecture. In the context of the SVP computing system, changes in both language and architecture severely impacted on the compiler. It is difficult to reflect in this thesis the scope of their impact besides the work already described to over- come SVP implementation. It is clear that compiler research is driven by forces coming from both architecture and language which very much pressurize the compiler functionality. A simple change in the instruction set may break inter- nal compiler assumptions in the way it is being processed during compilation. In the case of SVP, for instance, the use of a ‘pull’ mechanism with the create op- eration in the instruction set made the back-end more complicated than it could have been (cf. Sections 3.3.6 and 4.3.3). The back-end has then to be aware of the bindings between the resources of the argument list and the resources of thread parameters. The ‘pull’ create mechanism takes the created family’s globals and shareds directly from the context of the parent thread. This ‘pull’ mechanism has two downsides:

• it creates a dependency on the parent thread’s context that is unwanted for creates with no link with their parent, called continuation creates in the SVP model.

• it complicates the compiler design, as the registers that were shared with the child family would be removed from further allocation (frozen) be- tween the create and sync.

On the contrary, a ‘push’ create mechanism would have made the compiler back-end much simpler. The parent thread would explicitly copy the globals and shareds over to the child family’s context, thus removing any dependency on its context. Next to this, there is one dependency left on the parent con- text: the sync register. This register is cleared by create and written when the family has completed. This meant that the register is still off-limits between create and sync in the program. To remove this last dependency, the create in- struction is split into: ‘create’ and ‘sync’. The destination register of the ‘push’ create is cleared as usual but is written when the family’s context has been cre- ated and globals and shareds can be written to it. The ‘sync’ instruction writes its destination register when the specified family has completed. Since both these instructions can be considered atomic by the compiler, no special regis- ter allocation logic is required between the create and sync point in a program. By having the parent explicitly requesting all interaction (such as sync and pa- rameter passing) with the child family through its family identifier, families no longer need to know who and where their parent is.

113 Chapter 6. Implementing the SVP compiler

Despite platform and input language evolutions, the compiler must be al- ways up-to-date to provide certified and efficient code. Small changes in the Microgrid target definition may damage the compilation process and generate hazardous code. Furthermore, the design of the µTC input language impacted the way in which the compiler research has been conducted. µTC is not an extension of the C language but a sub-set with new constructs and assump- tions. Therefore, the C front-end properties must not be used entirely without the insurance of delivering SVP-certified code. With the presence of SVP com- munication channels, most compiler optimizations need to be certified before use. During this research, some optimizations have been studied and reported in Chapter 5. Evolution in µTC has to be well-advised to prevent any code hazards. Impact of both architecture and language changes may be consider- able on the compiler. This thesis, in its scope, reflects the impact of integrating concurrency in compiler modeling and specific Microgrid issues.

6.4.2 Support native concurrency

Chapter 4 has depicted the compilation techniques investigated during the com- piler development. Changes embedded in the compiler, as presented in Sec- tion 4.3 to integrate concurrency in a conventional compiler, have costs for a research team in development overhead. Figure 6.2 stresses a distribution of effort in implementing these changes in the SVP GCC-4.1 compiler. This the- sis work presents solutions to support native concurrency into an imperative- language compiler. This chapter has shown the different parts to extend to enable SVP properties in the infrastructure of the compiler: in the front-end, the middle-end, the back-end and the infrastructure itself.

6.4.3 Conclusion

The major contribution of this work is the existence of a working SVP com- piler. The working compiler development is as close as possible to production quality within the time constraints of this research. Although it is very diffi- cult to obtain a complete verification of the compiler, it is more than enough for a proof-of-concept work. The other contributions are the presentation of the challenges to integrate concurrency into a compiler’s infrastructure, and the ar- eas of danger with SVP compilation. The compiler has room for improvement, mainly because of research time limitations and manpower. The areas where it can be improved are code validation and verification, and also SVP-specific op- timizations where it would be interesting to observe what impact they would have on code performance and quality. Limitations on research time prohib- ited further compiler development enhancement leaving these improvements for future work. The approach undertaken has been to embed concurrency idioms into a con- ventional imperative sequential compiler as against an external library approach.

114 Chapter 6. Implementing the SVP compiler

Both approaches have advantages and disadvantages. Our thesis is to expose explicit concurrency to the system infrastructure and to avoid undefined be- havior in concurrency. Furthermore, the design choice of extending an existing compiler framework in contrast to writing a compiler from scratch, despite hu- man effort understanding compiler machinery, has reduced the amount of time to get a working compiler within the project duration. The presence of the com- piler allows experiments with a set of non-trivial larger benchmarks instead of simply having hand-coded versions. This chapter has exposed challenges in ex- isting compiler assumptions when dealing with the programming of multicore architectures.

115 Chapter 6. Implementing the SVP compiler

116 Chapter 7 SVP evaluation

Compiler development requires, as with any complex software development, special care with regard to code quality. A compiler can be thought of as a gas factory where an improvement to its infrastructure consists of adding or even removing pipes. After this operation, the gas factory must be checked for leak- age in case changes introduced some inconsistencies. Any extension in a com- plex software system suffers from consistency issues; the entire software system needs to be certified before it is safe to use. In the context of this research, we endorsed as far as possible compiler correctness. Adding code, adding compo- nents, removing segments from the compiler infrastructure may render it un- stable and become non-certified if compiler architects do not pay particular at- tention to any side-effects. In this regard, Test-Driven Development (TDD) [89] enables, in short development cycles, the verification of changes made in the compiler internals. TDD results in using specific test cases and scenarios to trigger problems in compiler behavior or compiler produced code. A few lines of changes can then be verified quickly and verify the range of side-effects they may impact on other compiler components. Chapter 5 has discussed how SVP properties endanger conventional compiler optimizations if they are not ex- tended. Therefore, achieving compiler correctness and correctness of produced code are big challenges in our research. This presence of a working compiler for SVP is already a proof of concept. This chapter thus talks about code analysis, after Chapter 6 has examined compiler conception and decisions.

The contents of this chapter are based on this publication:

• T.A.M. Bernard, C. Grelck, M.A. Hicks, C.R. Jesshope, R. Poss – “Resource-agnostic program- ming for many-core microgrids”, 4th Workshop on Highly Parallel Processing on a Chip, HPPC 2010, Aug. 2010.

117 Chapter 7. SVP evaluation

Chapter 3 has outlined evaluation of the SVP software system including sci- entific problems expressed in highly-optimized hand-coded benchmarks for the SVP architecture implementation. The first part of this chapter reviews the SVP extension of the compiler. Evaluation gathers code analysis of scientific problems: code size, code composition, code correctness and compiler over- head. The idea is to monitor impact and overhead of the SVP extension onto the imperative-language compiler. We then analyze SVP produced code to re- veal composition of programs and distribution of SVP and conventional oper- ations. Finally, performance differences between hand-coded (in the Microgrid ISA) programs and compiled programs give an estimate of our research com- piler in finding an optimal solution when compiling a problem. The second part of this chapter contrasts with the results section in Chapter 3 which has used as benchmarks hand-coded programs; the second part utilizes compiled programs as benchmarks. We provide an evaluation (with results) of the SVP parallel computing system comprising an SVP hardware implementation as the Microgrid, an SVP software implementation as the µTC language, an SVP GCC- based compiler implementation and a set of scientific problems programmed in µTC.

7.1 Evaluation of SVP compilation

This first section focuses on the SVP research compiler and evaluation of SVP compilation. Compiler correctness is our main objective. Moreover, code anal- ysis of scientific problems shows how the SVP compilation performs: composi- tion of SVP compiled programs (instruction mix), the importance of SVP-aware optimizations in SVP compilation.

7.1.1 Methodology: benchmarks and evaluation

The Livermore kernels [69] are characterizations of scientific problems used for benchmarking in this chapter. In our compiler research, they correspond to an acceptable trade-off between program size and problem complexity. Their program size allows us to monitor each operation in compiled code with a hu- man eye, whereas larger programs would complicate the monitoring with huge numbers of operations. Their problem complexity is a collection of common sci- entific mathematical problems, where some of them can be parallelized easily and others require more attention. They test code correctness and evaluate the code quality of the SVP compiler (µTC input language). Our SVP compiler is an extension of a conventional imperative-language compiler. It is based on the GCC-4.1 framework and extended by the presented compilation schemes from the µTC language to Alpha-based SVP cores (cf. Chapter 4), for which we have a many-core (Microgrid) emulator which provides a cycle-accurate simulation of execution time [68]. The Livermore kernels gather a set of scientific kernels which have been coded in µTC: Hydrodynamics Fragment (HF); Inner Prod-

118 Chapter 7. SVP evaluation uct (IP); Banded Linear Equations (BLE); First Differential (FP); First Sum (FS); Incomplete Cholesky Conjugate Gradient (ICCG); Difference Predictors, (DP); Tridiagonal Elimination (TDE); General Linear Recurrence Equations (GLRE); Equation of State Fragment (ESF). Our toolchain allows verifications of compi- lation results and execution on this platform against the execution of the same code on a conventional platform by transforming µTC into sequential C code, compiling it and executing it on a conventional processor.

7.1.2 Code analysis and quality

Figure 7.1 presents the distribution of operations over the entire set of kernels in their assembly representation (i.e. Microgrid ISA). Across scientific prob- lems, the distribution of extra SVP operations goes from 14% for ESF to 37% for FS - ratio of SVP operation overall the program. ESF has more arithmetic and logical operations (ALU ops) than FS. The number of SVP operations is there- fore not negligible; these operations are responsible for thread creation (create in Figure 7.1) and inter-thread communication in shared channels via shared- dependent operations (S/D ops). There are also directives for register window definition and global data pointer (directives), context switching tags (swch)- note that these are not considered as operations but still shown in Figure 7.1).

80

70

60 Stores 50 Loads Branches S/D Ops 40 Directives Create 30 Swch ALU Ops Number of instructions 20

10

0 BLE DP ESF FD FS HF ICCG IP TDE

Benchmarks: Livermore kernels Figure 7.1: Instruction mix of Livermore kernels in µTC. The number of instructions is calculated to expose the composition of each scientific problem. The extra SVP opera- tions are counted in create, directives, S/D ops, and swch; these are the number of instruc- tions for the create process, for the directives related to the register window definition, for inter-thread communication in shared channels, for context switching. Further code analysis is shown in Figure 7.2 where unoptimized (GCC -O0) and optimized (GCC -O1) µTC are compared. The unoptimized code has a bigger code size than the optimized version, because the GCC compiler does not use any code optimization and uses the stack for parameter passing and for storing intermediate results. Optimized versions prioritize the use of reg-

119 Chapter 7. SVP evaluation isters before the stack for parameter passing; code size drastically diminishes, by almost 30% in all programs. Moreover, to achieve this gain, major compiler optimizations needed to be extended to support the SVP assumptions with the -O1 optimization flag, as reported in Chapter 5.

6,000

5,000

4,000

Unoptimized (−O0) 3,000 Optimized (−O1)

2,000

1,000

0 BLE ESF GLRE HF IP TDE

Benchmarks: Livermore kernels Figure 7.2: Comparison of code size between unoptimized and optimized µTC code. Unoptimized code is generated with the debugging -O0 optimization flag of GCC, whereas optimized code uses optimization flag -O1. These results are generated with the SVP GCC-4.1 compiler. The y-axis represents the code size (octets). Figure 7.3 collects a comparison between hand-coded implementations ver- sus their corresponding representations in µTC. The focus here is to monitor how close the compiler is to an optimal solution. Hand-coded versions are extensively optimized by SVP hardware specialists; therefore, hand-coded pro- grams benefit from every single Microgrid property. In some cases, the com- piled code is much larger than the hand-coded version, e.g. HF is 204% larger in size than its hand-coded version. The worst case is for IP where the compiled version is 285% bigger. A possible criticism to the SVP compiler is the unique use of the -O1 optimization flag. The other optimization flags such -O2 and -O3 might give better results; however, the impact of SVP properties on compiler optimizations has not been investigated and will be left for further research. The SVP compiler is unstable with this aggressive optimization flag. Experi- ments with optimization flag -O2 mainly result in clashes with the instruction rescheduling pass of the compiler. Further work is required to take advantage of these optimization passes. It should be noted that the GCC compiler commu- nity struggles with the most aggressive (-O3) when combined with code (this can be observed with the amount of active discussion on the GCC developer mailing list).

7.1.3 Code correctness

Figure 7.4 compares the execution time of compiled Livermore kernels against optimized hand-coded ones. These compiled execution times are normalized

120 Chapter 7. SVP evaluation

180

160

140

120

100 Hand−coded Compiled (−O1) 80

60

40

20

0 BLE ESF GLRE HF ICCG IP TDE

Benchmarks: Livermore kernels Figure 7.3: Comparison of instruction/directive size between hand-coded and com- piled µTC code (µTC implementation of Livermore kernels). Hand-coded programs correspond to an optimal solution of SVP compilation. µTC code is compiled with op- timization flag -O1 with the SVP compiler. The bar chart display an analysis of instruc- tion/directive size for each scientific problem. These results are generated with the SVP GCC-4.1 compiler. The y-axis represents the number of instructions. against the hand-coded versions for execution on a single core and a cluster of 32 cores. The code correctness is certified by the proper execution of the kernels on the target platform (Microgrid) simulator. We evaluate the compiled code against the hand-coded assembly versions that can be considered as optimal solutions. We would expect the compiled code to be slower than the highly- optimized hand-coded assembly. We would also expect the code to scale with the number of cores, and here we see the compiled µTC code scaling much bet- ter than the assembly version. For a single core, the compiled code varies from being as good as the hand-coded version (BLE) to a factor of 2.8 times slower (IP). IP performs only one multiplication and one addition in each thread and less efficient address arithmetic can not be masked. The average slowdown is just about 50%. When we execute on 32 cores, however, the smaller hand- compiled kernels execute less efficiently, as there are fewer threads per core and more stalls waiting for memory or synchronizations (e.g. in IP), whereas the longer kernels continue to utilize the pipeline more efficiency. At 32 cores the overhead of compiled code becomes less than 20% and in few cases, HF, IP, BLE and ESF are very close to the hand-coded assembly version.

121 Chapter 7. SVP evaluation

3

2.5

2

Single−core execution 1.5 Multicore execution

1

0.5

0 BLE ESF GLRE HF IP TDE

Benchmarks: Livermore kernels Figure 7.4: Comparison of execution cycles between hand-coded Livermore kernels (set as one) versus Livermore kernels compiled with µTC GCC-based compiler with opti- mizations enabled (normalized) run on the Microgrid simulator. The blue bar corre- sponds to running with single-core execution, and the yellow bar with multicore execu- tion (i.e. 32 cores in the experiments). The y-axis represents the ratio of the compiled benchmarks over the hand-coded ones (set as one).

7.2 Evaluation of SVP computing system

This second section collects an evaluation of the SVP computing system com- prising a target platform (i.e. the Microgrid simulator in Section 7.2.1) bench- marks (cf. Section 7.2.2) compiled with our research SVP compiler. A major aspect of this evaluation is the reuse of the same program’s binaries across dif- ferent settings of the chip multi-processors (i.e. Microgrid) without need of recompilation: number of cores, problem size, etc. This resource agnosticism is a property of the SVP execution model. This section provides an overview of the performance of the SVP computing system.

7.2.1 Target platform: Microgrid simulator

The SVP software simulator [68] keeps track of the full state of the system, sim- ulates the pipeline, caches, network and memory and takes contention on those buffers and ports into account to provide a cycle-accurate simulation of execu- tion time. The simulated system consists of 128 1.2 GHz cores spread among several differently-sized places. Each core has a 4-way set associative 1 kB I- Cache and D-Cache, storage for 32 families, 256 threads and 1024 integer and 512 FP registers. A 4-way set associative 32 kB L2 cache serves 4 cores via a bus and a COMA directory is connected with 8 L2 caches in a ring. Two DDR3-2400 channels are connected to the top-level directory ring. The evaluation, performed in this section, uses this platform emulation of a Microgrid. The emulation captures state transitions down to the lowest level

122 Chapter 7. SVP evaluation

Cluster ring 1 FPU + 2 cores

L2 cache

COMA directory

COMA Root directory Root directory ring DDR Channel DDR Channel

Figure 7.5: Functional diagram of a 64-core Microgrid. in the cores’ pipelines. It also provides a fully parameterizable, cycle-accurate simulation of all hardware aspects of the Microgrid: core functional units, in- terconnects, network-on-chip, memory architecture, and external memory in- terface. The memory architecture is a COMA derivative described in [70]. We used parameters suitable for hardware that can be built using current technol- ogy [61, 65]: 128 SVP cores implementing the DEC Alpha ISA, sharing 64 FPUs with separate pipelines for adds (2 pipelines), divs, muls and square root oper- ations (1 pipeline each); 32 L2 caches of 32 kB each shared by group of 4 cores. L2 caches are in turn grouped in COMA rings of 8 caches connected to a COMA directory. The top level directory ring is connected via two DDR3-2400 channels to external storage of arbitrary size. The performance figures presented in the next section should be considered in the light of the following hardware char- acteristics: the two DDR3-2400 channels provide 2400 million 64-bit transfers/s of 64 bits each, i.e. a peak bandwidth of 38.4 GB/s overall for the external mem- ory interface; each COMA ring provides a total bandwidth of 64 GB/s, shared among its participants; the bus between cores and L2 caches provides 64 GB/s of bandwidth, shared among 4 cores; the SVP cores are clocked at 1.2 GHz. The cores are grouped in 8 clusters (SVP places) of 1, 1, 2, 4, 8, 16, 32 and 64 cores, to allow running benchmarks on places of varying sizes. The selection of cluster sizes has no impact on the performance of each cluster other than the number of cores. Figure 7.5 provides a schema of this configuration.

7.2.2 Methodology: benchmarks and evaluation

We present results spanning a selection of computation kernels present in most applications. The selection was made among known scientific problems (Liv- ermore kernels, linear algebra, etc.) as they are commonly used in benchmarks. Our selection process was guided by the intent to expose the relationship be- tween implementations, architecture parameters and actual observed perfor- mance.

123 Chapter 7. SVP evaluation

Most results are presented with performance on cold caches (first use) in ad- dition to the more common warm caches performance. This allows us to quan- tify the impact of kernels that may be used only once in a larger computation. In all figures, the problem size indicates the number of items on which the ker- nel computation is applied. This closely matches the number of SVP threads defined. In order to analyze the performance, we need to understand the constraints on performance. For this we define two measures of arithmetic intensity (AI). The first AI1 is the ratio of floating point operations to instructions issued. For a given kernel that is not I/O bound, this limits the FP performance. For P cores at 1.2 GHz, the peak performance we can expect therefore is P × AI1, the ideal case of full pipeline utilization (one apparent cycle per operation). In some circumstances, we know that execution is constrained by dependencies between floating point operations, so we modify AI1 to take this into account 0 giving an effective intensity AI1. The second measure of arithmetic intensity is the ratio of FP operations to I/O operations, AI2 FLOPs/byte. I/O bandwidth IO is usually measured at the chip boundary (38.4 GB/s - the two DDR3-2400 channels provide 2400 million 64-bit transfers/s, i.e. 2400 × 2 × 8 = 38.4 GB/s), unless we can identify bottlenecks on the COMA rings (64 GB/s). These I/O bandwidths are independent of the number of cores used, so it also provides a hard performance limit. We can then combine these two intensities to obtain a maximum performance envelope for a given code and problem size. A program is either constrained by AI1 if P ×AI1 ≤ AI2×IO or AI2 when P ×AI1 ≥ AI2×IO.

7.2.3 Sequential code

We consider the function DNRM2 of the BLAS library. This function computes the Euclidean norm of a vector: it sums up the squares of the vector elements, then computes the square root of the sum. In the sequential algorithm, each iteration requires one memory load, followed by an FP multiply-add. The add in each iteration has a data dependency on the previous iteration. We do not yet consider here the benefits of implementing this function using another algo- rithm based on e.g. a parallel reduction. This will be evaluated in Section 7.2.4. Importantly, one should consider that any architecture must deal with mem- ory latencies of hundreds of cycles or more in scheduling instructions. Branch prediction and out-of-order issue can provide some latency tolerance, typically tens of cycles, which is sufficient to optimize performance when working from on-chip cache but not for larger data sets. Prefetching can do better on constant- stride accesses but as memory latencies rise, the probability that prefetched data will remain in the cache diminishes. In our approach, the hardware provides latency hiding through multithreading in the pipeline: the load and FP mul form an independent prefix to the dependent add, so latency hiding is possi- ble by overlapping them across threads and across cores. In principle, we can expect that the latency of loads can be hidden even with cold caches, and that the resulting performance is the sequential performance of the dependent FP

124 Chapter 7. SVP evaluation operations assuming no delays. thread accum(shared double sum, double* v) { index i; sum += v[i] * v[i]; } thread dnrm2(shared double result, int n, double *v) { create(fid; ; 0; n; 1;) accum(sum = 0, v); sync(fid); result = sqrt(sum); }

Figure 7.6: BLAS DNRM2 in µTC.

We have a naive implementation of DNRM2 in µTC (cf. Figure 7.6). This compiles to code with 10 instructions per thread, including 3 FP operations. In each thread, the 7 initial instructions are entirely independent and can run concurrently. So, latency hiding applies and we can expect a visible 1-cycle cost per instruction. Then, the dependent adds are executed in sequence. The cost of this is between 6 and 11 cycles per add depending on the scheduling of threads (considering a 6-stage pipeline and chip layout), with the difference represent- ing the cost of waking up a waiting thread and getting it to the register read stage of the pipeline, which could be overlapped by other independent instruc- tions in the pipeline. To this we add the cost of the 7 independent instructions yielding 6 + 7 to 11 + 7 cycles per thread. Even overlapping adds with the inde- pendent instructions, we can not hide the minimum 13 cycles latency from add to add as they are dependent. We then deduce the arithmetic intensity AI1= nbr of FP operations ÷ nbr of instructions, i.e. AI1 = 3 ÷ 10 = 0.3. We then define 0 the effective intensity, AI1 considering the dependencies between adds, nbr of 0 FP ops ÷ (nbr of instructions + cost max) ≤ AI1 ≤ nbr of FP ops ÷ (nbr of in- 0 structions + cost min), i.e. (3÷(10+18)) ≈ 0.10 ≤ AI1 ≤ 0.13 ≈ (3÷(10+13)). 0 Deriving from AI1, it gives a theoretical maximum performance envelope - for 1.2 GHz cores - of 0.11 to 0.15 GFLOP/s on a single core. As Figure 7.7 shows, provided we have enough threads we observe just un- der 0.11 GFLOP/s on one core, i.e. within the expected range. We have con- firmed experimentally that this sort of gain can be reproduced for a variety of memory architecture configurations with diverse delays with the only dif- ference being the number of threads required to tolerate the different memory latencies.

125 Chapter 7. SVP evaluation

BLAS1-DNRM2 - Performance (GFLOP/s) (Cold Caches)

0.15 0.14 0.15 0.13 0.12 0.14 0.11 0.13 0.1 0.12 0.09 0.11 0.08 0.07 0.1 0.06 0.09 60 0.08 50 0.07 40 0.06 30 #cores 100 20 per SVP place 1000 10 10000 #problem size 100000

(a) Performance on cold caches

BLAS1-DNRM2 - Performance (GFLOP/s) (Warm Caches)

0.15 0.14 0.15 0.13 0.14 0.12 0.11 0.13 0.1 0.12 0.09 0.11 0.08 0.1 0.07 0.09 60 50 0.08 40 0.07 30 #cores 100 20 per SVP place 1000 10 10000 #problem size 100000

(b) Performance on warm caches

Figure 7.7: Performance of DNRM2 on one SVP place. The working set is 8 × #problem size bytes.

126 Chapter 7. SVP evaluation

We observe that we can further increase the performance by increasing the number of cores, because the independent prefix instructions can be scheduled concurrently across cores. However, not much gain is to be expected since less than one third of the cycles required can be executed concurrently. Even with ideal scheduling and no overhead Amdahls’ law would limit speedup to a fac- tor 1.5. Any deviation from this expected performance is due to insufficient threads to hide the asynchronous operations such as those to memory (e.g. problem size N=2K with cold caches in Figure 7.7(a)). Note that with warm caches less threads are required to tolerate memory latencies which are now smaller. We nonetheless observe with warm caches (e.g. size N=50K at 32 cores in Figure 7.7(b)) the performance drops beyond a problem size of N=50K. At that point, caches are full and new cache lines are required for computations - i.e. cache eviction. On a larger problem size, the overall performance gets better since the number of threads provides enough latency hiding. We conclude this subsection by highlighting that we have achieved the auto- matic, fine-grained parallelization of the concurrent part of a purely sequential algorithm. This result is significant considering that our software implementa- tion is naive, resource-agnostic and that the hardware scheduler is not specifi- cally optimized for this algorithm.

7.2.4 Parallel reductions

The case study in this section is the inner vector product (IP, loop 3). For this algorithm, we have assumed associativity of the floating point addition used in the reduction as an opportunity to parallelize across cores. Our program (cf. Figure 7.8) is a straightforward extension of the naive implementation in µTC, which could be identified and transformed automatically. It relies on the number of cores in the ‘current place’ being exposed to programs as a language primitive. It splits the reduction into two stages, where an initial family is cre- ated with one thread per core, and each thread in that family then creates a local family to perform a core-local reduction. In effect, this amounts to paralleliz- ing the reduction among the cores. So, when the number of threads per core is significantly larger than the number of cores, the cost of the final reduction is negligible and the performance should scale linearly with the number of cores. Figure 7.9 shows the experimental results for this code. To analyze our results, we consider the case of one core first. The compiled code for the inner thread contains 7 instructions, including 2 FP operations and 2 loads. The arithmetic intensity is AI1 = 2 ÷ 7 ≈ 0.28. There are dependen- cies between threads and instructions (i.e. dependent FP add); using the same methodology as before, we get a cost per thread of between 6 and 11 cycles. We 0 0 deduce the effective intensity AI1, i.e. (2 ÷ (7 + 11)) ≈ 0.11 ≤ AI1 ≤ 0.15 ≈ (2 ÷ (7 + 6)). The effective intensity hence provides a theoretical maximal per- formance envelope on one core - for 1.2 GHz cores - of 0.13 to 0.18 GFLOP/s. The actual performance observed falls within this range (0.15 GFLOP/s). As- suming linear speedup on the number of cores, the expected performance for P

127 Chapter 7. SVP evaluation cores is thus 0.15 × P or a theoretical maximum of 9.6 GFLOP/s for the largest place size of 64 cores. thread accum2(shared double sum, double* x, double *y) { index i; sum += x[i] * y[i]; } thread accum1(shared double sum, int sp, double* x, double *y) { index i; create(fid; LOCAL; i; i+sp; 1; ) accum2(sum1 = 0, x, y); sync(fid); sum += sum1; } thread lmk3(shared double result, int n, double *x, double *y) { int p = current_number_of_cores(); create(fid; ; 0; p; 1; 1) accum1(sum = 0, n/p, x, y); sync(fid); result = sum; }

Figure 7.8: N/P parallel reduction for the inner product.

However, the arithmetic intensity AI2 of this algorithm is just one FLOP per byte loaded from memory, i.e. AI2 = 2 FP ops ÷ 2 I/O ops. By increasing the number of cores we eventually saturate the external memory bandwidth which - as described in Section 7.2.1 - is 38.4 GB/s. When the working set does not fit in the L2 caches yet another constraint is added. Loads to memory must now be interleaved with evictions from L2 cache when full. In the worst case, a single load may evict a cache line where the loaded line is used only by one thread before being evicted again. However, as threads are scheduled initially in index order this extreme case is unlikely and up to eight loads (64 Byte cache lines) may be serviced for one load-eviction pair. Evictions are either injected into another cache in the same ring or update the root directory without requiring off-chip bandwidth. The COMA ring bandwidth - as described in Section 7.2.1 - is 64 GB/s.

128 Chapter 7. SVP evaluation

LMK3: Inner Product - Performance (GFLOP/s) (Cold Caches)

7 6 7 5 6 4 5 3 2 4 1 3 0 2 60 1 50 40 0 30 #cores 10 20 per SVP place 100 10 1000 10000 #problem size 100000 1e+06

(a) Performance on cold caches

LMK3: Inner Product - Performance (GFLOP/s) (Warm Caches)

5 4.5 4 5 3.5 4.5 3 4 2.5 3.5 2 1.5 3 1 2.5 0.5 2 0 1.5 60 1 50 0.5 40 0 30 #cores 10 20 per SVP place 100 10 1000 10000 #problem size 100000 1e+06

(b) Performance on warm caches

Figure 7.9: IP performance, using N/P reduction. The working set is 16×#problem size bytes.

129 Chapter 7. SVP evaluation

Thus, the ring bandwidth required for a single 8-byte load could be as much as two 64-byte cache-line transfers giving a perceived bandwidth of 4 GBytes per second, which with an arithmetic intensity of one eighth of a FLOP per byte gives a worst case expected performance to 0.5 GFLOP/s when the problem size is very much larger than the cache capacity, i.e with frequent evictions. As our results show, with cold caches, for sufficiently large problems we achieve 6.3 GFLOP/s somewhat less than the theoretical maximum which is 9.6 GFLOP/s. This is due to the overheads of the sequential reduction phase. With warm caches, as we increase the problem size further, the program be- comes constrained by eviction bandwidth. However, the number of threads provides enough latency hiding beyond N=50K. We consider this ability of our resource-agnostic software implementation to be bound only by simple architectural constraints as another significant result and contribution of this paper.

7.2.5 Data-parallel algorithms

We show here the behavior of three data-parallel algorithms which exhibit three different behavior patterns. The equation of state fragment (ESF) from the Liv- ermore benchmark suite (loop 7) is a data parallel kernel with seven local ac- cesses to the same array data by different threads. If this locality can be ex- ploited the kernel has an arithmetic intensity AI2 of about 0.5 FLOPs/Byte of I/O from off-chip memory. Matrix-matrix product from the Livermore bench- mark suite (loop 21), which although has significant non-local access to data, has a very high arithmetic intensity AI2 as specified here, where one matrix can remain in on-chip memory giving an arithmetic intensity AI2 of over 3 FLOPs per byte of I/O to off-chip memory. The 1-D FFT lies somewhere between these two extremes it has a logarithmic number of stages that can exploit reuse (lin- ear for matrix multiplication) but has poor locality of access to data. As for our other benchmarks, our µTC implementations here are straightforward paral- lelizations of the obvious sequential implementation and do not attempt any explicit mapping to hardware resources. We consider the ESF first. This takes 3 arrays and 3 scalars as input and computes an output array with the following expression: X[k] = U[k] + R*( Z[k] + R*Y[k] ) + T*( U[k+3] + R*( U[k+2] + R*U[k+1] ) + T*( U[k+6] + Q*( U[k+5] + Q*U[k+4] ) ) );

The compiled code for the inner thread contains 33 instructions, including 9 loads, 1 store, 8 FP adds and 8 FP muls. As before, assuming latency hiding, each instruction should execute with a ‘visible cost’ of 1 cycle. The threads are independent; therefore, there is no need to determine the effective intensity 0 AI1. The performance bound based on this mix of instructions should thus be 16 FLOPs in 33 instructions which gives an arithmetic intensity AI1 of 0.49. We

130 Chapter 7. SVP evaluation derive the performance from it - considering 1.2 GHz cores - of 0.58 GFLOP/s. The observed performance on one single core is 0.49 GFLOP/s, i.e. ≈ 85% of the expected maximum (see Figure 7.10(a)). As for the IP above, when there are enough threads per core to hide latency, we observe linear speedup on up to 8 cores. After this point, the program be- comes I/O bound. For this program, an arithmetic intensity AI2 of 0.5 FLOP- s/Byte assuming the eviction of the cache lines containing the results means a peak performance of 12.8 GFLOPS would be expected based on perfect sharing. Our experiments show that we reach 10.3 GFLOP/s for P=64 and N=1K, i.e. a little over 80% of this theoretical maximum. The bound may be loose because of constraints from internal cache traffic required to achieve this sharing. Again, as for the IP above when using larger problem sizes, L2 cache evictions cause the effective usable bandwidth to decrease and the performance drops accord- ingly. Conversely, when using warm caches and smaller problem sizes, greater speedups can be achieved (see Figure 7.10(b)). With warm caches, the program is computation-bound and the performance flattens after a size over N=2K. The cache lines keep being evicted by new ones. The observed maximum - at P=64 cores, N=2K - is 22.9 GFLOP/s. The next benchmark, in Figure 7.11, is a matrix-matrix product which im- plements naively the product of 25×25 and 25×N matrices using a local IP algorithm. The IP operates in each core on rows of 25 consecutive elements in memory, thus achieving near-ideal locality of reference for the first matrix, which easily fits into on-chip cache. Even for large problems, the algorithm has a high arithmetic intensity as, for each column of the second matrix loaded, 1250 FLOPs are required to produce a column of the result (25 IPs). Assuming the results are always evicted that gives 25 loads and 25 stores or 400 Bytes of I/O, giving an arithmetic intensity AI2 of 3.125 FLOPs/Byte or an I/O limit of 75 GFLOP/s, which exceeds the theoretical peak performance of this kernel. As previously, the IP requires 7 instructions to issue two FP operations giving an arithmetic intensity AI1 of 2 ÷ 7 ≈ 0.28. The theoretical maximum perfor- mance with 1.2GHz cores is therefore 0.34 GFLOP/s. We can then give a theo- retical peak rate for P=64 cores, i.e. P × 0.34 ≈ 21.9 GFLOP/s. With cold caches (cf. Figure 7.10(a)), our experiments show an actual peak of 10.5 GFLOP/s, i.e. 48% of the maximum for P=64 cores and N=1K.

131 Chapter 7. SVP evaluation

LMK7: Equation State Fragment - Performance (GFLOP/s) (Cold Caches)

3.5 3 3.5 2.5 3 2 2.5 1.5 1 2 0.5 1.5 0 1 60 0.5 50 40 0 30 #cores 10 20 per SVP place 100 10 1000 10000 #problem size 100000 1e+06

(a) Performance on cold caches

LMK7: Equation State Fragment - Performance (GFLOP/s) (Warm Caches)

25 25 20 15 20 10 15 5 10 0 60 5 50 40 0 30 #cores 10 20 per SVP place 100 10 1000 10000 #problem size 100000 1e+06

(b) Performance on warm caches

Figure 7.10: Performance of the ESF. The working set is 32 ×#problem size bytes.

132 Chapter 7. SVP evaluation

LMK21: Matrix-Matrix Product - Performance (GFLOP/s) (Cold Caches)

12 12 10 8 10 6 8 4 6 2 0 4 60 2 50 40 0 30 #cores 100 20 per SVP place 1000 10 #problem size 10000

(a) Performance on cold caches

LMK21: Matrix-Matrix Product - Performance (GFLOP/s) (Warm Caches)

14 12 14 10 12 8 10 6 4 8 2 6 0 4 60 2 50 40 0 30 #cores 100 20 per SVP place 1000 10 #problem size 10000

(b) Performance on warm caches

Figure 7.11: Performance of the matrix-matrix product. The working set is ≈ 200×#problem size bytes.

133 Chapter 7. SVP evaluation

In fact with a little analysis we see that the performance is limited by the COMA ring bandwidth. For row data, each core will request a cache line every 56 cycles assuming perfect sharing (8 words per row) and there will be no shar- ing between cores connected to the L2 cache, so the bandwidth at the L2 cache is 4 lines every 56 cycles. For column data, every thread will request a new cache line every 7 cycles as there is no locality of access. Assuming perfect sharing between cores i.e. every line is used for column data by every core before being evicted, then this translates at the L2 cache to one line every 7 cycles or 8 lines every 56 cycles. Total bandwidth demanded by the threads therefore with per- fect sharing of row and column data is 12 lines every 56 cycles from the 8 caches connected to a ring or 110 GB/s. As the COMA ring supplies 64 GB/s, the peak performance - on warm caches, cf. Figure 7.11(b) - is limited to 12.2 GFLOP/s. Of course, the performance drops for small problem sizes and large place sizes when there are not enough threads per core to hide all latencies.

thread fft_inner(int le, cpx_t* x) { index i; int w = i & (le - 1), j = (i - w) * 2 + w, ip = j + le; cpx_t t = { cos_sin[w].re * x[ip].re - cos_sin[w].im * x[ip].im, cos_sin[w].im * x[ip].re + cos_sin[w].re * x[ip].im }; x[ip].re = x[j].re - T.re; x[ip].im = x[j].im - T.im; x[j].re = x[j].re + T.re; x[j].im = x[j].im + T.im; } thread fft_outer(cpx_t* x, int n, shared int token) { index k; int le = 1 << (k - 1); int t = token; create(fid; ; 0; n/2; 1; ) fft_inner(le, x); sync(fid); token = t; } thread fft(cpx_t* x, int n) { create(fid; ; 1; log2(n)+1; 1; ) fft_outer(x, n, 0); sync(fid); }

Figure 7.12: Computation kernel for the 1-D FFT.

134 Chapter 7. SVP evaluation

FFT-1D - Performance (GFLOP/s) (Cold Caches)

0.45 0.45 0.4 0.35 0.4 0.3 0.35 0.25 0.3 0.2 0.15 0.25 60 0.2 50 40 0.15 30 #cores 100 20 per SVP place 1000 10 10000 #problem size 100000

(a) Performance on cold caches

FFT-1D - Performance (GFLOP/s) (Warm Caches)

0.45 0.45 0.4 0.35 0.4 0.3 0.35 0.25 0.3 0.2 0.15 0.25 60 0.2 50 40 0.15 30 #cores 100 20 per SVP place 1000 10 10000 #problem size 100000

(b) Performance on warm caches

Figure 7.13: Performance of the 1-D FFT. The working set is 8×#problem size bytes, plus a lookup table.

135 Chapter 7. SVP evaluation

We conclude this section by looking at the 1-D FFT which code is shown in Figure 7.12. Our µTC implementation performs the forward FFT kernel with- out the bit reversal phase, as it would be used in a scientific application. The implementation, again, is straightforward. For a problem size N, the outer com- putation is sequential and the inner computation defines N/2 threads each con- taining 4 loads, 4 FP muls, 3 FP adds, 3 FP subs and 4 stores. This thread contains 42 instructions, so on one core, the theoretical peak performance assuming la- tency hiding is 42 cycles per thread, each containing 10 FLOPs. We hence deter- mine the arithmetic intensity AI1= 10 ÷ 42 ≈ 0.24. We then derive from it the theoretical maximal performance, i.e. Perf = 1.2 ×AI1 ≈ 0.28 GFLOP/s. The actual observed performance - on one core with cold cache, cf. Figure 7.13(a) - is 0.17 GFLOP/s, or ≈60% of the expected maximum. When the number of cores and the problem size increase, the program becomes I/O constrained because of the lack of cache injection. No cache injection means that there are more I/O accesses. With warm caches (cf. Figure 7.13(b)), the observed maximal perfor- mance - for P=64 and N=16k - is 0.40 GFLOP/s.

136 Chapter 7. SVP evaluation

7.3 Discussion and conclusion

As already stated in the conclusion of Chapter 6, the major technical contribu- tion of this thesis is the existence of a working SVP compiler. This working compiler enables the compilation of non-trivial and more complex scientific benchmarks; the SVP GCC-4.1 compiler is stable with -O1 optimization flag (major compilation optimizations: dead-code elimination, common subexpression elimination, copy propagation, combining operation, etc.). However, performance is not as good as highly-optimized hand-coded assembly as to be expected. The reason for this is that other optimization flags such -O2 and -O3 are not stable with respect to SVP assumptions. For future work, these optimizations such as instruction rescheduling require further investigation before safe use. Optimiz- ing FP operations per instructions or per bytes of data of I/O would improve performance by reducing the limit dictated by arithmetic intensity computations. Moreover, despite the scientific problems used in this chapter, the specificity of the work undertaken did not allow further investigation with larger problems which will also be put for future work. Further investigation into compiling larger problems would reveal more relevant concurrency issues with compila- tion, but a lack of time and resources did not permit us to implement larger µTC programs. The Apple-CORE project [90] will provide large benchmarks for the SVP computing system. We did not consider production quality but a proof of concept for a native integration of concurrency assumptions into the compiler’s internals.

Acknowledgments

The author would like to thank Raphael Poss for all the constructive criticisms he made during the elaboration of this chapter with his contribution to the SVP system evaluation section. The author would also like to thank Mike Lankamp for providing the simulator of the target platform (Microgrid) used during eval- uation and for his expertise on the hardware perspective of the benchmark anal- ysis.

137 Chapter 7. SVP evaluation

138 Part III

Discussion and conclusion

139

Chapter 8 Discussion and conclusion

For tomorrow belongs to the people who prepare for it today. African Proverb

In this dissertation, we have addressed the problem of integrating concur- rency constructs (in both source and target languages) and assumptions (from the execution model) into an imperative-language compilation software sys- tem. As technical contribution, we have developed a working compiler for the µTC language targeting the Microgrid architecture. The compiler is extended with the properties of the SVP concurrent execution model. This work has served as a basis to investigate several aspects of concurrency for program- ming multicore architectures. In particular, this work has focused on the inte- gration of concurrency assumptions in a non-concurrency-aware transforming software system that bridges the two SVP implementations. This chapter re- views the different problems and results presented in this thesis and summa- rizes their implications.

8.1 Thesis overview

8.1.1 Programming multicore architectures with SVP

Chapter 2 has reported an overview of different layers of existing concurrent software systems: execution models, parallel architectures, concurrent program- ming languages, and compilers. We have discussed the context of our research when we have looked at how to program multicore architectures. We have then described the candidate used as a case study throughout this thesis in Chap- ter 3: the Self-Adaptive Virtual Processor execution model (SVP). This model

141 Chapter 8. Discussion and conclusion was resolved into software and hardware implementations with SVP proper- ties. In Chapter 4, we have looked at compilation techniques to allow the bridge between a concurrency-oriented language to a concurrent target plat- form. Besides the clashes in compiler optimizations reported in Chapter 5, we have made a working compiler based on an existing imperative-language compiler framework. Chapter 6 has presented the challenges we have en- countered during this research to embed native concurrency idioms within a sequential-oriented framework. Chapter 7 has evaluated the SVP extension to an imperative-language compiler with scientific problems. In this chapter, we use this experience to refine the current status of compiler research for concur- rent systems. The multicore programming menace is still present and the work to perform on tools to fit the concurrent systems is far from being finished.

8.1.2 SVP compilation

The motivation of integrating SVP properties into an existing imperative lan- guage is the opportunity to reuse existing compiler technology that is proven to work. Moreover, it has been also a way to produce results within the time constraints of this research. The other option would have been to implement a compiler from scratch; this was not relevant with our research goals:

• compilation of programs to the SVP hardware implementation, • investigate the impact of concurrent programs on existing development tools, • provide solutions to support concurrent assumptions, • expose dangers on relevant areas to monitor in compiler research.

8.2 Limitations

8.2.1 General limitations

Compiler research comprises a large scope of topics that we could not cover within the time allocated to this thesis. We first made restrictions with the choice to embed concurrency directly into the compiler’s infrastructure. In ad- dition to that, the choice of the compiler as GCC constrained the development in a Test-Driven Development (TDD) scheme [89] which consumes time and resource to obtain a working compiler. Compilers, as middle-men, are posi- tioned in between two worlds (software and hardware) with constant pressures to deliver more efficient compiled programs. With the appearance of multicore architectures and lack of concurrent programming standards, compilers have the job to take advantage of sequential programs (with and without concur- rency annotations in the code) and to discover parallelism in the code or just to

142 Chapter 8. Discussion and conclusion compile with an optimizing objective. The same happened in the context of our research where the compiler development was pressurized by frequent changes of the target platform. We then observed that simple changes in the target’s in- struction set have considerable impact on compiler modifications to adapt to these changes. It becomes clear to say that a close collaboration between com- piler architects and architecture designers is prerequisite for such research in software systems. Here, we made the assumption that the software side is a passive actor of the system. Software standards indeed do not evolve as fast as architectures. Developers are skeptical of new standards and are reluctant to learn new programming techniques. Consequently, the pressure of finding and handling concurrency goes to the developer tools, such as compilers. During this research, we encountered challenges due to changes in a mov- ing target platform and its instruction set, including changes in the program- ming language. These impacted considerably on compiler development which is hard to quantify. This resides mainly in our choice to expose concurrency explicitly in the source language with µTC [64] and in the target instruction set with the Microgrid [61]. µTC uses parts of the C-language properties and extends them with new constructs to support SVP properties. The Microgrid uses a DRISC approach [43] and extends a conventional instruction set with new instructions to support SVP properties. In theory, µTC constructs map di- rectly onto Microgrid instructions with regard to SVP assumptions. However, in practice, SVP semantics break conventional assumptions especially in com- piler optimizations and C-language compilation schemes. The latter requires extensive support in a compiler’s infrastructure to preserve the two principles of compilation:

1. The compiler must preserve the semantics of the input program being compiled.

2. The compiler must improve the input program in some perceptible way (in the case of an optimizing compiler).

8.2.2 Current compiler limitations

The SVP compiler has proven to work but it has limitations. The first limita- tion is related to the implementation of the thread local storage (TLS) which is concurrent stack management for multicore architecture, investigated in [91]. Consequently, spilling techniques are not yet implemented with the compiler which limits the amount of parameters that can be passed at thread creation (cf. Section 3.3.7). The compiler cannot reuse physically shared registers via spills: a spill itself would require a read from the register, thus forcing synchronization with the previous thread and reducing concurrency. Due to time constraints, regular C function calls are not yet implemented. The implication of supporting this in the back-end requires the work described in Section 4.2 and combines it with a complex back-end. This back-end forks

143 Chapter 8. Discussion and conclusion into two parts: the conventional sequential calling convention and the concur- rency creating convention. The thread creation mechanism reuses parts of the call mechanism and its calling convention. The back-end therefore needs to support different types of calling conventions which is not the case in the com- piler’s infrastructure. As mentioned in Section 6.3.2, the Microgrid implementation employs vari- able (dynamically-sized) register windows for each thread function. In other words, this removes the long standing assumption that the size of the archi- tectural register window is a constant for all programs. The register allocator requires hard descriptions of the target’s memory before compiler generation. Supporting variable register windows requires big changes in the concepts of register allocation; the calling convention and memory description are not static anymore. To render feasible the compilation of programs within the time of this research, we have fixed the register windows for the Microgrid implemen- tation. Nonetheless, this directly impacts the code quality for register windows which are not optimal in size. In other words, the size of register windows can sometimes not be the smallest possible.

8.3 Future work

8.3.1 Possible improvements

This SVP working compiler has room for improvements. First, the verifica- tion and evaluation steps will be pushed further with the results of the Apple- CORE project [90]. This project investigates the application of SVP concurrency in larger benchmarks and provides results for industry-standard benchmarks. The SVP compiler will then have a larger base for discovering further sleep- ing compiler defects than those already discovered. Moreover, the SVP model and its implementation are evolving continuously; the compiler will have to follow this evolution. The last major improvement in the SVP hardware im- plementation is the change of the create mechanism from a ‘pull’ scheme to a ‘push’ one. This is discussed in Section 6.4.1. This will impact considerably on the back-end complexity; there is no bound on variable initialization (pa- rameter passing) anymore, between the created family and its parent beyond the create action. The compiler does not have to preserve locks on resources for parameter passing and the register allocator becomes simpler to design and im- plement. The main reason for the change is the recent addition of continuation create in the SVP model to support, amongst other things, the implementation of system services in an operating system. In a continuation create, a family of threads is created with no existing link with its creator beyond its issue point. This SVP property is required for further investigation in operating system and resource management for SVP. Beyond SVP changes, the SVP compiler requires some more work and inves- tigation for advanced and aggressive optimizations used with GCC optimiza-

144 Chapter 8. Discussion and conclusion tion flags -O2 and -O3. Instruction reordering, which reorders instructions and breaks SVP dependency chains between operations; a shared communication channel may be written before being read for instance. Other optimizations would require more investigation, especially SVP-related optimizations that could occur at the middle-end level. Optimizing the concurrency tree of a pro- gram is a potential direction, with rebalancing the workload of thread families by reshaping/restructuring block size and other family parameters. To obtain this, we could introduce feedback from the target in order to report to the com- piler with some metrics the behavior of the concurrency tree during execution. Based on these metrics, the optimizing algorithm may reshape the concurrency tree to obtain a better balance of workload throughout the Microgrid. Existing feedback compilation using GCC is already in the market with MILEPOST [92] that uses the GCC plug-in ICI mechanism [86]. Moreover, target’s resources are static at design-time; but, they may change at execution-time. The com- piler can introduce variables and code sections at compile-time to permit other execution paths. These code sections and variables will be interpreted by the run-time system which sets and optimizes based on this special information. This optimization can be done without feedback but simply with known target parameters which be known at run-time.

8.3.2 Future perspectives of Compiler Research

This research has shown other reflections than the ones related to SVP compiler research. The emphasis on compiler technology is getting more important with the presence of multicore architectures. The compiler does not only have to transform a program correctly and to improve it, but it has to discover and take advantage of concurrency in the input program. Developers may use hints to expose concurrency in a program with various programming paradigms; com- pilers have to evolve as well and need to be recast with new compiler models to become more flexible and better compartmentalized. Most compilers suf- fer from a sequential compiler model with a fixed infrastructure (understand the sequence of front-end, middle-end, and back-end) which is not flexible for the addition of new components or concepts. There is therefore a need for more flexibility and modularity in next generation of compilers. In that sense, the CoSy framework [79] is already in ahead of its time. However, although the GCC framework struggles with this sequential compiler model, efforts are currently being made to modernize its compiler model: the ICI plug-in initia- tive [86] and with the Modular-GCC initiative [87]. The next generations of imperative-language compilers will have to encapsu- late an abstract representation good enough to model concurrency in the com- piler’s internals, and therefore to avoid concurrency discrepancies. Compiler optimizations are a major problem when dealing with concurrency properties; it becomes clear that the compiler wall will be hit if compiler models are not recast as a whole new system with the embedding of concurrency. This is an entire research area where sleeping bugs in existing sequential programs will

145 Chapter 8. Discussion and conclusion be exposed when they encounter concurrency. In the end, from the user per- spective, the developer community is too conservative to change their way of thinking and designing applications. Therefore, development tools, such as compilers, will have to do the job and relieve stress from them by handling concurrency in input programs.

8.4 Conclusions

To conclude this thesis work, we have addressed the problem of integrating concurrency constructs and assumptions into an imperative-language compi- lation software system. The context of this thesis work is the programmability (or programming) of multicore architectures where the SVP execution model is a potential candidate and used as a basis for this study. The SVP execution model provides a solution as a whole with different layers (and tools), in which each has its own responsibility and the adequate separation of concerns to man- age concurrency. This thesis has focused on specific areas:

1. presentation of the SVP execution model, 2. compilation of a concurrency-oriented language to a multicore architec- ture, 3. investigation of interactions of concurrency assumptions in conventional compiler optimizations, 4. exposition of challenges with SVP compilation and concurrency matters.

We have exposed that integrating concurrency assumptions of the SVP exe- cution model is technically possible and the working SVP compiler stands as the technical contribution of this thesis. However, we have also drawn up lim- its of such an approach: embedding concurrency constructs into an imperative paradigm. The first distinct impact is the problem with the tools that are de- signed in a way that will force the reengineering as a whole to get dedicated concurrency-aware tools. These compilers will be modular in their design to enable flexibility of retargeting to various architectures. Moreover, it will still take some time until concurrency standards appear and are approved by the developer community. Until that time, efforts in compilation will have to be performed to provide production-quality solutions. The reason for this reengi- neering is mainly due to clashes related to concurrency assumptions that break optimization algorithms and compilation schemes. Therefore, there is a need for dedicated solutions for the matter of programming multicore architecture.

146 Appendix A µTC language syntax summary

This appendix holds the complete µTC grammar, as defined by extending the C99 grammar with the µTC extension. primary-expression: identifier constant string-literal ( expression ) postfix-expression: primary-expression postfix-expression [ expression ] postfix-expression ( argument-expression-listopt ) postfix-expression . identifier postfix-expression -> identifier postfix-expression ++ postfix-expression -- ( type-name ) { initializer-list } ( type-name ) { initializer-list , } argument-expression-list assignment-expression argument-expression-list , assignment-expression unary-expression: postfix-expression ++ unary-expression -- unary-expression unary-operator cast-expression sizeof unary-expression

147 Appendix A. µTC language syntax summary

sizeof ( type-name ) unary-operator: one of & * + - ˜ ! cast-expression: unary-expression ( type-name ) cast-expression multiplicative-expression: cast-expression multiplicative-expression * cast-expression multiplicative-expression / cast-expression multiplicative-expression % cast-expression additive-expression: multiplicative-expression additive-expression + multiplicative-expression additive-expression - multiplicative-expression shift-expression: additive-expression shift-expression << additive-expression shift-expression >> additive-expression relational-expression: shift-expression relational-expression < shift-expression relational-expression > shift-expression relational-expression <= shift-expression relational-expression >= shift-expression equality-expression: relational-expression equality-expression == relational-expression equality-expression != relational-expression AND-expression: equality-expression AND-expression & equality-expression exclusive-OR-expression: AND-expression exclusive-OR-expression ˆ AND-expression inclusive-OR-expression: exclusive-OR-expression inclusive-OR-expression | exclusive-OR-expression logical-AND-expression: inclusive-OR-expression Appendix A. µTC language syntax summary

logical-AND-expression && inclusive-OR-expression logical-OR-expression: logical-AND-expression logical-OR-expression || logical-AND-expression conditional-expression: logical-OR-expression logical-OR-expression ? expression : conditional-expression assignment-expression: conditional-expression unary-expression assignment-operator assignment-expression assignment-operator: one of = *= /= %= += -= <<= >>= &= ˆ= |= expression: assignment-expression expression , assignment-expression constant-expression: conditional-expression declaration: declaration-specifiers init-declarator-listopt ; declaration-specifiers: storage-class-specifier declaration-specifiersopt

type-specifier declaration-specifiersopt type-qualifier declaration-specifiersopt function-specifier declaration-specifiersopt init-declarator-list: init-declarator init-declarator-list , init-declarator init-declarator: declarator declarator = initializer storage-class-specifier: typedef extern static auto register type-specifier: void char short Appendix A. µTC language syntax summary

int long float double signed unsigned Bool Complex Imaginary index place family struct-or-union-specifier enum-specifier typedef-name struct-or-union-specifier: struct-or-union identifieropt { struct-declaration-list } struct-or-union identifier struct-or-union: struct union struct-declaration-list: struct-declaration struct-declaration-list struct-declaration struct-declaration: specifier-qualifier-list struct-declarator-list ; specifier-qualifier-list: type-specifier specifier-qualifier-listopt

type-qualifier specifier-qualifier-listopt struct-declarator-list: struct-declarator struct-declarator-list , struct-declarator struct-declarator: declarator

declaratoropt : constant-expression enum-specifier: enum identifieropt { enumerator-list }

enum identifieropt { enumerator-list , } enum identifier enumerator-list: enumerator Appendix A. µTC language syntax summary

enumerator-list , enumerator enumerator: enumeration-constant enumeration-constant = constant-expression type-qualifier: const restrict volatile shared function-specifier: inline thread declarator: pointeropt direct-declarator direct-declarator: identifier ( declarator ) direct-declarator [ type-qualifier-listopt assignment-expressionopt ] direct-declarator [ static type-qualifier-listopt assignment-expression ] direct-declarator [ type-qualifier-list static assignment-expression ] direct-declarator [ type-qualifier-listopt * ] direct-declarator ( parameter-type-list ) direct-declarator ( identifier-listopt ) pointer: * type-qualifier-listopt

* type-qualifier-listopt pointer type-qualifier-list: type-qualifier type-qualifier-list type-qualifier parameter-type-list: parameter-list parameter-list , ... parameter-list: parameter-declaration parameter-list , parameter-declaration parameter-declaration: declaration-specifiers declarator

declaration-specifiers abstract-declaratoropt identifier-list: identifier Appendix A. µTC language syntax summary

identifier-list , identifier type-name: specifier-qualifier-list abstract-declaratoropt abstract-declarator: pointer

pointeropt direct-abstract-declarator direct-abstract-declarator: ( abstract-declarator )

direct-abstract-declaratoropt [ assignment-expressionopt ] direct-abstract-declaratoropt [ * ] direct-abstract-declaratoropt ( parameter-type-listopt ) typedef-name: identifier initializer: assignment-expression { initializer-list } { initializer-list , } initializer-list: designationopt initializer

initializer-list , designationopt initializer designation: designator-list = designator-list: designator designator-list designator designator: [ constant-expression ] . identifier statement: labeled-statement compound-statement expression-statement selection-statement iteration-statement jump-statement create-statement sync-statement kill-statement labeled-statement: identifier : statement case constant-expression : statement Appendix A. µTC language syntax summary

default : statement compound-statement: { block-item-listopt } block-item-list: block-item block-item-list block-item block-item: declaration statement expression-statement: expressionopt ; selection-statement: if ( expression ) statement if ( expression ) statement else statement switch ( expression ) statement iteration-statement: while ( expression ) statement do statement while ( expression ); for ( expressionopt ; expressionopt ; expressionopt ) statement for ( declaration expressionopt ; expressionopt ) statement jump-statement: goto identifier ; continue ; break expressionopt ; return expressionopt ; create-statement: create ( identifier ; expressionopt ; expression ; expression ; expression ; expressionopt ; identifieropt ) postfix-expression ( argument-expression-listopt ) sync-statement: sync ( identifier ); kill-statement: kill ( identifier ); translation-unit: external-declaration translation-unit external-declaration external-declaration: function-definition declaration function-definition: Appendix A. µTC language syntax summary

declaration-specifiers declarator declaration-listopt compound-statement declaration-list: declaration declaration-list declaration Summary

Computer-based systems are ubiquitous in daily life: in fridges which are reg- ulated with an embedded Integrated Circuit (IC); in car control systems which manage the major driving features and engine controls; etc. The commodity computer market is driven by demands to deliver more efficient and reliable computing appliances at frequent time intervals: from a few months to a few years. These computing systems have to cope with the ever-growing increase of resource-intensive applications. IC (i.e. processor) improvements in comput- ing systems are constrained to follow this never-ending quest for better perfor- mance. There are two major factors which contribute to this. The first improves process technology resulting in faster circuits by increasing a processor’s clock rate. The second is the increase in processing resources with integration of a larger number of transistors on chip. However, these improvements are nowa- days reaching physical limitations: the more transistors present on a chip, the more physical issues arise resulting in excessive power consumption and prob- lems with heat dissipation. Consequently, there is an urgent need to find other ways for microprocessor design improvements. In the last few years, there has been a shift to parallel architectures in main- stream computing. Exploiting parallelism is not new; parallel programming paradigms, architectures and languages have been researched over the last three decades. This concurrency revolution is happening now in computing sys- tems. Its impact on software systems appears at all levels, from the application level to the architecture level. Nonetheless, problems arise when dealing with concurrency. The main challenge is the efficient programming of such architec- tures: efficiency in building the applications from the developer’s perspective and efficiency in exploiting their full potential in terms of performance scal- ability. This fact remains, despite decades of research in parallel computing, because parallel programming is inherently difficult and non-intuitive for de- velopers; humans tend to think sequentially. Unlike sequential programs, there are two issues in parallel application development. The first, and most com- mon part, is the description of the algorithm required. This may be concur-

155 Summary rent or may be automatically derived from a sequential description. The sec- ond, and by far the most difficult, is the management of concurrency. In this, the developer has to deal with issues of scheduling, mapping and dynamic re- source management. In current approaches, the description and management of concurrency are not decoupled. This results in a mixture of concerns that overwhelm developers making parallel application development very difficult and error-prone. This lack of a clear separation of concerns also means a lack of appropriate high-level abstractions in both the architecture and application, which precludes portability between different platforms. Thus, what currently happens is that applications are either developed targeting specific platforms or existing applications are retargeted to a platform through a painstaking process of static application mapping and the introduction of platform-specific func- tionality into the application itself. This concurrency revolution has an impact on software systems. Users utilize applications without knowledge of the machinery underneath. Applications are executed on the hardware; the operating system provides the interface be- tween the software and the hardware. Having a multicore architecture requires an adapted toolchain in order to operate, comprising a concurrency-oriented operating system and corresponding applications. Users do not need to be aware of the machinery; however, developers use the adapted toolchain for software system development and must be aware of the machinery, at least the main concepts. Coupled with that, this toolchain must be adapted to handle the new features of this new concurrent target platform. Nowadays, the ma- jor issue of the computer community is to cope with the multicore programming menace. The problem is not that multicore architectures exist, but that the tools for exploiting them are not yet ready. In other words, the programmability of multicore architectures remains still a challenge. The main research focus of this thesis looks at a particular part of software system development which bridges the software side and the hardware side. Compilers are a major component in software system development. Often developers rely on them to optimize their code for an architecture about which they do not need to be completely aware. The role of the compiler is to take advantage of the targeted architecture with respect to the tasks to be accomplished within the programmed applications. This thesis, with an underlying compiler development for a new parallel pro- gramming language targeting a many-core architecture, looks at the impact of this concurrency revolution This thesis describes an abstract model called the Self-Adaptive Virtual Pro- cessor (SVP), which supports concurrent systems. Its main characteristic is that it provides a separation of concerns between concurrency description and con- currency management (scheduling and mapping). SVP provides abstractions that allow the description of an application’s concurrency at multiple levels of granularity, thus tackling one of the major issues with regards to parallel ar- chitectures and programming. The goal is to write an application capable of exploiting the full potential of any concurrent platform configuration. Concur- rency management is delegated to the SVP implementation, which is directly

156 Summary implemented in the processor’s ISA. Mapping is implemented in SVP by pro- viding a mechanism to explicitly attach resources to components of a program and this is performed at execution time. Dynamic concurrency management has two main advantages. The first is that is allows applications to have a higher level of portability over different platform configurations (the number of processors differ). Secondly and most importantly, applications can take full advantage of any SVP platform without any platform-specific optimizations. SVP is flexible as it allows applications to exploit concurrency at all levels from the task level to the loop level and even at the instruction level if the imple- mentation of SVP supports it. SVP captures data dependencies using dataflow scheduling between its component threads, which in turn introduces a mech- anism by which SVP programs can be self-scheduled. This thesis introduces, as foundations, a hardware implementation of SVP at the level of a processor’s ISA and a software implementation with the extension of an imperative pro- gramming language. This thesis investigates the changes and the challenges facing the integration of concurrency idioms and assumptions from a concurrent execution model into an existing sequential-based imperative-language compiler. The compila- tion schemes of the concurrent language implementation reuses and modifies the C-language standards. This thesis focusses on the differences of SVP com- pilation against conventional sequential compilation. Furthermore, this the- sis investigates the impact of such an integration and reveals the feasibility of reusing existing antecedent compiler technology that has been proven to work. Nonetheless, such integration is a significant challenge and the side-effects are relevant to understand the quest of multicore programming tools. At first glance, this work targets a specific audience of technical computer scientists involved in compilation. This work also represents a reflection on the limits of existing engineering methods when used to tackle the challenges of dealing with concurrency. The essence of the technical contribution of this work is then used as a theory of engineering on the limitations of current com- puting systems and other concurrency-based systems. In the context of facing the multicore programming menace, this work becomes relevant to an audi- ence dealing with issues of concurrency-based compilation, language design, and multicore programming issues.

157 Summary

158 Samenvatting

Computergebaseerde systemen zijn overal aanwezig in ons dagelijks leven: koelkasten worden bestuurd door ge¨ıntegreerde circuits (IC’s); systemen in au- to’s staan je bij tijdens het rijden en regelen de motor; enzovoort. De consumen- tenmarkt voor computergebaseerde systemen wordt gedreven door een vraag naar meer efficiente en betrouwbare toepassingen, die zich met een regelmaat van enkele maanden tot jaren blijft herhalen. Deze computergebaseerde syste- men moeten kunnen voldoen aan de eisen van applicaties die bij elke generatie nog complexer worden en meer rekenkracht vereisen. De verbeteringen in de IC- (en dus ook processor-) technologie worden hierdoor gestuurd tot het vol- gen van deze oneindige honger naar meer rekenkracht. Er zijn twee manieren waarop dit gedaan wordt. De eerste is door de productietechnologie te verbe- teren waardoor de circuits op hogere snelheden kunnen draaien. De tweede manier is door de rekenkracht van een IC uit te breiden door meer transistoren toe te voegen op een chip. Het probleem is dat deze verbeteringen inmiddels de fysieke grenzen bereiken: met meer transistoren op een chip wordt er steeds meer energie verbruikt, die wordt omgezet in warmte welke ook weer moet worden afgevoerd. Hierdoor is het noodzakelijk dat er andere manieren ge- vonden worden om het ontwerp van microprocessoren te verbeteren. In de laatste jaren heeft er in de computerwereld een verschuiving plaats- gevonden in de richting van parallelle architecturen en systemen. Op zich is het uitbuiten van parallellisme niet nieuw; parallelle programmeermethoden, architecturen en talen zijn al uitgebreid ontwikkeld en onderzocht in de laat- ste drie decennia. Maar de echte parallelle revolutie in de informatica vindt nu plaats, en dit heeft grote invloed op alle niveaus binnen de systemen, vanaf het applicatieniveau tot aan de onderliggende architectuur. Echter, er ontstaan veel problemen bij het gebruik van parallellisme. De grootste uitdaging is om effi- ciente programma’s te schrijven voor dergelijke architecturen: efficient, als in de manier waarop een programmeur een applicatie bouwt, maar ook efficient in het maximaliseren van de schaalbaarheid van de prestaties. Dit is nog steeds een probleem, ondanks de jaren van onderzoek naar parallelle systemen. Pa-

159 Samenvatting rallel programmeren is moeilijk en niet intu¨ıtief voor ontwikkelaars; immers, mensen neigen er naar om sequentieel te denken. Anders dan bij sequentiele¨ programma’s zijn er twee problemen bij het ontwikkelen van parallelle appli- caties. Het eerste, en meest voorkomende, probleem is de beschrijving van het benodigde algoritme. Dit kan op zich al parallel zijn, of automatisch worden afgeleid van een sequentiele¨ beschrijving. Het tweede, en veruit meest lasti- ge, probleem is het beheren van het parallellisme. Hiervoor moet de ontwik- kelaar rekening houden met problemen als planning, plaatsing, en het dyna- misch beheren van bronnen. In de bestaande aanpakken zijn de beschrijving en het beheer van parallellisme niet ontkoppeld. Dit leidt er toe dat ontwik- kelaars overweldigd worden door de combinatie van deze twee problemen, wat de ontwikkeling van parallelle applicaties heel erg moeilijk en foutgevoe- lig maakt. Het gebrek aan een duidelijke scheiding betekent ook een gebrek aan hoog-niveau abstractie in de applicatie en de architectuur, wat de uitwis- selbaarheid van software tussen verschillende platformen uitsluit. Het gevolg hiervan is dat applicaties dan wel voor een specifiek platform ontwikkeld wor- den, of dat bestaande applicaties herschreven moeten worden voor een plat- form door middel van een moeizaam en pijnlijk proces van statische plaatsing en het introduceren van platformspecifieke code in de applicatie. De parallelle revolutie heeft een grote invloed op softwaresystemen. Appli- caties worden gebruikt zonder dat de gebruiker kennis heeft van de werking onder de motorkap. Applicaties worden uitgevoerd op de hardware; het be- sturingssysteem voorziet in de interface tussen software en hardware. Een architectuur met meerdere rekenkernen (multicore) vereist een aangepaste set softwaretools om goed benut te kunnen worden, bestaande uit een besturings- systeem en applicaties, die beiden toegespitst zijn op het gebruik van parallel- lisme. Gebruikers hoeven zich niet bewust te zijn van deze interne werking, maar ontwikkelaars moeten dit wel zijn, of op zijn minst op de hoogte zijn van de belangrijkste concepten. Zij hebben een aangepaste set tools nodig om hun applicaties mee te schrijven, die alle nieuwe kenmerken van een dergelijk pa- rallel platform ondersteunt. Op dit moment is het omgaan met het gevaar dat multicore programmeren heet de grootste uitdaging van de informaticagemeen- schap. Het probleem is niet dat multicore-systemen bestaan, maar dat de tools om hun kracht goed uit te buiten er nog niet klaar voor zijn. Met andere woor- den, het programmeren van een multicore-architectuur is nog steeds een uit- daging. Het belangrijkste deel van het onderzoek in dit proefschrift richt zich op een bepaald deel van software-ontwikkeling, dat specifiek de software- en hardware-werelden aan elkaar verbindt. Vertaalprogramma’s zijn een belang- rijk onderdeel in de ontwikkeling van softwaresystemen. Vaak vertrouwen ontwikkelaars er op dat deze tools hun code optimaliseren voor een bepaal- de architectuur zonder zich daar verder in te hoeven verdiepen. De rol van vertaalprogramma’s is om al het mogelijke voordeel uit de doelarchitectuur te halen voor de taken die een applicatie omvat. Dit proefschrift, met de onder- liggende ontwikkeling van een vertaalprogramma voor een nieuwe parallelle programmeertaal voor een manycore-architectuur (d.w.z. een architectuur met veel rekenkernen), geeft een beschouwing van de impact van deze parallelle re-

160 Samenvatting volutie. Dit proefschrift beschrijft SVP, de Self-Adaptive Virtual Processor, oftewel een zelfregelende virtuele processor; een abstract model dat gespecialiseerd is in parallelle systemen. Het hoofdkenmerk van dit model is dat het een schei- ding aanbrengt tussen het beschrijven en het beheren (planning, plaatsing) van parallellisme. SVP geeft ons de abstracties om het parallellisme binnen een pro- gramma te beschrijven op verschillende niveaus van granulariteit, en verhelpt daarmee een van de grote problemen van parallelle architecturen en program- meertalen. Het doel is om een applicatie te kunnen schrijven die het maximale kan halen uit welk parallel platform dan ook. Het beheer van het parallellisme wordt overgelaten aan de implementatie van SVP, die direct ingebouwd is in de instructieset van een processor. Plaatsing is ge¨ımplementeerd in SVP door een mechanisme beschikbaar te stellen waarbij bronnen expliciet aan componenten van een programma verbonden worden tijdens de executie. Dynamisch beheer van parallellisme heeft twee belangrijke voordelen. De eerste is dat het de uit- wisselbaarheid van applicaties tussen verschillende platformconfiguraties (bijv. een verschillend aantal processoren) ten goede komt. De tweede reden, en mis- schien nog wel de meest belangrijke, is dat applicaties het maximale uit elk mogelijk SVP-platform kunnen halen zonder specifieke optimalisaties hiervoor toe te passen. SVP is flexibel en geeft de mogelijkheid om parallellisme in ap- plicaties uit de drukken vanaf taakniveau, lusniveau tot instructieniveau, in- dien de SVP implementatie dat laatste ondersteunt. SVP gebruikt een planning gebaseerd op de datastroom tussen softwarecomponenten die ge¨ıdentificeerd is door de beschreven afhankelijkheden. Deze manier van planning levert een mechanisme op waarmee alle SVP-programma’s zelfplannend uitgevoerd kun- nen worden. Dit proefschrift introduceert, als basis, een hardware implemen- tatie van het SVP-model op het niveau van een processorinstructieset, en een implementatie van software die gebruik maakt van uitbreidingen van een im- peratieve programmeertaal. Dit proefschrift onderzoekt de veranderingen en uitdagingen die het integre- ren van een parallel dialect in een bestaand vertaalprogramma voor sequen- tiele¨ imperatieve programmeertalen brengt, waarbij een parallel executie mo- del in acht genomen moet worden. Voor de basis van de vertaalregels voor de parallelle taal wordt er, met enkele aanpassingen, veel hergebruikt van de C-programmeertaal standaard. Dit proefschrift concentreert zich op het ver- schil in vertaaltechnieken tussen SVP programma’s en sequentiele¨ program- ma’s. Ook wordt onderzocht wat de invloed is van een dergelijke integratie, en laat het de haalbaarheid zien van het hergebruik van bewezen bestaande ver- talertechnologie. Echter, een dergelijke integratie is een grote uitdaging, en de bijwerkingen die dit oplevert zijn belangrijk om de zoektocht naar nieuwe tools om multicores mee te programmeren te begrijpen. Op het eerste gezicht is het hier beschreven werk alleen bedoeld voor een spe- cifiek publiek van wetenschappers in de technische informatica die ge¨ınteresse- erd zijn in vertalertechniek. Dit werk presenteert ook een reflectie over wat de grenzen zijn van de bestaande aanpakken wanneer deze worden toegepast op

161 Samenvatting het omgaan met parallellisme en de uitdagingen die dat veroorzaakt. De essen- tie van de technische toegevoegde waarde van dit werk wordt daarna gebruikt als toetsingsmethode over de grenzen van huidige rekensystemen en andere parallel-gebaseerde systemen. In de context van het omgaan met het gevaar dat multicore programmeren heet, is dit werk relevant voor een publiek dat om moet gaan met de problemen van parallel-gebaseerde vertaling, het ontwerp van programmeertalen, en het programmeren van multicores.

162 Acknowledgements

I would like to thank my promotor Chris Jesshope who gave me the opportu- nity to be involved in a challenging research project among the CSA group at the University of Amsterdam (UVA). It started at first in 2005 for a master in- ternship in the brand-new CSA group; Chris then offered to me a position as PhD candidate in his group. This opened me a great chapter to explore research topics and to develop and refine my scientific insight. Moreover, I would like to remember my former Peter Knijnenburg who, despite his disease, tried to give as much guidance as he could in the early stage. Clemens Grelck came in the group in the last stage of my PhD; I would like to thank him for his thorough input on my research and thesis. During my time at EISTI (engineering school), my M.Sc. professors - Chrys Baskiotis and Jorge Mayorquim - challenged me in following further the explo- ration and understanding research topics in computer science. I would like to thank them for their guidance and directions in finding interesting and chal- lenging domains of work. My classmates Philippe Olivier, Gerard Manguelle, Jean-Baptiste Le Stang, David Monclus, Georg Ramne, Benjamin Rigaud, Alexis Kofman, Cyril Ledru, Samya El Yousfi and I experienced teamwork and engi- neering on the smart-car project and other assignments. During these years at the UvA, Kostas Bousais and ’Jony’ Li Zhang were great officemates who gave me assistance and support during the difficult times but also for an after-work beer&dinner. Michael Hicks has been a great help in proofreading this thesis and in collaborating in the compiler development along with Simon Polstra. I thank you both and also my other colleagues: Mike Lankamp, Mark Thompson, Roberta Piscitelli, Toktam Tagahvi, Yang Qiang, Fu Jian, Irfan Uddin, Roy Bakker, Aram Visser, Peter van Stralen and Andy Pimentel. I would like to also mention my former colleagues Guang Liang, Joe Masters, Bram de Geus. Michiel van Tol, you always gave good fun at the office and out for a drink with the other guys. Raphael Poss, you have provided such a good insight on my work and group research. You have made me reflect a lot on research choices and undertaken directions. Your input has been really

163 Acknowledgements helpful. I had the chance to have had great support and fun with our secretaries Erik Hitipeuw, Alexis Salin and Dorien Bisselink. They helped a lot in going through UvA bureaucracy. Thanks to you all! Big thanks to Erik! During EU project meetings, I met Marcel Beemster and Joe van Vlijmen, both working in industry. They have offered me good perspectives on my work and myself from their industrial insight and experience. It gave a little push which was necessary to make further effort during my research. During my internship at NXP in Eindhoven, I encountered two major players in the field Marc Duranton and Zbigniev Chamski. Marc, you have pinpointed all research challenges and proposed directions in realizing a feasible research and scientific output. Zbigniev, your expertise in compiler and in GCC really gave me good support in choices I made during my research. I also encoun- tered colleagues who became friends: Anthony Martiniere, Rodrigo Baptista, Joao Paulo Carreiro, Camille Cortet, Alice Schwab, Amelie Onzon and Fancois Jegaden. This NXP experience could not have been better without you. My time as a PhD candidate has been great with friends around me who be- came my family in Amsterdam. Two persons come first in mind who could be my two big brothers Yves Fomekong Nanfack and Alfredo Tirado-Ramos. Heidi Denzel van Ramos, it was great to have met you. Sometimes hard to follow your flow of intellectual thoughts, we had good fun and great discus- sions. Along Yves and Alfredo, Jordi Vidal and Carlos Tamulonis were part of our UvA gang. Thanks guys for the good times! Tim van Kasteren, you are also part of my family - my bro man! Whoppa! We had such great discussions about everything and nothing. Anton Bossenbroek, you opened to me the Dutch so- ciety and your family. I am really grateful for what you did and to your parents Anne and Vim. Tom de Jong, you and your gang of friends gave me a lot of fun and retrospective on the Dutch lifestyle and friendship. I am also pleased to have encountered Maaike Weber, we had fun discussions. Thank you all for the good memories! I would like to give many thanks to my informal supervisor Marten Postma. Marten who is not an expert in my research area, you have given me so many good and thorough advices on how to conduct my research and my thesis. Thanks! For all the casual discussions at the UvA, I would like to thank: Nol Chandapol, Eric Lorenz, Fabio Tiriticco, Narges Javaheri, Breanndan O Nual- lain, Michael Sipior, Max Filatov, Isabelle Wartelle, Gohar Sargsyan and all the members of the CSP lab. Thanks to Adam Belloum for all our hallway discus- sions and your support. Dealing with the Dutch language is quite a challenge for a French guy. Char- lotta Olthof, you decided to invest in teaching me the language and the subtle things about the Dutch society.You always found the way to push me further for my Dutch and my own challenges. Thanks! My time in Amsterdam would not have been the same without football. I had

164 Acknowledgements the chance to play with world-class players: Alex Ter Beek, Jurjen Meeuwissen, Victor Guevara, Arnaud Spangenberg, Pierre-Alain Breuil, Tibert van der Loop, Michel Ferreira, Alessia Amore, Andreas Angermayr, Akshay Katre, Michel Ferreira, Alessia Gasparini, Mauro, Niels Turing, Gordon Lim, Cyrielle Allais, Wojciech Dzik, Pawel Dydio, Vladi Bocokic, Greg Seltzer, Chris Swan and all the others I played with or against. We had very bad and good games as well! When is the return game?? We drunk lot’s of liters of beer after our amazing games. My friends from my university time gave me comfort and support at all times as well: Daniel Doan, Stephane N’Guyen, Benoit Caron, Christophe Lepley, Filippe De Azevedo, Julien Cossais, Johann Bargoin. Germain Maurice, Jean- Marie Merizio, we had so much fun each time we met. Despite the distance, Nicolas Lefort, you still made it feel closer with your special jokes. ;-) My family in Paris and in Toulouse has been always there for me in sharing the difficult times and also in during the great times. I feel really thankful to have them around despite the distance. Lisa, you have been around me in mo- ments of doubt and joy. You always found a way to support me to go through it and till the end. Your words always had great vibes and gave me strength to achieve my goals and dreams. Thank you miss!

165 Acknowledgements

166 Publications

[1] T. Bernard, K. Bousias, B. de Geus, M. Lankamp, L. Zhang, A. Pimentel, P.M.W. Kni- jnenburg, and C.R. Jesshope. A Microthreaded Architecture and its Compiler. In M. Arenez, R. Doallo, B.B. Fraguela, and J. Tourino, editors, Proceedings 12th Interna- tional Workshop on Compilers for Parallel Computers (CPC), pages 326–340, 2006. [2] T. Bernard, C. Jesshope, and P.M.W. Knijnenburg. Microthreading: Model and Com- piler. In Proceedings of Advanced and Compilation for Embedded Systems (ACACES 2006), pages 101–104, L’Aquila, Italy, 2006. HiPEAC. [3] T.A.M. Bernard, C.R. Jesshope, and P.M.W. Knijnenburg. Strategies for Compiling µTC to Novel Chip Multiprocessors. In International Symposium on Systems, Architec- tures, MOdeling and Simulation (SAMOS 07), volume LNCS 4599, pages pp.127–138. S. Vassiliadis et al., 2007. [4] T. Bernard, K. Bousias, L. Guang, C. R. Jesshope, M. Lankamp, M. W. van Tol, and L. Zhang. A General Model of Concurrency and its Implementation as Many-Core Dynamic RISC Processors. In Proceedings of International Conference on Embedded Com- puter Systems: Architectures, MOdeling and Simulation (IC-SAMOS 2008), pages 1–9, 2008. [5] T.A.M. Bernard, C. Grelck, and C.R. Jesshope. On the Compilation of a Language for General Concurrent Target Architectures. Parallel Processing Letters, 20, March 2010. [6] T.A.M. Bernard, C. Grelck, M.A. Hicks, C.R. Jesshope, and R. Poss. Resource- agnostic programming for many-core microgrids. In Proceedings of 4th Workshop on Highly Parallel Processing on a Chip (HPPC 2010), August 2010.

167 Publications

168 Bibliography

[1] Mobile 50. Facts about the Mobile. A Journey through Time. The 50th Anniversary of the Mobile Phone, 2007. 1 [2] Ramjee Prasad Young Kyun Kim. 4G Roadmap and Emerging Communication Technolo- gies. Artech House, 2006. 1 [3] Ray M. Holt. Architecture Of A Microprocessor. Computer Design magazine, January 1971. 2 [4] Gordon E. Moore. Cramming more components onto integrated circuits. Electronics, 38(8), 1965. 2 [5] Arthur Trew and Greg Wilson, editors. Past, present, parallel: a survey of available parallel computer systems. Springer-Verlag New York, Inc., New York, NY, USA, 1991. 2, 17 [6] Domenico Talia. Parallel computation still not ready for the mainstream. Commun. ACM, 40(7):98–99, 1997. 2 [7] David Geer. Industry Trends: Chip Makers Turn to Multicore Processors. Computer, 38(5):11–13, 2005. 2, 17 [8] Krste Asanovic, Ras Bodik, Bryan Christopher Catanzaro, Joseph James Gebis, Parry Husbands, Kurt Keutzer, David A. Patterson, William Lester Plishker, John Shalf, Samuel Webb Williams, and Katherine A. Yelick. The Landscape of Parallel Com- puting Research: A View from Berkeley. Technical report, EECS Department, Uni- versity of California, Berkeley, 2006. 2, 24, 25 [9] Ida M. B. Nielsen and Curtis L. Janssen. Multicore challenges and benefits for high performance scientific computing. Scientific Programming, 16(4):277–285, 2008. 3 [10] Guri Sohi. Single-Chip Multiprocessors: The Rebirth of Parallel Processing. In PACT ’03: Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques, page 2, Washington, DC, USA, 2003. IEEE Computer Society. 3 [11] L. Spracklen and S.G. Abraham. Chip Multithreading: Opportunities and Chal- lenges. In 11th International Symposium on High-Performance Computer Architecture, 2005 HPCA-11, pages 248–252. IEEE Computer Society Washington, DC, USA, Feb. 2005. 3

169 Bibliography

[12] Katherine A. Yelick. Multicore: Fallout from a Hardware Revolution. In Invited talk, South Dakota School of Mining and Technology, September 2008. 3 [13] Herb Sutter. The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software. Dr. Dobb’s Journal, 30(3):202–210, March 2005. 3, 9, 25 [14] Barbara Chapman. The Multicore Programming Challenge. In Advanced Parallel Processing Technologies, volume 4847, San Francisco, CA, USA, 2007. Springer Berlin / Heidelberg. 3, 9, 25 [15] Eshel Haritan, Toshihiro Hattori, Hiroyuki Yagi, Pierre Paulin, Wayne Wolf, Achim Nohl, Drew Wingard, and Mike Muller. Multicore design is the challenge! what is the solution? In DAC ’08: Proceedings of the 45th annual conference on Design automa- tion, pages 128–130, New York, NY, USA, 2008. ACM. 3, 9 [16] Saman Amarasinghe. (How) can programmers conquer the multicore menace? In PACT ’08: Proceedings of the 17th international conference on Parallel architectures and compilation techniques, pages 133–133, New York, NY, USA, 2008. ACM. 3, 9, 25 [17] Herb Sutter and James Larus. Software and the concurrency revolution. Queue, 3(7):54–62, 2005. 3, 9, 25 [18] Henry Gabb, Tim Mattson, and Clay Breshears. Thinking in Parallel - Three Engi- neers’ Viewpoints. Intel Software Insight Magazine, 16:24–26, Febuary 2009. 8, 24 [19] David B. Skillicorn and Domenico Talia. Models and languages for parallel com- putation. ACM Computing Surveys, 30(2):123–169, 1998. 10, 22 [20] Henry Kasim, Verdi March, Rita Zhang, and Simon See. Survey on Parallel Pro- gramming Model. In NPC ’08: Proceedings of the IFIP International Conference on Network and Parallel Computing, pages 266–275, Berlin, Heidelberg, 2008. Springer- Verlag. 10, 22, 44, 108 [21] Ken Kennedy. Multicore: Questions and Concerns: A Compiler-Writer’s View. SC’06: Panel, International Conference for High Performance Computing, Networking, Storage and Analysis, 2006. 10 [22] NVIDIA. CUDA Compute Unified Device Architecture Programming Guide v.1.1. Technical report, NVIDIA Corporation, Santa Clara, CA, USA, 2007. 18 [23] Ben Lee and A. R. Hurson. Dataflow Architectures and Multithreading. Computer, 27(8):27–39, 1994. 19 [24] R. S. Nikhil. Can dataflow subsume von Neumann computing? SIGARCH Comput. Archit. News, 17(3):262–272, 1989. 19 [25] R. A. Iannucci. Toward a dataflow/von Neumann hybrid architecture. SIGARCH Comput. Archit. News, 16(2):131–140, 1988. 19 [26] M. Herlihy and J. E. B. Moss. Transactional Memory: architectural support for lock-free data structures. SIGARCH Computer Architecture News, 21(2):289–300, 1993. 19 [27] Steven Swanson, Andrew Schwerin, Martha Mercaldi, Andrew Petersen, Andrew Putnam, Ken Michelson, Mark Oskin, and Susan J. Eggers. The WaveScalar Archi- tecture. ACM Transactions on Computer Systems, 25(2):1–54, May 2007. 19, 23 [28] Tarek El-Ghazawi and Lauren Smith. UPC: unified parallel C. In SC ’06: Proceedings of the 2006 ACM/IEEE conference on Supercomputing, page 27, New York, NY, USA, 2006. ACM. 20

170 Bibliography

[29] GNU Unified Parallel C (GUPC). http://gcc.gnu.org/projects/gupc. html. 20, 104 [30] OpenMP. OpenMP Version 2.5 Specification, 2005. http://www.openmp.org/ drupal/mp-documents/draft_spec25.pdf. 20, 24 [31] OpenMP Architecture Review Board. The OpenMP specification for parallel programming, version 3.0, 2008. http://www.openmp.org/mp-documents/ spec30.pdf. 20 [32] T. GNU OpenMP (GOMP). http://gcc.gnu.org/projects/gomp/. 20, 104 [33] Karl Furlinger¨ and Michael Gerndt. Analyzing Overheads and Scalability Char- acteristics of OpenMP Applications. In Seventh International Meeting on High Perfor- mance Computing for Computational Science (VECPAR’06), pages 39–51, 2006. 20 [34] Alastair F. Donaldson, Colin Riley, Anton Lokhmotov, and Andrew Cook. Auto- parallelisation of Sieve C++ Programs. In Euro-Par 2007 Workshops: Parallel Processing (HPPC 2007), pages 18–27, 2007. 21 [35] Codeplay Software Limited. http://www.codeplay.com/. 21 [36] Robert D. Blumofe, Christopher F. Joerg, Bradley C. Kuszmaul, Charles E. Leiser- son, Keith H. Randall, and Yuli Zhou. Cilk: An Efficient Multithreaded Runtime System. SIGPLAN Notices, 30(8):207–216, 1995. 21 [37] Matteo Frigo. Multithreaded programming in Cilk. In PASCO ’07: Proceedings of the 2007 international workshop on Parallel Symbolic Computation, pages 13–14, New York, NY, USA, 2007. ACM. 21 [38] Guoping Long, Dongrui Fan, and Junchao Zhang. Architectural support for cilk computations on many-core architectures. In PPoPP ’09: Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming, pages 285–286, New York, NY, USA, 2008. ACM. 21 [39] The Go Programming Language. http://golang.org/. 21 [40] GCC Go compiler. http://golang.org/doc/gccgo_install.html. 21 [41] POSIX Threads Programming. https://computing.llnl.gov/tutorials/ pthreads/. 22 [42] Message Passing Interface Forum. http://www.mpi-forum.org/. 22 [43] A. Bolychevsky, C.R. Jesshope, and V.B. Muchnick. Dynamic scheduling in RISC architectures. IEE Proceedings of Computers and Digital Techniques, 143(5):309–317, Sep 1996. 23, 36, 107, 143 [44] Poonacha Kongetira, Kathirgamar Aingaran, and Kunle Olukotun. Niagara: A 32- Way Multithreaded Sparc Processor. IEEE Micro, 25(2):21–29, 2005. 23 [45] Harlan McGhan. Niagara 2 opens the floodgates. Microprocessor Report, 20(11):1–9, November 2006. 23 [46] Pedro Trancoso, Paraskevas Evripidou, Kyriakos Stavrou, and Costas Kyriacou. A case for chip multiprocessors based on the data-driven multithreading model. International Journal of Parallel Programming, 34(3):213–235, 2006. 23 [47] Intel. Single-chip Cloud Computer. http://techresearch.intel.com/ articles/Tera-Scale/1826.htm. 23 [48] Hans-J. Boehm. Threads Cannot be Implemented as a Library. In PLDI ’05: Pro- ceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation, pages 261–268, New York, NY, USA, 2005. ACM. 24, 44, 85, 107

171 Bibliography

[49] C. Addison, Y. Ren, and M. van Waveren. OpenMP Issues Arising in the Develop- ment of Parallel BLAS and LAPACK libraries. Scientific Programming, 11(2):95–104, 2003. 24 [50] Edward A. Lee. The Problem with Threads. Computer, 39(5):33–42, 2006. 24, 85 [51] Christopher W. Fraser and David R. Hanson. A Retargetable C Compiler: Design and Implementation. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1995. 25, 103 [52] ACK. http://tack.sourceforge.net/. 25 [53] Tiny C Compiler TCC. http://bellard.org/tcc/. 25 [54] The Information Project. GCC Definition. http://www.bellevuelinux. org/gcc.html. 25, 103 [55] Chris Lattner and Vikram Adve. The LLVM Compiler Framework and Infrastruc- ture Tutorial. In LCPC’04 Mini Workshop on Compiler Research Infrastructures, West Lafayette, Indiana, Sep 2004. 25 [56] Diego Novillo. Parallel Programming with GCC. In Red Hat Summit, Nashville, May 2006. 25 [57] Diego Novillo. Parallel Programming and Optimization with GCC. In ECI 2008, Buenos Aires, Argentina, July 2008. 25 [58] T. Bernard, K. Bousias, L. Guang, C. R. Jesshope, M. Lankamp, M. W. van Tol, and L. Zhang. A General Model of Concurrency and its Implementation as Many- Core Dynamic RISC Processors. In Proceedings of International Conference on Embedded Computer Systems: Architectures, MOdeling and Simulation (IC-SAMOS 2008), pages 1–9, 2008. 29, 50 [59] Thuy Duong Vu and Chris R. Jesshope. Formalizing SANE Virtual Processor in Thread Algebra. In Proceedings of 9th International Conference on Formal Engineer- ing Methods (ICFEM’07), volume 4789, pages 345–365. Springer Berlin / Heidelberg, 2007. 29, 32 [60] Guang R. Gao and Vivek Sarkar. Location Consistency-A New Memory Model and Cache Consistency Protocol. IEEE Transactions on Computers, 49(8):798–813, 2000. 34 [61] K. Bousias, L. Guang, C. R. Jesshope, and M. Lankamp. Implementation and Eval- uation of a Microthread Architecture. J. Syst. Archit., 55(3):149–161, 2009. 36, 43, 50, 64, 123, 143 [62] Arvind, Rishiyur S. Nikhil, and Keshav K. Pingali. I-structures: data structures for parallel computing. ACM Transactions on Programming Languages and Systems, 11(4):598–632, 1989. 36, 41, 109 [63] SPARC. SPARC Architecture Manual, Version 9. Technical report, SPARC Interna- tional, Inc., 2000. 41 [64] Chris R. Jesshope. muTC - An Intermediate Language for Programming Chip Mul- tiprocessors. In Asia-Pacific Computer Systems Architecture Conference, volume 4186, pages 147–160. LNCS, ISBN 3-540-4005, 2006. 44, 143 [65] Chris Jesshope, Mike Lankamp, and Li Zhang. The implementation of an SVP many-core processor and the evaluation of its memory architecture. SIGARCH Com- puter Architecture News, 37(2):38–45, 2009. 50, 123 [66] Chris Jesshope, Michael Hicks, Mike Lankamp, Raphael Poss, and Li Zhang. Mak- ing multi-cores mainstream - from security to scalability. In International Conference on Parallel Computing (ParCo). Springer (LNCS), September 2009. 50

172 Bibliography

[67] Chris Jesshope, Mike Lankamp, and Li Zhang. Evaluating CMPs and Their Mem- ory Architecture. In ARCS ’09: Proceedings of the 22nd International Conference on Architecture of Computing Systems, pages 246–257, Berlin, Heidelberg, 2009. Springer- Verlag. 50 [68] M. Lankamp. Developing a Reference Implementation for a Microgrid of Mi- crothreaded Microprocessors. Master’s thesis, University of Amsterdam, Amster- dam, the Netherlands, August 2007. 50, 118, 122 [69] F. McMahon. The livermore FORTRAN kernels: A computer test of the numeri- cal performance range. Technical report, Lawrence Livermore National Laboratory, USA, Dec. 1986. 50, 118 [70] Li Zhang and Chris R. Jesshope. On-Chip COMA Cache-Coherence Protocol for Microgrids of Microthreaded Cores. In Bouge and et al., editors, Euro-Par Workshops, volume 4854 of LNCS, pages 38–48. Springer, 2007. 51, 123 [71] M. W. van Tol, C. R. Jesshope, M. Lankamp, and S. Polstra. An implementation of the SANE Virtual Processor using POSIX threads. Journal of System Architecture, 55(3):162–169, 2009. 56 [72] Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman. Compilers: principles, techniques, and tools. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1986. 67 [73] Steven S. Muchnick. Advanced compiler design and implementation. Morgan Kauf- mann Publishers Inc., San Francisco, CA, USA, 1997. 67, 91, 94 [74] Linda Torczon and Keith Cooper. Engineering A Compiler. Morgan Kaufmann Pub- lishers Inc., San Francisco, CA, USA, 2007. 67, 89, 92 [75] Vivek Sarkar. Code optimization of parallel programs: evolutionary vs. revolution- ary approaches. In CGO ’08: Proceedings of the sixth annual IEEE/ACM international symposium on Code generation and optimization, pages 1–1, New York, NY, USA, 2008. ACM. 85, 89 [76] Sebastian Pop. The SSA Representation Framework: Semantics, Analyses and GCC Im- plementation. PhD thesis, Mines ParisTech (ENSMP), France, Paris, France, Dec. 2006. 88 [77] Ron Cytron, Jeanne Ferrante, Barry K. Rosen, Mark N. Wegman, and F. Kenneth Zadeck. Efficiently computing static single assignment form and the control depen- dence graph. ACM Transactions on Programming Languages and Systems, 13(4):451– 490, 1991. 89 [78] Clemens Grelck and Sven-Bodo Scholz. SAC: a functional array language for efficient multi-threaded execution. International Journal of Parallel Programming, 34(4):383–427, Aug 2006. 101 [79] ACE Associated Compiler Experts. CoSy Compilers - Overview of Construction and Operation. Technical report, ACE Associated Compiler Experts bv, Amsterdam, The Netherlands, 2003. 102, 145 [80] Mary W. Hall, Jennifer M. Anderson, Saman P. Amarasinghe, Brian R. Murphy, Shih-Wei Liao, Edouard Bugnion, and Monica S. Lam. Maximizing Multiprocessor Performance with the SUIF Compiler. Computer, 29(12):84–89, 1996. 102 [81] The SUIF Group. SUIF Compiler System. http://suif.stanford.edu/. 102 [82] Computer Architecture and Parallel Systems Laboratory (CAPSL). Open64 Com- piler. http://www.open64.net/. 102

173 Bibliography

[83] Intel and Institute of Computing Technology (Chinese Academy of Sciences). Open Research Compiler (ORC) for ItaniumTM Processor Family. http://ipf-orc. sourceforge.net/. 102 [84] The GCC Internals. http://gcc.gnu.org/onlinedocs/gccint.pdf. 103 [85] Vladimir N. Makarov. The integrated register allocator for GCC. In GCC Developers’ Summit, 2007. 104 [86] Collective tuning. Interactive Compilation Interface (ICI). http://gcc-ici. sourceforge.net/. 104, 145 [87] Modular GCC. http://gcc.gnu.org/wiki/ModularGCC. 104, 145 [88] SLOCCount. http://www.dwheeler.com/sloccount/. 105 [89] Craig Murphy. Improving Application Quality Using Test-Driven Development (TDD). http://www.methodsandtools.com/archive/archive.php?id= 20. 117, 142 [90] Apple-CORE. Architecture Paradigms and Programming Languages for Efficient programming of multiple CORES. http://www.apple-core.info/, 2008. 137, 144 [91] Rustam Abdullaev. A Parallel Call Stack Implementation for the Microthreaded Architecture. Master’s thesis, Universiteit van Amsterdam, 2008. 143 [92] Grigori Fursin, Cupertino Miranda, Olivier Temam, Mircea Namolaru, Elad Yom Tov, Ayal Zaks, Bilha Mendelson, Edwin Bonilla, John Thomson, Hugh Leather, Chris Williams, Michael O’Boyle, Phil Barnard, Elton Ashton, Eric Courtois, and Franc¸ois Bodin. MILEPOST GCC: machine learning based research compiler. In GCC Summit, Ottawa Canada, 2008. 145

174 Index

µTC keywords, 46 Dataflow, 19 µTC language, 44 DDM-CMP, 23 DLP, 4 ACK, 24 DRISC, 22

family creation, 47 Cilk, 21 family management, 39 CMP, 3 communication channels, 30, 41, 112 GCC, 24, 103, 105, 106, 145 compilation GCCGo, 21 CFG, 78 GOMP, 20 challenges, 107 Google Go, 21 classic work-flow, 70 GUPC, 20 DFG, 78 internal representation, 77 ICI, 145 µ of TC language, 75 ILP, 4 SVP compilation schemes, 63 SVP operators, 81 LCC, 24, 103 compiler LLP, 4 back-end, 74 LLVM, 24, 103 conception, 104 front-end, 70, 72 MCCMP, 3 middle-end, 72 Microgrid ISA, 36 optimizations, 77, 86, 87 Microgrids, 36 principles, 62 microthreaded pipeline, 38 role, 100 microthreaded processor, 38 selection, 101 ModularGCC, 145 structure, 67 MPI, 22 SVP extension, 106 multicore programming, 3, 9, 12, 26, 28, types, 61 114, 141, 146 concurrency management, 49 concurrency revolution, 3, 9 Niagara, 22 concurrency tree, 35, 76, 77 concurrent region, 76 object semantics, 40, 47 context switching, 38 Open64, 102 CoSy, 102, 145 OpenCL, 18 CUDA, 18 OpenMP, 20

175 Index

ORC, 102 parallel assumptions, 108 place, 34 Pthreads, 22 pull create mechanism, 41, 82, 113, 144 push create mechanism, 113, 144 register mapping, 42 register window, 42, 112 resource mapping, 109

SCC, 23 Sieve, 21 SUIF, 102 SVP approach, 28 compiler extension, 106 core, 38 execution model, 29 hardware implementation, 36 inter-thread communication, 30 place, 34 software implementation, 44

TCC, 24 TDD, 117, 142 thread context, 40 thread creation, 29 thread family, 29 thread management, 39 TLP, 3

UPC, 20

WaveScalar, 23

176