Royal Institute of Technology

KTH Royal Institute of Technology

School of Information and Communication Technology Electronic Systems

Automatic Software Synthesis from High-Level ForSyDe Models Targeting Massively Parallel Processors

Master of Science Thesis in System-on-Chip Design June 2013 TRITA–ICT–EX–2013:139

Author: Examiner: George Ungureanu Assoc. Prof. Ingo Sander Supervisors: Seyed Hosein Attarzadeh Niaki Gabriel Hjort Blindell Author: Ungureanu, George Title: Automatic Software Synthesis from High-Level ForSyDe Models Targeting Massively Parallel Processors Thesis number: TRITA–ICT–EX–2013:139 Royal Institute of Technology (KTH) School of Information and Communication Technology (ICT) Research Unit: Electronic Systems (ES) Forum 105 164 40 Kista Sweden

Copyright ©2013 George Ungureanu All rights reserved. This work is licensed under the Creative Commons Attribution-NoDerivs cc by-nd 3.0 License. A copy of the license is found at http://creativecommons.org/licenses/by-nd/3.0/

This document was typeset in LATEXwith kp-fonts as font package. Most of the ﬁgures were produced with TikZ package and the rest were drawn using Google Draw Document build: Monday 24th June, 2013, 10:26 Abstract

In the past decade we have witnessed an abrupt shift to parallel computing subsequent to the increasing demand for performance and functionality that can no longer be satisfied by conventional paradigms. As a consequence, the abstraction gab between the applications and the underlying hardware increased, triggering both industry and academia in several research directions. This thesis project aims at analyzing some of these directions in order to offer a solution for bridging the abstraction gap between the description of a problem at a functional level and the implementation on a heterogeneous parallel platform using ForSyDe – a formal design methodology. This report treats applications employing data-parallel and time-parallel computation, regards nvidia CUDA-enabled GPGPUs as the main backend platform. The report proposes a heuristic transformation-and-refinement process based on analysis methods and design decisions to automate and aid in a correct-by-design backend code synthesis. Its purpose is to identify potential data parallelism and time parallelism in a high-level system. Furthermore, based on a basic platform model, the algorithm load-balances and maps the execution onto the best computation resources in an automated design flow. This design flow will be embedded into an already existing tool, f2cc (ForSyDe-to-CUDA C) and tested for correctness on an industrial-scale image processing application aimed at monitoring inkjet print-heads reliability. Keywords: system design flow, high abstraction-level models, ForSyDe, GPGPU, CUDA, time- parallel, data-parallel

iii Acknowledgements

In the course of this thesis project several people have helped me accomplish my tasks and contributed some way or another and to whom I am deeply grateful. First of all, I would like to thank my supervisors, Hosein and Gabriel for all their support. Without Hosein’s scientiﬁc feedback my report would have been much less valuable and without Gabriel’s active involvement in the software tool’s development and implementation my progress would have been even further delayed. Secondly, I would like to thank my mentors, Werner Zapka and Ingo Sander, for investing so much time and trust in my personal and professional development. Without their "leap of faith" regarding my trustworthiness I would never have had the chance to be involved in such exciting projects and work in such an amazing environment. Thirdly, I would like to thank my colleagues from XaarJet AB who proved to be not only excellent professionals in their area of research, helping me to develop insight in areas I could never explore, but great friends as well. I am also grateful to my Master’s Program colleagues Marcus Miculcak and Ekrem Altinel. The excellent collaboration between us ended up in great outcomes, and the ideas presented in this report are mostly resulted from the interesting debates and discussions that I had with them. And last, but not least, I would like to show my deepest gratitude to my wife, Ana Maria, who has valiantly put up with me during the grim time when I worked for this thesis. Her unconditional support, care and understanding kept me going morally, and helped me yield results even if the workload was too heavy. This thesis is undoubtedly dedicated to her...

George Ungureanu Stockholm, June 2013

iv Contents

Contents ...... iv

List of Figures ...... x

List of Tables ...... xi

Listings ...... xiv

List of Abbreviations ...... xvi

1 Introduction ...... 1 1.1 Problem statement ...... 1 1.2 Motivation ...... 2 1.3 Objectives ...... 2 1.4 Document overview ...... 3 1.4.1 Part I ...... 3 1.4.2 Part II ...... 4 1.4.3 Part III ...... 4 1.4.4 Part IV ...... 5

IUnderstanding the Problem 7

2 ForSyDe ...... 9 2.1 Introduction ...... 9 2.2 The modeling framework ...... 10 2.3 System modeling in ForSyDe-SystemC ...... 12 2.3.1 Signals ...... 13 2.3.2 Processes ...... 13 2.3.3 Testbenches ...... 15 2.3.4 Intermediate XML representation ...... 15

v vi

3 Understanding Parallelism ...... 17 3.1 Parallelism in the many-core era ...... 17 3.2 A theoretical framework for parallelism ...... 18 3.2.1 Kleene’s partial recursive functions ...... 19 3.2.2 A functional taxonomy for parallel computation ...... 21 3.3 Parallel applications: the 13 Dwarfs ...... 22 3.4 Berkeley’s view: design methodology ...... 23 3.4.1 Application point of view ...... 24 3.4.2 Software point of view ...... 24 3.4.3 Hardware point of view ...... 27

4 GPGPUs and General Programming with CUDA ...... 29 4.1 Brief introduction to GPGPUs ...... 29 4.2 GPGPU architecture ...... 30 4.3 General programming with CUDA ...... 32 4.4 CUDA streams ...... 35

5 The f2cc Tool ...... 37 5.1 f2cc features ...... 37 5.2 f2cc architecture ...... 38 5.3 Alternatives to f2cc ...... 40 5.3.1 SkelCL ...... 40 5.3.2 SkePU ...... 42 5.3.3 Thrust ...... 43 5.3.4 Obsidian ...... 44

6 Challenges ...... 47

II Development and Implementation 51

7 The Component Framework ...... 53 7.1 The ForSyDe model ...... 53 7.1.1 f2cc approach ...... 53 7.1.2 Model limitations and future improvements ...... 57 7.2 The intermediate model representation ...... 59 7.2.1 f2cc approach ...... 59 7.2.2 Limitations and future improvements ...... 61 7.3 The process function code ...... 61 7.3.1 f2cc approach ...... 62 7.3.2 Future improvements ...... 63 7.4 The GPGPU platform model ...... 64 7.4.1 Computation costs ...... 64 7.4.2 Communication costs ...... 65 7.4.3 Future improvements ...... 66

8 Design Flow and Algorithms ...... 67 8.1 Model modiﬁer algorithms ...... 67 8.1.1 Identifying data-parallel processes ...... 67 8.1.2 Optimizing platform mapping ...... 72 vii

8.1.3 Load balancing the process network ...... 73 8.1.4 Pipelined model generation ...... 79 8.1.5 Future development ...... 79 8.2 Synthesizer algorithms ...... 80 8.2.1 Generating sequential code ...... 81 8.2.2 Scheduling and generating CUDA code ...... 83 8.2.3 Future development ...... 87

9 Component Implementation ...... 89 9.1 The ForSyDe model architecture ...... 89 9.2 Module interconnection ...... 90 9.3 Component execution ﬂow ...... 91

III Final Remarks 93

10 Component Evaluation and Limitations ...... 95 10.1 Evaluation ...... 95 10.2 Limitations and future work ...... 97

11 Concluding Remarks ...... 101

IV Appendices 103

A Component documentation ...... 105 A.1 Building ...... 105 A.2 Preparations ...... 106 A.3 Running the tool ...... 107 A.4 Maintenance ...... 107

B Proposing a ForSyDe Design Toolbox ...... 109 B.1 A simple explanatory example ...... 109 B.2 Layered refinements & Refinement loop ...... 110 B.3 Proposed architecture for the design flow tool ...... 113

C Demonstrations ...... 115

Bibliography ...... 131 viii List of Figures

2.1 ForSyDe process network ...... 11 2.2 ForSyDe process constructor ...... 11 2.3 ForSyDe MoCs ...... 12

3.1 Kleene’s composition rule ...... 19 3.2 Kleene’s basic forms of composition ...... 20 3.3 Kleene’s primitive recursiveness and minimization ...... 20

4.1 nvidia CUDA architecture ...... 30 4.2 nvidia CUDA thread division ...... 31 4.3 nvidia CUDA streams ...... 35

5.1 f2cc identiﬁcation pattern ...... 38 5.2 f2cc component connections ...... 39 5.3 f2cc internal model ...... 39 5.4 Obsidian program pattern ...... 44

7.1 f2cc v0.1 internal model ...... 54 7.2 f2cc v0.2 internal model ...... 56 7.3 The ParallelComposite process ...... 57 7.4 f2cc cross-hierarchy connections ...... 58 7.5 Generating variable declaration code ...... 62 7.6 CFuncton structure in f2cc v0.1 ...... 62 7.7 CFuncton structure in f2cc v0.2 ...... 62 7.8 Extracting variable information ...... 63

8.1 Grouping potentially parallel processes ...... 70 8.2 Building individual data paths ...... 74 8.3 Loop unrolling ...... 74 8.4 Modeling streamed execution ...... 85

ix x List of Figures

9.1 f2cc v0.2 internal model architecture ...... 90 9.2 f2cc execution ﬂow ...... 91

B.1 Simple model after analysis ...... 110 B.2 Simple model after reﬁnements ...... 110 B.3 Model analysis aspects ...... 111 B.4 Hierarchical separation of transformation layers ...... 111 B.5 Reﬁnement loop ...... 113

C.1 Demo example: input model ...... 118 C.2 Demo example: model after ﬂattening ...... 119 C.3 Model after grouping equivalent comb processes ...... 120 C.4 Model after grouping potentially parallel leaf processes ...... 121 C.5 Model after removing redundant zipx and unzipx processes ...... 122 C.6 Model after platform optimization ...... 123 C.7 Model after load balancing ...... 124 C.8 Model after creating pipeline directives ...... 125 List of Tables

3.1 The 13 Dwarfs of parallel computation ...... 23

8.1 Assigning data bursts to streams ...... 84

10.1 The component’s status at the time of writing the report ...... 96

xi xii List of Tables Listings

2.1 ForSyDe-SystemC signal definition ...... 13 2.2 ForSyDe-SystemC leaf process function definition ...... 13 2.3 ForSyDe-SystemC composite process declaration ...... 14 2.4 ForSyDe-SystemC testbench ...... 15 2.5 ForSyDe-SystemC introspection ...... 15 2.6 ForSyDe-SystemC intermediate XML format ...... 16 4.1 Matrix multiplication in C ...... 33 4.2 Matrix multiplication in CUDA - Host ...... 34 4.3 Matrix multiplication in CUDA - Device ...... 34 4.4 Concurrency in CUDA ...... 35 5.1 SkelCL syntax example ...... 41 5.2 SkePU function macros ...... 42 5.3 SkePU syntax example ...... 43 5.4 Thrust syntax example ...... 43 5.5 Obsidian function declaration ...... 44 5.6 Obsidian definition for pure and sync ...... 45 7.1 GraphML port...... 60 7.2 XML port...... 60 7.3 Static type name declaration ...... 60 7.4 Result of static type name declaration ...... 61 7.5 Algorithm for parsing ForSyDe function code ...... 63 8.1 Algorithm for identifying data parallel sections ...... 68 8.2 Methods used by data parallel sections identification algorithm ...... 69 8.3 Method used by data parallel sections identification algorithm ...... 71 8.4 Proposed algorithm for identifying data parallel sections ...... 72 8.5 Algorithm for platform optimization ...... 72 8.6 Algorithm for load balancing ...... 73 8.7 Method for data paths extraction, used by load balancing algorithm ...... 74 8.8 Method for extracting and sorting contained sections ...... 77 8.9 Method for splitting the process network into pipeline stages ...... 78 8.10 Algorithm for code synthesis ...... 81

xiii xiv List of Tables

8.11 Method for generating sequential code for composite processes ...... 82 8.12 Top level for method for generating CUDA code ...... 86 A.1 Platform model template ...... 107 C.1 ForSyDe process function code ...... 117 C.2 Extracted C code...... 117 C.3 Excerpt from the f2cc output logger ...... 126 C.4 Sample sequential code: composite process execution wrapper ...... 127 C.5 Sample parallel code: top level execution code ...... 128 C.6 Sample parallel code: kernel function wrapper ...... 129 List of Abbreviations

3D three-dimensional ANSI American National Standards Institute API Application Program Interface AST Abstract Syntax Tree CPU Central Processing Unit CT Continuous Time (MoC) CUDA Computer Unified Device Architecture DE Discrete Event (MoC) DI Domain Interface DRAM Dynamic Random-Access Memory DSL Domain Specific Language DUT Design Under Test EDSL Embedded Domain Specific Language ESL Electronic System Level f2cc ForSyDe to CUDA C ForSyDe Formal System Design GPGPU General Purpose Graphical Processing Unit GPU Graphical Processing Unit GraphML Graph Markup Language GUI Graphical User Interface HDL Hardware Description Language ILP Instruction Level Parallelism IP Intellectual Property ITRS International Technology Roadmap for Semiconductors MIMD Multiple Instruction Multiple Data MoC Model of Computation OS Operating System POM Project Object Model RTTI Run-Time Type Information SDF Synchronous Data Flow (MoC)

xv xvi List of Tables

SDK Software Development Kit SIMD Single Instruction Multiple Data SIMT Single Instruction Multiple Thread SM Streaming Multiprocessor SP Streaming Processor STL Standard Template Library SY Synchronous (MoC) UT Untimed (MoC) XML Extensible Markup Language Chapter 1 Introduction

This chapter will present the problem that will be approached throughout this thesis. The problem will be stated prior to a brief motivation for this project in the current industrial context. Afterwards, a set of overarching goals will be enumerated, followed by an overview of this report.

1.1 Problem statement

he current project aims at tackling with the problem of mapping intensive parallel T computation on platforms with resource support for data- and time-parallel computation, with special consideration to the leading many-core platform in industry, the General Purpose Graphical Processing Unit (GPGPU, [Kirk and Hwu, 2010]). As a design language for describing systems at a high level of abstraction, ForSyDe [Sander and Jantsch, 2004] will be used. ForSyDe is associated with a formal high-level system design methodology that raises the abstraction level in designing real-time embedded systems in order to aid the mapping on complex heterogeneous platforms through techniques like design space exploration, semantic- preserving transformations, reﬁnement-through-replacement, etc.

The ﬁrst problem that has to be treated is analyzing whether or not ForSyDe supports the description of parallel computation in harmony with the existing MoC-based framework. In this sense, a deep understanding of parallelism and its principles is necessary. The two main terms introduced in the current contribution, data parallelism and time parallelism are be presented in the context of parallel computation in Chapter 3.

The second problem that this project must attend to is the implementation of a mapping algorithm from a parallel ForSyDe model to a GPGPU backend. In order to do so, an existing tool called f2cc [Hjort Blindell, 2012]1 has to be extended to support both the new ForSyDe-SystemC features and new data-parallel and time-parallel models.

1ForSyDe to CUDA C (f2cc) was developed and implemented by Gabriel Hjort Blindell as part of his Master’s Thesis in 2012.

1 2 1. Introduction

The third and ﬁnal problem treated by the ongoing thesis is the resulting software component’s validation with an industrial-scale application provided by XaarJet AB, a printing-oriented company.

1.2 Motivation

In the past decade we witnessed a dramatic shift of computation paradigms into the parallel domain, hence the dawn of the "many-core era". This shift was not a result of great innovation as much as a necessity to cope with the increasing demands for performance and functionality. This fact is summed up with the increasing complexity of both platforms and applications that cannot be handled by traditional design methods anymore. Faced with the "parallel problem", both industry and academia came up with a number of solutions which will further be presented in Chapter 3, Chapter 4 and Section 5.3. The main issue is that most of these solutions are not based on a formal basis commonly agreed upon which can constitute a theoretical foundation for the parallel paradigm, just like Turing’s model was the foundation for the sequential paradigm. Furthermore, they represent points of view that are dispersed among research groups which try to mold the paradigms to their desired goals (productivity, backward-compatibility, verification, etc.). Most of the aforementioned solutions treat many-core parallel platforms as means of high- throughput computation. Strangely, one point of view was ignored until now, especially by high-level programming models: treating many-cores as complex, heterogeneous and analyzable platforms. Hence, we invoke ForSyDe as a methodology to treat with these issues. Due to its inherent formalism, the complexity problem can be properly handled, enabling correct-by-design implementation solutions. Furthermore, the Model-of-Computation-base formalism [Sander and Jantsch, 2004] is a natural framework for expressing parallel computation, consequently it offers a good environment for a foundation for parallelism. The design flows associated with the ForSyDe methodology are based on analysis, design space exploration and semantic- preserving transformations, providing means to take advantage of architectural traits hard to explore otherwise. The platform chosen for analysis is the GPGPU since it is the most widely-used many-core platform in industry. GPGPUs are notoriously difficult to program and verify due to their low- level style of programming based on a sequential model. One application that requires high throughput computation on a parallel platform whose development stagnated due to these issues is provided by XaarJet AB. This application will be implemented in ForSyDe and given as example for testing the current project.

1.3 Objectives

The main goal of this project is to investigate and oﬀer proper solutions to the problems stated in Section 1.1. This task has been split in the following set of sub-goals: 1. Extensive literature study. This will include: • ForSyDe tutorials and research papers; 1.4. Document overview 3

• The f2cc architecture, tool API, implementation, and the thesis report; • Relevant material related to parallel computation; • GPGPUs, their architecture and programming model; • Alternatives to f2cc; 2. Devise a plan for expanding f2cc’s functionality. 3. Expand f2cc with new features provided by the ForSyDe-SystemC modeling framework, and implement a new frontend. 4. Implement a synthesis flow for pipelined CUDA code synthesis for f2cc. 5. Provide high quality code documentation. 6. Evaluate the improved f2cc tool with an image processing application provided by XaarJet AB. As an optional goal, the analysis and proposal of a generic development flow tool for ForSyDe will be presented. This tool should easily embed the implemented synthesis flow but keep a high degree of flexibility in order to enable any type of flow. As it is beyond the scope of a M.Sc. thesis to consider implementing such a fully generic tool, only proposals for future research will be delivered. Other relevant optional goals would be the implementation of other types of parallelism if time permits it.

1.4 Document overview

The document is divided into four parts. The ﬁrst part includes the background study performed in order to understand the full scale of the problem which we will encounter. The second part presents the individual steps of implementing the component. The third part closes the report with some concluding remarks based on evaluation results. The fourth part contains supplementary material that could not be included in the report body. The following sections aim to oﬀer a reading guide for the current report.

1.4.1 Part I

The first part of the document digs into theoretical problem and tries to analyse it from different perspectives. Its purpose is to provide the reader with enough knowledge to understand the full scale of the problem and the challenges that will arise during the component implementation. Chapter 2 briefly introduces the reader to the ForSyDe methodology and the ForSyDe-SystemC design framework. It presents the basic concepts and usage, focusing on the structures used in this project, and it points to further related material. The reader may skip this chapter provided he or she possesses previous knowledge of ForSyDe. Chapter 3 paves the way into the current industrial and academic context of dealing with parallelism. This chapter is of pure scientific interest, and it does not contain information directly related to the project’s implementation work efforts. Its purpose is to dig into the heart of the problem at hand at a high level, and to analyze it from different perspectives. A reader interested only in the project’s methodology may skip this chapter, having in mind that there 4 1. Introduction exist a few theoretical notions that are defined and referenced in these sections. Still, a future ForSyDe developer is encouraged to read the provided material, since it offers valuable insight in the problems she or he might encounter.

Chapter 4 introduces the reader to the basic concepts of GPGPUs, as they are the main target platform. Material for further reading is referenced, and the basic usage of threads and streams is shown. This chapter constitutes the background for the implementation of the software component’s synthesizer module2.

Chapter 5 brieﬂy presents the current component that has to be improved, f2cc. It also analyses four alternatives to f2cc for synthesis ﬂows targeting GPGPUs, available in the research community. The reader is strongly encouraged to read this chapter in order to understand the content in Part II.

Chapter 6, the final chapter belonging to this part, lists the main challenges that were identified during Part I as needed to be treated by this project’s work efforts, and prioritizes them.

1.4.2 Part II

The second part concerns the development of the software component. An in-depth analysis of the software architecture and the algorithms used is presented. The part closes with putting together the previously presented components, in order to depict the proposed design ﬂow.

Chapter 7 introduces the reader to the main development framework that had to be improved in order to both deliver the desired goals and to embed this project’s design flow into the available design flow. Thus the main features are presented in order to understand the magnitude of the work effort. Apart from the design decisions made, a a comprehensive list of future improvement is proposed for a potential developer.

Chapter 8 presents the main theory and concepts that hide behind the above-mentioned software tool. Its algorithms are both analyzed for scalability and provided with either optimized alternatives or proposals for future development, since they are tempering with still young and undiscovered issues.

Finally, Chapter 9 binds together the previous two chapters and shortly presents the component’s main implementation features and plots its execution ﬂow.

1.4.3 Part III

The third part closes this report. An evaluation of the current state of the project is oﬀered, as regards the initial goals, along with a list with proposals for future development. Chapter 10 tries to evaluate the current state of the software component while listing and prioritizing proposals for future development, as they emerge from an overview of Part II. Chapter 11 concludes the M. Sc. project and gives a verdict with respect to the delivered versus the initially proposed goals.

2see Section 5.2 1.4. Document overview 5

1.4.4 Part IV

The fourth and last part contains the appendices. Appendix A contains a documentation of the software component, including installation, preparation, usage and maintenance tips. In Appendix B, a few of the author’s personal reﬂections concerning the future development of ForSyDe with regard to a ForSyDe Development Toolkit will be presented. Appendix C holds samples to demonstrate the software component’s usage and results outputted by its intermediate steps. As example, a core part from the Linescan application has been used. 6 1. Introduction Part I

Understanding the Problem

7 8 Chapter 2 ForSyDe

This chapter will brieﬂy present ForSyDe (Formal System Design), a system design methodology that starts from a high-level formal description. The ﬁrst section will introduce ForSyDe in the system design environment. The second section will provide a brief overview of the modeling framework, while the third section will show an example of how to model systems using the SystemC implementation. It is out of the scope of this report to provide full comprehensive documentation, which is why the reader is encouraged to consult related documents like [Sander and Jantsch, 1999, Sander, 2003, Sander and Jantsch, 2004, Attarzadeh Niaki et al., 2012, Jakobsen et al., 2011] or tutorials [ForSyDe, 2013]

2.1 Introduction

eutzer et al. states that "in order to be effective, a design methodology that addresses K complex systems must start at high levels of abstraction" [Keutzer et al., 2000]. ForSyDe is one such methodology for Electronic System Level (ESL) design that "raises the abstraction level of the design entry to cope with the increased complexity of (embedded systems)" [Attarzadeh Niaki et al., 2012]. ForSyDe’s main objective is to "move the design refinement from the implementation to the functional domain" [Sander and Jantsch, 2004], by capturing a design’s functionality inside a specification model. Thus, the designer works on this model which hides the implementation details and is able to "focus on what a system is supposed to do rather than how" [Hjort Blindell, 2012]. Working at a high level of abstraction has two main advantages. Firstly, it enables the designer to have an overview of the system and of the data flow. Secondly, it aids the identification of optimizations and of opportunities for better design decisions. One example, which will be extensively treated in this report, is the identification and exploitation of parallel patterns in algorithms described in ForSyDe. These patterns could not be exploited as naturally at compiler-level, since the full context for the cause and effect in the execution is lost. Another key feature of ForSyDe is the design transformation and refinement. By applying

9 10 2. ForSyDe semantic-preserving transformations to the high-level model, and gradual refinement by adding backend-relevant information, one can achieve a correct implementation through a transparent process optimized for synthesis. Combining refinement with analysis at each design stage, the designer is able to reach an optimum implementation solution for the given problem. Perhaps the most important feature of ForSyDe is its formalism. Practically the design starts from a formal model that expresses functionality. This fact aids in developing a correct-by-design system that can be both tested and validated at all levels of refinement. This is an especially difficult task to achieve without a formal basis. Also, the computational and structural features can be captured and analyzed formally, dismissing all ambiguities. This eliminates, at least theoretically, the need for post-verification and debugging which is often the most expensive stage of a product realization process.

2.2 The modeling framework

The following subsection is based on material found in [Attarzadeh Niaki et al., 2012] and [Lee and Sangiovanni- Vincentelli, 1997].

To understand the mechanisms behind ForSyDe one should have a clear picture of its modeling framework, which determines its formal basis. In the following paragraphs, the basic concepts will be explained.

Structure

The system model is structured as a concurrent hierarchical process network. The components of a process network are processes and domain interfaces, connected through signals, as shown in Figure 2.1. The processes are triggered and synchronized only through signals, and the functions encapsulated by them are side-eﬀect free. Hierarchy can be achieved through composite processes. They are formed by composing either leaf processes (like p1 ...p5 in Figure 2.1), or other composite processes.

Models of Computation

The Models of Computation (MoCs) describe the semantics of concurrency and computation of the processes in the system. Each process belongs to a MoC which explicitly describes its timely behavior. Currently the ForSyDe-SystemC framework supports four MoCs [ForSyDe, 2013], but more are researched and in development. The supported MoCs are: • The Synchronous Data Flow MoC (SDF), which is a variant of the Untimed MoC (UT), where there is no explicit time description. In SDF the synchronization is done by passing tokens through signals, and it is suitable for describing analyzable streaming applications. • The Synchronous MoC (SY), where it is assumed that neither computation nor communication takes time. It is suitable for describing digital systems or control systems, where the design model ignores timing details by implying synchronization with a master event. 2.2. The modeling framework 11

composite process1 composite process2

p2 di1 p4 Legend p1 MoC A MoC B p3 di2 p5 domain interface process

Figure 2.1: ForSyDe process network

• The Discrete Event MoC (DE), where a time quantum is deﬁned. It is suitable for describing test bench systems and modeling the environment. • The Continuous Time MoC (CT) that describes physical time. It is suitable for modeling analog components and physical processes.

Process Constructors

The process constructors enforce formal restrictions upon the design, to ensure analyzability and an organized structure. In order to create a leaf process in the model, the designer must choose a process constructor from the deﬁned ForSyDe library. A process constructor takes side-eﬀect-free functions and values as arguments and creates a process. The formal computation and communication semantics are embedded in the model based on the chosen constructor.

Process Constructor Functions Values Process

mealy mealy

f g v f g

Figure 2.2: Example for creating a Mealy process using a ForSyDe process constructor. Source: adapted from [ForSyDe, 2013] .

Figure 2.2 illustrates the concept of process constructor, by creating a process that implements a Mealy finite-state machine within the SY MoC. The process constructor defines the Model of Computation, the type of the process (finite-state machine), and the process interface. The functionality of the process is defined by a function f that specifies the calculation of the next state, another function g that specifies the calculation of the output, and a value v that specifies the initial value of the process.

Domain Interfaces and Wrappers

The domain interfaces (DI) are used to connect processes belonging to diﬀerent MoCs. For MoCs of close abstraction levels, there are deﬁned DIs as suggested by the bold lines in 12 2. ForSyDe

Figure 2.3. Other DIs (the dotted lines) are derived by composing existing DIs.

SDF SY DE CT

Figure 2.3: ForSyDe MoCs and their DIs

The wrappers are special processes which behave similarly to other processes, but which embed external models. They communicate their input/output to external simulators to co-simulate the model and assure system validation even if not all components are implemented in the ForSyDe framework. It is out of this thesis’ scope to study the eﬀects and oﬀer solutions for DIs and wrappers.

The Synchronous Model of Computation

This report will mainly focus on the SY MoC, since it is the only MoC considered in the design ﬂow associated with this project’s software component. The SY MoC describes a timed concurrent system, implying that its events are globally ordered. This means that any two distinct events are either synchronous (they happen at the same moment, and are associated with the same tag) or one unambiguously precedes another [Lee and Sangiovanni-Vincentelli, 1997]. Two signals can be considered synchronous if all events in one signal are synchronous with the events from the other signal and vice-versa. A system is synchronous if every signal in the system is synchronous to every other signal in the system. Apart from ForSyDe, there are several languages that describe synchronicity, such as Lustre [Halbwachs et al., 1991], Esterel [Berry and Gonthier, 1992] or Argos [Maraninchi, 1991]. These languages describe events tagged as present or events . A key property is that the order of these event tags is absolute and unambiguous.> ⊥

2.3 System modeling in ForSyDe-SystemC

The following section is based on material found in [ForSyDe, 2013]. This report assumes that the reader is familiar with programming in C++, understanding XML, and using the SystemC platform. For a comprehensive SystemC tutorial, the reader is encouraged to consult [ASIC World, 2013].

ForSyDe-SystemC is an implementation of the ForSyDe design framework using the SystemC kernel. SystemC is a template library for C++, an object-oriented language, with the main purpose of co-simulating and validating of hardware-software systems at a high level of abstraction. Many elements of the SystemC language are not allowed to be used in ForSyDe-SystemC, and the ones which are used may appear in a diﬀerent terminological context, to enforce formalism. 2.3. System modeling in ForSyDe-SystemC 13

All ForSyDe-SystemC elements are implemented as classes inside the ForSyDe namespace. Each element belongs to a MoC, which is in fact a sub-namespace of the ForSyDe namespace. For example ForSyDe::SY holds all elements (processes, signals, DIs) belonging to the SY MoC.

2.3.1 Signals

Signals are bound to an input or an output of a ForSyDe process. They are typed and can be deﬁned as belonging to a MoC by using their associated class from the respective MoC namespace. For the SY MoC there is a helper (template) class abst_ext which is used to represent absent-extended values. Absent-extended values can be either absent or present with a value of type T. Listing 2.1 deﬁnes a signal of SY MoC called my_sig which carries tokens of type abst_ext.

1 ForSyDe::SY::SY2SY my_sig;

Listing 2.1: ForSyDe-SystemC signal deﬁnition

2.3.2 Processes

Leaf processes are created using process constructors. Process constructors are templates provided by the library that are parameterized in order to create a process. The parameters to a process constructor can be initial values (e.g., initial states) or functions. From the C++ point of view, creating a leaf process out of a process constructor is equivalent to instantiating a C++ class and passing the required parameters to its constructor.

1 void mul_func(abst_ext& out1, 2 const abst_ext& a, const abst_ext& b){ 3 int inp1 = a.from_abst_ext(0); 4 int inp2 = b.from_abst_ext(0); 5 6 #pragma ForSyDe begin mul_func 7 out1 = inp1 * inp2; 8 #pragma ForSyDe end 9 }

Listing 2.2: ForSyDe-SystemC leaf process function deﬁnition

Listing 2.2 shows an example of defining a process constructor’s associated function. It looks like a regular C++ function definition but there are a few particularities that have to be taken into account: • The function header contains the function name and the function parameters, in the order defined in the API (please consult the API documentation in [ForSyDe, 2013]). In this example, the function has two inputs that have to be declared const and one output. • The function body, where one can identify two separate parts: the computation part, between pragmas, which hold the C function which can be analysed or further be mapped to a platform; and the protocol part, outside pragmas, with the sole purpose of wrapping / unwrapping the data variables into / from functional capsules like abst_ext. 14 2. ForSyDe

A composite process is the result of instantiation of other processes and wiring them together using signals. A set of rules should be respected in order to beneﬁt from the ForSyDe features such as formal analysis, composability, etc. Otherwise, the system can still be simulated using SystemC kernel, but will not be able to follow a design ﬂow. These rules are [ForSyDe, 2013]:

• A composite process is in fact a SystemC module derived from the sc_module class.

• A composite process is the result of instantiation and interconnection of other valid ForSyDe processes, no ad-hoc SystemC processes or modules are allowed.

• Ports of all child processes in a composite process are connected together using signals of SystemC channel type ForSyDe::[MoC]::[signal] (for example ForSyDe::SY::SY2SY).

• A composite process in the includes zero or more inputs and outputs of SystemC port types [MoC]_in and [MoC]_out (for example SY_in and SY_out).

• If an input port of a composite process should be connected to several child processes, an additional fanout process (i.e., ForSyDe::SY::fanout) is needed in between.

1 #ifndefMULACC_HPP 2 #defineMULACC_HPP 3 4 #include 5 #include "mul.hpp" 6 #include "add.hpp" 7 8 using namespace ForSyDe::SY; 9 10 SC_MODULE(mulacc){ 11 SY_in a, b; 12 SY_out result; 13 14 SY2SY addi1, addi2, acci; 15 16 SC_CTOR(mulacc){ 17 make_comb2( "mul1" , mul_func, addi1, a, b); 18 19 auto add1 = make_comb2( "add1" , add_func, acci, addi1, addi2); 20 add1->oport1(result); 21 22 make_delay( "accum" , abst_ext (0), addi2, acci); 23 } 24 }; 25 26 #endif

Listing 2.3: ForSyDe-SystemC composite process declaration

Having this knowledge in mind, the code from Listing 2.3 can be explained. The ﬁle includes the forsyde header library and the functions referenced by the process constructors. The SystemC constructs SC_MODULE and SC_CTOR are used to declare the composite process called mulacc. The leaf processes can be declared either in SystemC fashion, by connecting channels (signals) to ports, or by using helper functions like make_comb2 (line 17). 2.3. System modeling in ForSyDe-SystemC 15

1 SC_MODULE(top){ 2 SY2SY srca, srcb, result; 3 4 SC_CTOR(top){ 5 make_constant("constant1" , abst_ext (3), 10, srca); 6 7 make_source("siggen1" , s_func, abst_ext(1), 10, srcb); 8 9 auto mulacc1 = new mulacc("mulacc1" ); 10 mulacc1->a(srca); 11 mulacc1->b(srcb); 12 mulacc1->result(result); 13 14 make_sink("report1" , report_func, result); 15 } 16 };

Listing 2.4: ForSyDe-SystemC testbench

2.3.3 Testbenches

There are processes in each MoC that only produce / consume values and can be used for testing purposes. As seen in Listing 2.4, the testbench can be seen as a top module which connects the design under test (DUT – in this case the mulacc composite process) with these source / sink processes.

2.3.4 Intermediate XML representation

ForSyDe’s introspection feature enables it to extract structural information from the SystemC project files and encapsulate it in an XML format. The XML files represent an intermediate format that will further be fed to the system design flow, and they capture essential structural information. This information can be easily accessed, analyzed and modified (refined) by an automatic process. To enable introspection, one has to invoke the ForSyDe::XMLExport::traverse function to traverse the DUT’s top module at the start of the simulation, and to compile the design with the macro FORSYDE_INTROSPECTION defined. Listing 2.5 shows the syntax to enable the introspection feature while Listing 2.6 shows an example XML output.

1 #ifdefFORSYDE_INTROSPECTION 2 void start_of_simulation() { 3 ForSyDe::XMLExport dumper("" ); 4 dumper.traverse(this); 5 } 6 #endif

Listing 2.5: Enabling the introspection feature 16 2. ForSyDe

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

Listing 2.6: Example of intermediate XML format Chapter 3 Understanding Parallelism

This chapter aims at tackling the problem that has arisen due to the brusque leap from the industry standard of single-processor sequential computation to many-core parallel computation. First, a short background will attempt to place the current problem which industry is facing in the context of the many-core era. The second section will propose a theoretical framework deﬁning parallelism starting from Kleene’s computational model. The third and the fourth sections will describe Berkeley’s view of the parallel problems, and its views regarding the design of hardware and software systems. The fourth section will also compare Berkeley’s proposed methodologies with ForSyDe, and we will argue about why ForSyDe is a proper methodology for designing heterogeneous systems embedding massively parallel many-core processors.

3.1 Parallelism in the many-core era

oday industry is facing an abrupt shift to parallel computing, which it is not yet ready to T fully embrace. Over the past decades the main means of pushing forward the IT industry was by either increasing the clock frequency or by other innovations that were ineﬃcient in terms of transistors and power, but which kept the sequential programming model (ILP , deep pipelining, cache systems, etc) [Hennessy and Patterson, 2011]. During this time, there were several attempts to develop parallel computers, like MasPar [Blank, 1990], Kendall Square [Dunigan, 1992] or nCUBE [Hayes et al., 1986], but they failed due to the rapid development and increase of the sequential performance. Indeed, compatibility with legacy programs, like C programs, was more valuable to industry than new innovations, and programmers accustomed to continuous improvement in sequential performance saw little need to explore parallelism. However, during the last decade industry has reached its most important turning point by hitting the power limit which a chip is able to dissipate, called in [Hennessy and Patterson, 2011] "the power wall". As the International Technology Roadmap for Semiconductors (ITRS) was "replotted" during these years [ITRS, 2005, ITRS, 2007, ITRS, 2011] one could see an increasing discrepancy between earlier clock rate predictions (15GHz in 2010 judging by

17 18 3. Understanding Parallelism

2005 predictions [ITRS, 2005]), and actual processors’ sequential performance (currently Intel products are far below even the conservative 2007 predictions [ITRS, 2007]). This is an understandable phenomenon due to the sudden changes in conventional wisdoms that had to be accepted by the industry. A comprehensive list of old versus new conventional wisdoms can be found in [Asanovic et al., 2006]. Apart from the well-known power wall, memory wall and ILP wall which constitute "the brick wall", we can point out the tenth conventional wisdom pair. According to it programmers cannot rely on waiting for sequential performance increase instead of parallelizing their programs, since it will be a much longer wait for a faster sequential computer. Thus the current leap to parallelism is not based on a breakthrough in programming or architecture, but "(it) is actually a retreat from the more diﬃcult task of building power- eﬃcient, high-clock-rate, single-core chips" [Asanovic et al., 2009]. Indeed, the current solution for general computing is still replicating sequential processors into multi-cores, which has proven to work for a small number of cores (2 to 16) without drastic changes from the sequential paradigms and way of thinking. But this strategy is likely to face diminishing returns once the number of cores increases beyond 32 [Hennessy and Patterson, 2011], stepping into the many- cores domain. Apart from that, the more pessimistic predictions in [ITRS, 2011] show an increased discrepancy in the user required performance and device performance. Faced with this new knowledge and the new eleventh conventional wisdom stating that "increasing parallelism is the primary way of increasing a processor’s performance", industry is faced with the decision of adopting new paradigms and new functional models that maximise productivity into thousands of cores environments. Asanovic et al. state that the only solution lies in the research community, and that "researchers [have to] meet the parallel challenge".

3.2 A theoretical framework for parallelism

As seen in Section 3.1, the difference between multi- and many-processors is qualitative rather than quantitative. While multi-processors could be regarded as multiple machines running sequentially and extended with scheduling constructs with the parallel execution mainly at program level, many-processors have a completely different foundation principle. Sequential processors have a strong foundation in Turing’s computational model, which lead to the von Neumann machine. Although this model lasted for more than half a century, it doesn’t express execution platforms naturally any more. Maliţa et al. say that "Turing’s model cannot be used directly to found many-processors" [Maliţa and Ştefan, 2008]. Unfortunately, industry is conservative and many of the available solutions are rather non-formal extensions of available topologies. This drawback is summed up with the theoretical weakness of the new domain. The parallel computation is still in its infancy and does not have a theoretical framework of its own that is unanimously accepted by both computer scientist and industry. During the past few decades, several groups of researchers adopted Kleene’s model of partial recursive functions [Kleene, 1936] as a computational model for parallelism [Papadimitriou, 2003, Beggs and Tucker, 2006, Chen et al., 1992]. The following subsections will define a formalism for parallel computation based on Kleene’s model, as presented in [Maliţa et al., 2006, Maliţa and Ştefan, 2009, Maliţa and Ştefan, 2008,Ştefan, 2010]. 3.2. A theoretical framework for parallelism 19

3.2.1 Kleene’s partial recursive functions

The following subsection is based on material found in [Maliţa et al., 2006,Maliţa and Ştefan, 2009,Maliţa and Ştefan, 2008]

In the same year that Turing published his paper, Kleene published the partial recursive functions. He deﬁnes computation using basic functions (zero, increment, projection), and rules (composition, primitive recursiveness and minimization). The main rule is the composition, and Figure 3.1 depicts a structure which computes Equation 3.1:

f (x0,x1,...xn 1) = g(h0(x0,x1,...xn 1),...hm 1(x0,x1,...xn 1)) (3.1) − − − − where hx represents an increment function and g represents a projection function. Both the ﬁrst level of computing and the second level are parallel, and the only restriction is that g cannot start the computation before all h have ﬁnished.

x0,x1,...xn 1 −

h0 h1 ... hm 1 −

out = f (x0,x1,...xn 1) − Figure 3.1: The structure associated with the composition rule

A Universal Turing Machine is a sequential composition of functions (for example hi), thus the parallel aspect of the computation is lost. On the other hand, the Kleene processes are inherently parallel, where hi functions are independent and g could be independent if it works in a pipelined fashion and on diﬀerent inputs data x0,x1,...xn 1. Thus, Kleene’s model is a natural starting point for a parallel computation model. − From the general form of composition expressed in Equation 3.1 one can express several simpliﬁed forms to describe the other rules: • data-parallel composition, described by Equation 3.2, is a limit case of Equation 3.1, where n = m and g is the identity function.

f (x0,x1,...xn 1) = [h0(x0),...hn 1(xn 1)] (3.2) − − −

• serial composition, described by Equation 3.3, is deﬁned for p applications of the composition with m = 1, and the function is applied on an input stream < x0,x1,...xn 1 > in a pipelined fashion. − f (x) = kp 1(kp 2(...k0(x))...) (3.3) − − 20 3. Understanding Parallelism

• reduction composition, described by Equation 3.4, is a special case of Equation 3.1, where h is the identity function, and the input vector [x0,x1,...xn 1] is reduced to a scalar. − f (x0,x1,...xn 1) = g(x0,x1,...xn 1) = out (3.4) − −

x0,x1,...xn 1 − k0 h0 h1 ... hm 1 −

... h0 h1 ... hm 1 g −

km 1 − g(x0,x1,...xn 1) h0(x0) h1(x1) hn 1(xn 1) − − − f (x)

(a) Parallel composition (b) Serial composition (c) Reduction composition

Figure 3.2: The basic forms of composition

The composition rule is strong and natural enough to describe almost all types of data-intensive problems and applications, and could be associated to many implementations. The last two rules, primitive recursiveness and minimization, introduce a higher degree of diﬃculty in expressing them in implementations and are less natural to associate with structural implementations.

x y f (x,y) x H

G1 G0 ... Gi ...

r1 R ... r0 ri R

ri f (x) = miny [g(x,y) = 0] ...

(a) Primitive recursion (b) Minimization

Figure 3.3: Structure for two of Kleene’s rules

The primitive recursion is described by Equation 3.5 and by a structure like in Figure 3.3a. This structure is fully parallel since, apart from the serial composition, it supports speculation at each level of computation through a reduction network. The function has an initial value, described by block H, which feeds the inﬁnite pipeline. The reduction network R inputs an inﬁnite vector of pairs {scalar, predicate}, corresponding to the predicated result for each stage. Thus the result will always return the scalar which is paired with the predicate having value 1. f (x,y) = g(x,y,f (x,y 1)) where f (x,0) = h(x) (3.5) − 3.2. A theoretical framework for parallelism 21

The minimization rule is described by Equation 3.6 and it computes the function f (x) to the value of the minimal y for which g(x,y) = 0. As with the previous rule, the structure depicted in Figure 3.3 is an example of applying minimization, while keeping the concept of ideal parallelism by using speculation. Thus each block G computes the predicated value and returns a pair of form i,g(x,i) == 0 and the reduction network R extracts the ﬁrst pair having the predicated value 1{ (if any). }

f (x) = miny[g(x,y) = 0] (3.6)

3.2.2 A functional taxonomy for parallel computation

Since new mathematical models are emerging to describe parallelism, the huge diversity of solutions involved in actual implementations tend to make the classic computer taxonomies [Flynn, 1972, Xavier and Iyengar, 1998] obsolete. One such taxonomy introduced in [Flynn, 1972] describes parallel machines from a structural point of view, where parallelism is symmetrically described using a two-dimensional space: data programs. The current parallel applications cannot fit in one of Flynn’s categories (for example× SIMD or MIMD), since they require more than one type of parallelism. In [Maliţa and Ştefan, 2008] and [Ştefan, 2010] there is proposed a new more consistent functional taxonomy, starting from the way a function is computed, as presented in Subsection 3.2.1. Thus, five types of parallel computation have been emphasized: • Data-parallel computation as seen in the data-parallel composition. It is applied on vectors, and each component of the output vector results from the predicated execution of the same program. • Time-parallel computation as seen in the serial composition. It applies a pipe of functions on input streams and, according to [Hennessy and Patterson, 2011], it is efficient if the length of the stream is much greater than the pipe’s length. • Speculative-parallel computation extracted as a functional approach for solving primitive recursion and minimization. This computation can be described by replacing Equation 3.1 with the limit cases in Equation 3.7. It usually applies the same variable to slightly different functions. • Reduction-parallel computation deduced from the reduction composition. Each vector component is equivalent related to the reduction function. • Thread-parallel computation is not directly presented in Subsection 3.2.1 but can be deduced by replacing Equation 3.1 with the limit case in Equation 3.8. It also describes the timely behaviour of interleaved threads.

h (x ,...x ) = h (x), g(h (x),...h (x)) = h (x),...,h (x) (3.7) i 1 m i 1 m { 1 m } h (x ,...x ) = h (x), g(h (x ),...h (x )) = h (x ),...,h (x ) (3.8) i 1 m i 1 1 m m { 1 1 m m }

Based on this taxonomy, the types of computation can be separated into two categories: • complex computation where parallelism is tightly interleaved allowing eﬃcient complex computations. It includes the thread-parallel computation. 22 3. Understanding Parallelism

• intensive computation where parallelism is strongly segregated, allowing large-sized simple computations. It groups together the data-parallel computation, time-parallel computation, speculative-parallel computation and reduction-parallel computation. [Maliţa et al., 2006] concludes that "any computation, if it is intensive, can be performed eﬃciently in parallel".

3.3 Parallel applications: the 13 Dwarfs

The parallel problem has been studied intensely in the last decade by numerous research groups focussing on multi- and many-core processing [Georgia Tech, 2008, Habanero, 2013, PPL, 2013,Illinois, 2013,Par Lab, 2013]. One of the groups involved in this research originates from Berkeley University of California and consists in multidisciplinary researchers. They discuss in [Asanovic et al., 2006] and [Asanovic et al., 2009] an application-oriented approach to treat the parallel problem from different perspectives and at different layers of abstraction. They motivated this approach by examining the parallelism at the extremes of the computing spectrum, namely embedded computing and high performance computing. They argue that "these two ends of the computing spectrum have more in common looking forward than they had in the past" [Asanovic et al., 2006]. By studying the success driven by parallelism for many of the applications, it is possible to synthesize feasible and correct solutions based on application requirements. Thus, the main approach is to "mine the parallelism experience" to get a broader view of the computation mechanisms. Also, since parallelism is still not clearly described yet by formal means, benchmarking programs cannot be used as measurements of innovations. Asanovic et al. argues that "there is a need to find a higher level of abstraction for reasoning about parallel application requirements" [Asanovic et al., 2006]. This point is valid, judging from the experience of successful mapping high-performance scientific applications on embedded platforms. Extending the work of Phil Colella [Colella, 2004], the research team grouped similar applications into thirteen "dwarfs", which are equivalence classes based on similarity of computation and data movement, and by studying the programming patterns. They are presented in Table 3.1.

# Dwarf Data Comm. Pat- Description Application Exam- Hardware terns ples

1 Dense Lin- dense memory usually vector-vector, matrix- Block Triadiagonal Vector ear Algebra matrices strides vector and matrix-matrix opera- Matrix, Symmetric computers, or vectors tions Gauss-Siedel Array computers 2 Sparse Lin- compressed indexed data includes many zero values, Conjugate Vector ear Algebra matrices loads/ compressed for low storage and Gradient computers stores bandwidth with gather /scatter 3 Spectral frequency multiple combination of multiply-add Fourier Transform DSPs, Zalink Methods domain butterﬂy operations and speciﬁc data PDSP patterns permutations 4 N-Body discrete interaction particle-particle methods – Fast Multipole GRAPE, Methods points between O(N 2); hierarchical particles – Method MGRAPE points O(N logN) or O(N) 3.4. Berkeley’s view: design methodology 23

# Dwarf Data Comm. Pat- Description Application Exam- Hardware terns ples

5 Structured regular high spatial grid may be subdivided into Multi-Grid, Scalar QCDOC, Grids grids locality finer grids ("Adaptive Mesh Re- Pentadiagonal, BlueGeneL finement"); transition between Hydrodynamics granularity may happen dynamically 6 Unstructured irregular multiple location and connectivity de- Unstructure Adap- Tera Multi grids grids levels of termined from neighboring ele- tive Threaded memory ments Architecture reference 7 MapReduce – not calculations depend on statisti- Monte Carlo, Ray NSF Teragrid dominant cal results of repeated random tracer trials. Considered embarrass- ingly parallel 8 Combinato- large bit-level op- simple operations on very large Encription, Cyclic Hardwired rial Logic amount erations amount of data, often exploiting Redundancy algorithms of data bit-level parallelism Codes, IP NAT 9 Graph nodes, many algorithms involving many lev- Route Lookup, Sun Niagara Traversal objects lookups els of indirection and small XML parsing, amount of computation collision detection 10 Dynamic – – solve simpler overlapping sub- Viterbi Decode, Dyna Program- problems. Used in optimization variable ming of problems with many feasible elimination solutions 11 Back-track – – optimal solutions by dividing in Kernel Regression, – and Branch subdomains and pruning sub- Net-work Simplex + bound problems that are suboptimal Algorithm 12 Graphical nodes – graphs where random variables Bayesian Net- – Models are nodes and conditions are work, Hidden edges Markov Models 13 Finite State states transitions behavior defined by states, tran- PNG, JPEG, – Machines sitions and events MPEG-4, TCP, compiler

Table 3.1: The 13 Dwarfs of parallel computation

While the first twelve dwarfs show inherent parallelism, the parallelization of the thirteenth constitutes a challenge. The main reason is that it is difficult to split the computation into several parallel finite state machines. Although the Berkeley research group favors the exclusion of the thirteenth dwarf from the parallel paradigm, as being considered "embarrass- ingly sequential", architectures like Revolver [Öberg and Ellervee, 1998], the Integral Parallel Architecture [Ştefan, 2010], or the BEAM [Codreanu and Hobincu, 2010], demonstrate that these problems can successfully be parallelized, as being derived from the complex computation class from Subsection 3.2.2.

3.4 Berkeley’s view: design methodology

The following section is based on material found in [Asanovic et al., 2006, Asanovic et al., 2009]

As stated in the previous section, the multidisciplinary research team from Berkeley studied the parallel problem from a broad range of perspectives. Like ForSyDe, they support the idea of raising the level of abstraction for both programming and system design. The main diﬀerence though, is that they adopt a programmer-friendly application-oriented approach for productivity layer rather than a formal starting point. 24 3. Understanding Parallelism

The following section will present Berkeley’s view, in comparison to the ForSyDe methodology, in order to merge these two diﬀerent schools of thought into an even stronger conceptual foundation. In addition we will try to demonstrate that ForSyDe is a proper methodology to approach even parallel computational problems, not only real-time embedded problems.

3.4.1 Application point of view

Section 3.3 pointed out the need to mine applications that demand more computing power and can absorb the increasing number of cores for the next decades in order to provide concrete goals and metrics to evaluate progress. For this matter, a number of applications are studied and developed based on different criteria: compelling in terms of marketing and social impact, short-term feasibility, longer- term potential, speed-up or efficiency requirements, platform coverage, potential to enable technology for other applications, involvement in usage and evaluation of technology. Some of the applications could count: music and hearing, speech understanding, content-based image retrieval, intraoperative risk assessment, parallel browser, 3D graphics, etc. Currently, ForSyDe has a suite of case studies originating in industrial or academic applications and some of them, for example Linescan, focus on industrial control. In the future, its application span could broaden even in other user-oriented areas (for example actor-based parallel browsers [Jones et al., 2009]), by studying and following other successful attempts such as Berkeley’s. ForSyDe has a good profile for many of the applications in Table 3.1. Since parallelism is expressed inherently, it could well fit in the above classes of applications.

3.4.2 Software point of view

The Berkeley research group admit that developing a software methodology for bridging the gap between users and the parallel IT industry is the most vexing challenge. One reason is the fact that many programmers are unable to understand parallel software. Another reason is that both compilers and operating systems have grown so large that they are resistant to changes. Also, it is not possible to properly measure improvement in parallel languages, since most of them are prototypes that reﬂect solely the researchers’ point of view. There are eight main ideas presented that will be analysed separately in the following paragraphs.

Idea #1: Architecting parallel software with design patterns, not just parallel programming languages This is the ﬁrst idea proposed by Asanovic et al. Since "automatic parallelism doesn’t work" [Asanovic et al., 2009], they propose to re-architect the software through a "design patten language", explored in earlier works such as [Alexander, 1977, Gamma et al., 1993, Buschmann et al., 2007]. The pattern language is a collection of related and interlocking patterns, constructed such that the pattern ﬂow into each other as the designer solves a design problem. The computation and structural patterns can be composed to create more complex patterns. These are conceptual tools that help a programmer reason about a software project and develop an architecture, but are not themselves implementation mechanisms for producing code. Unfortunately, this is one of the ideas that arouse disputes between researchers belonging to the two schools of thoughts studied in this report. While the Berkeley group openly disagrees 3.4. Berkeley’s view: design methodology 25 with using formalism as a starting point in a design methodology, ForSyDe enforces formal restrictions at early design stages. A reason against using a formal model as a starting point is the fact that it is understood only by a narrow group of researchers, and it limits expressiveness in designing solutions, at least for the uninitiated. We argue that this is a common misconception amongst the research groups and has to be overcome in order to fully take advantage of both concepts. Case studies have shown that by using a structured thinking, the formal constraints do not limit expressiveness. On the contrary, describing computation through processes and signals aids the designer in keeping a clear picture of the entire system. It is simpler to mask MoC details under a design pattern than to assure formal correctness to large pattern-based systems. By masking, the designer needs only minimal prior knowledge of the mathematical principles behind MoCs, while still being able to respect formalism and fully take advantage of it. Thus, ForSyDe could be easily extended into a pattern framework since it allows composable patterns of process networks. This subject is further treated in Section 5.3, and could be a relevant point of entry for future research.

Idea #2: Split productivity and efficiency layers, not just a single general-purpose layer Productivity, efficiency and correctness are inextricably linked and must be treated together, during system design stages. They are not a single-point solution and must be defined in separate layers. The productivity layer uses a common composition and coordination language to glue together the libraries and programming frameworks produced by the efficiency-layer programmers. The implementation details are abstracted at this layer. Customizations are made only at specified points and do not break the harmony of the design pattern.

The efficiency layer is very close to machine language, allowing the best possible algorithm to be written in the primitives of the layer. This is the working ground for specialist programmers trained in details of parallel technology. This concept is powerfully rooted in the parallel programming community. It explains the multitude of template libraries, DSLs or language extensions for specific parallel platforms (i.e. the ones presented in Section 5.3) during the last decade. Although ForSyDe is not only a programming language, but rather a system design methodology, it follows this conceptual pattern. While the ForSyDe-Haskell or ForSyDe-SystemC design frameworks can be associated with the productivity layer, the suite of tools of analysis, transformation, refinement and synthesis can be associated with the efficiency layer. The schema proposed in Appendix B extends this seemingly simple but powerful idea.

Idea #3: Generating code with search-based autotuners, not compilers Since compilers have grown so large and are resistant to changes, one cannot rely on them to identify and optimise parallel applications. Instead, one useful lesson can be learned from autotuners. These are optimization tools that generate many variants of a kernel and measure each variant by running on the target platform. Autotuners are built by efficiency-layer programmers. ForSyDe replaces the idea of auto-tuning with design space exploration mechanisms. Since they involve off-line analysis of systems before synthesizing implementation solutions, these mechanisms have a much higher potential for productivity performance. Even so, ForSyDe’s 26 3. Understanding Parallelism development should be aware of the autotuner mechanisms, since it can benefit from hybrid synthesis methods. One best practice example is narrowing down the design space, then running several analyses in parallel on virtual platforms with different configurations and choose the best solution.

Idea #4: Synthesis with sketching This idea enforces programmers to write "incomplete sketches" for programs. In them, they provide an algorithmic skeleton and let the synthesizer ﬁll in the holes in the sketch. As presented in Chapter 2, this is one of ForSyDe’s ground rules, described by the abstraction of design details. Moreover, in earlier ForSyDe publications, process constructors were referred to as skeletons [Sander and Jantsch, 1999], which is infers the concept of "sketch".

Idea #5: Verification and testing, not one or the other The research group enforces modular verification and automated unit-test generation through a high-level semantic constraints on the behavior of the individual modules (such as parallel frameworks and parallel libraries). They identified this to be a challenging problem since most programmers find it convenient to specify local properties using assert statements or static program verification. As a consequence, these programmers would have a hard time adapting to the high-level constructs. Since the ForSyDe methodology starts from a formal correct-by-design reaching an implementation mostly through semantic-preserving constraints, this problem is not valid anymore. Validation may be elegantly taken care of by the design’s formalism, while early-stage testing can be achieved by executing the model [Attarzadeh Niaki et al., 2012].

Idea #6: Parallelism for energy efficiency Efficient use of multiple cores to complete a task is more efficient in terms of energy consumption [Hennessy and Patterson, 2011]. Several mechanisms are recommended, such as task multiplexing, use of parallel algorithms to amortize instruction delivery and message passing instead of cache coherency. There is a number of projects related to ForSyDe which address power estimations in system design [Zhu et al., 2008, Jakobsen et al., 2011]. Since it is an issue problem especially in embedded systems, this problem will still be a main topic for future ForSyDe researches.

Idea #7: Space-time partitioning for deconstructed operating systems The spatial partition contains the physical resources of a parallel machine. The space-time partitioning virtualizes spatial partitions by time-multiplexing whole partitions onto available hardware. As seen in [Ştefan, 2010], the partition can be done at low instruction level or, as the Berkeley research group propose, at a "deconstructed OS" level. Currently there is no OS support in ForSyDe, but implementing it could be seen as a mapping problem. The temporal dimension has to be described along with the partitioning of tasks to resources. Therefore, it counts as a design space exploration problem, and could be a relevant future research topic.

Idea #8: Programming model efforts inspired by psychological research This idea has been mentioned in [Asanovic et al., 2006], and identifies the need for researching a human- centric programming (in our case design) model, along with the other already widespread 3.4. Berkeley’s view: design methodology 27 models: hardware-centric, application-centric and formalism-centric. Since humans write programs, studying human psychology could lead to breakthroughs in identifying sources of human errors, problem solving abilities, maintaining complex systems, etc. Also the above- mentioned report increases awareness of the risk of human experiments when testing a new model. Examples in academic environments have shown that the test subjects’ intuition has been challenged and changed during these experiments. Regarding this research topic "there has been no substantial progress to date" [Asanovic et al., 2006]. Even so, examples like [Hochstein et al., 2005, Malik et al., 2012] and other research in the field could constitute valuable resources for future ForSyDe studies on user or market impact.

3.4.3 Hardware point of view

Driven by the power wall, there are changes both in programming model and in the architectural designs of systems. Many-core architectures have implications that may potentially work in industry’s advantage. Apart program parallelism, many simple low-frequency cores imply lower resource consumption [Hennessy and Patterson, 2011], simplifying design and veriﬁcation, and a higher yield. Also guided by Amdahl’s law [Hennessy and Patterson, 2011], heterogeneous hardware platforms are encouraged since they optimally exploit diﬀerent aspects of computation. There is a number of guidelines for parallel hardware design in both [Asanovic et al., 2006] and [Asanovic et al., 2009]. Among them, we could point out using new mechanisms for assuring data coherency like the transactional memory or message passing and accurate, complete counters for performance and energy. Since these features favor both parallel computation and correct system design methods, ForSyDe’s future development could be guided and eased by platforms implementing them.

As presented in Section 3.4, we can draw the conclusion that ForSyDe is indeed an adequate methodology both for synthesizing parallel software and for designing systems with parallel platforms. Its primary target, the real-time embedded applications, can successfully be extended to parallel applications, while keeping its concepts and philosophy intact. Furthermore, even adepts of functionalism should not be discouraged by the formal style for programming, since it is much easier for ForSyDe to "put on a functional coat" than for a functional approach to implement a formal methodology. This functional ﬂavor can materialize in the form of MoC-based IP blocks. Nevertheless, the advantages of formalism are obvious. 28 3. Understanding Parallelism Chapter 4 GPGPUs and General Pro- gramming with CUDA

In order to understand the tasks that f2cc needs to solve and the challenges further implied by the component implementation, one has to understand the target platform: the GPGPU. This chapter will provide a short overview of the GPGPU landscape for a quick grasp of this very large domain. After a brief introduction, the reader is presented with the main GPGPU architectural features in Section 4.2, followed by an example of CUDA programming in Section 4.3. The chapter closes with an overview on CUDA streams, which constitutes the main tool for delivering this thesis’ objectives.

4.1 Brief introduction to GPGPUs

s their name suggests, GPUs were propelled by the graphics industry, especially in gaming. A Throughout their evolution, their main purpose was to render high-resolution 3D scenes at real-time [Nickolls and Dally, 2010]. The marketing term GPU was coined by nvidia in 1999 when they released GeForce 256, "the world’s first GPU"[nvidia, 2013b]. Since then, the number of transistors has increased from 23 million to 7 billion, and the computing power has raised from 480 Mega-operations per second to 4.5 Single Precision Tera-FLOPS for nvidia GeForce GTX Titan in 2013 [nvidia, 2013c]. This tremendous boost in computing power has attracted developers from areas other than graphics processing areas [Nickolls and Dally, 2010, Kirk and Hwu, 2010, Joldes et al., 2010, Owens et al., 2008]. Pioneering developers had to express non-graphical computations through the graphics API shader languages, which were not ideal for general purpose. Furthermore, increased inefficiencies due to poor load balancing between the CPU and the GPU made these platforms difficult to handle for other applications. Because of the increased demands in general purpose computing using massive parallel processors, manufacturers opted that GPUs and CPUs are to be unified into a single entity [Lindholm et al., 2008]. Also, further modifications in the GPU architecture, like support for integer floating-point arithmetic, synchronization barriers, etc, allowed them to be computed using a general-purpose imperative program like C. This lead to the dawn of the General

29 30 4. GPGPUs and General Programming with CUDA

Purpose GPUs (GPGPUs) in 2006, when nvidia launched GeForce 8800, the ﬁrst uniﬁed graphics and computing architecture [Nickolls and Dally, 2010].

4.2 GPGPU architecture

The following subsection is based on material found in [Kirk and Hwu, 2010] and [Hjort Blindell, 2012]. For a more detailed presentation, the reader is encouraged to read the latter report

The GPU architecture reflects its purpose, namely processing of 3D graphics. Most of the graphical algorithms in the graphics pipeline [Kirk and Hwu, 2010] can be grouped under the Dense Linear Algebra motif (see Section 3.3). Therefore, many architectural features can be explained related to the first of Berkeley’s Dwarfs, and to the legacy compatibility to the graphics pipeline. The GPU is a throughput-oriented architecture, assuming plenty of parallelism and employing thousands of simple processing units, in comparison with the latency-oriented architecture of a CPU. Thus, by keeping loyalty to the historical term of "graphics accelerator", its purpose is not to replace the CPU, but to act as co-processor. GPGPUs follow the principle that computation is much cheaper than memory transfers. The memory organization differs from that of a general purpose CPU and it has its own throughput- oriented memory hierarchy. Furthermore, raw computation is enforced rather than lookup algorithms with precomputed values, since it is faster and less power consuming. For processing individual graphical elements, such as pixels or triangle, a small program named in CUDA a kernel function is invoked. In order to keep the processing cores busy while waiting for long latency operations, GPGPUs apply hardware multithreading to make thread switches with fine granularity. A kernel normally involves spawning, executing and retiring of thousands of threads with minimal overhead.

Cluster GPGPU SM SM

SP SP Fetch/Disp. SP SP Fetch/Disp.

SP SP Registers SP SP Registers

SP SP Inst. cache SP SP Inst. cache

SP SP Const.cache SP SP Const.cache

SFU Shared SFU Shared memory memory SFU SFU

Texture memory DRAM DRAM DRAM DRAM DRAM

Figure 4.1: Overview of nvidia CUDA platform. Source: adapted from [Hjort Blindell, 2012]

Figure 4.1 illustrates a typical nvidia CUDA GPGPU. In the following enumeration, a brief explanation of the main architectural components is given: • a cluster is a subdivision of the GPGPU, which contains a pair of streaming multiprocessors (SM). 4.2. GPGPU architecture 31

• each SM has its storing elements connected to high bandwidth DRAM interfaces. The DRAM memory as a whole is called the global memory. • each SM has 8 streaming processors (SP), known as CUDA cores. They are the primary computing units, and each one is fully-pipelined and in-order. The number of SPs, as well as SMs is device independent. • the special-function unit (SFU) computes fast approximations for certain functions, such as sinx or 1/√x. • the register ﬁle holds the thread context. Parts of it can be spilled in a portion of the DRAM called local memory, referring to its scope rather than its locality. • the shared memory is a small low-latency, multibanked on-chip memory which allows application-controlled caching of DRAM data. • the constant cache is used for reducing latencies of operations with constants. • the texture memory caches neighbored elements in 2D matrices, for improved performance in this type of memory accesses. • newer CUDA GPGPUs are equipped with L1 and L2 caches as well. In order to manage the large population of threads and to achieve scalability, the threads are divided into a set of hierarchical groups – grids, threads blocks and warps [Kirk and Hwu, 2010]. Figure 4.2 illustrates this organization, shortly explained in the following bullet points:

Host Device

Grid 1

Thread block (2,1) Thread (1,0,1) Kernel 1

Thread (2,1,0) Grid 2

Kernel 2

Figure 4.2: Thread division and organization as seen from the programmer’s point of view. Source: adapted from [Kirk and Hwu, 2010, Hjort Blindell, 2012]

•a grid is formed of all the threads belonging to the same kernel invocation. The geometry and size is conﬁgurable by the programmer. •a thread block contains the threads to execute, arranged in either 1D, 2D or 3D geometry, conﬁgurable by the programmer. The geometry limitations depend on the device’s generation. Each SM supports up to 8 thread blocks, and the GPU dynamically balances the workload across SMs • The threads are executed in a manner similar to SIMD execution, called Single Instruction Multiple Thread (SIMT), where threads adjacent to each other are grouped in warps. 32 4. GPGPUs and General Programming with CUDA

• warps eligible for execution are scheduled and executed as a whole. Thus it is the programmer’s task to assure that the warps are distributed so that there are no idle resources during execution. Also, thread divergence within warps (data-dependent branch instructions) involve idle resources. • the CPU and the GPGPU are logically separated as host and device. As suggested in Figure 4.2, each kernel call on the host executes the kernel on the device.

4.3 General programming with CUDA

The following subsection is based on material found in [Kirk and Hwu, 2010]. This text will provide a very brief example of CUDA programming. For a more comprehensive tutorial the reader is encourage to consult [Hjort Blindell, 2012, Kirk and Hwu, 2010] For CUDA programmers, GPUs are massive parallel processors programmed in C with extensions. nvidia provides a software development kit (SDK) consisting in a set of libraries and a compiler called nvcc, so that developers can establish applications in a familiar sequential environment. According to Section 3.4, CUDA’s approach is a hardware-centric one, since it has constructs for explicit control of the hardware architecture. Although it enables an experienced developer to take advantage of the hardware features and tweak programs to boost the performance, this approach is a tedious one, not optimal in terms of productivity on a large scale. Its similarity to C although, enables skilled programmers to port appropriate C programs to CUDA C "in a matter of days" [Kirk and Hwu, 2010]. In order to understand CUDA C programming style, an example program performing matrix multiplication is shown in listings 4.1, 4.2 and 4.3. The multiplication between matrix A and matrix B is done by calculating the dot product between a row in A and a column in B, as suggested in Equation 4.11.

c = row col (4.1) i,j A,i · B,j Listing 4.1 illustrates a pure C approach of the given problem. It computes the dot product between two matrices A and B in an iterative fashion that can be optimized by compilers for sequential machines. The CUDA C implementation for the same problem has a rather different approach. The developers cannot relate anymore to automatic compiler optimizations, thus they have to take full responsibility for the proper utilization of resources, which can differ from application to application. As seen in Figure 4.2, there is a logical separation between host and device. Since there will be two programs syncing each other, two files have to be written. Listing 4.2 shows the host code. Generally, a CUDA program follows five steps, identified in the given code: 1. allocate memory on device, with the cudaMalloc library function; 2. copy data from host to device, using cudaMemcpy;

1for more on the matrix multiplication, the reader may consult [Hennessy and Patterson, 2011] or any linear algebra manual 4.3. General programming with CUDA 33

1 void matrixMult(int * a, int* b, int* c) { 2 int i, j, k; 3 for (i = 0; i < N; ++i) { 4 for (j = 0; j < N; ++j) { 5 int sum = 0; 6 for (k = 0; k < N; ++k) { 7 sum += a[i * N + k] * b[j + k * N]; 8 } 9 c[i * N + j] = sum; 10 } 11 } 12 }

Listing 4.1: C code for matrix multiplication

3. perform data calculations on the device, by calling a kernel function. The syntax shows the grid and thread configuration for the given kernel; 4. copy the result back to host, with cudaMemcpy; 5. free the allocated memory, achieved with cudaFree. Listing 4.3 illustrates one choice of implementation for the given problem on the GPGPU device. This choice was selected since it shows the usage of threads, synchronization barriers and shared memory – the main mechanisms that will be employed in the current project. The kernel declaration is marked by the keyword __global__. Two variables are allocated in the shared memory, Da and Db, to temporarily store intermediate blocks of data2. These locations will be shared by multiple threads to drastically reduce the memory access time. To allow computations of matrices larger than supported by the device resources, they can be split into multiple kernel invocations, thus into multiple thread blocks. As seen in Listing 4.3, individual elements in large matrices can be identified by indexing them by a displacements relative to the thread index (threadIdx) and block index (blockIdx). Perhaps the most significant change in matrixMult is that the two outer loops have been removed. Since CUDA is a parallel platform, calculating each element iteratively would be a waste of resources, because the compiler cannot identify the potential for parallelism. Instead each element is calculated in parallel as a thread, identified by the its indices. Synchronization barriers (_syncthreads) have been used to separate each computation steps. They ensure that all threads have finished loading their data or performed their calculations before proceeding to the next tile.

2For more on the blocked matrix multiplication algorithm, the reader may consult [Hennessy and Patterson, 2011] and [Kirk and Hwu, 2010] 34 4. GPGPUs and General Programming with CUDA

1 //Memory allocation on device 2 cudaMalloc((void **) &Ma, size); 3 cudaMalloc((void **) &Mb, size); 4 cudaMalloc((void **) &Mc, size); 5 6 //Host to device memory copy 7 cudaMemcpy(Ma, a, size, cudaMemcpyHostToDevice); 8 cudaMemcpy(Mb, b, size, cudaMemcpyHostToDevice); 9 10 //Kernel invocation 11 dim3 gridDimention(1, 1); 12 dim3 blockDimension(N, N); 13 matrixMult<<>> 14 ((int *) Ma, (int *) Mb, (int *) Mc); 15 16 //Device to host memory copy 17 cudaMemcpy(c, Mc, size, cudaMemcpyDeviceToHost); 18 19 //Device memory deallocation 20 cudaFree(a); cudaFree(b); cudaFree(c);

Listing 4.2: CUDA Host code for matrix multiplication

1 __global__ void matrixMult(int * a, int* b, int* c) { 2 __shared__ int Da[TILE_SIZE][TILE_SIZE]; 3 __shared__ int Db[TILE_SIZE][TILE_SIZE]; 4 5 int bx = blockIdx.x; int int by = blockIdx.y; 6 int tx = threadIdx.x; int ty = threadIdx.y; 7 int row = by * TILE_SIZE + ty; 8 int col = bx * TILE_SIZE + tx; 9 10 //Calculate dot product 11 int i, sum = 0; 12 for (i = 0; i < N / TILE_SIZE; ++i) { 13 int k, sum_tmp = 0; 14 15 //Load tile 16 Da[ty][tx] = a[row * N + TILE_SIZE * i + tx]; 17 Db[ty][tx] = b[col + (TILE_SIZE * i + ty) * N]; 18 __syncthreads(); 19 20 //Calculate partial dot product 21 for (k = 0; k < TILE_SIZE; ++k) { 22 sum_tmp = += Da[ty][k] * Db[k][tx]; 23 } 24 sum += sum_tmp; 25 __syncthreads(); 26 } 27 c[row * N + col] = sum; 28 }

Listing 4.3: CUDA Device code for matrix multiplication 4.4. CUDA streams 35

4.4 CUDA streams

In March 2010 nvidia launched the Fermi architecture, which allowed concurrency between CPU computation, multiple kernels, one memory transfer from host to device, one from device to host. This feature favours complex computation (see Subsection 3.2.2) through the use of CUDA streams.

A stream is a sequence of operations that execute in issue-order on the GPU [Sanders and Kandrot, 2010]. The programming model allows CUDA operations in diﬀerent streams to run concurrently and to be interleaved. Figure 4.3 shows a pipeline behavior induced by stream concurrency.

default_stream

stream 1 H2D kernel D2H H2D kernel D2H

stream 2 H2D kernel D2H H2D kernel D2H

stream 3 H2D kernel D2H H2D kernel D2H

CPU exec

Figure 4.3: Concurrency example with CUDA streams. Source: adapted from [nvidia, 2013a]

The default stream, or stream 0 is used when no stream is speciﬁed. It implies that all operations are synchronous between host and device. Listing 4.2 shows an example of kernel called in the default stream.

In order to implement concurrency, a number of requirements have to be fulfilled [nvidia, 2013a]. First, The concurrent CUDA operations have to be in different, non-0 streams. Secondly, the data transfers have to be made using cudaMemcpyAsync, and from page-locked memory on host (cudaMallocHost). Thirdly, the user has to assure that there are enough resources available for full concurrency. This means that data is transferred in different directions at one point in time, and there are enough device resources (like SMs, blocks, registers, etc.). Listing 4.4 demonstrates the use of streams.

1 cudaStream_t stream1, stream2, stream3, stream4 ; 2 cudaStreamCreate ( &stream1) ; 3 ... 4 cudaMalloc ( &dev1, size ) ; 5 cudaMallocHost ( &host1, size ) ;// pinned memory required on host 6 ... 7 //potentially overlapped section 8 cudaMemcpyAsync ( dev1, host1, size, H2D, stream1 ) ; 9 kernel2 <<< grid, block, 0, stream2 >>> ( ..., dev2, ... ) ; 10 kernel3 <<< grid, block, 0, stream3 >>> ( ..., dev3, ... ) ; 11 cudaMemcpyAsync ( host4, dev4, size, D2H, stream4 ) ; 12 some_CPU_method (); 13 //end of potentially overlapped section 14 ...

Listing 4.4: Using streams in CUDA. Source: [nvidia, 2013a] 36 4. GPGPUs and General Programming with CUDA

In this chapter the reader has been brieﬂy introduced in the topic of GPGPUs, and general programming with CUDA. While the reader is encouraged to read more about this topic in order to fully understand the mechanisms behind a successful CUDA program, the information provided is enough to stand as a technical background for the synthesizer module and its methods presented in Chapter 8 Chapter 5 The f2cc Tool

This chapter will introduce the reader to the f2cc tool, the main component to be improved during the current project. Section 5.1 will present the tools main features and its usage. Since the last chapter presented CUDA programming basics, it is proper now to oﬀer an insight into the component’s architecture in Section 5.2. The chapter will end with a presentation of alternative solutions for CUDA code generation and their methodologies, in order to compare them with ForSyDe.

5.1 f2cc features

2cc stands for ForSyDe to CUDA C. It is a ForSyDe tool and its main purpose is to F synthesize CUDA-enabled GPGPU backend code from a high-level ForSyDe model. It has been developed by Gabriel Hjort Blindell as part of his Master’s Thesis in 2012 [Hjort Blindell, 2012]. The tool inputs high-level ForSyDe models as GraphML1 files. Apart from structural information, these intermediate representations encapsulate C code for each leaf process2. It offers support for the following ForSyDe process constructors: MapSY, ParallelMapSY, ZipWithNSY, UnzipxSY, ZipxSY, DelaySY3. The component generates either CUDA C for running on nvidia GPGPUs or sequential C code for running on CPUs. The output is controlled via a command-line switch. For GPGPU code synthesis it identifies contained Unzipx-Map-Zipx sections like in Figure 5.1, coalesces them into internal ParallelMapSY process and wraps around them an optimized CUDA kernel. For sequential code synthesis, it uses an internal scheduling mechanism to identify the correct process execution order, and generates C code afterwards.

1at the time of the tool’s development ForSyDe-SystemC was not yet released, thus f2cc inputted ForSyDe-Haskell generated GraphML ﬁles. 2at the time of of the tool’s development there was not yet a tool for C code extraction from Haskell processes, thus the C code for each process had to be hand-written. 3the process nomenclature corresponds to ForSyDe-Haskell naming convention.

37 38 5. The f2cc Tool ......

P arallelMapSY

Figure 5.1: Parallel patterns identiﬁed by f2cc f2cc uses an internal object-based representation of the ForSyDe model, for easy user-access and parsing. The frontend module of the tool extracts data from the GraphML input ﬁles and translates it into the internal representation, so that all transformations happen internally. One last feature of f2cc is that it has the option to make use of CUDA shared memory for minimising transfer costs within parallel processes.

5.2 f2cc architecture f2cc is an independent tool implemented in an object-oriented style with C++. All its classes and components with their APIs are documented using Doxygen. To avoid naming clashes, all components that belong to the tool are implemented in their own namespace, f2cc. Its main modules and their interconnections are shown in Figure 5.2. In principle, the tool’s data flow follows the path described in the next enumeration. Further information about the whole process can be found in [Hjort Blindell, 2012]. 1. parse input GraphML file; 2. translate the data into own intermediate ForSyDe object-based representation; 3. perform modifications upon the intermediate model, like redundant process elimination, or finding, coalescing and wrapping of contained sections into ParallelMapSY; 4. input the resulting model to the synthesis process, which finds a correct sequential schedule and wraps parallel sections into CUDA kernels. 5. output the resulting host and device code, or the sequential code respectively. The frontend module holds classes and definitions for frontend parsing. It contains general functions, and the GraphML parser that converts input data into the internal format. As suggested in Figure 5.2, this module makes use of a third-party library called TinyXML++, included in the ticpp module. This library provides f2cc with XML parser functions. The module mainly used in the next step of the component’s execution path is forsyde. It includes classes for the internal representation of the ForSyDe model’s components, and for model modifier methods called throughout the process. In order to avoid clashes with other tools or components, every class or method that belongs to this category resides under f2cc::Forsyde namespace. 5.2. f2cc architecture 39

frontend forsyde ticpp

logger exceptions tools

synthesizer language config

Figure 5.2: Component modules and their connections. Source: [Hjort Blindell, 2012]

The internal model is equipped with a class for each ForSyDe process constructor recognized by the component. Every object instantiated at runtime encapsulates all the information needed (extracted from the model or assumed, as it will be presented in Chapter 7) to describe a ForSyDe process. The relations between the classes used in the internal representation follow the pattern in Figure 5.3.

has many Model Port

has many has many Process

is a is a is a is a is a

has a ZipWithNSY DelaySY ZipSY UnzipSY MapSY CFunction

is a CoalescedMapSY

is a ParallelMapSY

Figure 5.3: Classes used for f2cc internal model representation and the relations between them. Source: adapted from [Hjort Blindell, 2012]

The main model modifier identifies contained sections, coalesces them and transforms them into a ParallelMapSY process, for an easy wrapping into a CUDA kernel. The analysis of contained section is done by traversing the model in reverse order, and identifying the processes between an UnzipxSY and a ZipxSY. Coalescences and conversions are done at the model level, and all model modifications consist mainly in adding/removing processes and redirecting signals. When all the necessary model modifications are ready, the synthesizer module takes over. Its two main classes are Synthesizer with methods for generating CUDA C or sequential C code, and the Schedule Finder, with methods for identifying a correct sequential schedule for running a ForSyDe model. The language module is referenced throughout execution by the previous modules, and it 40 5. The f2cc Tool holds containers for storing the C code for each process. It also contains string-based methods for manipulating the contained code. Each executing module reports its runtime events to the logger module. The tools module contains common miscellaneous methods, and the config class contains user-specified settings for the current program invocation. The exceptions module defines all exception classes used throughout the program.

5.3 Alternatives to f2cc

GPGPUs are notoriously difficult to program, due to the low level of the programming languages like CUDA and OpenCL. Due to this fact, code synthesis from a high level language is an intensely studied domain among the research community. Several research groups focused their work in the past few years on providing with proper tools and methodologies for increasing yields and productivity in GPGPU programming. In Chapter 3 it has been shown that ForSyDe it is not just a parallel programming language like most of the researches in the field, and that it is flexible in describing and handling parallel constructs since it formally supports the parallel computation framework. Since GPGPUs are the most popular parallel platforms in industry, it is a reasonable decision to start exploring the high-throughput parallel domain with applications that target them. In order to do so a review of the available alternatives for GPU code synthesis has to be made. This will aid in comparing their approaches with ForSyDe and examining their conceptual compatibility with a formal methodology. [Hjort Blindell, 2012] presents two alternative DSLs for GPGPU synthesis: SkePU and Obsidian. The following subsections extend the list with additional approaches.

5.3.1 SkelCL

The following section is based on material found in [Steuwer et al., 2013]. SkelCL is a frontend for OpenCL, developed at University of Münster. It is built as a C++ library that offers pre-implemented recurring computation and communication patterns which simplify programming for single- and multi-GPU systems. These patterns4, are called skeletons. These skeletons are raising the abstraction level for programming, and shields the programmer from boilerplate code, in the same manner as ForSyDe process constructors do5. Formally, a skeleton is a "higher-order function that executes one or more user-defined functions in a pre-defined parallel manner, while hiding the details of parallelism and communication from the user" [Steuwer et al., 2013]. SkeCL framework offers four basic skeletons, one extended skeleton and two data containers. The containers offered by SkelCL are Vector and Matrix. Vector is an abstraction for a contiguous memory area that is accessible by both the host and device, and implements optimal transfers between them. Matrix behaves as a 2D vector and takes care of memory organization for 2D applications The skeletons implemented by SkelCL are:

4the reader may notice the resemblance with Idea #1 in Section 3.4 5in fact in its early development stages, ForSYDe named the process constructors skeletons 5.3. Alternatives to f2cc 41

• Map: applies a unary function f to each element on an input vector vin, and is described by Equation 5.1; • Zip: operates on two input vectors and applies a binary operator to all pairs of elements. It is described by Equation 5.2; ⊕ • Reduce: computes a scalar value r from a vector using a binary operator . It is described by Equation 5.3; ⊕ • Scan: (or preﬁx-sum) yields an output vector with each element obtained by applying a binary operator to the elements of the input vector up to the current element’s index. It is described by Equation⊕ 5.4; • MapOverlap is the extended skeleton, used with either vector or matrix data type and it applies f to each element of an input matrix min while taking the neighboring elements within the range of [ d,+d]. It is described by Equation 5.5. −

vout[i] = f (vin[i]) (5.1) v [i] = v [i] v [i] (5.2) out inl ⊕ inr r = v[0] v[1] ... v[n 1] (5.3) ⊕ ⊕ ⊕ − i 1 v [i] = − v [j] (5.4) out ⊕j=0 in    min[i d,j d] min[i d,j] min[i d,j + d]   − − ··· − ··· −   . . .   . . .      mout[i,j] = f  min[i,j d] min[i,j] min[i,j + d]  (5.5)  − ··· ···   . . .   . . .   . . .     m [i d,j + d] m [i,j + d] m [i + d,j + d]  in − ··· in ··· in

1 int main (int argc, char const * argv[]){ 2 SkelCL::init();//initializes SkelCL 3 //Create skeletons 4 Reduce sum ( "float func(floatx, floaty) returnx+y;" ); 5 Zip mult ( "float func(floatx, floaty) returnx * y;" ); 6 7 Vector A(SIZE); fillVector(A);//create input vectors 8 Vector B(SIZE); fillVector(B); 9 10 Vector C = sum( mult( A, B ) );//execute skeletons 11 cout << "Result:" << C.front();//print result 12 }

Listing 5.1: SkelCL program computing the dot product of two vectors. The same program in OpenCL would take 59 lines of code. Source [Steuwer et al., 2013]

A strong resemblance with ForSyDe is that SkelCL also uses formal means for describing communication and computation. Despite that, one can see that the two methodologies started from diﬀerent areas of interest. While SkelCL implements optimal solutions for increasing performance and productivity on GPGPUs, it implements no concept of MoC. As a result it could not describe complex heterogeneous systems, and the programming model cannot be 42 5. The f2cc Tool

extended farther than GPGPUs. Nevertheless, its strong mathematical foundation in parallel computation6 should be an exemplary model for ForSyDe in further exploring the parallel computation domain.

5.3.2 SkePU

The following subsection is based on material found in [Enmyren and Kessler, 2010]. SkePU is a skeleton programming library for multiple GPGPU backends, developed at Linköping University, in the Department of Computer and Information Science. It is a C++ template library which provides a simple and uniﬁed interface for specifying data-parallel computation, and its target platforms are CUDA, OpenCL, OpenMP and sequential CPU. It also provides support for multiple GPUs. The approach is similar to SkelCL’s in many ways. SkePU also provides a list of skeletons and data containers that operate similarly and generate backend code. The only data container at the time of writing this report was the vector, witch employs lazy memory copies, in order to minimize bottlenecks. The basic skeletons provided are Map and Reduce, with formal descriptions similar to those in Equation 5.1 and Equation 5.3. Other skeletons provided are compositions of these two basic ones: MapReduce, MapOverlap and MapArray.

1 UNARY_FUNC(name, type1, param1, func) 2 UNARY_FUNC_CONSTANT(name, type1, param1, const1, func) 3 BINARY_FUNC(name, type1, param1, param2, func) 4 BINARY_FUNC_CONSTANT(name, type1, param1, param2, const1, func) 5 TERTIARY_FUNC(name, type1, param1, param2, param3, func) 6 TERTIARY_FUNC_CONSTANT(name, type1, param1, param2, param3, const1, func) 7 OVERLAP_FUNC(name, type1, over, param1, func) 8 ARRAY_FUNC(name, type1, param1, param2, func)

Listing 5.2: SkePU function macros. Source [Enmyren and Kessler, 2010]

Listing 5.2 shows the function macros available in SkePU. They offer a standardized but larger degree of expressiveness in solving problems with the aid of skeletons. Listing 5.3 demonstrates the use of the Map skeleton, associated with a binary function. The main addition that SkePU offers is the multi-platform support. Each skeleton contains member functions that correspond to the supported backends. Basically the function macros expand the implementation of the same function in CUDA, OpenCL, OpenMP and even sequential CPU. Also, the authors state that the interface is general enough to be expanded to other platforms than the ones mentioned. Despite the apparent flexibility and generality of the method approached by SkePU, it is still limited static code mapping. Also, since ForSyDe was born in the real-time heterogeneous embedded platforms domain, it has a higher, more abstract view of systems, a view which can integrate parallel platforms from a formal perspective. Thus these strong features like system analysis and manipulation are not used when regarding solely code generation. Hence,

6the reader can clearly see the resemblance between the parallel taxonomy derived from Kleene’s model (Subsection 3.2.2) and the skeletons. Basically the ﬁrst four skeletons are implementations of the data, reduction and speculative parallel computations, while the last one is a composition between the data and speculative parallel computations. 5.3. Alternatives to f2cc 43

1 BINARY_FUNC(plus, double, a, b, return a+b) 2 3 int main(){ 4 skepu::MapglobalSum(new plus); 5 skepu::Vector v0(10,10); 6 skepu::Vector v1(10,5); 7 skepu::Vector r; 8 9 sum(v0, v1, r) 10 std::cout<<"Result:" <

Listing 5.3: SkePU syntax example. Source [Enmyren and Kessler, 2010]

1 int main(void){ 2 // generate 16M random numbers on the host 3 thrust::host_vector h_vec(1 << 24); 4 thrust::generate(h_vec.begin(), h_vec.end(), rand); 5 6 // transfer data to the device 7 thrust::device_vector d_vec = h_vec; 8 9 //sort data on the device 10 thrust::sort(d_vec.begin(), d_vec.end()) 11 12 //transfer data back to host 13 thrust::copy(d_vec.begin(), d_vec.end(), h_vec.begin()) 14 }

Listing 5.4: A Thrust program which sorts data on the GPU. Source [Bell and Hoberock, 2011]

SkePU’s approach, although useful for future research, is not proper for a (full) system design methodology. Still, the optimization mechanisms will deﬁnitely aid in the implementation of a synthesis stage in the ForSyDe design ﬂow.

5.3.3 Thrust

The following subsection is based on material in [Bell and Hoberock, 2011]. Thrust is a high level API which mimics STL C++ library, provided with CUDA version 4.0. It implements an abstraction layer for the CUDA platform meant for increased productivity7. It can be utilized in rapid prototyping of CUDA applications, as well as in production, since it provides robust implementations with increased performance. Thrust provides two main data containers, device_vector and host_vector, which contain member functions and iterators. The most important tools of this library are custom device functions, like permutation, constant, counting, etc. These features give an application- oriented ﬂavour to the programming environment, very much alike to MATLAB or other similar languages, rather than a hardware-oriented one. Listing 5.4 illustrates a typical program written with Thrust.

7equivalent with productivity layer, see Section 3.4. 44 5. The f2cc Tool

As can be seen, Thrust’s approach is completely diﬀerent than ForSyDe. It is merely a productivity layer targeting only CUDA, and there is no method that can be imported in ForSyDe methodology, since there is no type of formalism involved. The reason why this set of tools has been included in the list for alternatives is to present an application-oriented approach which enforces a general programming style, much appreciated in industry. The patterns employed may be useful for future ForSyDe researches on user impact. Design patterns may present themselves as formal IP blocks.

5.3.4 Obsidian

The following subsection is based on material found in [Svensson et al., 2010]. Obsidian is a domain-specific language (DSL) embedded in the functional programming language Haskell, targeting data-parallel programming on GPUs. It generates nvidia CUDA code, and at the time of writing this report, it considered only the generation of single kernels without coordination. Obsidian aims at simplifying development of GPU kernels by using familiar constructs from Haskell, like map, reduce, foldr, permutation functions like rev, riffle, unriffle, recursion, or other special design combinators for combining GPU programs. As in the previously presented environments, the computation is described as computation between arrays, and combinators are used instead of direct indexing into structures. This favors developing of prototypes and experiment with different partitionings and choices when implementing an algorithm, without thinking of the architectural details. Once a data array is formed, the programmer may use Obsidian versions of Haskell library functions, with the restriction that they only operate on Obsidian arrays. Listing 5.5 shows an example where fmap is used to declare a function incr, that increments each value in an array.

1 incr :: Arr a -> Arr a 2 incr = fmap (+1)

Listing 5.5: Declaration of an Obsidian function incr. Source [Hjort Blindell, 2012]

As mentioned, Obsidian features a set of combinators to construct GPGPU kernels. A typical GPGPU kernel is represented by data a :-> b, and is depicted in Figure 5.4. The combinators used are pure and sync, and they are deﬁned like in Listing 5.6. The pure turns one or more arrays into a kernel, and the sync is used either to store data in the shared memory between kernels, or to apply synchronization barriers.

a Pure Pure Pure b Sync Sync

Figure 5.4: A GPU program of type a :-> b is represented in Obsidian as pure computation interspersed by syncs. Source: adapted from [Svensson et al., 2010]

The generated CUDA code has "satisfactory" [Svensson et al., 2010] performance for some applications, while for others it is not enough. 5.3. Alternatives to f2cc 45

1 pure :: (a -> b) a :-> b 2 sync :: Flatten a => Arr a :-> Arr a

Listing 5.6: Deﬁnition of pure and sync combinators in Obsidian. Source [Hjort Blindell, 2012]

Obsidian, like all the other approaches, is a platform-specific environment. Although it works on raising the abstraction level of development through formal methods (like computation and synchronization), it has little in common with a system design methodology, like ForSyDe. Obsidian focusses on enabling restrictions to the programming model, forcing the developer to adapt his / her algorithm to fit them, in order to enable platform-speciffic optimization. ForSyDe on the other hand, models platform-independent computation and the platform optimizations are done through design space explorations. Thus, a relevant approach in a ForSyDe design flow would be to model GPGPUs as a target platform and enable design space exploration on the MoC based algorithm.

There are many more similar research or commercial tools available, that have more or less the same purpose: synthesizing low-level GPGPU code from a high-level model. Among them, we can enumerate DPSKEL (dynamic programming skeletal-based environment, targeting multi-GPU), CUDPP, Brook (other C++ template libraries), Vertigo (Haskell EDSL, optimized for DirectX9), Accelerate (another Haskell EDSL), PyGPU (Python library), etc. The multitude of tools demonstrate the intensity of research in this particular area. Strangely though, among them there is no system design methodology like ForSYDe. Since it has a diﬀerent approach than most of the tools available, ForSyDe can borrow too few conceptual methods. Studying their optimization mechanisms on the other hand, could prove useful for the development of speciﬁc design space exploration algorithms. 46 5. The f2cc Tool Chapter 6 Challenges

This chapter lists the main challenges that the component development and implementation will face.

ased on the material covered through Part I of this report, the following set of challenges B has been identiﬁed. They are all related to the f2cc tool, since it will be the main software component extended and developed throughout this project. Based on the module that they are associated with, the new set of goals have been grouped into ﬁve categories, and annotated based on their priority as either High, Medium or Low. Since time is a limiting factor, only the challenges marked with High and Medium priorities will be treated. Challenges with Low priority shall be taken into account only if time permits it, and will be taken out of the scope of this M.Sc. thesis.

frontend related challenges: • Implement a new XML parser (High) The current ForSyDe-Haskell generated GraphML representation is obsolete, and is replaced by the new XML intermediate representation. Therefore, it is necessary to build a parser inside the frontend module of f2cc in order to input and interpret ForSyDe-SystemC models. • Identify a minimal set of additional XML annotations (High) Data-parallel computation can be identified from the high-level functional model, but time-parallel computation needs additional information regarding architectural details and constraints. Since ForSyDe does not specify yet design constraints nor platform descriptions, this information has to be assumed. The challenge is to find a minimal set of implied information so it does not impose restrictions for future ForSYDe research. • Propose topics for further investigation (Low) Since the intermediate format is still under development and hasn’t reached its final form, new requirements can be identified actively. One should be aware of the future research and propose solutions for extracting information not yet available. It is more appropriate to extract missing information through various means than to infer it inside the tool.

47 48 6. Challenges

model related challenges:

• Offer model support for the new frontend (High) The fact that the XML representation encapsulates more information than the previous GraphML could be considered a major advantage. Since both structural and computational information is easier to extract, this could ease further steps like model transformations, pattern identification and even code synthesis. • Develop an identification algorithm for the data parallel sections (High) Currently, f2cc is limited to identifying contained split-map-merge sections as data parallel. Consequently, it misses some opportunities to exploit models that do not fit this pattern. The composite process described by the new XML representation has an improved potential for expressing patterns, that can be exploited to identify parallel computation. Hence a new pattern recognition algorithm is necessary. • Implement model modifier for synthesis of time-parallel code (High) In order to correctly wrap streamed CUDA kernels that employ time-parallelism, a set of model modifiers have to be implemented. These modifiers split "contained" sections into "pipelined" ones. • Implement load balancing algorithm (High) Improper usage of streams can degrade performance instead of improving it. To avoid such situations, a correct load-balancing algorithm has to be employed, that splits correctly contained sections so that the best resource usage occurs. This load-balancing algorithm has to take into account the supplementary model annotations. • Implement internal model dumper (Medium) An internal model dumper may be an invaluable addition for debugging purposes. By being able to dump the internal model, one can verify that the model modifiers are working correctly. • Conform with the new ForSyDe naming convention (Low) There have been some minor modifications in ForSyDe naming convention regarding process constructors. In order to avoid confusion, the internal model of f2cc should conform with these changes as well. • Implement a graph-based internal representation model (Low) Although the current internal representation model is sufficient for the task at hand, it is a limiting factor for future ForSyDe development. All the analysis, traversal and modifier algorithms have to be implemented manually, and the object-based containers may not be optimized for performance. On the other hand, the graph theory is intensely studied and there are enough optimized template-based libraries with "out-of-the-box" functions that could prove invaluable for future ForSyDe development. This problem will be treated only if time permits it. • Explore the possibility of identification and implementation of other types of parallelism (Low) If time permits it, new solutions for identifying and implementing various tipes of parallel computation can be researched. The different types of parallelism are presented in Subsection 3.2.2 and some implementation solutions can be inspired from Section 5.3. 49 synthesizer related challenges: • Implement algorithm that takes care of signal conversion (High) The GraphML representation does not include information regarding data type contained by signals. Therefore, the association between a signal and its variable had to be inferred using a specific algorithm. Since this information is "for free" in the new XML representation, signal conversion requires a different algorithm. • Implement CUDA stream wrapper (High) This task is straightforward, but it is critical for enabling the synthesis of time-parallel code on an nvidia CUDA-enabled GPGPU. The synthesis of optimized streamed CUDA code strongly depends on a good load-balancing algorithm. • Further optimize the use of shared memory (Low) Currently, f2cc employs the use of CUDA shared memory, but more optimizations are possible. If time permits it, this issue can be treated. language related challenges: • Develop separate tool for C code extraction (High) f2cc uses string-based code parsers embedded in the language module. Although they are sufficient for parsing C code, they are not enough for identifying C++ or SystemC templates. Since developing an internal C++ parser cannot fit inside the time slot of this project, it will be developed as a separate tool based on an existing code parser with C++ grammar. Since such a tool is developed in parallel with the current project, the work effort associated with this goal will not be assigned an important time frame. Instead it will be regarded as a temporary solution that will be improved in the future. • Conform with new C header model (High) There is a discrepancy between the C code implied by the GraphML model inputted by f2cc and the ForSyDe-SystemC function declaration. This has to be taken care of as well for a proper interpretation of the processes’ code. • Take care of C++ and ForSyDe containers (High) The ForSyDe-SystemC framework makes extensive use of C++ templates and data types. Also, the protocol part, where signal containers are wrapped and unwrapped cannot be identified by a C parser either. There has to be an association algorithm for a correct interpretation of the source code.

General implementation challenges: • Make use of current algorithms and modules as much as possible (High) Since f2cc is a fully-functional tool, an important part of its code, classes and methods can be reused to increase productivity. • Maintain backward-compatibility (High) Backward-compatibility must not be damaged. The component has to be compatible with the initial GraphML models. • Implement a new execution path for ForSyDe-SystemC models (High) There are too many discrepancies between ForSyDe-Haskell generated GraphML and ForSyDe-SystemC generated XML representations. The frontend, model modiﬁer, 50 6. Challenges

synthesizer and language algorithms are more or less diﬀerent. Therefore it is easier to implement a new execution path, and ensure backward-compatibility, than trying to embed new functionality in the same path and risk damaging existing code. • Merge the old and the new execution paths (Low) Once the new component is validated, one can merge the two execution paths for code cleanness. Part II

Development and Implementation

51 52 Chapter 7 The Component Framework

This chapter describes the software requirements and the architectural traits that a tool like f2cc demands in order to satisfy the desired functionality. Since the current project will develop and improve an already existing tool while maintaining backward compatibility, special attention will be paid to the existing features and possible ways to improve them. Finally, for each feature, the design decisions made in the implementation process will be presented and justiﬁed, along with solutions for future development. Part II will provide an overview of the existing tool implementation described in [Hjort Blindell, 2012] with f2cc v0.1 and the new component improved as part of this thesis’ contribution with f2cc v0.2.

7.1 The ForSyDe model

s presented in Chapter 2, the ForSyDe modeling framework originally provided three A main building blocks for describing systems: the process; the signal which transports data between processes through ports; and the domain interface, which is not treated in the current contribution. Starting with the ForSyDe-SystemC modeling framework, a new building block was introduced: the composite process, which describes the compositions of other processes. The composite process enables the designer to express more naturally hierarchy in the design (similar to Hardware Description Languages – HDL), while enabling the potential tool developer with means for enhanced model analysis and manipulation. Models can now be analyzed and grouped locally, and replication or coalescing can be done without using intermediate data structure, thus both the designed and the tool developer can have a clear view of the model even during intermediate design ﬂow steps. Also patterns like data parallelism are easier to identify since it is not necessary to parse the whole model, favoring early pattern recognition.

7.1.1 f2cc approach

In order to use, analyze and manipulate the ForSyDe model, f2cc needs its own internal data representation that can be extracted from the XML or GraphML representations generated by

53 54 7. The Component Framework either of the ForSyDe implementations. Since every step in the execution ﬂow revolves around this model representation, it can be considered the "backbone" of f2cc, as it plays a similar role as the backbone presented in Appendix B. f2cc v0.1

In Section 5.2 the existing tool’s architecture is described as implemented in [Hjort Blindell, 2012]. From its point of view, a process is an object that derives from the Process class (Figure 5.3). These objects encapsulate data relevant for the model description outputted by the ForSyDe-Haskell design framework. The main features of f2cc can be depicted like in Figure 7.1.

List of List of Inputs Model Outputs List of processes

mapsy1 * List of List of In Ports ID: mapsy1 Out Ports proc1 * proc2 * ID:1 ID:1 proc3 * variables zipwithnsy1 * * * List of List of In Ports ID: proc3 Out Ports List of List of In Ports ID: proc2 Out Ports C function ID:1 ID:1 * ID:1 ID:1 * * type: * ID:2 * * type: ZipxSY List of List of ID:2 List of ID: List of In Ports ID: proc1 Out Ports UnzipxSY In Ports Out Ports * * zipwithnsy1 ID:1 ID:1 ID:1 ID:1 * * * variables * type: ID:2 ID:2 CopySY * * C function

Figure 7.1: Visual representation of the internal model in f2cc v0.1

As can be seen, f2cc’s internal model supports the description and manipulation of a ForSyDe process network with respect to development stage at the time. The process network is denominated in f2cc v0.1 as a Model. The Model contains a list of unique process IDs associated with processes. The entries in this list point to Process objects which have special properties as described by their equivalent ForSyDe process constructor. Each Process object has a unique ID, a list of input ports, and a list of output ports. These Ports are objects identiﬁed by their ID, and they each contain a pointer to another port belonging to a diﬀerent process. Two connected ports are double-linked, meaning that both objects encapsulate pointers to one another. A port is connected to only one other port and if another connection is formed, the initial connection is broken. Some processes, like mapSY, zipwithnSY, coalescedMapSY and parallelMapSY [Hjort Blindell, 2012] encapsulate C functions, further described in Section 7.3. The Model also contains a list of inputs and a list of outputs, which are in fact "one-way" pointers to ports. These pointers can be considered starting and ending points for model parsing algorithms. The implementation of this model has been optimized for a limited set of actions, undertaken by the algorithms in f2cc. Special consideration for high performance execution has been paid by using combsets, lists, and optimized memory accesses for critical methods. 7.1. The ForSyDe model 55 f2cc v0.2

Although the framework provided by f2cc v0.1 has an intuitive interface and facilitates fast code execution, its main weakness is the inflexibility to the new features introduced with the ForSyDe-SystemC modeling framework, which were not known at the time of the tool’s development. This weakness is amplified by the fact that every module of the software component is highly dependent of the internal model, so any changes in the existing structure could render the whole tool unusable. Since maintaining backward compatibility is a high priority goal, the main challenge is to keep the internal model API unmodified, while providing new functionalities transparent to the execution flow in v0.1. Facing this challenge, a number of design decision were made which favoured the intuitive pattens in the object-based representation from v0.1 over an optimized execution time which would imply changing the whole structure of the tool. The new features can be depicted in Figure 7.2 which reflects the design decisions made. The Model from v0.1 is replaced by the Process Network. From outside, it has to be seen in the same manner: a collection of processes with two lists of pointers to the input/output ports. Now there are two types of processes having distinct natures and functions, leaf processes and composite processes. Due to this, they have been separated in two distinct lists, accessed with different methods. The Process objects, although renamed into Leaf objects to disambiguate them from Composite objects, suffered minor modifications concerning their interface. Their names have been conformed to the new ForSyDe terminology: mapSY and zipwithnSY have been renamed to comb, and copySY has been renamed to fanout. Also, support for coalescedMapSY and parallelMapSY has been seized, since the execution flow in v0.2 will not use these processes. Internally however, a leaf process’ structure is different. Instead of only one unique ID, all processes (including composites) are described by a hierarchical path that enables the correct placement in the process network. They have to have a context, and using a process without context results in an invalid model exception. A relation between two processes (child, parent, first child, sibling, sibling’s child, etc.1) can be extracted from the hierarchy path, for further manipulation. Another main change is that actor leaf processes (comb) do not hold C functions any more; instead, functions are held by the Process Network. An actor leaf process just points to a function in that list, enabling easy identification of data parallel processes and reducing storage space in case they occur. The Composite processes are a new class of processes, sharing some traits with both the Process Network and Leaf processes. As seen in Figure 7.2, they may contain leafs or other composites, but unlike the Process Network, their scope of vision includes only their first children. Their content is described by their name which is the equivalent of the component name in the ForSyDe-SystemC design framework. By using this design framework property (further described in Section 7.2) it is now possible to identify data parallel sets of processes by comparing their parents’ component name. The Ports now contain more information than in v0.1. They now encapsulate data type information, extracted directly from the XML representation. Thus, the data type coherence does not have to be tested when a process network is built, since it is assured by the

1the nomenclature uses similar terms to the ones used in tree structures [Knuth, 1997] 56 7. The Component Framework

List of Outputs *

* arr

List of Out I/O Ports

iop1

* arr ID:

* arr

List of Out I/O Ports

iop1

* arr ID:

Composite process

p1 * array ID: List of Out Ports

leaf3 Zipx type: ID:

p1 p2 * * int int ID: ID:

List of In Ports

* * int int

List of Out I/O Ports

iop1 iop2

* * int int

ID: ID:

* *

p1 p1 Process Network actors int int

function_set * * ID: ID: List of Out Ports List of Out Ports List of processes composite

Name:

comb2 comb1

variables variables * * *C function *C function

ID: ID:

top_module

* * *

p1 p2 p1 int arr int

comb1 comb2

* ID: * ID: * ID: composite2 List of In Ports List of In Ports List of leaf processes ID: Name: composite1 ID:

* * int int arr

List of In I/O Ports

iop1

iop2 iop3

* * * int

int arr ID: ID: ID: * * root ID:

p1 p2 * * int int func1 func2 ID: ID: List of functions List of Out Ports

* * *

leaf2

type: ID: Unzipx processes leaf2 leaf3 composite2 List of composite List of leaf processes

p1 *

array * * * ID:

List of In Ports

* * processes arr arr

composite1 composite2 root

List of composite List of In I/O Ports

iop1 iop1

* * arr arr ID: ID:

p2 p1 * * array array ID: ID:

List of Out Ports

* *

* * * * * leaf1

type: fanout ID: processes

leaf1

composite1 List of composite List of leaf processes comb1 comb2 leaf1 leaf2 leaf3

p1 * List of leaf processes array ID: List of In Ports

* arr

List of In I/O Ports

iop1

* arr ID:

List of Inputs *

Figure 7.2: Visual representation of the internal model in f2cc v0.2 7.1. The ForSyDe model 57 List of In I/O Ports List of Out I/O Ports ID: composite2 Name: actors_pcomp Number of processes: 8

List of leaf List of processes List of List of composite In Ports ID: comb1 Out Ports processes comb1 * comb2 * ID:iop1 ID:p1 ID:p1 ID:iop4 variables arr[8] int int arr[8] int int * * * * * * * *

*C function

ID:iop2 arr[8] int List of List of In Ports ID: comb2 Out Ports * * ID:p1 int ID:p1 ID:iop5

* * variables int int arr[8] ID:iop3 ID:p2 * * * * arr[8] int arr *C function * * * *

Figure 7.3: Visual representation of the ParallelComposite process

ForSyDe-SystemC design framework. Furthermore, ports belonging to actor leaf processes point directly to a C variable in the process’ C function. In this way, model coherence, data coherence and enhanced access to information are enforced. Because the "port-to-port" pattern had to be maintained, the access mechanisms had to be unmodified. Therefore the Ports belonging to Composite objects were derived into a new class, the IOPort. The IOPort is similar to Port, with the distinction that they have two connections: one outside the composite process and one inside it. The outside connection can lead to either sibling processes or to its first parent, while the inside connection can lead only to its first children. These restrictions are enforced by the internal mechanisms employed by the IOPort’s methods. Another compatibility feature is that composite processes have to be transparent as concerns the algorithms in v0.1. Therefore port-to-port accesses offer the possibility of recursive accessing until the connected Leaf Port has been reached, ignoring the intermediate Composite IOPorts. The search direction is handled by the methods’ internal mechanism which check the caller’s hierarchical relation to the callee. Also, the methods for breaking connections offer the same recursive mechanisms, enabling the composite processes’ transparency. In order to express parallelism internally, a new process type has been implemented in f2cc’s internal model, the ParallelComposite. This process compensates for the lack of coalescedMapSY and parallelMapSY. It is structured as a Composite process, but it describes multiple instantiations of the same process. This is possible through a new property: the number of processes. This has multiple implications, the most important one being the correlation between the ports connected outside and the ones connected inside, as seen in Figure 7.3.

7.1.2 Model limitations and future improvements

Although this model improves usability and suﬃces for the desired functionality, it is rather inﬂexible for future development. Both the compatibility and the new features made the 58 7. The Component Framework

root

cp0 cp1

cp00 cp01 lf10 lf11

lf000 lf001 lf010 lf011

Figure 7.4: Visualization of the cross-hierarchy connection mechanism by three examples. The processes are depicted in by their tree structure. cpx denotes a composite process’ ID, and lfx represents a leaf process’ ID. An intersection between a red line and a dotted line represents a new IOPort that has to be automatically generated implementation too complex to be scaled and handled efficiently. Also, since the time frame permitted implementing only a prototype for the component, some features still need to be added and existing ones need thorough testing and debugging. For example, the design flow would be greatly aided by a mechanism of connecting two ports anywhere in the process network. The implications of such an action can be seen in Figure 7.4. For each transition to and from a composite process, an IOPort has to be generated. Its generation has to take into account the source where the connection method has been called. Based on the relation between the current process and the destination process, the framework should compute the position of the next IOPort. Another priority for future improvement is developing even further the link between ports and code variables. Ideally, the data flow traceability should not stop at the process level. Instead it should go even further at code level. By splitting the code into an abstract syntax tree (AST) the signals can be traced and analyzed from the process input to its output, feature which may prove invaluable for model analysis. Since introducing a new feature might require a high overhead in both development time and end performance, it is the author’s belief that in order to continue the development of a design flow tool for ForSyDe, the current internal model has to be left aside in order to develop a more flexible model. This new internal framework has to satisfy three main requirements: • it has to scale better with new features that are continuously developing in ForSyDe, and with custom features needed for different design flows. • it has to provide an intuitive interface so that ideally a broader community of developers are able to contribute to the tool’s development. The current interface is a good starting point. • it has to be implemented with respect to a high performance on very large models. Being the "backbone" of the tool set, performance is critical. As presented in Appendix B, a proposed internal model will mainly use a fast tree library with storage facilities (most likely an XML or database library, like RapidXML2 or SQLite3) with

2http://rapidxml.sourceforge.net/manual.html 3http://www.sqlite.org/ 7.2. The intermediate model representation 59 a graph library (like Boost Graph Library4). Both library types offer the functionality and performance needed for the task at hand. The model can benefit from an XML database by using its inherent tree hierarchy, and its data storage facilities which can cope with the new or custom features introduced in ForSyDe. A graph library can aid the design flow with its fast out-of-the box analysis algorithms most needed in the design flow. The main task for future development is to bind these libraries into an intuitive API specific for ForSyDe models.

7.2 The intermediate model representation

The intermediate model representation is outputted by the ForSyDe modeling framework, and contains data extracted from the ForSyDe model. It encapsulates information about the process interconnections and process constructor information. Beginning with ForSyDe-SystemC design framework, additional information was included regarding the data type transported by signals and arriving in / departing from ports.

7.2.1 f2cc approach

Based on the data encapsulated in the intermediate XML or GraphML ﬁles, f2cc builds an internal model using the functions included in the frontend module. Depending on which pieces of information are available, the design ﬂow relies on extracted, inferred or assumed data. The following paragraphs will present both the old and the new versions of f2cc, to justify and understand the context of the design decisions taken in the tool’s development. f2cc v0.1

A comprehensive documentation of the available information contained in the input GraphML files can be found in chapter 9 of [Hjort Blindell, 2012]. f2cc inputs a single GraphML file which contains a flat process network, as presented in Subsection 7.1.1. The input file holds both structural information as XML nodes, and C code encapsulated by a data XML element. Since at the time of the tool’s development the ForSyDe-Haskell modeling framework did not dump C code, the C functions had to be included manually for each actor process. One important design decision taken in the development and implementation of f2cc v0.1 was inferring signal data types from the C code in the synthesis stage of its execution flow. This decision was influenced by the fact that there is no information regarding the data transported by signals or ports in the GraphML file. This fact is reflected in the internal model representation (Figure 7.1), since ports hold no data type information. Due to this lack of information, it was impossible to infer the sizes of array data types, thus sizes had to be provided manually in the GraphML file. Furthermore, structural information such as port direction had to be inferred from a strong naming convention, and any deviations from this convention results in a parse exception.

4http://www.boost.org/doc/libs/1_53_0/libs/graph/doc/index.html 60 7. The Component Framework

1 2 3 7 4

Listing 7.1: Example of port transporting an array in the GraphML intermediate format

Also, since the only function code present in the GraphML is C code, the only data type assumed and supported by f2cc v0.1 are ANSI C basic data types.

f2cc v0.2

f2cc v0.2 uses the XML intermediate files dumped by ForSyDe-SystemC to harvest model data. Since this representation is very different from the ForSyDe-Haskell generated GraphML representation, assuring backward-compatibility was never an issue, and it will not be considered in future implementations either. A different frontend parser is taking care of building an enhanced internal model from the input data available. The ForSyDe-SystemC modeling framework dumps multiple XML files for one process network, one file for each composite process available. This efficiently reduces storage space for multiple instantiations of the same composite process and, as presented in Subsection 7.1.1, enables easy identification of data parallel sections. Since the XML files contain data type information for ports and signals, and structural information is provided from explicit tags there is no need for the inferences in v0.1. As can be seen in Listing 7.2 and in Figure 7.2, this information is present in the intermediate model from early stages.

Listing 7.2: Example of port transporting an integer in the XML intermediate format

ForSyDe-SystemC’s introspection module dumps run-time type information (RTTI). The RTTI successfully identiﬁes both ANSI C standard types and classes (custom structures). When it comes to STL types though, the information regarding the base data type wrapped by template is lost. Since SystemC is a collection of C++ classes, and the ForSyDe-SystemC design framework often employs C++ STL types, a mechanism for preserving type information had to be provided. For the moment, the static declaration of custom templates was employed, using the macro DEFINE_TYPE_NAME provided by ForSyDe-SystemC. This means that every time the designer uses a custom data type, he or she has to make sure to declare it. These names can afterwards be interpreted by the f2cc methods into an internal data type representation and translated into an ANSI C type necessary for CUDA C code synthesis.

1 DEFINE_TYPE_NAME(std::vector >, "vector>" );

Listing 7.3: Example of using the static type name declaration during the system design. 7.3. The process function code 61

1 2 3 4 5

Listing 7.4: The diﬀerence between not using and using the static type name declaration.

Concerning the data provided by the XML intermediate representation, some temporary methods for data size extraction have been provided, necessary for cost calculations, and they are further presented in Section 7.4. Unlike in v0.1, in v0.2 the function code for actor processes is extracted directly from the ForSyDe model, thus the designer does not have to provide the C code manually. The extraction process is further presented in Section 7.3. As part of the implementation eﬀort, an XML dumper class has been provided as well. It provides methods for dumping the internal model representation into an XML ﬁle similar to the intermediate ForSyDe model. Its usage can be seen in Appendix C.

7.2.2 Limitations and future improvements

The solutions in the previous subsection were presented as temporary, since they need further study and development until they can be considered syntactically and functionally correct and flexible enough to be included into a large-scale tool. Most of the improvements would imply increased and direct support from the ForSyDe design framework. For example, since the ForSyDe-SystemC design framework is built over a C++ environment, the data types do not have fixed sizes. The current component, especially the analysis part is highly dependent on information provided by the design framework. This means that if the system was designed on a different machine than the one that does the analysis (for example an x64 and an x86 machines) the results are likely to be erroneous. Due to this fact, a standardized ForSyDe set of data types should be developed (similar to u_int16, int32, etc.) and be offered support for analysis. Also, design flows that target hardware backends like HDLs would greatly benefit from support of data types controlled at bit level. Another feature that could benefit the whole design flow would be advanced support for complex template data types. This would mean either enhanced support for recognition and introspection of composite STL types or, as mentioned before, the development of a standardized ForSyDe set of fully analyzable data containers.

7.3 The process function code

As presented in Subsection 7.1.1, actor processes hold function code. This code stands as basis for synthesizing code for the diﬀerent ForSyDe backends. Currently C code is preferred for denoting these processes’ functionality, since it is the most widely spread language for computing machines and most targeted platforms are equipped with C compilers. 62 7. The Component Framework

7.3.1 f2cc approach f2cc extracts the function code as text, analyzes it and encapsulates it into a CFunction object. The function body as such, since it is formed of C code, is left unmodified and it is relevant for the tool only in the final stage of the design flow, namely during the backend code synthesis. f2cc v0.1

Upon the code extracted from the GraphML ﬁle, text manipulation functions are applied. The C function is separated into body and header and stored into a CFunction object. The header is further analyzed and split into function name and variables like in Figure 7.6. The header contains parameters as CVariable objects, each having associated a CDataType object. The objects structure all the information in such a way that it is accessible for code generation mechanisms associated with these variables, as can be seen in the examples from Figure 7.5.

1 double sampIn; Name Type array array size pointer const * 2 sampIn = (double *) malloc (500 * sizeof( sampIn double T 500 F F double));

Figure 7.5: Generating C-style declaration code for a double array, based on information found in a CVariable object

function name function name

return CDataType Out parameter CDataType ...... In parameter CDataType Out parameter CDataType ...... In parameter CDataType In parameter CDataType ......

In parameter CDataType

Body Body

Figure 7.6: CFunction structure in f2cc v0.1 Figure 7.7: CFunction structure in f2cc v0.2

f2cc v0.2 f2cc v0.2 uses the same mechanism for storing a C function and similar object containers for that purpose. Since now code has to be generated for composite processes as well as leaf processes, the functions are not restricted to only one output. Instead, a list with CVariable objects associated with output parameters are provided for each CFunction, like in Figure 7.7. As mentioned in Section 7.2, the function code is extracted directly from the ForSyDe model. Since the ForSyDe-SystemC modeling framework uses C++ constructs, a basic set of methods for converting the function code into C code has been implemented, and included into a C parsing frontend. Currently, these methods employ only text manipulation, but they suﬃce for the desired functionality. 7.3. The process function code 63

The function extraction algorithm is presented in Listing 7.5. The parsing methods are invoked every time a new comb process is built during the XML parsing. The source code file name is built from the process constructor function name. This means that the designer has to respect a set of conventions when naming both process functions and function code files, as suggested in Appendix A. The coding style suffers a few restrictions as well, listed also in Appendix A. When the code parsing is complete, the associated CFunction object contains C code and connections between ports and variables are made immediately, and type sizes are adjusted accordingly. The variables extracted have ANSI C data types, and have direct equivalents in the function body. An example of extracting data type information from ForSyDe STL types is shown in Figure 7.8.

1 forsydeCode Read(source_file) 2 new_CFunction_obj← empty CFunction() 3 equivalence_list ←empty l i s t 4 ← 5 for each line forsydeCode do 6 if line ∈function declaration then 7 extract∈ function_name , function_arguments from line 8 for each argument function_arguments 9 extract variable_name∈ , variable_type from argument 10 new_CVariable_obj CVariable(variable_name , variable_type) 11 add new_CVariable_obj← to the new_CFunction_obj input list / output list 12 13 if line ("#pragma ForSyDe begin", "#pragma ForSyDe end" ) then 14 rename∈ macros with macro definitions in line 15 add line to new_CFunction_obj body 16 17 if line variable wrapping/unwrapping section then 18 extract∈ lhs_name, rhs_name from line 19 add lhs_name, rhs_name to equivalence_list 20 21 for each equivalence equivalence_list 22 analyze equivalence∈ 23 rename pointed variable from new_CFunction_obj variable lists with equivalent name 24 25 return new_CFunction_obj

Listing 7.5: Pseudocode for ForSyDe function code parser

const abst_ext> &in_state Name in_state state

std::array state = unsafe_from_abst_ext( type double in_state); array T array size 27 "in" const T

Figure 7.8: Extracting variable information (right) from the ForSyDe function code (upper left) and the XML intermediate representation (lower left). The variable name is later renamed according to its occurrence in the function body (middle left)

7.3.2 Future improvements

The current solution for code extraction, although suﬃces the proposed tasks, is very limited and inﬂexible. Since it is purely based on text parsing methods, a number of restrictions have 64 7. The Component Framework to be respected, restrictions that limit the usability of the design framework.

A proper way to store function information is its AST or another description of the functionality, rather than the code text. Thus an intermediate form for the code would greatly benefit the whole design flow. As suggested in Subsection 7.2.2 dataflow traceability through code may aid the design space exploration further than the current process network-based analysis.

Because developing an intermediate model for code analysis may prove too resource-consuming, support for C code extraction needs to be further improved. Currently ForSyDe speciﬁc constructs residing in the body, like the usage of absent values is ignored or removed. A proper way of generating semantically correct C constructs has to be developed.

Also, currently STL data type support is rudimentary. A proper way of dealing with templates and containers has to be developed. Currently, only std::array is supported, since it is static and its container size reﬂects the number of elements transported, thus structural data is easily extracted.

ForSyDe’s range of applications can be broadened by implementing support of dynamically- sized vectors. Although they would greatly increase the diﬃculty of system analysis, heuristic algorithms associated to vectors may be developed. Also, extending the idea of vectors into processes with dynamic parameters (for example parallel processes) may be associated with mapping to systems with dynamic resource allocation (for example run-time thread spawning).

7.4 The GPGPU platform model

In Chapter 8 a heuristic algorithm for load balancing is presented. This algorithm is based on a platform model which is not currently described by any available ForSyDe tool. Due to the limited time frame, the current project developed a minimal set of descriptive empirical attributes that can provide enough information to support the load balancing algorithm.

An XML ﬁle has been written to model the GPGPU as a platform. Currently this ﬁle contains only seven nodes each containing an attribute. These attributes are constants that roughly describe the platform’s execution patterns.

7.4.1 Computation costs

The computation costs describe the running time of individual processes. These costs are inferred from rough approximations of the processes’ running time in the context of their ForSyDe model running on a test platform (e.g. sequential CPU). Although the execution on a GPGPU is influenced by many more factors than provided, these coefficients suffice for proof-of-concept purposes.

Calculating cost values imply two factors: an average run-time estimation coeﬃcient and a platform cost coeﬃcient. The calculation is done with one of the equations 7.1, 7.2 and 7.3. 7.4. The GPGPU platform model 65

Cleaf ,seq = kseq Cleaf Cleaf ,par = kpar Cleaf (7.1) X · X· Ccomp,seq = Cproc,seq Ccomp,par = Cprocess,par (7.2) proc comp proc comp ∈ ∈ C = N C C = C (7.3) pcomp,seq · comp,seq pcomp,par comp,par For a leaf process, the cost calculations are straightforward (Equation 7.1). The cost associated CPU with sequential execution on the host (Cleaf ,seq) is done by multiplying the cost coefficient for running the process on a sequential platform (kseq) with the process’ run time estimation GPGPU coefficient (Cleaf ). For computing the cost associated with parallel execution on the device (Cleaf ,par ), a parallel platform cost coefficient is used instead (kpar ).

For a composite process (Equation 7.2), both sequential (Ccomp,seq) and parallel (Ccomp,par ) execution costs are calculated by summing all the associated costs for the contained processes. The communication happening internally between the contained processes, a diﬀerent type of analysis is employed, as it will be shown in the next section.

A parallel composite process (Equation 7.3) gets its sequential execution cost (Cpcomp,seq) by multiplying the number of processes (N) with the sequential execution cost for only one instantiation of this process (Ccomp,seq). When calculating the parallel cost execution though, N is ignored. This implies the assumption that the parallel platform (GPGPU) is inﬁnitely parallel, and there are enough resources to execute all the process threads at once. Although not realistic, this assumption is suﬃcient for demonstrating the load balancing algorithm.

The cost coeﬃcients kseq and kpar are extracted from the platform model, while the rest of the factors are included in the ForSyDe intermediate model representation. The run time execution coeﬃcients for leaf processes (Cleaf ) are provided manually by the designer, by writing a new attribute for the leaf_process nodes: cost. An example is shown below.

7.4.2 Communication costs

The communication costs describe the time necessary for data to be transferred between computing resources. Unlike the computation costs, the main parameter, namely the size of the transported data is precisely calculated, not approximated. Since part of the work effort was to implement a mechanism that extracts signal sizes5, the only assumed information is the transfer mechanism costs, which are synthesized into a set of transfer coefficients and provided in the platform model file. The operations revolving around transfer costs are following the pattern in Equation 7.4. C = k n size (7.4) transfer transfer · elements · datatype where Ctransf er is the communication cost associated with one type of transfer, ktransf er is the transfer mechanism cost for the same type of transfer, nelements is the fixed array size and sizedatatype is the signal data type size in bytes.

5for this purpose a modiﬁed ForSyDe library has been used, with enhanced introspection features. The modiﬁcations will be studied whether or not they should feature on future releases of ForSyDe. 66 7. The Component Framework

The types of transfer described in f2cc v0.2 are: host-to-device (H2D), device-to-host (D2H), device inter-thread (D2D), device intra-thread (T2T), host-to-host (H2H).

7.4.3 Future improvements

As presented, the current solution for modeling GPGPU execution is minimal, and is destined only for proof-of-concept purposes. In order to obtain real improvements two areas need to be further developed: the model analysis part and the platform description part. Currently there is no ForSyDe analysis tool available, other than the run-time introspection and model testbench present in ForSyDe-SystemC modeling framework. In Appendix B a separate tool is proposed for static and dynamic analyses. With its help, a more elaborate and proper run-time cost extraction is possible. Since process execution is influenced by different factors on different platforms, this aspect has to be reflected as correctly as possible in the analysis phase. For static analysis, a high-level code would be helpful, while for dynamic code analysis, better costs may be extracted from running partial testbenches on virtual platforms or target platforms. Also, a thorough platform description has to be modelled in order to have realistic predictions on both execution and data transfer. The current cost calculations (Equations 7.1 through 7.4) take into account only a set of empirical constants. These equations need to be developed to depend on real architectural or implementation traits, or at least statistical values. Chapter 8 Design Flow and Algorithms

This chapter will present the main algorithms implemented by the current component. These algorithms are tightly connected to the component framework presented in Chapter 7, and many of the design decisions taken in their development are dependent on it. By following the last chapter’s model, after describing the used algorithms, solutions for future improvement will be mentioned.

8.1 Model modiﬁer algorithms

he first part of the design flow is presented Chapter 7 and it implies code and model T parsing and extraction of data necessary for building an internal model. It shall be assumed that the internal model has been built and verified by the new frontend and there is enough information to continue with the second major step of the design flow: the model modifications. These modifications are necessary in order to synthesize the desired CUDA code, and are results of decisions made after an thorough search and analysis of the intermediate model.

8.1.1 Identifying data-parallel processes

Identifying data-parallel processes is the first step of the model modifier algorithms. As presented in [Hjort Blindell, 2012], there are four approaches for this task: 1. let the software component decide where and when potential data parallelism can be exploited and always execute it on the GPGPU device. 2. same as approach 1, but execute on GPGPU only parts of the code where there is sufficient data for such execution to be beneficial, and let the component take this decision. 3. let the model developer decide where and when potential data parallelism can be exploited and always execute it on the GPGPU device. 4. same as approach 3, but execute on GPGPU only parts of the code where there is sufficient data for such execution to be beneficial, and let the component take this decision.

67 68 8. Design Flow and Algorithms

f2cc v0.1 applies approach 1, and identifies data parallel sections by searching for unzipx-map- zipx patterns. In f2cc v0.2 methods of tackling with approach 2 are implemented, laying a foundation for future development. The identification is made using different methods than in v0.1, since now the model holds more information. Future development will consider approach 3 and 4 as well, but at the moment they are out of the scope of this Master’s Thesis project.

Implemented algorithm

The pseudo-code in Listing 8.1 describes the top level of the data parallel processes identification algorithm. It reflects one early design decision that was forced by the framework’s context: flattening the process network before any analysis. Since flattening the process network is a decision imposed by the current framework, it is not optimal and reduces the advantages of manipulating composite processes, rendering a high algorithm time complexity, as shown in the following paragraphs. An improved algorithm associated with an improved framework is presented at the end of this section.

1 root ProcessNetwork. root 2 ← 3 for each composite root .composites do 4 FlattenCompositeProcess∈ (composite) 5 6 equivalent_comb_groups ExtractEquivalentCombs( root ) 7 for each comb_group equivalent_comb_groups← do 8 pcomp createParallelComposite∈ (root , comb_group) 9 add pcomp← to root , ProcessNetwork 10 11 equivalent_leaf_groups ExtractEquivalentLeafs( root ) 12 while unresolved equivalent_leaf_groups← do 13 for∃ each leaf_group equivalent_leaf_groups do 14 pcomp createParallelComposite∈ (root , leaf_group) 15 add pcomp← to root , ProcessNetwork 16 equivalent_leaf_groups ExtractEquivalentLeafs( root ) 17 ← 18 RemoveRedundantZipsUnzips( root )

Listing 8.1: Top level for the algorithms for identifying data parallel sections

The algorithms in Listing 8.2 expand the functions from Listing 8.1. They simplify the implementation mechanisms in order to express only their functionality, and may not fully reflect the implementation details. The algorithm results in a flat process network containing only leafs for non-parallel processes and parallel composites for potentially parallel processes. As its name suggests, FlattenCompositeProcess() destroys the hierarchy of a composite process, and brings all leaf processes at the same level. It is applied recursively for all child branches, and it systematically raises the hierarchy with one level for all leaf processes at the end of these branches. The method results can be observed in Figure C.1 (input model) and Figure C.2 (modified model) from Appendix C. Since this algorithm is a recursive one, its time complexity scales very quickly with O(2N ), where N is the number of composite processes and it strongly depends on the complexity of the hierarchy tree. Another aspect that has to be regarded is that, due to the intermediate connections that have to be destroyed when moving a leaf process, the algorithm adds O(n), where n is the number of leaf processes, which scales with the number of ports contained by 8.1. Model modifier algorithms 69

1 function FlattenCompositeProcess(composite) 2 for each child_composite composite . composites do 3 FlattenCompositeProcess(child_composite)∈ 4 for each child_leaf composite. leafs do 5 move child_leaf ∈to composite.parent 6 redirect child_leaf.data_flow through child_leaf 7 8 function ExtractEquivalentCombs(composite) 9 grouped_equivalent_processes empty list of lists 10 table_of_equivalences empty← combset(identifiers , comb l i s t s ) 11 for each leaf composite.← leafs do 12 if leaf is ∈comb and leaf table_of_equivalences then 13 function_name table_of_equivalences.∈ identifier(leaf) 14 if FoundDependencyUp(/Down)stream← (leaf , equivalence_list) then 15 new_pair new pair(function_name, leaf) 16 add new_pair← to table_of_equivalences 17 else add leaf to table_of_equivalences. lists (function_name) 18 for each equivalence_list table_of_equivalences do 19 if equivalence_list.size()∈ > 1 then 20 add equivalence_list to grouped_equivalent_processes 21 return grouped_equivalent_processes 22 23 function ExtractEquivalentLeafs(composite) 24 grouped_equivalent_processes empty list of lists 25 table_of_equivalences empty← combset(identifiers , leaf lists) 26 for each leaf composite.← leafs do 27 if leaf is∈ not zipx or unzipx then 28 if leaf is connected toa zipx or unzipx then 29 i d e n t i f i e r the ID of the connected zipx or unzipx 30 if identifier← table_of_equivalences then 31 add leaf to∈ table_of_equivalences. list(identifier) 32 else 33 new_pair new pair(identifier , leaf) 34 add new_pair← to table_of_equivalences 35 for each equivalence_list table_of_equivalences do 36 if equivalence_list.size()∈ > 1 then 37 add equivalence_list to grouped_equivalent_processes 38 return grouped_equivalent_processes 39 40 function createParallelComposite(parent , group_of_equivalent_processes) 41 for each leaf group_of_equivalent_processes do 42 count number_of_processes∈ 43 new_pcomp new ParallelComposite(number_of_processes) 44 add new_pcomp← to parent 45 reference_process group_of_equivalent_processes .pop_front() 46 integrate reference_process← into new_pcomp 47 zips_and_unzips new Zipx/Unzipx() reference_process .ports 48 redirect reference_process.data_flow← ∀through zips_and_unzips 49 for each process group_of_equivalent_processes do 50 redirect process.data_flow∈ through zips_and_unzips 51 erase process 52 return new_pcomp 53 54 function FoundDependencyUp(/Down)stream(leaf , to_compare_with) 55 mark leaf as visited 56 if leaf is delay then return false 1 57 for each port leaf .in/out_ports do 58 connected_leaf∈ port.connected_leaf_port .process 59 if connected_leaf← has been visited then return false 60 if leaf to_compare_with then return true 61 else if FoundDependencyUp(/Down)stream∈ (connected_leaf , to_compare_with) then return true 62 return false

Listing 8.2: Methods used by data parallel sections identiﬁcation algorithm 70 8. Design Flow and Algorithms the process. Since all the composite processes are destroyed, the N term disappears from future complexity calculations.

ExtractEquivalentCombs() is a method that searches for all the comb processes in the system and groups them into equivalent processes. A set of equivalent combs that are potentially data-parallel point to the same function and are not data-dependent. In order to identify data-parallel processes, a system of lookup tables is employed, along with an additional parse through the process’ data path to search for dependencies, through the function FoundDependencyUp(/Down)stream(). This function is still experimental, and it needs to be further studied whether the dependency veriﬁcation should target the whole data path between the process network’s input to its output, or only the data path between two delay processes. Currently this feature may be activated through a ﬂag. The method’s behavior can be observed in Figure C.3 from Appendix C.

This function traverses a ﬂat process network for each comb process and could be greatly enhanced through systematic groupings of processes. Since all processes have to be veriﬁed whether they are combs, the search time would scale with O(n). And since for every comb process found, a dependency check is activated which is basically parsing through the whole process network, with a worst case scenario of O(n) (since visited process are ignored). Thus the full time complexity scales with O(n2). Another list swipe is performed at the end of the method, where all the lists in the built lookup table are grouped into a list of lists, but its complexity can be considered negligible compared to the process network parsing. Also, with a correct lookup table system, verifying whether a process is found in a group is reduced to O(1). This is available for the following methods as well, thus this aspect will be neglected.

ExtractEquivalentLeafs() is similar to the previous method, but instead of parsing through the full data path, only neighbour processes are visited. This method identiﬁes leaf processes that have the same neighbours and groups them into separate lists, so that they too become parallel composites. This aids simplifying process networks that contain potentially parallel processes which do not necessarily respect the zipx-map-unzipx pattern, but may have a more complex pattern, like in Figure 8.1. The method’s eﬀect can be observed upon the test model in Figure C.4 from Appendix C. This method is called systematically until the component can make sure that there is no further potential for data parallelism, since grouping parallel composites may generate further opportunities for parallelism.

unzipx unzipx unzipx unzipx

pcomp pcomp pcomp pcomp pcomp pcomp pcomp pcomp comb comb comb comb comb comb comb comb [2] [2] [2] [2]

zipx zipx zipx zipx

Figure 8.1: Grouping potentially parallel processes with the same source(s) and target(s)

Since the potentially parallel combs have been grouped into parallel composite processes, parsing through the system will take less time. Thus, the worst case scenario would run the method in O(n) time, and the nested search would depend only on the number of processes, which will be named np. Thus the full time complexity of this method is O(n np). · 8.1. Model modifier algorithms 71

createParallelComposite() is a method that mainly integrates a set of equivalent and potentially parallel leaf process into a parallel composite process. The integration implies creating the parallel composite, moving the leaf inside the new process redirecting the data flow and assuring data type and process integrity. To make the new parallel composite transparent, all its input and output ports are connected to the rest of the process network through sets of zip and unzip processes. Thus, when redirecting the data flow through the new parallel composite, these zips/unzips are used instead, aiding the future identification of potentially data parallel processes. The results of this method applied upon the example model can bee seen in both Figure C.3 and Figure C.4 from Appendix C. As seen in Listing 8.1, this method is called for every set of equivalent processes found, and is continually called until there are no equivalent processes left. For each group of potential parallel processes, the method’s time complexity scales with O(n np), depending on the number of equivalent processes, since for every one of them the data flow· needs to be redirected.

1 function RemoveRedundantZipsUnzips(composite) 2 for each leaf composite. leafs do 3 if leaf is ∈zipx or unzipx then 4 equivalence_set empty combset(connected_id , current_id) 5 for each port leaf.in(out)_ports← do 6 if port.connected_port.process∈ equivalence_set then 7 integrate port.signal into ∈equivalence_set.signal 8 erase port.connected_port , port 9 else 10 new_equivalence new pair(connected_id , current_id) 11 add new_equivalence← to equivalence_set 12 for each leaf composite. leafs do 13 if leaf is ∈zipx or unzipx and l e a f has only one in(out) port then 14 redirect flow ignoring leaf 15 erase leaf

Listing 8.3: Pseudo-code for RemoveRedundantZipsUnzips() method

The ﬁnal method in this algorithm is RemoveRedundantZipsUnzips() listed in Listing 8.3. It is used for grouping together bundles of signals into arrays rendering fewer redundant data paths. After this grouping is made, the zipx and unzipx processes with only one connection are removed from the process network, since they do not play any role. The method’s eﬀect can be observed in Figure C.4 from Figure C.5. Since this method consists in two parsings of the full list of leaf processes, the execution time would be O(n + n), thus the time complexity scales with O(n).

Proposed algorithm

As seen in Listing 8.1, the used algorithm for identifying data parallel processes scales quickly when increasing the data set. Its time complexity grows exponentially with the number of composite processes, and predominant quadratically with the number of leaf processes (due to repeated parsings of the full process network, for each leaf process). We propose an algorithm for future development that would increase the execution performance of the current identification algorithm, offering a logarithmic growth rate for for both leaf and composite processes. This algorithm is described in Listing 8.4. Although at first glance the algorithm seems to employ more nested loops than the previous one, this aspect is deceiving. The process network tree is parsed and flattened from root 72 8. Design Flow and Algorithms

1 root ProcessNetwork. root 2 ← 3 while root .composites do 4 equivalent_process_groups∃ ExtractEquivalentProcesses( root ) 5 while unresolved equivalent_process_groups← do 6 for∃ each process_group equivalent_process_groups do 7 pcomp createParallelComposite∈ (root , process_group) 8 add pcomp← to root , ProcessNetwork 9 equivalent_leaf_groups ExtractEquivalentProcesses( root ) 10 ← 11 for each composite root .composites do 12 move composite.leafs∈ and composite.composites to root 13 redirect data flow through composite.leafs and composite.composites 14 erase composite 15 16 RemoveRedundantZipsUnzips( root )

Listing 8.4: Top level for the proposed algorithm for identifying data parallel sections

to branches, not viceversa. For each level, all data parallelism is systematically identified, resolved and grouped for that specific level, after which the composite information along with its hierarchy is destroyed. Since potentially data parallel processes are identified early on, before reaching the branches, each composite that is transformed into a parallel composite "cuts" a branch from the hierarchical tree thus all methods associated with that branch are skipped. Also, the leaf processes in the ramifications are also destroyed and synthesized into the new parallel composites. This gives the algorithm a time complexity with a curve proportional to O(logn).

8.1.2 Optimizing platform mapping

As mentioned in Subsection 8.1.1, the current component tries to identify sections where execution on a GPGPU could be beneﬁcial. This is done through a set of cost calculations and it employs the platform model described as part of the component framework in Section 7.4. The algorithm is described in the pseudo-code from Listing 8.5.

1 root ProcessNetwork. root 2 ← 3 while optimization is not finished do 4 for each process root.processes do 5 current_cost ∈ total_cost(process on current platform) 6 changed_cost ← total_cost(process on changed platform) 7 if changed_cost← < current_cost then 8 map process on changed platform 9 mark optimization as not finished

Listing 8.5: Algorithm for platform optimization

The algorithm assumes that the data parallel processes were identiﬁed and grouped into parallel composites. By default, all leaf processes are mapped for host (i.e. for sequential execution) and parallel composite processes are mapped for device (i.e. parallel execution). The algorithm parses through all processes in the process network root, and calculates their total cost assuming that they are mapped on either host or device. If the total cost on a changed platform ends up being less than the cost on the current platform, then the that process’ mapping directive is changed. This triggers another parse through the process network at the 8.1. Model modifier algorithms 73

end of this algorithm, since new opportunities for optimizations may arise after this change. In the worst case scenario, all processes systematically change platforms, rendering the algorithm’s time complexity of O(n2)2, but this is very unlikely. In the average case the algorithm is ﬁnished after O(n) or O(2n). X Ctotal,platf orm = Cproc,platf orm + Ctransf er,platf orm (8.1) port proc ∈

The total cost calculation is done with Equation 8.1. The execution cost Cprocess,platf orm is calculated depending on the platform with either Equation 7.1 or Equation 7.3. The transfer cost Ctransf er,platf orm is calculated with Equation 7.4, where ktransf er can be kH2D , kD2H , kD2D or kH2H , depending on the process’ placement in the network.

8.1.3 Load balancing the process network

The main purpose of this component is to load balance the execution of a process network in order to eﬃciently map it to a parallel platform that supports time-parallel computation (see Subsection 3.2.2). The outcome should result in a parallel execution of the process network in a pipelined fashion, like presented in Section 4.4. In order to do so, the critical section of the process network has to be identiﬁed.

1 root ProcessNetwork. root 2 datapaths← ExtractDataPaths( root ) 3 quantum_cost← FindCriticalCost( root ) 4 contained_sections← ExtractAndSortContainedSectionsByCost(datapaths) 5 for each contained_section← contained_sections do 6 SplitPipelineStages(contained_section)∈ 7 if cost was modified then goto line 5

Listing 8.6: Top level for the algorithm for load balancing

This algorithm can be reused in many design ﬂows that target platforms which provide resources for time parallelism. In our case, this platform is the GPGPU device, but it may very well be used for other platforms that may beneﬁt from load balancing pipeline stages. This algorithm is straightforward, and only source for time complexity scaling at the top level is the parsing through contained sections at the end. The guard at line 7 can be activated only once, if the quantum cost belongs to a process’ execution, and the calculation algorithm omits to add transfer costs (also unbreakable). The following paragraphs will present the main methods used for this algorithm.

Data paths extraction

The load balancing algorithm implies repeated analyses upon the paths followed by data. For this reason, a sensible decision would be to extract the data paths from the beginning and apply the algorithm upon them instead of parsing the process network every time an analysis needs to be done.

2since the equivalent processes have been erased and the model contains only grouped parallel, the initial number of processes is not relevant any more. From now on n will denote the new (reduced) number of processes 74 8. Design Flow and Algorithms

1 function ExtractDataPaths(composite) 2 group_of_paths empty group(datapaths) 3 for each port ←composite. out_ports do 4 group_of_paths.append(∈ ParsePath(port.connecte_port.proces , emty_path, root)) 5 return group_of_paths 6 7 function ParsePath(process, path, root) 8 mark process as visited path 9 new_group_of_paths empty∈ group(datapaths) 10 new_path path ← 11 add process← to new_path 12 for each port process. in_ports do 13 next_process∈ port.connected_port. process 14 if next_process← is root then 15 add new_path to group_of_paths 16 else 17 if next_process was visited in the same path then 18 mark new_path as loop 19 add new_path to group_of_paths 20 else new_group_of_paths.append(ParsePath(next_process , new_path, root)) 21 return new_group_of_paths

Listing 8.7: Method for data paths extraction, used by load balancing algorithm

The ExtractDataPaths() solves this task and it is described in the pseudo-code from Listing 8.7. It parses through the process network starting from its outputs and adds a new process network every time it reaches either the inputs or a visited process. When reaching a visited process, it marks the current data path as a loop, since loops imply further analyses in future steps of the algorithm. This results in an unrolled loop like in Figure 8.3.

··· A D H A C E F H C F B C E F H B E A C E G H G B C E G H

Figure 8.2: Building individual data paths

A D D E A B C D

B C

Figure 8.3: Loop unrolling

The method returns a set of linear paths like in Figure 8.2. Its time complexity is difficult to describe, since it is dependent on the result of the previous algorithms. If the method RemoveRedundantZipsUnzips() from Subsubsection 8.1.1 would not have been applied, this method would have scaled with O(2n), analysing and extracting many redundant paths. Instead, we can affirm that currently this method would execute in a time period somewhere between O(n2) for a network filled with ramifications and O(n) in a straightforward network. Again, the costs for searching through groups and verifying whether processes were visited are ignored, since they are implemented with combsets as lookup tables, rendering their time complexity O(1). 8.1. Model modifier algorithms 75

Computing the quantum cost and the number of bursts

This section will describe a quantum cost, that cannot be split any further. Based on this cost, the load balancing algorithm will have to equilibrate the process network based on execution and transfer costs so that the pipeline stages mirror the critical section for an optimum performance versus resource distribution ratio.

Another key parameter introduced at this stage is the number of data bursts. It can be deﬁned as the number of interleaved execution tracks that can be performed in parallel, similar to CUDA streams. Splitting data to be executed into separate streams has the advantage to overlap communication over computation, but in order to be advantageous, these two mechanisms have to have the same execution time (see Figure 4.3). The optimum distribution of data streams is strongly dependent on the ratio between the maximum communication cost and the maximum computation cost.

The method FindCriticalCost() from Listing 8.6 performs the above mentioned tasks. It involves browsing the processes in the network once and the data paths once. For the data paths, only the ones containing loop sections are analyzed, and the others are ignored. Judging by this, we can aﬃrm that the method’s time complexity scales with O(n) in the average case, and O(n2) in a very unlikely case when all data paths are full loops that enclose all the processes from the network.

This method’s main task is search and apply Equation 8.2 through Equation 8.6, and extract the maximum costs afterwards through Equation 8.7 and Equation 8.8. The main factors are:

• Cmax,H2D and Cmax,D2H are the transfer costs between host and device, which may constitute the main bottleneck in many applications. Since cost is "unbreakable", it is a candidate as the quantum in pipelining.

• Cmax,par is the critical computation cost among the processes mapped for a device with resources for time parallelism. In this case, this platform is the GPGPU device.

• Cmax,seq is the total cost of processes running on a platform that does not oﬀer resources for time parallelism. In our case, this platform is the CPU host. Since this execution cannot be split into pipeline stages, it may very well be considered a bottleneck and treated as such.

• Cmax,∆ is the maximum cost of the processes in a loop. Since the current component is not equipped with an algorithm for splitting further loops while preserving the system’s semantics, it is safe to assume that the loops are allowed to be split into pipelined sections as many times as there are delay processes available. Also, further methods will take this fact into account.

• Cmax,comp is the maximum computation cost describing the critical execution time in the process network.

• Cmax,comm is the maximum communication cost describing the critical transfer time in the process network. 76 8. Design Flow and Algorithms

Cmax,H2D = max(CH2D ) (8.2) Cmax,D2H = max(CD2H ) (8.3) Cmax,par = max(Cproc,par ) (8.4) X Cmax,seq = Cproc,seq (8.5) proc seq ∈ P  C   proc,par   proc loop  = max ∈  (8.6) Cmax,∆  , loop  n∆ + 1 ∀ 

Cmax,comp = max(Cmax,comp,Cmax,seq,Cmax,∆) (8.7)

Cmax,comm = max(Cmax,H2D ,Cmax,D2H ) (8.8)

After the critical costs are calculated, a re-evaluation is done in order to find the optimum number of bursts and to fix the quantum cost Q. The number of bursts are found out by applying Equation 8.9. The equation can be justified by the fact that splitting the execution on a platform where the main bottleneck is data transfers into pipeline stages, may prove useless if the cost for execution is greater than the cost for transfer. In that case, there will be no time parallelism involved, thus all the data will be processed in one burst at a time. Otherwise, the number of bursts may be described as the number of times that the slowest process execution may be overlapped with the slowest data transfer When splitting the execution into data bursts the maximum transfer cost lowers as can be seen in Equation 8.10, thus Q will reflect this. & '  C  max,comm  when Cmax,comm > Cmax,comp Nbursts =  Cmax,comp (8.9)  1 otherwise ! Cmax,comm Q = max Cmax,comp, (8.10) Nbursts

The information extracted at his stage will be used both for further model transformations and for the synthesis of pipelined code in Section 8.2.

Sorting the contained sections by cost

After the extraction of individual data paths, a second extraction is performed. This time contained sections are extracted from individual data paths. A contained section is a stream of neighbouring processes in a data path that are mapped for parallel execution. While they are being extracted the contained section groups are being sorted by their cost. The contained section cost respects Equation 8.11. X Ccontained,par = Cproc,par (8.11) proc contained ∈

Method ExtractAndSortContainedSectionsByCost() is used for the above mentioned task. It takes into account whether the contained section is part of a larger loop and calculates the loop cost with Equation 8.6. Since loops can be broken only as the number of delay 8.1. Model modifier algorithms 77

1 function ExtractAndSortContainedSectionsByCost(datapaths) 2 sorted_csections empty group(contained_sections) 3 for each datapath← datapaths do 4 extract contained_sections∈ from datapath 5 for each section contained_sections do 6 if section is∈ loop then loop_cost c o s t ( loop ) 7 s e c t _ c o s t cost(section) ← 8 is loop_cost>sect_cost← then sect_cost loop_cost 9 csect_pair pair(sect_cost , section) ← 10 add section ←to sorted_csections 11 return sorted_csections

Listing 8.8: Method for extracting and sorting contained sections by their cost

processes permits it, these section’s cost should not be treated separately. Regarding its time complexity, we can say that it scales with O((n/2) n) in the worst case scenario where all processes alternate platforms, making all contained· sections one process-wide, and they are all part one system-wide loop, forcing every time a second browsing through the whole system. Otherwise, a balanced system would be parsed by this algorithm in around O(n) time, since all the supplementary inner loops tend to be balanced out by the groupings into paths.

Load balancing main method

The main method for the load balancing algorithm is SplitPipelineStages() and it uses all the data extracted previously. It is applied upon the contained sections in reverse order of their cost, ﬁxing a strict priority for resolving the load balance. This enables the analysis of the critical path ﬁrst, since its analysis may change the quantum cost and render the load balancing invalid.

The pseudo-code in Listing 8.9 presents the current method. As can be seen, the method starts by filling up already existent pipeline stages until they reach the maximum allowed cost Q. These stages were formed in previous steps for contained sections with a higher priority, but which have common processes with the current section. For this purpose, the synchronization cost plays a very important role, along with the computation cost. Depending on the process’ context, the synchronization cost may be one of the following: CH2D or CD2H for transfers between host and device, CD2D for transfers between two processes belonging to the same pipeline stage (thus it will be mapped to the same kernel), or CT 2T for transfers between two processes belonging to different pipeline stages (thus mapped for different kernels, introducing a new type of cost). The processes which are left unmapped are then analyzed for assigning them to pipeline stages. After further splitting them into coherent sections, a further cost calculation is done, similar to the previous one. As seen in line 29, there is a guard that verifies whether the quantum cost Q is still relevant or has been surpassed by the synchronization and computation cost of a single (critical) process. In this case, the algorithm is halted and invalidated, and a new stage splitting is performed with respect to the new quantum cost. Since the algorithm is a heuristic one, based on empirical cost calculations and parsed from process to process, the result may not be optimal concerning the ratio between performance and resource allocation. This is why, to increase the chances of delivering an optimal solution, a second analysis is done, but this time in the reverse order of the process execution (i.e. from right to left). This doubles the amount of time needed to perform the method, but may provide a 78 8. Design Flow and Algorithms

1 function SplitPipelineStages(contained_section) 2 already_assigned_processes empty l i s t 3 for each process contained_section← do 4 if process is∈ assigned to pipeline stage then 5 add process to already_assigned_processes 6 for each assigned_proc already_assigned_processes do 7 if assigned_proc is∈ before/after an unassigned process contained_section then 8 proc_to_assign that unassigned process ∈ 9 assigned_stage ← assigned_proc. stage 10 new_sync_cost ←sync_cost(proc_to_assign) 11 old_sync_cost ← sync_cost(assigned_proc) 12 new_stage_cost← = assigned_stage.cost old_sync_cost + new_sync_cost 13 if new_stage_cost < Q then − 14 assign proc_to_assign to assigned_stage 15 mark flag goto.non_interupt line 2 16 unassigned_sections ⇒contained_section. split(unassigned processes) 17 for each section unassigned_sections← do 18 l e f t _ r i g h t _ s t a∈ g e s empty group(sections ,costs) 19 r i g h t _ l e f t _ s t a g e s ← empty group(sections ,costs) ← 20 for each process −−−−−−−→section do 21 if process is∈ section.first then 22 sync_cost sync_cost(process) .input 23 sync_cost sync_cost← + sync_cost(process).output 24 computation_cost← comp_cost(process) 25 if computation_cost← + sync_cost < Q then 26 add process , sync_cost , computation_cost to left_right_stages.last 27 else if process is the only one in the stage then 28 Q computation_cost + sync_cost 29 return← invalidating the algorithm so far 30 else 31 new_stage new stage(process , sync_cost , computation_cost) 32 add new_stage← to left_right_stages 33 for each process ←−−−−−−−section do 34 if process is∈ section.last then 35 sync_cost sync_cost(process) .output 36 sync_cost sync_cost← + sync_cost(process).input 37 repeat steps← 24 32 filling right_left_stages 38 demands right_left_stages.sync_cost− < left_right_stages.sync_cost and right_left_stages. size← < left_right_stages.size 39 if demands are satisfied then 40 for each process in right_left_stages do 41 assign process to right_left_stages.stage 42 else 43 for each process in left_right_stages do 44 assign process to left_right_stages.stage

Listing 8.9: Method for splitting the process network into pipeline stages based on cost information

better mapping in concerning synchronization costs and number of pipeline stages. The actual mapping of processes to pipeline stages is done after deciding which of the previous results are optimal, decision that is subject to change in future improvements. The ideal number of pipeline stages should respect Equation 8.12 and should be either the number of processes in the section or the report between the section cost and the quantum cost, but due to mapping mechanisms it may be higher. This is why a second search is beneﬁcial. & ' Ccontained,par Nsections = min(nproc contained, ) (8.12) ∈ Q

The method’s effect on a ForSyDe model can be seen by studying the differences between Figure C.63 and Figure C.7 from Appendix C. In the former model all processes are assigned to stage 0 while in the latter model they hold different stage mapping information.

3see the notes in Appendix C for the reason why Figure C.6 is diﬀerent than Figure C.5 8.1. Model modifier algorithms 79

Due to the algorithm’s complexity and its strong dependencies upon all the previous steps, it is difficult to make an analysis regarding its scalability. It employs many switch statements and guards, and its execution flow is hard to follow. The only certainty is that the number of calculations decreases as the data paths are analyzed, since once a process has been assigned to a pipeline stage, all the algorithm steps revolving around it are skipped. Also, processes mapped for sequential platform are not even taken into consideration. Thus we can affirm that the time complexity roughly scales with O(n).

8.1.4 Pipelined model generation

In order to enable the component to synthesize CUDA code, a series of directives are needed to be provided to the synthesizer module. The model holds already enough information to enable mapping of processes to kernels, but a "cleaner" way would be to modify the model so that it does most of the synthesizer’s work: gathering the correct functions under a kernel and generating wrappers around them. Thus a final model modifier algorithm is necessary, to group processes associated with a pipeline stage under a parallel composite. The effect of this algorithm on the ForSyDe model can be seen in Figure C.8 from Appendix C. The algorithm is similar to the method CreateParallelComposite() from Listing 8.2. It is a succession of creating new parallel composites and filling them with processes, while assuring model integrity. Because of the current framework’s instability, the model integrity checks take most of the execution time. Since all processes mapped for time parallel execution are visited only once, we can affirm that the algorithm scales with O(n).

8.1.5 Future development

Many aspects of the algorithms and the implementation mechanisms are still to be studied and developed. The current component is just a prototype and provides just a theoretical basis for a much larger area: model manipulation and design space exploration as part of the ForSyDe system design flow. As in the previous chapter, a few proposed directions for further research will be listed. As mentioned, the current algorithms and the design decisions taken in their implementations were forced by the current model framework. The aspect that suffered the most was the performance during the identification and extraction of data parallel sections. The algorithm proposed in Subsubsection 8.1.1 can greatly increase the algorithm performance from O(2nc ), 2 respectively O(nl ) to O(lognl). Also, the new algorithm should be implemented with minimum code and development overhead on a framework like the one proposed in Section B.3 and Subsection 7.1.2. Therefore, after developing the ForSyDe Toolbox backbone, the development of this algorithm has the highest priority. In Listing 8.6, the load balancing algorithm employs two extractions: one for data paths, and another for contained sections. While this separation is natural taking into consideration the fact that they employ different analyses, these two algorithms may be merged into a single browsing of the whole system. While increasing the performance, this improvement may lower the algorithm’s flexibility and reusability, thus its effect needs to be studies. In Listing 8.8 the sorting method may trigger the calculation of the same loop cost multiple times, if there exist multiple contained sections included in the same loop. This value can be stored in a separate lookup table, or a container may be provided for a loop object to avoid such a situation. The 80 8. Design Flow and Algorithms development overhead although would not have been justified compared to the performance gain. Still, it may be regarded as possible enhancement. The platform optimization algorithm in Listing 8.5 studies one process at a time to take the decision of changing the platform. If there exist a chain of processes that may benefit from changing the platform altogether instead of separately, they will be ignored. A method for identifying such processes needs to be developed. Apart from purely implementation issues, there is a number of algorithm issues that have to be intensely studied and formally validated in order to continue the component’s implementation. First of all, the problems revolving around splitting loops and preserving the semantics of the system need to be solved. Currently a safe solution is adopted, that still needs to be formally verified: allowing a loop to be split only as much as there are delay elements in the data path. This restricts both the splitting position and diminishes the potential for parallelism in case of system-wide loops. This issue needs to be studied and further treated. Also there are still many effects that the separation into pipeline stages may have upon the process network, which have to be studied as well, for example removing a delay elements at a separation. As can be seen, most of the algorithms and features provided in this contribution have a predominant engineering profile: we present the problem and then we try to find the best solutions in the given time frame. For this matter, an in-depth study of the related research in order to find the roots for the problems is definitely necessary. For example, the automatic parallelization has been intensely studied in the past years, and the scientific literature may hold answers for our problems as well. For example, [Feautrier, 1996] describes an automatic parallelization process which transforms the concerning code into a polyhedral called polytope, and using mathematical methods to perform automatic partitioning for data depending on the frequency of their access. An enhanced version of this algorithm is presented in [Baskaran et al., 2008], containing a method for estimating when such allocations are beneficial. The load balancing algorithm still needs to be formally validated. Although it delivers the desired results, it is still an engineering solutions, that may not be ideal in all situations. An analytical way of tackling with the problem and delivering the optimum result need to be studied in the research community.

8.2 Synthesizer algorithms

The second type of algorithms provided in the current contribution are code synthesis algorithms. They are employed by the synthesizer module, as part of the flow depicted in Figure 5.2. These transformations target only CUDA platforms and they are the equivalents of code optimizations in the flow depicted in Figure B.4. The core function of this stage, called several time is the scheduler, which is presented and thoroughly described in [Hjort Blindell, 2012] and developed as part of f2cc v0.1. its purpose is to find an optimal sequential schedule for executing processes so that system’s semantics are preserved. The current contribution did not develop the algorithm any further, except for integrating it into ForSyDe v0.2. The algorithm is described by the pseudo-code in Listing 8.10. Its purpose is to perform a final set of parsings and generate a function code file (.c for C and .cu for CUDA) and a header file (.h). It treats each type of process differently: • comb leaf processes are not processed any further, since they already have the function 8.2. Synthesizer algorithms 81

1 check ProcessNetwork 2 root ProcessNetwork. root 3 for each← composite root .composites do 4 FindSchedule(composite)∈ 5 for each composite ProcessNetwork . composites do 6 if target platform∈ is CUDA and composite is root then 7 generateCudaKernelWrapper(composite) 8 else generateWrapperForComposite(composite) 9 generateKernelConfigurationFunction () 10 functions ProcessNetwork. functions 11 code generateDocumentation← () 12 for each← fct functions do 13 code code∈ + fct.string 14 generateCodeHeader← (functions. last)

Listing 8.10: Top level for the algorithm for code synthesis

body. Their header instead is used for function invocation inside their parent’s (a composite process) wrapper function. • other leaf processes are treated only in their parent’s wrapper function, when speciﬁc code is generated for each one, as it will be presented in Subsection 8.2.1. • composite processes are analyzed for a sequential schedule of their contained processes. The synthesizer module generates a wrapper function that invokes its child processes’ functions in order determined by the found schedule. • parallel composite process are treated like composite processes. If they are targeted for sequential platforms they are invoked multiple times. Otherwise, on parallel platforms, their execution implies the usage of parallel threads. The following subsections will present the speciﬁc methods for the two approaches for synthesizing either C or CUDA code.

8.2.1 Generating sequential code

The sequential code generation methods are used system-wide for both target platforms. Only the root process is skipped for CUDA mapping, since that requires a CUDA kernel. As previously mentioned, only composite and parallel composite processes are analyzed and are associated with wrapper functions. The generation of these wrappers is described in Listing 8.12. The method starts with extracting signals and delay variables from the composite process. These new elements and their generation are presented in section 8.5 of [Hjort Blindell, 2012] and they are necessary to generate variables that transmit data between the process functions in the wrapper body. In f2cc v0.1 their role was crucial since the ForSyDe model did not carry information about the data types transported between processes, and a full system parse was necessary to back-trace these types from the function code. In f2cc v0.2 this problem is not valid anymore, since all necessary information is available in the model. Storing this information in a set of containers that will directly be transformed into variables, though, eases the synthesis process. Their extraction does not imply a full network parse, but instead gathering information from existing ports and encapsulate it in signal, respectively delay variable object containers. 82 8. Design Flow and Algorithms

1 function generateWrapperForComposite(composite) 2 new_function empty function 3 extract signals← from composite 4 extract delay_variables from composite 5 for each port composite. ports do 6 new_input(output)_parameter∈ variable(port) 7 add new_input(output)_parameter← to new_function.parameters 8 body GenerateCompositeDefinitionCode(composite.schedule) 9 RenameVariables← ( body ) 10 add body to new_function 11 add new_function to ProcessNetwork and composite.wrapper 12 if composite is root then 13 GenerateRootExecutionCode(new_function) 14 add new_function to ProcessNetwork. functions

Listing 8.11: Method for generating sequential wrapper code for composite processes

The second step in creating a sequential function wrapper is building the function header. For this, the composite process’ inside ports are parsed and transformed into input/output function parameters. Since all the information is already present in the ports, creating parameters is a matter of binding this information into new containers. The next step is generating the function body with GenerateCompositeDefinitionCode(). This method takes the composite and its schedule as arguments and builds the execution code, in the following order: 1. generate variable declaration code from signals. 2. generate delay variable declaration and initialization code. 3. generate first step of the delay execution code (i.e. loading the previous value from the delay variable). 4. generate execution code for all other processes. Depending on the process type, the processes are treated differently: • fanout processes just copy the input values to its output variable; • zipx and unzipx distribute their input values to/from multiple other variables; • comb processes invoke the process function through a function call generated from its header; • composite processes invoke the previously generated wrapper execution code, in the same way as for comb processes; • parallel composite processes invoke the associated composite wrapper execution code, but inside a for loop that reflects the number of processes; 5. generate the final step of the delay execution code (i.e. storing the next state value in the delay variable). 6. generate signal clean-up code. All the variables that have been allocated during execution need to be deallocated from memory. The next step is to browse again through the function body and rename variables. Since signals have unique IDs which are combinations of characters involving process IDs, port IDs, and hierarchy paths, they are hard to follow in manual debugging is necessary. Thus a set of more intuitive names are assigned to variables. 8.2. Synthesizer algorithms 83

As seen in Listing 8.12, for the root composite process, there are two wrappers generated. The second one is the top module function accessed by the C header. An example of generated code for a composite process wrapper can be seen in Listing C.4 from Appendix C.

8.2.2 Scheduling and generating CUDA code

The CUDA code generation mechanism is invoked if the component’s target platform is set to GPGPU. It consists in three steps: • generate execution code for the parallel composite processes, a method similar to the in Listing 8.12. • generate a scheduler for splitting and executing pipelined code streams with the contained sections belonging to the root process. • generate kernel wrapper functions (ideally for each contained section), conﬁgure it for optimal execution and run them through an elaborate set of mechanisms.

Basic concepts for the mapping algorithm

Before presenting the method’s mechanisms, the theoretical background that the method is based on needs to be presented. All the development efforts for the current component have been targeted to provide enough information so that it is possible to implement a ForSyDe system on a platform that supports data and time parallelism. Cost analyses, data extraction, model modifications and all other operations built up to this purpose: offering an optimized data parallel and pipelined solution. Although the final purpose of this component is to generate CUDA code that employs streams for an interleaved execution, all the stages of the design flow were developed to be as general as possible, so that they can be treated separately from nvidia GPGPUs. Ideally, they should be flexible enough to be reused for other design flows than the current one, targeting a large set of parallel platforms. Following the same philosophy, even this last algorithm was tried to be conceived as independent as possible from the specific GPU implementation traits (although a fully independent approach is impossible). This way, future development may have a starting point for different research directions. The mapping algorithm revolves around two parameters that are extracted during different stages of the design flow:

• the number of stages or Nstages denotes the number of pipeline stages that may be interleaved for parallel execution. This parameter is extracted further on in Listing 8.12. Apart from the number of parallel composites associated with stages, this parameter needs to add up the data transfers between host and device that happen during the scheduled execution of processes.

• the number of data bursts or Nbursts denotes the number of times the transfer cost are to be split so that they are balanced by the load of the slowest executing process. This parameter has been described in Equation 8.9. 84 8. Design Flow and Algorithms

As presented in Section 4.4, the streams have the potential of overlapping the data transfer with the device execution, and (for devices supporting this type of actions) interleaving the execution of multiple kernels and the host. Although this is a device-specific property, the architecture provided by devices with the above-mentioned compute capability may be regarded (at least in early stages) as a general pipelined architecture. The execution model targeted for this component is depicted in. Figure 8.4 which shows the advantages of pipelined execution concerning performance. The plot models the repeated execution of three data sets on a GPGPU. Since the ratio between the maximum computation time and the maximum transfer time is approximately 3, the data transfer has been split into three bursts. Since the number of pipeline stages is 3kernels + 2transf ers + 1cpu = 6, which is less than the number of streams available in the GPU (16 for compute capability 2.0 or higher), the component will allocate 6 streams for interleaved execution. If the system is fed an infinite stream of data sets, provided there are no resource conflicts during execution and the pipeline stages are perfectly split (there is no synchronization overhead) so that the resource utilization becomes 1, the execution has the potential to perform in 1/Nstages of the initial time. In other words, the system execution time is reduced to the critical execution time. Practically though, Figure 8.4 shows that these overheads are inevitable. The figure presents two approaches for executing the system in a pipelined fashion. The first one involves synchronizing the device after each stage. This results in a predictable execution pattern that would respect the semantics of the ForSyDe SY MoC. This model is a first approach for future research in using GPGPUs in systems which require a predictable description of their timely behavior. As seen in Figure 8.4, the synchronization induces an overhead for the stages that have lower costs that the quantum cost. The second approach does not synchronize the device after each pipeline stage, leaving the CUDA scheduler to assign processes as long as there are resources available. This would lead to an unpredictable behavior and a number of resource clashes, that may affect performance and cancel the head-start over the synchronous approach (as it happens in Figure 8.4). Such an execution model could be compared with the ForSyDe SDF MoC where processes are executed as soon as there is data available. Assigning data bursts to streams is a "revolving barrel" problem. Each stream is assigned with a displacement with which it browses through the data sets, as suggested in Table 8.1.

stream 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 burst 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2

Table 8.1: Assigning data bursts to streams for a system with Nbursts = 7 and Nstreams = 5

The synthesizer algorithm

The pseudocode in Listing 8.12 describes the top level for the CUDA code generation algorithm. It is applied upon a process network’s root, and it generates several functions associated with either kernel functions or kernel wrappers. The generation of these kernels is thoroughly described in [Hjort Blindell, 2012] thus only important aspects will be stressed out, the reader being encouraged to consult the documentation for f2cc v0.1 for further details. The first action after extracting signals and delay variables is to add the prefix __device__ 8.2. Synthesizer algorithms 85 3 D2H K3 K2 K1 3 H2D 2 D2H K3 K2 3 3 D2H K1 3 D2H K3 3 D2H K3 3 D2H K2 K2 K3 2 Non-Streamed Execution 3 D2H K3 H2D 2 D2H K1 K1 K2 K2 K3 2 D2H K3 3 3 2 D2H K1 K1 K2 K2 K3 H2D H2D Streamed Execution with Device Synchronization 2 D2H K3 Streamed Execution without Device Synchronization 3 3 2 D2H K1 K1 K2 K2 K3 H2D H2D 2 D2H K3 1 3 3 1 D2H D2H K1 K1 K2 K2 K3 H2D H2D 1 D2H K3 2 2 1 D2H K1 K1 K2 K2 K3 H2D H2D 1 D2H K3 K3 2 2 1 D2H K1 K1 K2 K2 K2 K3 H2D H2D 1 D2H K3 2 2 D2H K1 K1 K1 K2 K2 K3 H2D H2D K3 1 1 K1 K1 Figure 8.4: Streamed execution model. Comparison between not using and using streams in two configurations K2 K2 H2D H2D 1 1 1 K1 K1 H2D H2D H2D 1 1 H2D H2D CPU CPU CPU default default _default stream 1 stream 2 stream 3 stream 1 stream 2 stream 3 stream 1 stream 2 stream 3 stream 4 stream 5 stream 6 86 8. Design Flow and Algorithms

1 function generateCudaKernelWrapper( root ) 2 new_function empty function 3 extract signals← from root 4 extract delay_variables from root 5 for each port root . ports do 6 new_input(output)_parameter∈ variable(port) 7 add new_input(output)_parameter← to new_function.parameters 8 for each process root.processes do 9 if process i s∈ mapped for device execution then 10 process.cfunction.prefix "__device__" 11 ← 12 kernel_schedules empty list of schedules 13 num_stages 0 ← 14 for each contained_section← root .schedule do 15 num_stages num_stages∈ + 2 16 add empty_schedule← to kernel_schedules 17 for each process contained_section do 18 num_stages ∈num_stages + 1 19 add process ←to kernel_schedules. last 20 compute n_proc 21 kernel_exec GenerateExecutionForKernelComposite(kernel_schedules.last , n_proc) 22 kernel_func ← generateCudaKernelWrapperFunction(kernel_exec) 23 add kernel_exec← , kernel_func to ProcessNetwork. functions 24 25 new_body generateCudaRootCode(root , kernel_schedules , num_bursts, num_stages) 26 add body ←to new_function 27 add new_function to ProcessNetwork. functions and root.wrapper

Listing 8.12: Method for generating wrappers associated with CUDA code

for all the functions belonging to processes mapped for device execution. This prefix lets the compiler know that the specific function will be executed on the device. The next step is to browse all the previously identified contained sections (chains of processes mapped for parallel execution on the device). For each section, the number of pipeline stages Nstages is calculated as suggested in the previous subsection. Also, the processes are added into a new schedule which will aid in generating the kernel execution function for that particular contained sections. Before generating the kernel execution function and the kernel wrapper associated with a contained section, the number of parallel processes (which will further be used to calculate the number of threads) is calculated using Equation 8.13. Since the data "wideness" has been split into bursts, the number of processes contained by the parallel composites is not relevant any more, and needs to be readjusted. & ' Nproc,pcomp Nproc,kernel = (8.13) Nbursts

The synthesis of a kernel execution function with GenerateExecutionForKernelComposite() is similar with the method described in Listing 8.12. The main difference is that now device synchronization code is added, in case the first approach in Figure 8.4 is chosen. The function prefix, is set as well to __device__. The kernel wrapper created with generateCudaKernel- WrapperFunction() sets the CUDA indexes according to the future kernel configuration, and invokes the execution function. Its prefix is set to __global__. More information about the kernel wrappers is found in [Hjort Blindell, 2012]. The top level function is the last one synthesized by the current component. This function contains code for gathering device information and based on it calculate the optimum 8.2. Synthesizer algorithms 87 configuration for the invoked kernels. Its mechanisms are also explained in [Hjort Blindell, 2012]. The main contribution of v0.2 to this method is introducing the mapping mechanism of kernels to streams and distributing the data associated with them. Since for each contained section the component generated a (presumably smaller) kernel each having its own Nproc,kernel, the top module function’s main role is to invoke them while preserving the execution model in Figure 8.4. This function an array of Nstreams streams, where Nstreams is calculated witn Equation 8.14 and may be considered the number of pipeline stages, as long as it is smaller that the maximum number of resources provided by the GPGPU device4.

Nstreams = min(Nmax,streams,Nstages) (8.14)

All the signals between the contained sections and the non-contained processes in the top module are split as well into Nstreams, since each stream operates on its own set of data. For each signal, the component generates code for allocating device and host memory at the beginning of the system’s execution, and deallocate it at the end. The system execution is then wrapped inside a while loop that pends for streams to finish their execution, interrogating them in a revolving manner. Each time a stream is identified to have finished its execution and still has data left in its work set, its associated kernel is invoked after calculating an ideal kernel configuration and its work set index is incremented. An example of generated kernel execution function and its associated kernel wrapper can be seen in Listing C.6, and a top level function that invokes this kernel is found in Listing C.5. For page layout reasons, parts of the code had to be removed and replaced with (...). This code is either reused from f2cc v0.1 and its meaning is explained in [Hjort Blindell, 2012] or is repeated over multiple data sets.

8.2.3 Future development

All the theoretical foundation provided in part I of this document and the implementation efforts presented in part II built up for one particular task: offering a data parallel and time parallel solution for a ForSyDe model. For this purpose, GPGPUs have been used as platform. But since the main contribution is intended to provide a foundation for multiple research directions, more resources have been invested in the conceptual part and algorithms than in the platform-specific optimizations for performance. Therefore the end result of this component may not be fully optimal or fully correct from a CUDA developer’s point of view. The synthesizer module has the greatest potential for future optimization, and will require the largest work effort for future development and improvement. Although GPGPUs are notorious for their demands for efficient hardware resources usage through clever coding, apart from the already implemented features in f2cc v0.1 and the features presented in this report, no specific optimizations were employed. This can be justified with both the limited time frame available, and with the predominant proof-of-concept profile that this project has. Regarding the sequential code generation, there is still room for much performance improvements. For example, although the inputs and outputs of generated composite execution functions are passed as pointers to spare execution time, the internal variables associated with signals and signal manipulation processes (zipx, unzipx, fanout) still employ data copying

4for example, the maximum number of streams provided by an nvidia GPGPU with compute capability 2.1 is 16. 88 8. Design Flow and Algorithms mechanisms for passing values, for data integrity reasons. The sequential execution time would be greatly reduced if a correct mechanism for transporting pointers is found. The current CUDA code generated by the component could not be validated because of the limited time frame, and both the schedule and the execution model still need require thorough research and development. Currently there is no support for models with more than one contained sections which would require the execution of more than one kernel. Although the model is built up for this specific purpose, there are still many implementation issues that have to be handled until offering a stable result. Also, both the execution model and the generated code are far from optimal. For example, the kernel configuration function is invoked every time a kernel is invoked even if its parameters may have not changed, rendering unnecessary overhead on the CPU execution. Also, the invocation of the same CUDA kernel on multiple streams may not be considered a best-practice in the GPGPU programming community, but was chosen again for purely proof-of-concept reasons. Concerning the synthesizer algorithms, many improvements are still to be done in that area as well. For example, there is currently no support for generated parallel composite processes whose child processes do not have the same fixed width5, nor kernel generation for this type of processes. Also, several stages may be merged or mechanisms changed in order to further increase the tool’s execution performance. For example, as mentioned, the signal objects may be considered redundant in a model where ports already contain data type information, but their usage aided reusability of several methods from f2cc v0.1, and hastened the tool implementation. The development of a robust synthesizer tool will be possible only with help from the research community. Since there are many related researches that target the generation of optimized CUDA code, most of the ideas may be harvested and compiled for our own purpose. For example, [Udupa et al., 2009] presents a method for generating pipelined execution models using CUDA streams from a high level language called StreamIt. A more extensive study of the available researches than the one already performed in Part I will be performed as part of future development.

5this explains the reason why the model in Figure C.5 could not have been fed to the synthesizer, thus the more strict model in Figure C.6 was adopted. Chapter 9 Component Implementation

This chapter will present a few of the main implementation traits. Although the main concepts behind the framework have been already presented, showing a part of the implementation is beneﬁcial in order to get a grip on the internal mechanisms which are used during the design ﬂow, and the complexity of the project. Detailed descriptions of the architecture can be found in the component’s API documentation provided as part of the tool suite, and a potential developer is encouraged to consult it.

9.1 The ForSyDe model architecture

he ForSyDe model presented in Section 7.1 is implemented in a predominant object- T oriented manner. It was conceived as a superset for the model used in f2cc v0.1, depicted in Figure 5.3. Its inheritance and collaboration graph is presented in Figure 9.1. The relations between the model classes can be determined from the above-mentioned ﬁgure. Each Process contains a Hierarchy object, which is a list of Ids with a set of methods for ﬁnding relations. The Process class acts as parent for Leaf and Composite classes, thus both these child classes can be regarded as processes. Also, their lists of Ports, respectively IOPorts can be regarded from the outside as interfaces, since they both inherit basic traits from the Inerface class. A Model is a class that contains lists with pointers to leafs and composites. The Composite is both a Process and a Model, since its main role is to encapsulate other processes, but it has to be seen as a process from the outside. The Process Network is another class that inherits traits from the Model. Its main role is to serve as a "via" for selecting the main components in the ForSyDe model. For the backward compatibility reasons mentioned earlier in Section 7.1, the process network points to ports which are inputs or outputs for the system. The CFunction, apart from its body which is plain text, contain CVariables, who themselves encapsulate a CDataType each. As mentioned in Section 7.1, the comb processes point to a CFunction while Map processes still encapsulate them, to preserve backward compatibility. Also, the fact that ports and "ioports" contain data type information can be seen from

89 90 9. Component Implementation

Process Networks

Model CFunction Parallel Composite

Composite IOPort CDataType

Hierarchy Process Interface

Leaf Port CVariable

delay zipX unzipX fanout zipWithN comb map

Coalesced Legend Map inherits (is a) encapsulates (has) Parallel points to (has) Map

Figure 9.1: The restructured classes used for f2cc v0.2 internal model their inclusions. Although not previously mentioned, the composite processes do include a CFunction container. They will be ﬁlled further in the synthesis stage, when the tool will generate execution functions for each composite process.

9.2 Module interconnection

The interconnection between the main modules has been maintained as in Figure 5.2. Since the component has grown considerably since v0.1, most of the modules have been restructured or even split into sub-modules. The following paragraphs will present only the main changes. For a full view of the component the reader is encouraged to consult the API documentation. As in f2cc v0.1, all methods and classes are bound under the f2cc namespace to avoid naming clashes with other libraries. The tools module using the f2cc::tools namespace still contains methods common to all modules and has been broadened with new methods. The same can be said about the config module which now contains both data about the platform model and new switches and flags associated with the new execution flow. The logger and exceptions modules were left unmodified. The frontend module has been provided a new set of classes associated with the new component. The Frontend main class has been linked to an XmlParser class which extracts information necessary to build the internal ForSyDe model. A subclass of XmlParser is CParses 9.3. Component execution flow 91 which parses ForSyDe code to transform it into CFunction objects, as presented in Section 7.3. Separated from these classes, a new XmlDumper class has been provided with methods for generating XML files from intermediate stages of the model modification. The forsyde module underwent the most modifications. It has been split into three submod- ules. One resides under the f2cc::Forsyde namespace, and it holds the classes associated with the internal ForSyDe model. The second resides under f2cc::Forsyde::SY namespace and is a subset of the ForSyDe model, specific to the SY MoC. This leaves space for future development, if it is decided to offer model support for more MoCs. This would imply describing process constructors with the same name but different semantics. Grouping them into different namespaces takes care of the naming issues. The third submodule contains the model modifier classes for both f2cc v0.1 and v0.2. The language module as such underwent minor modifications. Since it contains the language- related container classes, just a few methods were added to support the new type of functions. The whole C++ and ForSyDe-SystemC language support is handled by the frontend. The synthesizer module was developed separately from the existing one, thus a new class with synthesizer methods was added.

9.3 Component execution ﬂow

Now that the main component’s algorithm, implementation and architectural features have been presented, the reader should be familiar with its internal mechanisms for creating, analysing, modifying and generating code from ForSyDe models. This makes a review of all the steps executed in the tool’s design ﬂow, as depicted in Figure 9.2.

Frontend

C Synthesized Code

ForSyDe- GraphmlParser ModelModifier Synthesizer Haskell GraphML C code

v0.1 Internal Internal Model Model Repre- Repre- v0.2 sentation sentation

XmlParser XMLXMLXML CUDA code

ForSyDe- ModelModifier Platform Synthesizer SystemC SysC model 02 (XML) C++ CParser (functionXMLXML code) XmlDumper XMLXML

Figure 9.2: The f2cc component execution ﬂow

The component starts its execution by invoking the frontend which inputs a ForSyDe model. While v0.1 inputs graphml files with C code annotations, v0.2 inputs xml files generated from ForSyDe-SystemC and hpp files with C++ code directly from the framework model. The 92 9. Component Implementation choice of following a particular execution flow is done automatically by verifying the input file extension. The frontend has specific parsers for each execution flow which are invoked according to the above-mentioned choice. After the input files have been parsed, both execution flows result in an internal model representation using the new internal model. This model serves as input for the following stage, the modification stage. This stage is handled by different modifiers for each tool version, and ends up in another (modified) internal representation. As suggested in the depiction, in v0.2 the intermediate states are continuously monitored and the XmlDumper is invoked for dumping these models, resulting in multiple xml files. The model outputted by the modifier serves as input for the synthesizer module. Again, there are different synthesizers for each version of the component. The synthesizer in v0.2 inputs platform informations stored as cost information in another xml file. This final step ends up generating either CUDA C code or sequential C code, in accordance with on the user’s choice. Part III

Final Remarks

93 94 Chapter 10 Component Evaluation and Limitations

This chapter will attempt to evaluate the current state of the Master’s Thesis project. The evaluation will be performed based on the output generated so far and a ﬁnal verdict will be given. The second part will review all the limitations of the current component and will group together the proposals for future development based on their priority.

10.1 Evaluation

t the time of writing this report the component’s status reflected Table 10.1. The A table presents the main tasks performed by the f2cc tool, grouped by the module which performs the tasks. All other features are ignored since they cannot be measured and deliver no result. Judging the frontend module, we can affirm that its performance is satisfactory. The XML parser works well, all XML files generated by the enhanced version of ForSyDe-SystemC (see Subsection 7.4.2) being parsed and analyzed for data extraction. The C function code parser performs acceptably as well, process function files from the ForSyDe model, written in C++ with annotations (see Appendix A) being able to be parsed and encapsulated in semantically- correct C function objects. The internal model construction is also done properly according to the f2cc model (see Section 7.1), being subject for further manipulation along the execution flow. Finally, the XML model dumper is working correctly, therefore plotting the intermediate structures is possible. Analyzing the forsyde module, the results are satisfactory as well, although much development is still to be done. The internal f2cc model representation respects the ForSyDe protocol and stores enough information to make possible further model manipulations. Moreover, the manipulations are correctly performed according to Chapter 8. The data parallel sections are identified using an advanced model analysis algorithm that determines potentially parallel processes from random ForSyDe patterns, that would otherwise be difficult to point out by a human designer. The platform optimization algorithm performs well when considering single processes in their context, but still needs improvement when chains of processes would

95 96 10. Component Evaluation and Limitations

Module Function Status Sample(s)

working, sat- parse XML structural ﬁles isfactory Figure C.1 working, sat- parse C++ function code isfactory Listings C.1 & C.2 frontend working, sat- build internal ForSyDe model isfactory Figure C.1 working, sat- dump XML from model isfactory Figures C.1 to C.8

working, sat- storage of internal model isfactory working, model manipulation limited Figures C.1 to C.8 working, sat- forsyde advanced identiﬁcation of data parallel sections isfactory Figures C.1 to C.5 working, platform optimization limited working, sat- load balancing based on platform model isfactory Figures C.6 to C.8

working, sat- process sequential scheduler isfactory Listing C.3 working, sat- pipeline execution scheduler isfactory Listings C.3 & C.5 synthesizer usable, C code generation incomplete Listing C.4 not usable, CUDA code generation incomplete Listings C.5 & C.6

Table 10.1: The component’s status at the time of writing this report benefit from platform re-mapping. The load balancing employs a novel algorithm based on cost analysis extracted from a platform model (presumably provided by future ForSyDe tools), which performs well. Although functioning within satisfactory parameters, these algorithms require further research and development. The synthesizer module could not be finalized in the time frame assigned for this project, thus, as Table 10.1 states, the code synthesis features, although functioning, do not generate proper code that can be used or tested. The C code, although usable, still needs some manual debugging to be run and tested (mostly concerning variable allocations, or syntax). The CUDA code on the other hand, as mentioned in Subsection 8.2.3, still requires implementation effort in order to generate proper code. The state of the current output may be verified in Listing C.5, and it can be seen that it respects the principles presented in Subsection 8.2.2. Still it is a long way until generating syntactically correct CUDA code especially for ForSyDe models that would require the synthesis of multiple kernels. Even so, the schedulers work correctly and deliver the expected outcome. As a result both performance evaluation and functional verification of the generated code are impossible. Therefore, the final verdict is that the project is still in its early youth to be properly evaluated. For now, the demonstration provided in Appendix C which shows intermediate steps of the tool’s execution flow will have to suffice for the project’s current development stage validation. Based on this material, we can affirm that although the final results are not available for the time being, the f2cc component behaves in the desired way, as described in Chapter 8. The f2cc tool version 0.2 is therefore currently in its pre-alpha stage, and the release of a stable version will have to be delayed. 10.2. Limitations and future work 97

10.2 Limitations and future work

Part II of this report presents the contributions made in implementing the current component. Each time a new feature, concept or mechanism is introduced, its limitations and proposals for future contributions are stated. This section is intended to provide a short summary of these proposals, with the purpose of offering a clear and condensed list of goals for future development to be prioritized. As in Chapter 6, the priorities have been denominated with High, Medium and Low. The proposals for improving the component framework are: • Develop a new internal model based on existing tree and graph libraries (High) Since it has been shown on numerous occasions that the current model is very limited in its flexibility and scalability, a new model has been proposed both in Appendix B and Subsection 7.1.2. Its development is crucial for the project’s further development. • Develop an API that masks all operations employed in the model manipulations (Medium) The manipulation of the internal model, although greatly aided by the current API, is still laborious and unstable. Since the new hierarchy structure introduces a new dimension for internal management, this has to be masked in order to increase the development productivity and attract a larger community of developers. One of the operations proposed is presented in Subsection 7.1.2. • Create a link between the structural model and the functional code (Low) Although invaluable for research and experimental purposes, this feature, as presented in Subsection 7.1.2, does not have a high priority yet, as compared to the other more urgent goals. • Improve support for/from the ForSyDe-SystemC design framework (High) In order to embed the component into a fully automated design flow, direct support with the ForSyDe-SystemC is crucial. In Subsection 7.2.2 a few proposals are made in this sense. • Impose a standardized set of data types (High) Although this feature would seem obvious in the context of system design and analysis, a study on its full impact on the design framework and language is necessary. • Enhance support for recognition and extraction of complex data types and containers (High) The importance of complex data types has been seen in the development of the Linescan model and other ForSyDe models. As proposed in Subsection 7.2.2, mechanisms for recognition and manipulation of these data types need to be embedded into the ForSyDe framework.

• Oﬀer support for dynamic systems with variable parameters (Low) These systems would greatly broaden ForSyDe’s range of applications, as suggested in Subsection 7.3.2. Currently though, this goal is just speculated and further research is necessary to determine its validity. • Develop a model for storing code’s functionality based on its AST (Medium) As long as code manipulation and storage is based on text methods, it is subject to errors and imposes unnecessary restrictions upon the design framework, weakening ForSyDe’s potential and attractiveness for new users. 98 10. Component Evaluation and Limitations

• Develop mechanisms for functional code analysis and manipulation (Low) This task connects the previous goal with creating a link with the ForSyDe structural model. At the time being it is a far goal for pure scientific purposes. • Develop a tool for model run-time analysis in addition to the run-time functional validation (Medium) Since the current component employs an empirical model for cost description, it would be proper to extract these costs from run-time analyses of the execution patterns. Such a tool is proposed in Appendix B. • Provide a real model of the platforms based on real measurements and architectural traits (High) This issue is discussed in Subsection 7.4.3. For the design space exploration algorithms to make sense, a real model of the platform is needed. • Improve the platform description algorithms (High) Currently, the running platform is described using the simple (linear) equations enumerated in Section 7.4 involving some empirical constants. These formulas need to be further developed. Regarding the algorithmic background, the following proposals can be made: • Replace the algorithm for identifying data parallel sections (High) In Subsection 8.1.1 a much faster algorithm than the one currently implemented for identifying potentially data parallel processes was presented, which was dropped due to the model’s inflexibility to complex manipulations. Once the new internal model library is ready, the implementation of this algorithm has the highest priority. • Increase the performance of the current algorithms (Medium) Although some proposals have been stated in Subsection 8.1.5, the algorithm performance is currently a much weaker issue than the proof of concept. Nevertheless, performance needs to be taken into account when considering the component’s scalability. • Formally validate the algorithms (High) The algorithms developed as part of the current contribution have been presented, and their effect has been shown. A formal verification though is out of the scope of this Master’s Thesis, but necessary for employing them in future design flows. • Further develop model manipulation and design space exploration algorithms (High) In the present contribution, only a small part of ForSyDe’s full potential has been explored. Further research is still necessary in order to provide better solutions. The continuation with an extensive study of the current scientific literature may provide valuable resources. • Complete the implementation of the synthesizer and improve its algorithms (High) As stated in Section 10.1, the project is not completed yet. In order to be fully evaluated, it has to be finalized by providing a fully functional and correct synthesizer for both C and CUDA code. • Optimize the generated code (Medium) Since the project is in still in its youth stages, proof of concept and scientific exploration are more important than platform-specific optimizations. In order to offer a full-scaled industrial tool however, this task will have to be assigned the highest priority. Even so, the platform optimizations, will strongly depend on the targeted industry or market, 10.2. Limitations and future work 99

which are not known yet. As for the design space exploration algorithms, potential for optimization and its exploitation can be aided by the research community which has proven to be very active in this particular domain in the past few years.

• Oﬀer support for generating multiple kernels (High) All the design ﬂow stages supports and builds models that would render in multiple kernel invocations. Even so, the absolute last step, namely the top level invocation function synthesis still needs a theoretical basis to generate code that invokes multiple kernels. This issue is stated in Subsection 8.2.3. 100 10. Component Evaluation and Limitations Chapter 11 Concluding Remarks

This chapter closes the current project by summarizing the main contributions to both system design and the academic community. Its achievements will be listed with respect to the initial goals ﬁxed in the introduction, and to the challenges that were identiﬁed in Chapter 6.

his report presents a method for analyzing and manipulating formally-described systems T in order understand their potential for parallel computation and to map them on platforms offering resources for time and data parallelism. The language chosen for describing systems is ForSyDe, a language that has been shown to meet all the demands for describing parallel systems, and which is associated with a methodology with high potential for developing automatic design flows. As design framework, ForSyDe-SystemC has been used. As target platforms, GPGPUs were chosen, as they are the leading platform for massively multi-parallel processing in industry, and there is much ongoing research revolving around them. Finally, as tool for design flow, an existing component called f2cc previously developed in [Hjort Blindell, 2012] that already offered code synthesis mechanisms for nvidia GPGPUs was chosen to be expanded. As concerns the scientific contribution, both the theoretical background and the implementation work efforts offer a strong foundation in multiple directions for future research. The "parallel paradox" as stated in Chapter 3 is analyzed from different angles, while attempting to reach the heart of the problem. While offering an insight into available paradigms in Part I the current contribution builds its own views on dealing with parallelism in Part II, proposing methods and solutions for the main challenges that arose during the development process. Also a secondary track has been kept throughout the project with the main purpose of building specifications for a potential ForSyDe development toolkit, synthesized in Appendix B. Considering the foundation built by the current project, it may be considered a success. Regarding the initial objectives however, not every target could be achieved successfully in the time frame available, some of them remaining open problems for future work: • An extensive literature study has been conducted and relevant ideas have been extracted.

101 102 11. Concluding Remarks

• A plan for expanding f2cc’s functionality has been presented. • New features specific to the previously non-supported ForSyDe-SystemC design framework have been provided. • A synthesis flow for data and time parallel code targeting CUDA-enabled GPGPUs has been developed and embedded into the existing f2cc flow. • A high-level model for the industrial-scale application Linescan provided by XaarJet AB was implemented and serves as demonstration model In Appendix C. • High-quality code documentation was provided with both in-line comments and Doxygen generated API documentation. • The evaluation of the improved f2cc tool was impossible in the given time frame, due to the unfinished state of the code generation mechanisms. As concerning the component challenges identified in Chapter 6, the project places itself in the following situation: • all the frontend related goals have been achieved, even the low priority ones. • the model related challenges have been achieved, except for the low priority ones, which fell too far off the scope of this Master’s Thesis. • as concerns the synthesizer related goals, they have been partially achieved. Optimizing the use of shared memory was ruled out, due to its low priority and the stream wrapper, although provided, needs further development in order to generate correct CUDA code. • all language related tasks were accomplished, although a different approach has been chosen for code extraction: text-based manipulation from inside f2cc, with enhanced support from the ForSyDe-SystemC design framework. • the general challenges were faced successfully. Although the low priority goal which proposed merging the two execution flows was overlooked, since it was unrealistic from the start, both backward compatibility and execution flow integration were delivered in good conditions. Part IV

Appendices

103 104 Appendix A Component documentation

This appendix presents how to build and use the software component in its current development state. Since the tool is a part of the ForSyDe design ﬂow, a few guides on how to set up its context will be listed, with regard to the user’s presumed already-acquired ForSyDe knowledge. As long as the tool is still in its pre-alpha stages, a few guides for maintenance and future development will be provided in the last section of this chapter.

A.1 Building

he current software tool has been developed and ran on Unbuntu Linux 12.04, Un- T buntu Linux 12.10 and Xubuntu Linux 12.04, and it should compile and run on any Linux distribution1. The component as such uses standard libraries and tools that come with the Linux distribution, such as make and g++. In case that, for some reason, these dependencies are not fulﬁlled, the user must install them. On a Debian distribution running the following command in the terminal will ﬁx this.

$ sudo apt-get install build-essential

The source code comes with its own makefile. To build the component, go to the path where the source code was extracted and run:

$ make

This produces the binary ﬁle along with its library, object, and all the other ﬁles associated with the tool. The program has its own folder structure that is generated and placed in the bin subfolder. To clean the build, execute:

$ make clean

1no warrants can be made for compiling and running under Windows, since it most likely require a few minor modiﬁcations in the source code regarding ﬁle handling and data acquisition

105 106 A. Component documentation

A.2 Preparations

Although the tool itself uses tools and libraries provided with the Linux distribution, it is part of a design flow based on existing ForSyDe tools. The system that needs to be synthesized into parallel CUDA code is designed in the ForSyDe-SystemC design framework. For setting up and usage of ForSyDe-SystemC, it is strongly advisable to consult the ForSyDe wiki page [ForSyDe, 2013]. As stated in the official documentation, the design framework depends on SystemC2, and it uses features from the Boost libraries and C++11 standard. At the time of writing this report, a custom ForSyDe library with enhanced introspection features was used. These features will be discussed whether or not they should appear in the official releases, and they implement data type recognition and data size extraction. If it is decided that another solution is to be used, the component’s wiki page will hold this information and further guides. The ForSyDe system development is performed according to [ForSyDe, 2013], with the following constraints: • each process function needs to be defined in a separate file called [procname]_func.hpp, where [procname] is the name given to the process. Also, the function has to be named following the same convention: [procname]_func. • each time a new complex data type is used, it has to be defined for proper introspection using the DEFINE_TYPE_NAME macro like in the following example:

DEFINE_TYPE_NAME(std::array, "std:: array" );

• the only STL data type allowed for the time being is std::array, since it holds structural information that can be extracted by the ForSyDe introspection module. • code between #pragmas needs to be pure C code, since no semantic modification is done to that part. • the code outside #pragmas has the main role of wrapping variables into or unwrapping them from signals. This area must contain no functionality, hence it will be ignored. After the system model has been created, validated and its structural information has been dumped, its xml files need to be annotated with cost information. This means that the designer needs to add run time approximations for each process3, by adding a cost attribute for each process_constructor. This step may be automated by further offering introspection support from the ForSyDe library, but this kind of support would lead to a lot of non-formal workarouds, thus a correct way for extracting run-time information is necessary. The platform model (see Section 7.4) needs to be provided as well. The tool comes with a template that looks like in Listing A.1. This file is created in the bin/config subfolder the first time the component is built. Depending on to the user’s own platform model, the values contained in this file need to be adjusted.

2a good tutorial on setting up SystemC with popular IDEs is found at [Bjerge, 2007] 3for a run-time analysis tool, the user may consult Callgrind or KCacheGrind. A.3. Running the tool 107

Listing A.1: The platform model XML template.

To set up the synthesis of the ForSyDe model, the internal tool paths need to be filled accordingly: • bin/inputs: contains all the input structural XML or GraphML files that serve as input model for the f2cc tool. • bin/outputs: will hold the generated .h, .c or .cu files. • bin/func: contains the process function files, as they were used for the validation process in ForSyDe-SystemC. These files are needed for f2cc v0.2 execution flow. • bin/config: as mentioned above, holds the platform model. • bin/intermediate: will hold the intermediate XML representations of the internal f2cc model, as they are dumped between the main steps of the model modification stage.

A.3 Running the tool

After all the preparations in Section A.2 have been made, the tool can be run. It is controlled from the command prompt through command-line arguments. To synthesize a model ﬁle with the default options, execute: $ ./f2cc top_level_input_file

The argument top_level_input_file coincides with the XML structural file corresponding to the top module of a ForSyDe-SystemC system or the GraphML structural file associated with a ForSyDe-Haskell system. All the command-line option arguments need to be placed before the input file, but their order is not important. For a full list of the option arguments and their explanation, run: $ ./f2cc -h

A.4 Maintenance

Since f2cc version 0.2 is still in a pre-alpha state, it does not arrive with any documentation, because it is due to change. The source code though has been documented using Doxygen annotations. To dynamically generate API documentation that reﬂects all the source code 108 A. Component documentation changes, the makefile is equipped with a command that invokes Doxygen to parse through the source. This implies that Doxygen is installed using the following command line: $ sudo apt-get install doxygen

The API documentation, having a HTML format, resides in the source/docs subfolder and is generated using the following command line: $ make docs

The source project’s module structure suffered minor modifications from version 0.1. Therefore the guidelines from Appendix A in [Hjort Blindell, 2012] are still valid with the additions presented in Section 9.2. Therefore, improving the component by adding process support implies: • adding the process’ class description into its correct MoC folder and namespace, and inheriting its base attributes from the Leaf class. • providing frontend support so that the new process is recognized and parsed correctly, and the internal model is able to include it. • if the process requires any further analysis or affects the current analysis or model modifier algorithms, the model modifier classes need to add support for that (ModelModifier for v0.1 execution flow and ModelModifierSysC for v0.2 execution flow). • the synthesizer needs to be provided with methods for interpreting this new process and generate code from it. As expected, any new algorithm that refers to model analysis or manipulation needs to be provided with a public function in one of the model modifier classes (depending on the design flow they regard). Also, any improvement or new feature regarding platform synthesis, code generation or optimizations, should be reflected in one of the two synthesizer classes. Appendix B Proposing a ForSyDe Design Toolbox

This chapter will present some of the author’s personal reﬂections that arouse while being emerged into the ForSyDe methodology and studying the ForSyDe-SystemC framework. Trying to solve an industrial problem using ForSyDe, a number of features were identiﬁed and analyzed from the author’s point of view. A series of reports has been written while identifying these features.

The following sections do not intend to solve any given problem, and will not be speciﬁcally treated during this project. Their purpose is to shortly synthesize the work report’s content to ease settling of some concepts that may aid the development of future ForSyDe research.

B.1 A simple explanatory example

In order to explain the full scale of the ForSyDe design flow, a very simplistic example is presented. It is assumed that there is a tool that analyses it for its computation and communication behavior used after the designer modeled a system in the modeling framework. This tool will further be presented in Section B.2. The analysis tool should provide quantitative measurements as costs (empirically denominated as low, medium or high), and potential for further manipulation or design space exploration. The analyzed system looks like in Figure B.1. Following a set of refinements based on analysis information, design space exploration and other similar mechanism, the system in Figure B.1 would result in the refined system in Figure B.2. A suitable set of transformation are:

• process 2, identiﬁed as potentially parallelizable (e.g. matrix multiplication), could be suggested to be mapped on a intensive parallel1 platform. If such is the case, its computation cost will be recalculated to medium. • process 6, could be suggested to be mapped to a cache-based / branch prediction-based platform. This suggestion is enforced by the low communication intensity.

1more on parallel taxonomies can be found in Subsection 3.2.2

109 110 B. Proposing a ForSyDe Design Toolbox

2. High

- high potential for parallelization - identified potential for code optimization H H

1. Med 5. Low 6. High L L L

- distribution role - high complexity

M H

3. Med 4. Low L - no identified potential for optimization

Figure B.1: Example of ForSyDe model after analysis

2. Med

H H

L 1. Med

5. Low L L 6. High M

L 4. Low

3. Med

stage 1 stage 2 stage 3 stage 4

Platform A (complex + intensive parallel) Platform B (sequential)

Figure B.2: Example of ForSyDe model after transformations & reﬁnements

• a good practice would be load balancing the pipeline stages on a complex parallel1 platform. As seen in Figure B.1, process 3 cannot be further optimized for parallelism, thus it is not susceptible to further design space exploration. • the high communication between process 4 and process 5 could be considered a potential bottleneck. Consequently, the best choice is to merge these two processes as shown in Figure B.2, and lower communication overhead using low-latency memories and other code optimization techniques.

B.2 Layered reﬁnements & Reﬁnement loop

As seen in Section B.1, the bridge from a high-level description to a low-level implementation is a long and tedious one, and deﬁnitely not straight-forward. It involves several design decisions B.2. Layered refinements & Refinement loop 111

Analysis flow

Static code Run-time analysis analysis

Cyclomatic Channel Computation Performance Comm. complexity structure intensity analysis intensity

instructions relative simple access vs. memory execution time channel calls patterns accesses / CPI

Figure B.3: Some aspects of ForSyDe analysis

High-Level, MoC-based Purely HL model description transformations MoC info

application- Model specific transformations with mapping information platform- specific

Code Backend info optimizations Low-Level, Backend implementation

Figure B.4: Hierarchical separation of refinements 112 B. Proposing a ForSyDe Design Toolbox based on a proper analysis of the system. First, the analysis mechanism, in order to be efficient and favor proper design decisions, has to be far more complex than presented in Section B.1. Fig. B.3 depicts a few aspects of model behavioral and structural analysis. These analyses are grouped as static analyses, achieved by parsing and analyzing either the code or the model, and run-time analyses, accomplished by running the specified process on virtual platforms described by relevant features: • cyclomatic complexity is a measurement that describes a program’s complexity (e.g. loops, jumps, decision points) [McCabe, 1976]. Usually, complex programs benefit from specific types platforms (cache-based, bubble-free [Codreanu and Hobincu, 2010], etc.). • channel structure takes information about the data and amount of data encapsulated and transported by each channel associated with a ForSyDe signal. • computation intensity measurements can be done by analyzing the program’s behavior at run-time. Some computation details can be identified, for example arithmetical intensity defined as the computation vs. memory accesses [Hennessy and Patterson, 2011]; and simple memory access patterns which could prove useful in further code / mapping optimizations. • performance analysis. Although counting the CPI or the execution time may not be relevant, especially for high-level models, a run-time analysis would provide early estimates of potential bottlenecks and critical paths in the system. • communication intensity describes how much the channel is used. A good analysis is done by running the model in a real-case scenario. By counting the channel calls and knowing the channel’s structure, a relevant measurement of the communication intensity is possible. A thorough analysis of the transformation patterns leads to adopting a layered topology for describing ForSyDe design flow. In Figure B.4 a hierarchical model for separating design transformation layers is depicted. It enforces re-usability of the same refinements for different designs flows based on their similarity, or the similarities between targeted platforms or MoCs. As the graphic suggests, the farther one goes into refinement, the more MoC information is lost or getting irrelevant. On the other hand, more implementation / mapping details show up as backend / platform information. Three main hierarchical levels can be identified: • purely high-level model transformations that affect only the model of the design and have no outcome on the final implementation. • model transformations with mapping information include mainly model transformations dependent on the platform and relevant only for a specific class of applications. Further classifications can be made, for example: class of applications class of platforms platform. → → • code optimizations, the final stages of the refinement process, when the high-level model is not relevant anymore. This implies the fact that the means of communication and computation have been described and/or mapped, but code optimizations are still possible. Finally, a refinement loop is defined, as a quantum mean to achieve the proper transformation for a design stage. It is applied upon an intermediate representation of the design and results B.3. Proposed architecture for the design flow tool 113

Structure Mapping IntermediateIntermediate representation

Platform Code model & Analysis information tool Analysis

Design constraints

Transformation

Transformation library

Human Decision Machine

Figure B.5: A sketch for a general refinement loop in another intermediate representation. Each hierarchical layer may have several refinement loops or none at all, depending on the desired goal. A simple sketch is drawn in Figure B.5 and it implies that the transformation is done after a decision based upon a proper analysis that has the proper input data. The picture also implies that the author believes that the refinement loop should not be completely transparent to the designer, or at least the designer should also be allowed in the decision process if he wishes so.

B.3 Proposed architecture for the design ﬂow tool

Currently ForSyDe as a design methodology is still in its youth, thus there is still no unified toolbox that embeds all its features. Driven by this fact, we propose the future research of a unified software architecture that implements ForSyDe design flow, based on examples found in the open-source community. One especially attracting project model is inspired by Apache Maven build automation tool [Maven, 2007]. It fixes a Project Object Model (POM) which, like in the object-oriented paradigm, it describes a hierarchical structure of projects and sub-projects. It also describes dependency between projects, dependency inheritance and the plugin usage. The most important feature is that the primary means to extend a project is through plugins. Once a project has set its foundation, technically anyone can write plugins, without possessing thorough knowledge of the project’s internal architecture. By following Maven’s model, ForSyDe Toolkit can be built as two main entities: • the backbone, which will play the role of hub for the toolkit. It will be built following the Model-View-Controller (MVC) [Krasner and Pope, 1988] architecture, and will serve as both interface and controller for the plugins. It will implement three sets of classes: an interface, for interacting with the plugins, a control set for running, automating plugins and taking care of dependencies, and a data set encapsulating intermediate data for plugins. 114 B. Proposing a ForSyDe Design Toolbox

• the plugins which will be individual components of the ForSyDe Toolkit, developed as separate tools. They will be controlled and will communicate only with the backbone through its API. They can be transformation libraries, parsing tools, analysis tools, backend synthesis, automation tools (which implement design flows), visualization tools, or even GUIs. The advantages of such an architecture would be separation and individual development of different components by different research parties, with minimum overhead. By following the POM, backward compatibility will be easy to assure, and the handling of dependencies will be taken care of through automation rather than software development. Furthermore, it will enforce parallel execution of tools since they have their own separate environment, hence high performance processing can be exploited. Appendix C Demonstrations

This appendix demonstrates the usage of the current component by showing the intermediate results outputted at the time of writing this report.

s mentioned in the project’s introduction, the main application studied and used as an A example to demonstrate f2cc’s usage is called Linescan and was provided by the company XaarJet AB. It is an industrial-scale image processing application with the purpose of real- time monitoring of hardware systems. Since it involves processing massive amounts of data and safety-critical system control loops, it is a good application for demonstrating the potential of ForSyDe as a system design methodology. As part of the Master’s Thesis contribution, core parts of Linescan have been modelled in ForSyDe in order to be fed into experimental design flows that would result in correct-by-design implementations on heterogeneous platforms. The core functionality chosen for testing f2cc with was profiled in the provider’s reports as being potentially time- and data-parallel. The material found in page 117 through page 129 show samples from different stages of the design flow associated with f2cc v0.2. Although the end result of the component could not be tested and validated due to the lack of time available for debugging, the results are nevertheless shown to reflect the current state of the tool’s development. The test system has been modelled in ForSyDe-SystemC as described in Appendix A. A sample of the function code, both as implemented in ForSyDe, and as extracted by the f2cc frontend module can be seen in Listing C.1 and Listing C.2. The ForSyDe model has been dumped into its XML structural representation, and parsed by f2cc and an internal model was successfully built. This fact is demonstrated by Figure C.1 which plots the internal model. Afterwards the model modifier algorithms applied their methods upon this internal representation: the model flattening (Figure C.2), the grouping of potentially data-parallel comb processes (Figure C.3), the grouping of the rest of potentially parallel processes (Figure C.4), the removal of redundant processes (Figure C.5), the load balancing of the process network (Figure C.7) and finally the remodelling of the system in order to describe pipelined execution (Figure C.8).

115 116 C. Demonstrations

Figures C.3 to C.5 show the result of grouping potentially parallel processes even from different data flows provided there is no data dependency between them. As can be seen, the data flow is merged into xcorr_ns and then split again, while still respecting the correct semantics of the system. Since it has been shown that further development is still needed to provide a solution to synthesize kernels containing processes with different widths, we give up this optimization thus leaving the two data paths separated in Figure C.6. Also, since the network contains no process that would benefit from changing the initially-mapped platform, platform optimization could not be observed on the provided model. Excerpts from the logger output have been provided in Listing C.3 to demonstrate the working state of both the sequential scheduler and the pipeline stage scheduler. Listing C.4, Listing C.5 and Listing C.6 show samples of backend code generated by the current component. As can be seen, the sequential code is in an almost finished state. With minor adjustments and polishing, the generated sequential program can even be compiled and run. The CUDA code though, although complex, still needs much further adjustment in order to be validated. 117

1//type name definition used by the ForSyDe dumper to represent types in the XML files 2 DEFINE_TYPE_NAME( std : : array , "std:: array" ); 3 DEFINE_TYPE_NAME( std : : array , "std:: array" ); 4 5//process function definition 6 void acorr_od(abst_ext> &output , 7 const abst_ext> &in_state) { 8 9 std::array state = unsafe_from_abst_ext(in_state); 10 11#pragma ForSyDe begin od 12 unsigned int position = (unsigned int)state[0]; 13 int sw=(int)state[1]; 14 intW= sw * 2 ; 15 int sampWin_begin_idx = (position + (BUFFER_SIZE / 2 sw ) ) ; 16 − 17 double retValue[BUFFER_SIZE] = { 0 }; 18 int k, d, c_idx, d_idx; 19 for(intb= sw; b < sw; b++) { 20 for(int c− = 0; c<=W; c++) { 21 d = c + b + 1 ; 22 c_idx = (sampWin_begin_idx + c) % (BUFFER_SIZE + 1); 23 d_idx = (sampWin_begin_idx + d) % (BUFFER_SIZE + 1); 24 k = b + sw ; 25 if ((d>= 0) && (d out_vector = reinterpret_cast&>( retValue ) ; 34 output = abst_ext>(out_vector) ; 35 } ;

Listing C.1: Sample ForSyDe process function code written in C++ for simulation in the ForSyDe-SystemC framework

1 void acorr_od(double * retValue , const double * s t a t e ) { 2 3 unsigned int position = (unsigned int)state[0]; 4 int sw=(int)state[1]; 5 intW= sw * 2 ; 6 int sampWin_begin_idx = (position + (24 / 2 sw ) ) ; 7 − 8 int k, d, c_idx, d_idx; 9 for(intb= sw; b < sw; b++) { 10 for(int c− = 0; c<=W; c++) { 11 d = c + b + 1 ; 12 c_idx = (sampWin_begin_idx + c)%(24 + 1); 13 d_idx = (sampWin_begin_idx +d)%(24 + 1); 14 k = b + sw ; 15 if ((d>= 0) && (d

Listing C.2: Sample C process function code extracted by the CParser module from the the code in Listing C.1 118 C. Demonstrations

hallo.xml

f2cc0

double[16] : double[16] unsigned int[16] : unsigned int[16]

double[16] unsigned int[16] unzipX unzipX double double double double double double double double double double double double double double double double unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int f2cc0_DDFA0 f2cc0_DDFA1 f2cc0_DDFA2 f2cc0_DDFA3 f2cc0_DDFA4 f2cc0_DDFA5 f2cc0_DDFA6 f2cc0_DDFA7 f2cc0_DDFA8 f2cc0_DDFA9 f2cc0_DDFA10 f2cc0_DDFA11 f2cc0_DDFA13 f2cc0_DDFA12 f2cc0_DDFA14 f2cc0_DDFA15

double : double unsigned int : unsigned int double : double unsigned int : unsigned int double : double unsigned int : unsigned int double : double unsigned int : unsigned int double : double unsigned int : unsigned int double : double unsigned int : unsigned int double : double unsigned int : unsigned int double : double unsigned int : unsigned int double : double unsigned int : unsigned int double : double unsigned int : unsigned int double : double unsigned int : unsigned int double : double unsigned int : unsigned int double : double unsigned int : unsigned int double : double unsigned int : unsigned int double : double unsigned int : unsigned int double : double unsigned int : unsigned int

double unsigned int double unsigned int double unsigned int double unsigned int double unsigned int double unsigned int double unsigned int double unsigned int double unsigned int double unsigned int double unsigned int double unsigned int double unsigned int double unsigned int double unsigned int double unsigned int fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout double double unsigned int unsigned int double double unsigned int unsigned int double double unsigned int unsigned int double double unsigned int unsigned int double double unsigned int unsigned int double double unsigned int unsigned int double double unsigned int unsigned int double double unsigned int unsigned int double double unsigned int unsigned int double double unsigned int unsigned int double double unsigned int unsigned int double double unsigned int unsigned int double double unsigned int unsigned int double double unsigned int unsigned int double double unsigned int unsigned int double double unsigned int unsigned int f2cc0_DDFA0_XCorr f2cc0_DDFA1_XCorr f2cc0_DDFA2_XCorr f2cc0_DDFA3_XCorr f2cc0_DDFA4_XCorr f2cc0_DDFA5_XCorr f2cc0_DDFA6_XCorr f2cc0_DDFA7_XCorr f2cc0_DDFA8_XCorr f2cc0_DDFA9_XCorr f2cc0_DDFA10_XCorr f2cc0_DDFA11_XCorr f2cc0_DDFA13_XCorr f2cc0_DDFA12_XCorr f2cc0_DDFA14_XCorr f2cc0_DDFA15_XCorr f2cc0_DDFA0_ACorr f2cc0_DDFA1_ACorr f2cc0_DDFA2_ACorr f2cc0_DDFA3_ACorr f2cc0_DDFA4_ACorr f2cc0_DDFA5_ACorr f2cc0_DDFA6_ACorr f2cc0_DDFA7_ACorr f2cc0_DDFA8_ACorr f2cc0_DDFA9_ACorr f2cc0_DDFA10_ACorr f2cc0_DDFA11_ACorr f2cc0_DDFA13_ACorr f2cc0_DDFA12_ACorr f2cc0_DDFA14_ACorr f2cc0_DDFA15_ACorr double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double : double unsigned int : unsigned int fanout double : double unsigned int : unsigned int double : double unsigned int : unsigned int fanout double : double unsigned int : unsigned int double : double unsigned int : unsigned int fanout double : double unsigned int : unsigned int double : double unsigned int : unsigned int fanout double : double unsigned int : unsigned int double : double unsigned int : unsigned int fanout double : double unsigned int : unsigned int double : double unsigned int : unsigned int fanout double : double unsigned int : unsigned int double : double unsigned int : unsigned int fanout double : double unsigned int : unsigned int double : double unsigned int : unsigned int fanout double : double unsigned int : unsigned int double : double unsigned int : unsigned int fanout double : double unsigned int : unsigned int double : double unsigned int : unsigned int fanout double : double unsigned int : unsigned int double : double unsigned int : unsigned int fanout double : double unsigned int : unsigned int double : double unsigned int : unsigned int fanout double : double unsigned int : unsigned int double : double unsigned int : unsigned int fanout double : double unsigned int : unsigned int double : double unsigned int : unsigned int fanout double : double unsigned int : unsigned int double : double unsigned int : unsigned int fanout double : double unsigned int : unsigned int double : double unsigned int : unsigned int fanout double : double unsigned int : unsigned int double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27]

double[27] double unsigned int double[27] double[27] double unsigned int double[27] double unsigned int double[27] double[27] double unsigned int double[27] double unsigned int double[27] double[27] double unsigned int double[27] double unsigned int double[27] double[27] double unsigned int double[27] double unsigned int double[27] double[27] double unsigned int double[27] double unsigned int double[27] double[27] double unsigned int double[27] double unsigned int double[27] double[27] double unsigned int double[27] double unsigned int double[27] double[27] double unsigned int double[27] double unsigned int double[27] double[27] double unsigned int double[27] double unsigned int double[27] double[27] double unsigned int double[27] double unsigned int double[27] double[27] double unsigned int double[27] double unsigned int double[27] double[27] double unsigned int double[27] double unsigned int double[27] double[27] double unsigned int double[27] double unsigned int double[27] double[27] double unsigned int double[27] double unsigned int double[27] double[27] double unsigned int double[27] double unsigned int double[27] double[27] double unsigned int xcorr_ns xcorr_od xcorr_ns xcorr_ns xcorr_od xcorr_ns xcorr_ns xcorr_od xcorr_ns xcorr_ns xcorr_od xcorr_ns xcorr_ns xcorr_od xcorr_ns xcorr_ns xcorr_od xcorr_ns xcorr_ns xcorr_od xcorr_ns xcorr_ns xcorr_od xcorr_ns xcorr_ns xcorr_od xcorr_ns xcorr_ns xcorr_od xcorr_ns xcorr_ns xcorr_od xcorr_ns xcorr_ns xcorr_od xcorr_ns xcorr_ns xcorr_od xcorr_ns xcorr_ns xcorr_od xcorr_ns xcorr_ns xcorr_od xcorr_ns xcorr_ns xcorr_od xcorr_ns double[27] double[24] double[27] double[27] double[24] double[27] double[27] double[24] double[27] double[27] double[24] double[27] double[27] double[24] double[27] double[27] double[24] double[27] double[27] double[24] double[27] double[27] double[24] double[27] double[27] double[24] double[27] double[27] double[24] double[27] double[27] double[24] double[27] double[27] double[24] double[27] double[27] double[24] double[27] double[27] double[24] double[27] double[27] double[24] double[27] double[27] double[24] double[27]

double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] delay double[24] : double[24] delay delay double[24] : double[24] delay delay double[24] : double[24] delay delay double[24] : double[24] delay delay double[24] : double[24] delay delay double[24] : double[24] delay delay double[24] : double[24] delay delay double[24] : double[24] delay delay double[24] : double[24] delay delay double[24] : double[24] delay delay double[24] : double[24] delay delay double[24] : double[24] delay delay double[24] : double[24] delay delay double[24] : double[24] delay delay double[24] : double[24] delay delay double[24] : double[24] delay double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27]

double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27]

double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] acorr_od acorr_od acorr_od acorr_od acorr_od acorr_od acorr_od acorr_od acorr_od acorr_od acorr_od acorr_od acorr_od acorr_od acorr_od acorr_od double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24]

double[24] : double[24] double[24] : double[24] double[24] : double[24] double[24] : double[24] double[24] : double[24] double[24] : double[24] double[24] : double[24] double[24] : double[24] double[24] : double[24] double[24] : double[24] double[24] : double[24] double[24] : double[24] double[24] : double[24] double[24] : double[24] double[24] : double[24] double[24] : double[24] f2cc0_DDFA0_SubDiv f2cc0_DDFA1_SubDiv f2cc0_DDFA2_SubDiv f2cc0_DDFA3_SubDiv f2cc0_DDFA4_SubDiv f2cc0_DDFA5_SubDiv f2cc0_DDFA6_SubDiv f2cc0_DDFA7_SubDiv f2cc0_DDFA8_SubDiv f2cc0_DDFA9_SubDiv f2cc0_DDFA10_SubDiv f2cc0_DDFA11_SubDiv f2cc0_DDFA13_SubDiv f2cc0_DDFA12_SubDiv f2cc0_DDFA14_SubDiv f2cc0_DDFA15_SubDiv

double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24]

double[24] : double[24] fanout double[24] : double[24] fanout double[24] : double[24] fanout double[24] : double[24] fanout double[24] : double[24] fanout double[24] : double[24] fanout double[24] : double[24] fanout double[24] : double[24] fanout double[24] : double[24] fanout double[24] : double[24] fanout double[24] : double[24] fanout double[24] : double[24] fanout double[24] : double[24] fanout double[24] : double[24] fanout double[24] : double[24] fanout double[24] : double[24] fanout double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24]

double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] sub sub sub sub sub sub sub sub sub sub sub sub sub sub sub sub double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24]

double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] div div div div div div div div div div div div div div div div double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24]

double[24] : double[24] double[24] : double[24] double[24] : double[24] double[24] : double[24] double[24] : double[24] double[24] : double[24] double[24] : double[24] double[24] : double[24] double[24] : double[24] double[24] : double[24] double[24] : double[24] double[24] : double[24] double[24] : double[24] double[24] : double[24] double[24] : double[24] double[24] : double[24] f2cc0_DDFA0_AvgSub f2cc0_DDFA1_AvgSub f2cc0_DDFA2_AvgSub f2cc0_DDFA3_AvgSub f2cc0_DDFA4_AvgSub f2cc0_DDFA5_AvgSub f2cc0_DDFA6_AvgSub f2cc0_DDFA7_AvgSub f2cc0_DDFA8_AvgSub f2cc0_DDFA9_AvgSub f2cc0_DDFA10_AvgSub f2cc0_DDFA11_AvgSub f2cc0_DDFA13_AvgSub f2cc0_DDFA12_AvgSub f2cc0_DDFA14_AvgSub f2cc0_DDFA15_AvgSub

double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24]

double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] averager averager averager averager averager averager averager averager averager averager averager averager averager averager averager averager double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24]

double[24] : double[24] double[24] : double[24] double[24] : double[24] double[24] : double[24] double[24] : double[24] double[24] : double[24] double[24] : double[24] double[24] : double[24] double[24] : double[24] double[24] : double[24] double[24] : double[24] double[24] : double[24] double[24] : double[24] double[24] : double[24] double[24] : double[24] double[24] : double[24] f2cc0_DDFA0_OutB f2cc0_DDFA1_OutB f2cc0_DDFA2_OutB f2cc0_DDFA3_OutB f2cc0_DDFA4_OutB f2cc0_DDFA5_OutB f2cc0_DDFA6_OutB f2cc0_DDFA7_OutB f2cc0_DDFA8_OutB f2cc0_DDFA9_OutB f2cc0_DDFA10_OutB f2cc0_DDFA11_OutB f2cc0_DDFA13_OutB f2cc0_DDFA12_OutB f2cc0_DDFA14_OutB f2cc0_DDFA15_OutB double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26]

double[24] : double[24] fanout double[24] : double[24] fanout double[24] : double[24] fanout double[24] : double[24] fanout double[24] : double[24] fanout double[24] : double[24] fanout double[24] : double[24] fanout double[24] : double[24] fanout double[24] : double[24] fanout double[24] : double[24] fanout double[24] : double[24] fanout double[24] : double[24] fanout double[24] : double[24] fanout double[24] : double[24] fanout double[24] : double[24] fanout double[24] : double[24] fanout double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26]

double[26] double[24] double[26] double[26] double[24] double[26] double[26] double[24] double[26] double[26] double[24] double[26] double[26] double[24] double[26] double[26] double[24] double[26] double[26] double[24] double[26] double[26] double[24] double[26] double[26] double[24] double[26] double[26] double[24] double[26] double[26] double[24] double[26] double[26] double[24] double[26] double[26] double[24] double[26] double[26] double[24] double[26] double[26] double[24] double[26] double[26] double[24] double[26] oblock_ns oblock_od oblock_ns oblock_od oblock_ns oblock_od oblock_ns oblock_od oblock_ns oblock_od oblock_ns oblock_od oblock_ns oblock_od oblock_ns oblock_od oblock_ns oblock_od oblock_ns oblock_od oblock_ns oblock_od oblock_ns oblock_od oblock_ns oblock_od oblock_ns oblock_od oblock_ns oblock_od oblock_ns oblock_od double[26] double double[26] double double[26] double double[26] double double[26] double double[26] double double[26] double double[26] double double[26] double double[26] double double[26] double double[26] double double[26] double double[26] double double[26] double double[26] double

double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] delay double : double delay double : double delay double : double delay double : double delay double : double delay double : double delay double : double delay double : double delay double : double delay double : double delay double : double delay double : double delay double : double delay double : double delay double : double delay double : double double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26]

double : double double : double double : double double : double double : double double : double double : double double : double double : double double : double double : double double : double double : double double : double double : double double : double

double double double double double double double double double double double double double double double double

zipX double[16]

double[16] : double[16]

Figure C.1: Input model. Zoomed detail 119

flattened.xml

f2cc0

double[16] : double[16] unsigned int[16] : unsigned int[16]

double[27] double unsigned int double[27] double unsigned int double[27] double unsigned int double[27] double unsigned int double[27] double unsigned int double[27] double unsigned int double[27] double unsigned int double[27] double unsigned int double[27] double unsigned int double[27] double unsigned int double[27] double unsigned int double[27] double unsigned int double[27] double unsigned int double[27] double unsigned int double[27] double unsigned int double[27] double unsigned int double[27] double unsigned int double[27] double unsigned int double[27] double unsigned int double[27] double unsigned int double[27] double unsigned int double[27] double unsigned int double[27] double unsigned int double[27] double unsigned int double[27] double unsigned int double[27] double unsigned int double[27] double unsigned int double[27] double unsigned int double[27] double unsigned int double[27] double unsigned int double[27] double unsigned int double[27] double unsigned int xcorr_ns xcorr_ns xcorr_ns xcorr_ns xcorr_ns xcorr_ns xcorr_ns xcorr_ns xcorr_ns xcorr_ns xcorr_ns xcorr_ns xcorr_ns xcorr_ns xcorr_ns xcorr_ns xcorr_ns xcorr_ns xcorr_ns xcorr_ns xcorr_ns xcorr_ns xcorr_ns xcorr_ns xcorr_ns xcorr_ns xcorr_ns xcorr_ns xcorr_ns xcorr_ns xcorr_ns xcorr_ns double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27]

double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27]

double[27] double[24] double[27] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[27] double[24] double[27] double[24] double[27] double[24] double[27] double[24] double[27] double[24] double[27] double[24] xcorr_od fanout xcorr_od fanout fanout fanout fanout fanout fanout fanout fanout fanout xcorr_od fanout xcorr_od fanout xcorr_od fanout xcorr_od fanout xcorr_od fanout xcorr_od fanout double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24]

double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] div div div div div div div div div div div div div div div div double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24]

double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24]

double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] averager averager averager averager averager averager averager averager averager averager averager averager averager averager averager averager double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24]

double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[26] double[24] double[24] double[24] double[24] double[24] double[24] sub sub sub sub sub sub sub sub sub sub sub sub sub fanout sub sub sub double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[26] double[26] double[24] double[24] double[24]

double[26] double[24] double[26] double[24] double[26] double[24] double[26] double[24] double[26] double[24] double[26] double[24] double[26] double[24] double[26] double[24] double[26] double[24] double[26] double[24] double[26] double[24] double[26] double[24] double[26] double[24] double[26] double[24] double[26] double[24] double[26] double[24] oblock_ns oblock_ns oblock_ns oblock_ns oblock_ns oblock_ns oblock_ns oblock_ns oblock_ns oblock_ns oblock_ns oblock_ns oblock_ns oblock_ns oblock_ns oblock_ns double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26]

double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] delay delay delay delay delay delay delay delay delay delay delay delay delay delay delay delay double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26]

double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout oblock_od fanout fanout fanout double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double double[26] double[26] double[26] double[26] double[26] double[26]

double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] oblock_od oblock_od oblock_od oblock_od oblock_od oblock_od oblock_od oblock_od oblock_od oblock_od oblock_od oblock_od oblock_od oblock_od oblock_od double double double double double double double double double double double double double double double

double double double double double double double double double double double double double double double double zipX double[16]

double[16] : double[16]

Figure C.2: The model after ﬂattening. Zoomed detail 120 C. Demonstrations

flattened1.xml

f2cc0

double[16] : double[16] unsigned int[16] : unsigned int[16]

double double double double double double double double double double double double double double double double unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout double double double double double double double double double double double double double double double double double double double double double double double double double double double double double double double double unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int

double double double double double double double double double double double double double double double double double double double double double double double double double double double double double double double double unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int zipX zipX double[32] unsigned int[32]

pcomp_8

double[32] : double double[864] : double[27] unsigned int[32] : unsigned int

double[27] double unsigned int xcorr_ns double[27]

double[27] : double[864]

double[864] unzipX double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27]

double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27]

double[432] : double[27] double[432] : double[27]

double[27] double[27] xcorr_od acorr_od double[24] double[24]

double[24] : double[384] double[24] : double[384]

double[384] double[384] unzipX unzipX double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24]

double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] zipX fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout double[384] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24]

pcomp_7

double[384] : double[24] double[384] : double[24]

double[24] double[24] sub double[24]

double[24] : double[384]

double[384] unzipX

double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24]

double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] zipX

double[384]

pcomp_3

double[384] : double[24] double[384] : double[24]

double[24] double[24] div double[24]

double[24] : double[384]

double[384]

unzipX double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24]

double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24]

fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24]

zipX zipX double[384] double[384]

pcomp_2

double[384] : double[24]

double[24] averager double[24]

double[24] : double[384]

double[384] unzipX double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24]

double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24]

zipX double[384]

pcomp_6

double[384] : double[24] double[384] : double[24]

double[24] double[24] sub

double[24]

double[24] : double[384]

double[384] unzipX

double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24]

double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] zipX

double[384]

pcomp_4

double[416] : double[26] double[384] : double[24]

double[26] double[24] oblock_ns double[26]

double[26] : double[416]

double[416] unzipX double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26]

double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] delay delay delay delay delay delay delay delay delay delay delay delay delay delay delay delay double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26]

double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout fanout double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26]

pcomp_5

double[416] : double[26]

double[26] oblock_od double

double : double[16]

double[16] unzipX double double double double double double double double double double double double double double double double

double double double double double double double double double double double double double double double double zipX double[16]

double[16] : double[16]

Figure C.3: The model after grouping equivalent comb processes. Zoomed detail 121

flattened2.xml

f2cc0

unsigned int[16] : unsigned int[16] double[16] : double[16]

unsigned int[16] double[16] unzipX unzipX unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int double double double double double double double double double double double double double double double double

unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int double double double double double double double double double double double double double double double double zipX zipX unsigned int[16] double[16]

pcomp_15 pcomp_14

unsigned int[16] : unsigned int double[16] : double

unsigned int double fanout fanout unsigned int unsigned int double double

unsigned int : unsigned int[16] unsigned int : unsigned int[16] double : double[16] double : double[16]

unsigned int[16] unsigned int[16] double[16] double[16] unzipX unzipX unzipX unzipX unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int double double double double double double double double double double double double double double double double double double double double double double double double double double double double double double double double

unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int double double double double double double double double double double double double double double double double double double double double double double double double double double double double double double double double zipX zipX unsigned int[32] double[32]

pcomp_8

double[864] : double[27] unsigned int[32] : unsigned int double[32] : double

double[27] double unsigned int xcorr_ns

double[27]

double[27] : double[864]

double[864]

pcomp_11

double[864] : double[27]

double[27] delay

double[27]

double[27] : double[864]

double[864] unzipX

double[432] double[432]

pcomp_12 pcomp_16

double[432] : double[27] double[432] : double[27]

double[27] double[27]

fanout fanout double[27] double[27] double[27] double[27]

double[27] : double[432] double[27] : double[432] double[27] : double[432] double[27] : double[432]

double[432] double[432] double[432] double[432]

unzipX unzipX unzipX unzipX double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27]

pcomp_1 pcomp_9

double[432] : double[27] double[432] : double[27]

double[27] double[27] acorr_od xcorr_od double[24] double[24]

double[24] : double[384] double[24] : double[384]

double[384] double[384]

unzipX unzipX double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24]

double[384] double[384]

pcomp_13

double[384] : double[24]

double[24] fanout

double[24] double[24]

double[24] : double[384] double[24] : double[384]

double[384] unzipX

double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24]

double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] zipX

double[384]

pcomp_7

double[384] : double[24] double[384] : double[24]

double[24] double[24] sub

double[24]

double[24] : double[384]

double[384] double[384] unzipX unzipX

double[384] double[384]

pcomp_3

double[384] : double[24] double[384] : double[24]

double[24] double[24]

div double[24]

double[24] : double[384]

double[384] unzipX

double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24]

zipX double[384]

pcomp_17

double[384] : double[24]

double[24]

fanout double[24] double[24]

double[24] : double[384] double[24] : double[384]

double[384]

unzipX double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24]

double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24]

zipX double[384]

pcomp_2

double[384] : double[24]

double[24]

averager double[24]

double[24] : double[384]

pcomp_6

double[384] : double[24] double[384] : double[24]

double[24] double[24] sub

double[24]

double[24] : double[384]

double[384] unzipX double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24]

double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] zipX double[384]

pcomp_4

double[416] : double[26] double[384] : double[24]

double[26] double[24] oblock_ns double[26]

double[26] : double[416]

double[416] unzipX double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26]

double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] zipX double[416]

pcomp_10

double[416] : double[26]

double[26] delay double[26]

double[26] : double[416]

double[416] unzipX double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26]

double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] zipX double[416]

pcomp_18

double[416] : double[26]

double[26] fanout double[26] double[26]

double[26] : double[416] double[26] : double[416]

double[416] double[416] unzipX unzipX double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26] double[26]

pcomp_5

double[416] : double[26]

double[26]

oblock_od double

double : double[16]

double[16]

unzipX double double double double double double double double double double double double double double double double

double double double double double double double double double double double double double double double double zipX double[16]

double[16] : double[16]

Figure C.4: The model after grouping potentially parallel leaf processes. Full scale 122 C. Demonstrations

flattenAndParallelize.xml

f2cc0

unsigned int[16] : unsigned int[16] double[16] : double[16]

pcomp_15 pcomp_14

unsigned int[16] : unsigned int double[16] : double

unsigned int double fanout fanout unsigned int unsigned int double double

unsigned int : unsigned int[16] unsigned int : unsigned int[16] double : double[16] double : double[16]

unsigned int[16] unsigned int[16] double[16] double[16] zipX zipX unsigned int[32] double[32]

pcomp_8

double[32] : double double[864] : double[27] unsigned int[32] : unsigned int

double[27] double unsigned int xcorr_ns double[27]

double[27] : double[864]

pcomp_11

double[864] : double[27]

double[27] delay double[27]

double[27] : double[864]

double[864] unzipX double[27] double[27]

pcomp_12 pcomp_16

double[432] : double[27] double[432] : double[27]

double[27] double[27] fanout fanout double[27] double[27] double[27] double[27]

double[27] : double[432] double[27] : double[432] double[27] : double[432] double[27] : double[432]

pcomp_1 pcomp_9 double[432] double[432]

double[432] : double[27] zipX double[432] : double[27] double[864]

double[27] double[27] acorr_od xcorr_od double[24] double[24]

double[24] : double[384] double[24] : double[384]

pcomp_13

double[384] : double[24]

double[24] fanout double[24] double[24]

double[24] : double[384] double[24] : double[384]

pcomp_7

double[384] : double[24] double[384] : double[24]

double[24] double[24] sub double[24]

double[24] : double[384]

pcomp_3

double[384] : double[24] double[384] : double[24]

double[24] double[24] div double[24]

Figure C.5: The model after removing double[24] : double[384] redundant zipx and unzipx processes. Zoomed detail

pcomp_17

double[384] : double[24]

double[24] fanout

double[24] double[24]

double[24] : double[384] double[24] : double[384]

pcomp_2 pcomp_10

double[384] : double[24] double[416] : double[26]

double[24] double[26]

averager delay double[24] double[26]

double[24] : double[384] double[26] : double[416]

pcomp_6 pcomp_18

double[384] : double[24] double[384] : double[24] double[416] : double[26]

double[24] double[24] double[26] sub fanout double[24] double[26] double[26]

double[24] : double[384] double[26] : double[416] double[26] : double[416]

pcomp_4 pcomp_5

double[416] : double[26] double[384] : double[24] double[416] : double[26]

double[26] double[24] double[26] oblock_ns oblock_od double[26] double

double[26] : double[416] double : double[16]

double[16] : double[16] 123

optimizePlatform.xml

f2cc0

unsigned int[16] : unsigned int[16] double[16] : double[16]

pcomp_17[16] pcomp_16[16]

unsigned int[16] : unsigned int double[16] : double

unsigned int double fanout = cost:0 = fanout = cost:0 =

unsigned int unsigned int double double

unsigned int : unsigned int[16] unsigned int : unsigned int[16] double : double[16] double : double[16]

pcomp_9[16] pcomp_1[16] pcomp_11[16]

double[432] : double[27] double[16] : double unsigned int[16] : unsigned int double[432] : double[27] double[16] : double unsigned int[16] : unsigned int double[432] : double[27]

double[27] double unsigned int double[27] double unsigned int double[27] xcorr_ns = cost:10 = acorr_ns = cost:10 = delay = cost:0 = double[27] double[27] double[27]

double[27] : double[432] double[27] : double[432] double[27] : double[432]

pcomp_13[16] pcomp_18[16]

double[432] : double[27] double[432] : double[27]

double[27] double[27]

delay = cost:0 = fanout = cost:0 = double[27] double[27] double[27]

double[27] : double[432] double[27] : double[432] double[27] : double[432]

pcomp_15[16] pcomp_2[16]

double[432] : double[27] double[432] : double[27]

double[27] double[27] fanout = cost:0 = acorr_od = cost:260 = double[27] double[27] double[24]

double[27] : double[432] double[27] : double[432] double[24] : double[384]

pcomp_10[16] pcomp_14[16]

double[432] : double[27] double[384] : double[24]

double[27] double[24] xcorr_od = cost:267 = fanout = cost:0 = double[24] double[24] double[24]

double[24] : double[384] double[24] : double[384] double[24] : double[384]

pcomp_8[16]

double[384] : double[24] double[384] : double[24]

double[24] double[24] sub = cost:25 = double[24]

double[24] : double[384]

pcomp_4[16]

double[384] : double[24] double[384] : double[24]

double[24] double[24] div = cost:23 =

double[24]

double[24] : double[384]

pcomp_19[16]

double[384] : double[24]

double[24]

fanout = cost:0 =

double[24] double[24]

double[24] : double[384] double[24] : double[384]

pcomp_3[16] pcomp_12[16]

double[384] : double[24] double[416] : double[26]

double[24] double[26] averager = cost:30 = delay = cost:0 = double[24] double[26]

double[24] : double[384] double[26] : double[416]

pcomp_7[16] pcomp_20[16]

double[384] : double[24] double[384] : double[24] double[416] : double[26]

double[24] double[24] double[26] Figure C.6: The model after platform optimizationsub = cost:25 = for a diﬀerentfanout example. = cost:0 = Zoomed detail double[24] double[26] double[26]

double[24] : double[384] double[26] : double[416] double[26] : double[416]

pcomp_5[16] pcomp_6[16]

double[416] : double[26] double[384] : double[24] double[416] : double[26]

double[26] double[24] double[26] oblock_ns = cost:30 = oblock_od = cost:5 = double[26] double

double[26] : double[416] double : double[16]

double[16] : double[16] 124 C. Demonstrations

loadBalanced.xml

f2cc0

unsigned int[16] : unsigned int[16] double[16] : double[16]

pcomp_17[16] pcomp_16[16]

unsigned int[16] : unsigned int double[16] : double

unsigned int double fanout = cost:0 = fanout = cost:0 =

unsigned int unsigned int double double

unsigned int : unsigned int[16] unsigned int : unsigned int[16] double : double[16] double : double[16]

pcomp_9[16] pcomp_1[16] pcomp_11[16]

double[432] : double[27] double[16] : double unsigned int[16] : unsigned int double[432] : double[27] double[16] : double unsigned int[16] : unsigned int double[432] : double[27]

double[27] double unsigned int double[27] double unsigned int double[27] xcorr_ns = cost:10 = acorr_ns = cost:10 = delay = cost:0 = double[27] double[27] double[27]

double[27] : double[432] double[27] : double[432] double[27] : double[432]

pcomp_13[16] pcomp_18[16]

double[432] : double[27] double[432] : double[27]

double[27] double[27]

delay = cost:0 = fanout = cost:0 = double[27] double[27] double[27]

double[27] : double[432] double[27] : double[432] double[27] : double[432]

pcomp_15[16] pcomp_2[16]

double[432] : double[27] double[432] : double[27]

double[27] double[27] fanout = cost:0 = acorr_od = cost:260 = double[27] double[27] double[24]

double[27] : double[432] double[27] : double[432] double[24] : double[384]

pcomp_10[16] pcomp_14[16]

double[432] : double[27] double[384] : double[24]

double[27] double[24] xcorr_od = cost:267 = fanout = cost:0 = double[24] double[24] double[24]

double[24] : double[384] double[24] : double[384] double[24] : double[384]

pcomp_8[16]

double[384] : double[24] double[384] : double[24]

double[24] double[24] sub = cost:25 = double[24]

double[24] : double[384]

pcomp_4[16]

double[384] : double[24] double[384] : double[24]

double[24] double[24] div = cost:23 =

double[24]

double[24] : double[384]

pcomp_19[16]

double[384] : double[24]

double[24]

fanout = cost:0 =

double[24] double[24]

double[24] : double[384] double[24] : double[384]

pcomp_3[16] pcomp_12[16]

double[384] : double[24] double[416] : double[26]

double[24] double[26] averager = cost:30 = delay = cost:0 = double[24] double[26]

double[24] : double[384] double[26] : double[416]

pcomp_7[16] pcomp_20[16]

double[384] : double[24] double[384] : double[24] double[416] : double[26]

double[24] double[24] double[26] Figure C.7: The model applying loadsub = balancingcost:25 = algorithm.fanout =Zoomed cost:0 = detail double[24] double[26] double[26]

double[24] : double[384] double[26] : double[416] double[26] : double[416]

pcomp_5[16] pcomp_6[16]

double[416] : double[26] double[384] : double[24] double[416] : double[26]

double[26] double[24] double[26] oblock_ns = cost:30 = oblock_od = cost:5 = double[26] double

double[26] : double[416] double : double[16]

double[16] : double[16] 125

pipelined.xml

f2cc0

double[16] : double[16] unsigned int[16] : unsigned int[16]

pcomp_1[16]

double[16] : double unsigned int[16] : unsigned int

double unsigned int fanout = cost:0 = fanout = cost:0 = double double unsigned int unsigned int

double[27] double unsigned int double[27] double unsigned int acorr_ns = cost:10 = xcorr_ns = cost:10 = double[27] double[27]

double[27] double[27] delay = cost:0 = delay = cost:0 = double[27] double[27]

double[27] double[27] fanout = cost:0 = fanout = cost:0 = double[27] double[27] double[27] double[27]

double[27] : double[432] double[27] : double[432]

pcomp_2[16] pcomp_10[16]

double[432] : double[27] double[432] : double[27]

double[27] double[27] acorr_od = cost:260 = xcorr_od = cost:267 = double[24] double[24]

double[24] : double[384] double[24] : double[384]

pcomp_12[16]

double[384] : double[24]

double[24]

double[384] : double[24] fanout = cost:0 = double[24] double[24]

double[24] double[24] sub = cost:25 = double[24]

double[24] double[24] div = cost:23 = double[24]

double[24] fanout = cost:0 = double[24] double[24]

double[24] averager = cost:30 = double[24]

double[24] double[24] sub = cost:25 = double[24]

double[26] double[24] oblock_ns = cost:30 = double[26] Figure C.8: The model after creating pipeline directives. Zoomed detail

double[26] delay = cost:0 = double[26]

double[26] fanout = cost:0 = double[26] double[26]

double[26]

oblock_od = cost:5 = double

double : double[16]

double[16] : double[16] 126 C. Demonstrations

1 (...) 2 3 2013-06-10 15:57:10 [INFO] - NEW MODEL INFO: 4 Number of leafs: 20 5 Number of composites: 5 6 Number of inputs: 2 7 Number of outputs: 1 8 2013-06-10 15:57:10 [INFO] - Checking that the internal processnetwork is valid for 9 synthesis... 10 2013-06-10 15:57:10 [INFO] - All checks passed 11 2013-06-10 15:57:10 [INFO] - Generating sequential schedules for all composite 12 processes... 13 2013-06-10 15:57:10 [INFO] - Generating process schedule for ParDDFA... 14 2013-06-10 15:57:10 [INFO] - Process schedule for ParDDFA: 15 pcomp_1, pcomp_2, pcomp_10, pcomp_12 16 2013-06-10 15:57:10 [INFO] - Generating process schedule for stage_1... 17 2013-06-10 15:57:10 [INFO] - Process schedule for stage_1: 18 f2cc0_DDFA0_ACorr_buffer, f2cc0_DDFA0_ACorr_bufFan, 19 f2cc0_DDFA0_XCorr_buffer, f2cc0_DDFA0_XCorr_bufFan, 20 f2cc0_DDFA0_sampFan, f2cc0_DDFA0_swFan, 21 f2cc0_DDFA0_ACorr_acorr_ns, f2cc0_DDFA0_XCorr_xcorr_ns 22 2013-06-10 15:57:10 [INFO] - Generating process schedule for stage_2... 23 2013-06-10 15:57:10 [INFO] - Process schedule for stage_2: 24 f2cc0_DDFA0_XCorr_xcorr_od 25 2013-06-10 15:57:10 [INFO] - Generating process schedule for stage_3... 26 2013-06-10 15:57:10 [INFO] - Process schedule for stage_3: 27 f2cc0_DDFA0_OutB_buffer, f2cc0_DDFA0_OutB_bufFan, 28 f2cc0_DDFA0_SubDiv_inFan, f2cc0_DDFA0_SubDiv_sub, 29 f2cc0_DDFA0_SubDiv_div, f2cc0_DDFA0_AvgSub_inFan, 30 f2cc0_DDFA0_AvgSub_averager, f2cc0_DDFA0_AvgSub_sub, 31 f2cc0_DDFA0_OutB_oblock_ns, f2cc0_DDFA0_OutB_oblock_od 32 2013-06-10 15:57:10 [INFO] - Generating process schedule for stage_4... 33 2013-06-10 15:57:10 [INFO] - Process schedule for stage_4: 34 f2cc0_DDFA0_ACorr_acorr_od 35 2013-06-10 15:57:10 [INFO] - Generating wrapper functions for composite processes... 36 2013-06-10 15:57:10 [INFO] - Creating signal variables for "pcomp_2"... 37 2013-06-10 15:57:10 [INFO] - Created 2 signal(s). 38 2013-06-10 15:57:10 [INFO] - Creating delay variables for "pcomp_2"... 39 2013-06-10 15:57:10 [INFO] - Created 0 delay variable(s) 40 2013-06-10 15:57:10 [INFO] - Creating signal variables for "pcomp_12"... 41 2013-06-10 15:57:10 [INFO] - Created 15 signal(s). 42 2013-06-10 15:57:10 [INFO] - Creating delay variables for "pcomp_12"... 43 2013-06-10 15:57:10 [INFO] - Created 1 delay variable(s) 44 45 (...) 46 47 2013-06-10 15:57:10 [INFO] - Generating streamed CUDA kernel execution functions for 48 adjacent parallel composite processes... 49 50 (...) 51 52 2013-06-10 15:57:10 [INFO] - Creating a CUDA kernel wrapper from the contained section 53 "pcomp_1--pcomp_12"... 54 2013-06-10 15:57:10 [INFO] - USING SHARED MEMORY FOR INPUT DATA: NO 55 2013-06-10 15:57:10 [INFO] - Creating signal variables for "f2cc0"... 56 2013-06-10 15:57:10 [INFO] - Created 7 signal(s). 57 58 (...) 59 60 2013-06-10 15:57:10 [INFO] - Creating the top level execution function... 61 2013-06-10 15:57:10 [INFO] - Optimizing kernel for 1 burst(s) and 6 stage(s)... 62 63 (...) Listing C.3: Excerpt from the f2cc output logger. The highlighted lines 14, 17, 23, 26 and 33 show the sequential schedule, while line 61 shows the calculated parameters necessary for pipeline stage mapping 127

1 void stage_1_exec_wrapper(double * out1 , double * out2 , double in1, unsigned int in2) { 2 int i;// Can safely be removed if the compiler warns 3// about it being unused 4// Declare signal variables 5 double * fanout1 = new double[27]; 6 double fanout2; 7 unsigned int fanout3; 8 double * delay4 = new double[27]; 9 double * comb5 = new double[27]; 10 double * delay6 = new double[27]; 11 double * comb7 = new double[27]; 12 double * fanout8 = new double[27]; 13 double fanout9; 14 unsigned int fanout10; 15 16// Declare delay variables 17 static double v_delay_element2[27] = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}; 18 static double v_delay_element1[27] = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}; 19 20// Execute leafs 21 for (i = 0; i < 27; i++) { 22 delay4[i] = v_delay_element1[i]; 23 } 24 for (i = 0; i < 27; i++) { 25 delay6[i] = v_delay_element2[i]; 26 } 27 for (i = 0; i < 27; i++) { 28 out2 [ i ] = delay4 [ i ] ; 29 } 30 for (i = 0; i < 27; i++) { 31 fanout1[i] =delay4[i]; 32 } 33 for (i = 0; i < 27; i++) { 34 out1 [ i ] = delay6 [ i ] ; 35 } 36 for (i = 0; i < 27; i++) { 37 fanout8[i] =delay6[i]; 38 } 39 fanout2 = in1 ; 40 fanout9 = in1 ; 41 fanout3 = in2 ; 42 fanout10 = in2 ; 43 acorr_ns(comb5,fanout1, fanout2, fanout3); 44 xcorr_ns(comb7,fanout8, fanout9, fanout10); 45 for (i = 0; i < 27; i++) { 46 v_delay_element1[i] = comb5[i]; 47 } 48 for (i = 0; i < 27; i++) { 49 v_delay_element2[i] = comb7[i]; 50 } 51 52// Clean up memory 53 delete[] fanout1; 54 delete[] delay4; 55 delete[] comb5; 56 delete[] delay6; 57 delete[] comb7; 58 delete[] fanout8; 59 }

Listing C.4: Sample from a generated sequential function. This is the wrapper function for process pcomp1 from Figure C.8 128 C. Demonstrations

1 void cudaParDDFA(double * out1, double* in1, unsigned int* in2, unsigned long longN){ 2 int i;// Can safely be removed if the compiler warns about it being unused 3 cudaStream_t stream[6]; 4 struct cudaDeviceProp prop; 5 int max_threads_per_block; 6 int shared_memory_per_sm; 7 int num_multicores; 8 int full_utilization_thread_count; 9 int is_timeout_activated; 10 11 // GetGPGPU device information 12 // method presented and explained in [Hjort Blindell, 2012] 13 //(...) 14 15 // Declare signal variables 16 double* out1_device[6]; 17 // @todo Better error handling 18 if (cudaMalloc((void **) &out1_device[1], 16 * sizeof(double)) != cudaSuccess) { 19 printf("ERROR: Failed to allocate GPU memory\n" ); 20 exit(-1); 21 } 22 //(...) 23 unsigned int* in2_device[6]; 24 //(...) 25 26 unsigned long data_index[6] = {0, 1, 2, 3, 4, 5}; 27 unsigned long number_of_bursts = N * 1; 28 char finished = 0; 29 30 // start executing the kernels ina revolving barrel pattern like suggested in Table 8.1 31 while (!finished) {// while there is still data needed to be processed 32 for (i = 0; i < 6; i++) { 33 if ((data_index[i] < number_of_bursts) && (cudaStreamQuery(stream[i] == cudaSuccess) { 34 // H2D transfer 35 if (cudaMemcpyAsync((void *) in1_device[i], (void *) in1[data_index[i]], 16 * sizeof (double), cudaMemcpyHostToDevice, stream[i]) != cudaSuccess) { 36 printf("ERROR: Failed to copy data to GPU\n" ); 37 exit(-1); 38 } 39 if (cudaMemcpyAsync((void *) in2_device[i], (void *) in2[data_index[i]], 16 * sizeof (unsigned int), cudaMemcpyHostToDevice, stream[i]) != cudaSuccess) { 40 printf("ERROR: Failed to copy data to GPU\n" ); 41 exit(-1); 42 } 43 // Execute kernel 44 if (is_timeout_activated) { 45 // Prevent the kernel from timing out by splitting up the work into smaller pieces through multiple kernel invokations 46 int num_threads_left_to_execute = 16; 47 int index_offset = 0; 48 while (num_threads_left_to_execute > 0) { 49 // method presented and explained in [Hjort Blindell, 2012] 50 //(...) 51 } 52 } 53 else{ 54 struct KernelConfig config = calculateBestKernelConfig(16, max_threads_per_block, 1 * sizeof(double), shared_memory_per_sm); 55 pcomp_1_kernel_stage_kernel<<>>(out1_device[i] , in1_device[i], in2_device[i], index_offset_device[i], 0); 56 } 57 //D2H transfer 58 if (cudaMemcpyAsync((void *) out1[data_index[i]], (void *) out1_device[i], 16 * sizeof(double), cudaMemcpyDeviceToHost, stream[i]) != cudaSuccess) { 59 printf("ERROR: Failed to copy data to GPU\n" ); 60 exit(-1); 61 } 62 data_index[i] += 6; 63 } 64 } 65 finished = (index[0] >= number_of_bursts) && (index[1] >= number_of_bursts) 66 && (index[2] >= number_of_bursts) && (index[3] >= number_of_bursts) 67 && (index[4] >= number_of_bursts) && (index[5] >= number_of_bursts); 68 } 69 // Free allocated memory 70 if (cudaFree((void *) out1[1]) != cudaSuccess) { 71 printf("ERROR: Failed to free GPU memory\n" ); 72 exit(-1); 73 } 74 //(...) 75 }

Listing C.5: Sample from a generated CUDA function. This is the top level execution code for the process network in from Figure C.8. It reﬂects the current stage of the tool’s ability to generate CUDA code. 129

1 __device__ 2 void pcomp_1_kernel_stage_exec(double * out1, double in1, unsigned int in2) { 3 int i;// Can safely be removed if the compiler warns 4 // about it being unused 5 // Declare signal variables 6 double* par_composite1 = new double[27]; 7 double* par_composite2 = new double[24]; 8 double* par_composite3 = new double[24]; 9 double* par_composite4 = new double[27]; 10 11 12 // Execute leafs 13 14 cudaDeviceSync(); 15 16 stage_1_exec_wrapper(par_composite1, par_composite4, in1, in2); 17 18 cudaDeviceSync(); 19 20 stage_4_exec_wrapper(par_composite2, par_composite4); 21 22 cudaDeviceSync(); 23 24 stage_2_exec_wrapper(par_composite3, par_composite1); 25 26 cudaDeviceSync(); 27 28 stage_3_exec_wrapper(out1, par_composite2, par_composite3); 29 30 cudaDeviceSync(); 31 32 33 // Clean up memory 34 delete[] par_composite1; 35 delete[] par_composite2; 36 delete[] par_composite3; 37 delete[] par_composite4; 38 } 39 40 __global__ 41 void pcomp_1_kernel_wrapper(double * out1, double in1, unsigned int in2, int index_offset) { 42 unsigned int global_index = (blockIdx.x * blockDim.x + threadIdx.x) + index_offset; 43 if (global_index < 16) { 44 int in1_index = global_index * 0; 45 int in2_index = global_index * 0; 46 int out1_index = global_index * 0; 47 pcomp_1_kernel_stage(&out1[out1_index], in1[in1_index], in2[in2_index]); 48 } 49 }

Listing C.6: Sample from a generated CUDA function. This is a kernel function generated for executing the data parallel section in the process network from Figure C.8. It reﬂects the current stage of the tool’s ability to generate CUDA kernel functions. 130 C. Demonstrations Bibliography

[Alexander, 1977] Alexander, C. (1977). A pattern language: towns, buildings, construction, volume 2. Oxford University Press, USA. [Asanovic et al., 2006] Asanovic, K., Bodik, R., Catanzaro, B. C., Gebis, J. J., Husbands, P., Keutzer, K., Patterson, D. A., Plishker, W. L., Shalf, J., and Williams, S. W. (2006). The landscape of parallel computing research: A view from berkeley. Technical report, Technical Report UCB/EECS-2006-183, EECS Department, University of California, Berkeley. [Asanovic et al., 2009] Asanovic, K., Bodik, R., Demmel, J., Keaveny, T., Keutzer, K., Kubiatowicz, J., Morgan, N., Patterson, D., Sen, K., and Wawrzynek, J. (2009). A view of the parallel computing landscape. Communications of the ACM, 52(10):56–67. [ASIC World, 2013] ASIC World (last update: 17/03/2013). SystemC tutorial. Available from: http://www.asic-world.com/systemc/tutorial.html. [Attarzadeh Niaki et al., 2012] Attarzadeh Niaki, S. H., Jakobsen, M. K., Sulonen, T., and Sander, I. (2012). Formal heterogeneous system modeling with SystemC. In 2012 Forum on Speciﬁcation and Design Languages (FDL), pages 160 –167. [Baskaran et al., 2008] Baskaran, M. M., Bondhugula, U., Krishnamoorthy, S., Ramanujam, J., Rountev, A., and Sadayappan, P. (2008). Automatic data movement and computation mapping for multi-level parallel architectures with explicitly managed memories. In Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming, PPoPP ’08, pages 1–10, New York, NY, USA. ACM. [Beggs and Tucker, 2006] Beggs, E. J. and Tucker, J. V. (2006). Embedding inﬁnitely parallel computation in newtonian kinematics. Applied mathematics and computation, 178(1):25–43. [Bell and Hoberock, 2011] Bell, N. and Hoberock, J. (2011). Thrust: A productivity-oriented library for CUDA. GPU Computing Gems: Jade Edition, pages 359–372. [Berry and Gonthier, 1992] Berry, G. and Gonthier, G. (1992). The esterel synchronous programming language: Design, semantics, implementation. Science of computer programming, 19(2):87–152.

131 132 Bibliography

[Bjerge, 2007] Bjerge, K. (2007). Guide for getting started with systemc development. Technical report, Danish Technological Institute. Available from: http://www.dti.dk/_root/media/ 27325_SystemC_Getting_Started_artikel.pdf. [Blank, 1990] Blank, T. (1990). The MasPar MP-1 architecture. In Compcon Spring’90. Intellectual Leverage. Digest of Papers. 35th IEEE Computer Society International Conference., pages 20–24. [Buschmann et al., 2007] Buschmann, F., Henney, K., and Schmidt, D. C. (2007). Pattern Oriented Software Architecture: On Patterns and Pattern Languages, volume 6. John Wiley & Sons. [Chen et al., 1992] Chen, G.-H., Wang, B.-F., and Lu, C.-J. (1992). On the parallel computation of the algebraic path problem. Parallel and Distributed Systems, IEEE Transactions on, 3(2):251–256. [Codreanu and Hobincu, 2010] Codreanu, V. and Hobincu, R. (2010). Performance gain from data and control dependency elimination in embedded processors. In Electronics and Telecommunications (ISETC), 2010 9th International Symposium on, pages 47–50. [Colella, 2004] Colella, P. (2004). Defining software requirements for scientific computing. presentation. [Ştefan, 2010] Ştefan, G. M. (2010). Integral parallel architecture in system-on-chip designs. Faculty of Electronics, Tc. and IT, Politehnica University of Bucharest, România. Available from: http://arh.pub.ro/gstefan/2010ucas.pdf. [Dunigan, 1992] Dunigan, T. H. (1992). Kendall square multiprocessor: Early experiences and performance. Technical report, Mathematical Sciences Section, Oak Ridge National Laboratory. [Enmyren and Kessler, 2010] Enmyren, J. and Kessler, C. W. (2010). SkePU: a multi- backend skeleton programming library for multi-GPU systems. In Proceedings of the fourth international workshop on High-level parallel programming and applications, pages 5–14. [Feautrier, 1996] Feautrier, P. (1996). Automatic parallelization in the polytope model. In Perrin, G.-R. and Darte, A., editors, The Data Parallel Programming Model, number 1132 in Lecture Notes in Computer Science, pages 79–103. Springer Berlin Heidelberg. [Flynn, 1972] Flynn, M. J. (1972). Some computer organizations and their effectiveness. Computers, IEEE Transactions on, 100(9):948–960. [ForSyDe, 2013] ForSyDe (2013). The ForSyDe Homepage. Page Version ID: 24. Available from: https://forsyde.ict.kth.se/trac. [Gamma et al., 1993] Gamma, E., Helm, R., Johnson, R., and Vlissides, J. (1993). Design patterns: Abstraction and reuse of object-oriented design. ECOOP’93 – Object-Oriented Programming, pages 406–431. [Georgia Tech, 2008] Georgia Tech (last update: 10/25/2008). STI center of competence for the cell broadband engine processor. Available from: http://sti.cc.gatech.edu/. [Habanero, 2013] Habanero (last update: 11/03/2013). Habanero multicore software research project - habanero - rice university campus wiki. Available from: https://wiki.rice.edu/ confluence/display/HABANERO/Habanero+Multicore+Software+Research+Project. Bibliography 133

[Halbwachs et al., 1991] Halbwachs, N., Caspi, P., Raymond, P., and Pilaud, D. (1991). The synchronous data ﬂow programming language LUSTRE. Proceedings of the IEEE, 79(9):1305– 1320. [Hayes et al., 1986] Hayes, J. P., Mudge, T. N., Stout, Q. F., Colley, S., and Palmer, J. (1986). Architecture of a hypercube supercomputer. In Proceedings of the 1986 International Conference on Parallel Processing, pages 653–660. [Hennessy and Patterson, 2011] Hennessy, J. L. and Patterson, D. A. (2011). Computer Architecture, Fifth Edition: A Quantitative Approach. Morgan Kaufmann, 5 edition. [Hjort Blindell, 2012] Hjort Blindell, G. (2012). Synthesizing software from a ForSyDe model targeting GPGPUs. Master’s thesis, Dept. ICT, Royal Institute of Technology (KTH), Stockholm, Sweden. [Hochstein et al., 2005] Hochstein, L., Carver, J., Shull, F., Asgari, S., and Basili, V. (2005). Parallel programmer productivity: A case study of novice parallel programmers. In Supercomputing, 2005. Proceedings of the ACM/IEEE SC 2005 Conference, pages 35–35. [Illinois, 2013] Illinois, U. (accessed: 17/03/2013). Universal parallel computing research center at the university of illinois. Available from: http://www.upcrc.illinois.edu/. [ITRS, 2005] ITRS (2005). International Technology Roadmap for Semiconductors, 2005 edition, Executive Summary. Technical report. Available from: http://www.itrs.net/ Links/2005ITRS/ExecSum2005.pdf. [ITRS, 2007] ITRS (2007). International Technology Roadmap for Semiconductors, 2007 edition, Executive Summary. Technical report. Available from: http://www.itrs.net/ Links/2007ITRS/ExecSum2007.pdf. [ITRS, 2011] ITRS (2011). International Technology Roadmap for Semiconductors, 2011 edition, System Drivers. Technical report. Available from: http://www.itrs.net/Links/ 2011ITRS/2011Chapters/2011SysDrivers.pdf. [Jakobsen et al., 2011] Jakobsen, M. K., Madsen, J., Niaki, S. H. A., Sander, I., and Hansen, J. (2011). System level modelling with open source tools. In Embedded World Conference 2011, Nuremberg, Germany. [Joldes et al., 2010] Joldes, G. R., Wittek, A., and Miller, K. (2010). Real-time nonlinear ﬁnite element computations on GPU-Application to neurosurgical simulation. Computer methods in applied mechanics and engineering, 199(49):3305–3314. [Jones et al., 2009] Jones, C. G., Liu, R., Meyerovich, L., Asanovic, K., and Bodik, R. (2009). Parallelizing the web browser. In Proceedings of the First USENIX Workshop on Hot Topics in Parallelism. [Keutzer et al., 2000] Keutzer, K., Newton, A. R., Rabaey, J. M., and Sangiovanni-Vincentelli, A. (2000). System-level design: Orthogonalization of concerns and platform-based design. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, 19(12):1523– 1543. [Kirk and Hwu, 2010] Kirk, D. B. and Hwu, W.-m. W. (2010). Programming Massively Parallel Processors: A Hands-on Approach. Morgan Kaufmann, 1 edition. [Kleene, 1936] Kleene, S. C. (1936). General recursive functions of natural numbers. Mathematische Annalen, 112(1):727–742. 134 Bibliography

[Knuth, 1997] Knuth, D. E. (1997). The Art of Computer Programming: Fundamental Algorithms, volume 1, chapter 2.3. Addison-Wesley Professional. [Krasner and Pope, 1988] Krasner, G. E. and Pope, S. T. (1988). A description of the model- view-controller user interface paradigm in the smalltalk-80 system. Journal of object oriented programming, 1(3):26–49. [Lee and Sangiovanni-Vincentelli, 1997] Lee, E. A. and Sangiovanni-Vincentelli, A. (1997). Comparing models of computation. In Proceedings of the 1996 IEEE/ACM international conference on Computer-aided design, pages 234–241. [Lindholm et al., 2008] Lindholm, E., Nickolls, J., Oberman, S., and Montrym, J. (2008). nvidia tesla: A unified graphics and computing architecture. Micro, IEEE, 28(2):39–55. [Maliţa and Ştefan, 2008] Maliţa, M. and Ştefan, G. (2008). On the many-processor paradigm. In Proceedings of the 2008 World Congress in Computer Science, Computer Engineering and Applied Computing, vol. PDPTA, volume 8. [Maliţa and Ştefan, 2009] Maliţa, M. and Ştefan, G. (2009). Integral parallel architecture & berkeley’s motifs. In Application-specific Systems, Architectures and Processors, 2009. ASAP 2009. 20th IEEE International Conference on, pages 191–194. [Maliţa et al., 2006] Maliţa, M., Ştefan, G., and Stoian, M. (2006). Complex vs. intensive in parallel computation. In Computing in the Global Information Technology, 2006. ICCGI’06. International Multi-Conference on, pages 26–26. [Malik et al., 2012] Malik, M., Li, T., Sharif, U., Shahid, R., El-Ghazawi, T., and Newby, G. (2012). Productivity of GPUs under different programming paradigms. Concurrency and Computation: Practice and Experience. [Maraninchi, 1991] Maraninchi, F. (1991). The Argos language: Graphical representation of automata and description of reactive systems. In IEEE Workshop on Visual Languages, volume 3. [Maven, 2007] Maven, A. S. F. (2007). Apache maven project. URL: http://maven.apache.org/. Ultima Consulta, 8. [McCabe, 1976] McCabe, T. (1976). A complexity measure. IEEE Transactions on Software Engineering, SE-2(4):308 – 320. [Nickolls and Dally, 2010] Nickolls, J. and Dally, W. J. (2010). The GPU computing era. Micro, IEEE, 30(2):56–69. [nvidia, 2013a] nvidia (2013a). CUDA C/C++ streams and concurrency. Available from: http: //developer.download.nvidia.com/CUDA/training/StreamsAndConcurrencyWebinar. pdf. [nvidia, 2013b] nvidia (accessed: 24/03/2013b). GeForce 256. Available from: http://www. nvidia.com/page/geforce256.html. [nvidia, 2013c] nvidia (accessed: 24/03/2013c). GeForce GTX TITAN graphics card design details | nvidia. Available from: http://www.nvidia.com/titan-graphics-card/design. [Öberg and Ellervee, 1998] Öberg, J. and Ellervee, P. (1998). Revolver: A high-performance MIMD architecture for collision free computing. In Euromicro Conference, 1998. Proceedings. 24th, volume 1, pages 301–308. Bibliography 135

[Owens et al., 2008] Owens, J. D., Houston, M., Luebke, D., Green, S., Stone, J. E., and Phillips, J. C. (2008). GPU computing. Proceedings of the IEEE, 96(5):879–899. [Papadimitriou, 2003] Papadimitriou, C. H. (2003). Computational complexity. In Encyclopedia of Computer Science, 4th edition, number ISBN:0-470-86412-5, pages 260–265. John Wiley and Sons Ltd. [Par Lab, 2013] Par Lab (accessed: 17/03/2013). The parallel computing laboratory. Available from: http://parlab.eecs.berkeley.edu/. [PPL, 2013] PPL (accessed: 17/03/2013). Stanford pervasive parallelism laboratory. http://ppl.stanford.edu/main/. Available from: http://ppl.stanford.edu/main/. [Sander, 2003] Sander, I. (2003). System modeling and design refinement in ForSyDe. PhD thesis, Dept. IMIT, KTH, Stockholm, Sweden. [Sander and Jantsch, 1999] Sander, I. and Jantsch, A. (1999). System synthesis based on a formal computational model and skeletons. In VLSI’99. Proc. IEEE Computer Society Workshop On, pages 32–39. [Sander and Jantsch, 2004] Sander, I. and Jantsch, A. (2004). System modeling and transformational design refinement in ForSyDe [Formal System Design]. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, 23(1):17–32. [Sanders and Kandrot, 2010] Sanders, J. and Kandrot, E. (2010). CUDA by example: an introduction to general-purpose GPU programming. Number ISBN:0132180138. Addison- Wesley Professional. [Steuwer et al., 2013] Steuwer, M., Gorlatch, S., Buß, M., and Breuer, S. (2013). Using the SkelCL library for high-level GPU programming of 2D applications. In Euro-Par 2012: Parallel Processing Workshops, pages 370–380. [Svensson et al., 2010] Svensson, J., Claessen, K., and Sheeran, M. (2010). GPGPU kernel implementation and refinement using Obsidian. Procedia Computer Science, 1(1):2065–2074. [Udupa et al., 2009] Udupa, A., Govindarajan, R., and Thazhuthaveetil, M. J. (2009). Software pipelined execution of stream programs on GPUs. In International Symposium on Code Generation and Optimization, 2009. CGO 2009, pages 200–209. [Xavier and Iyengar, 1998] Xavier, C. and Iyengar, S. S. (1998). Introduction to parallel algorithms, volume 1 of Wiley Series on Parallel and Distributed Computing. Wiley-Interscience. [Zhu et al., 2008] Zhu, J., Sander, I., and Jantsch, A. (2008). Energy efficient streaming applications with guaranteed throughput on MPSoCs. In Proceedings of the 8th ACM international conference on Embedded software, pages 119–128.