Synthesis and Exploration of Loop Accelerators for Systems-on-a-Chip

Der Technischen Fakultät der Universität Erlangen-Nürnberg zur Erlangung des Grades

DOKTOR-INGENIEUR

vorgelegt von

Hritam Dutta

Erlangen 2011 Als Dissertation genehmigt von der Technischen Fakultät der Universität Erlangen-Nürnberg

Tag der Einreichung: ...... 10. Januar 2011 Tag der Promotion: ...... 03. März 2011 Dekan: ...... Prof. Dr.-Ing. Reinhard German Berichterstatter: ...... Prof. Dr.-Ing. Jürgen Teich ...... Prof. Christian Lengauer, Ph.D. Acknowledgements

I owe my deepest gratitude to my adviser, Professor Jürgen Teich, for always being enthusiastic to propose and discuss new ideas. He also provided me a great amount of freedom, and valuable scientific and editorial feedback. I would also like to thank Professor Christian Lengauer for agreeing to serve on my dissertation committee and the suggestions to improve the dissertation. My sincere gratitude also goes to Professor Bernard Pottier and Professor Ulrich Rüde for introducing me to new ideas and fields of research. My special thanks goes to all colleagues, especially, Frank Hannig, Dmitrij Kissler, Joachim Keinert, Richard Membarth, Moritz Schmid, Jens Gladigau, Dirk Koch for brainstorming sessions and intensive co-operation, which led to key scientific progress and enrichment of my knowledge. I appreciate Frank’s patience in reading the whole dissertation and making valuable suggestions. I was fortunate to have won- derful office mates in Mateusz Majer and Tobias Ziermann, and thank them both for all the technical and non-technical discussions. My sincere acknowledgements also goes to external colleagues Sebastian Siegel, Rainer Schaffer (TU Dresden), Wolf- gang Haid (ETH Zürich), Samar Yazdani (UBO, Brest) for co-operation on important research problems. I also appreciate the efforts of undergraduate students, especially, Teddy Zhai and Holger Ruckdeschel, in the software development of PARO method- ology. I am also deeply indebted to Sonja Heidner and Ina Derr for helping me sort out several administrative issues. I greatly value the friendship of all the people who made my stay in Erlangen a real pleasure. My family has been a constant source of love, concern, support, and strength all these years. I would like to express my heart-felt gratitude to my family and dedicate this dissertation to them.

Hritam Dutta

iii iv Contents

1. Introduction 1 1.1. Next Generation Applications ...... 3 1.2. Accelerator based SoC Architectures ...... 5 1.3. Programming Models for SoC ...... 8 1.4. Problem Definition ...... 11 1.5. Contributions and Bibliographic notes ...... 12 1.6. A Guided Tour through the Thesis ...... 14

2. Fundamentals and Related Work 15 2.1. Algorithm Specification in the Polytope Model ...... 15 2.1.1. Fundamentals: Algorithm Specification ...... 15 2.1.2. Specification of Communicating Loop Nests ...... 20 2.1.3. Related Work ...... 24 2.2. A Generic Accelerator Scheme ...... 26 2.2.1. Characterization and Classification of Loop Accelerators . . 26 2.2.2. Accelerator Subsystem for Streaming Application ...... 28 2.3. High-level Synthesis of Hardware Accelerators ...... 30 2.3.1. Front End: Loop Transformations ...... 30 2.3.1.1. Program Transformations ...... 31 2.3.1.2. Tiling ...... 33 2.3.2. Front End: Scheduling ...... 34 2.3.2.1. Global Scheduling and Binding ...... 35 2.3.2.2. Local Scheduling and Resource Binding . . . . . 37 2.3.3. Back End: Synthesis ...... 38 2.3.3.1. Synthesis of Processor Element ...... 39 2.3.3.2. Synthesis of Array Interconnection Structure . . . 40 2.3.3.3. Synthesis of Control Hardware ...... 40 2.4. Accelerator Design Space Exploration ...... 42 2.5. High-level Synthesis Tools ...... 44 2.6. Conclusion ...... 47

3. Accelerator Generation: Loop Transformations and Back End 49 3.1. Loop Optimizations for Accelerator Tuning ...... 50 3.1.1. Loop Transformations ...... 51

v Contents

3.1.1.1. Loop Permutation ...... 53 3.1.1.2. Loop Tiling ...... 55 3.1.2. Hierarchical Tiling ...... 56 3.1.2.1. Tiling: Decomposition of the Iteration Space . . . 62 3.1.2.2. Embedding: Splitting of Data Dependencies . . . 67 3.1.2.3. Iteration dependent Conditions ...... 70 3.1.2.4. Parallelization of Tiled Piecewise Linear Algorithms 71 3.1.3. Results: Scalability and Overhead of Hierarchical Tiling . . 74 3.2. Controller Generation ...... 75 3.2.1. Accelerator Control Engine: Architecture and Synthesis Method- ology ...... 77 3.2.1.1. Counter Generation ...... 78 3.2.1.2. Determination of Processor Element Type . . . . 88 3.2.1.3. Global and Local Controller Unit ...... 89 3.2.1.4. Propagation of Global Control and Counter Signals 91 3.2.2. I/O Communication Controller ...... 92 3.2.2.1. Buffer Modeling and Synthesis ...... 92 3.2.2.2. I/O Controller Synthesis ...... 94 3.3. Results ...... 97 3.3.1. Embedded Computation Motifs ...... 97 3.3.2. Impact of Transformations on Controller Overhead 100 3.4. Conclusion ...... 104

4. Accelerator Subsystem for Streaming Applications: Synthesis and Sys- tem Integration 105 4.1. Communicating Loop Model ...... 107 4.1.1. Loop Graph ...... 108 4.1.2. Accelerator Model ...... 110 4.1.3. Mapping: Putting it all together ...... 111 4.2. Automated Generation of a Communicating Accelerator Subsystem 116 4.2.1. Modeling of Communication Channels ...... 118 4.2.1.1. Simplified Windowed Synchronous Data Flow Model118 4.2.1.2. Conversion from the Polyhedral Model to the Data Flow Representation ...... 119 4.2.2. Multi-dimensional FIFO: Architecture and Synthesis . . . . 127 4.3. Synthesis of Accelerators for MPSoCs ...... 131 4.3.1. Interface Synthesis ...... 133 4.3.1.1. Accelerator Memory Map Generation ...... 133 4.3.1.2. Hardware Wrapper ...... 135 4.3.1.3. Software Driver ...... 137 4.3.2. Accelerator Integration in SoC ...... 139 4.4. Results ...... 141

vi Contents

4.4.1. Overhead of Communication Primitives ...... 141 4.4.2. Accelerators as Components in SoC ...... 143 4.5. Conclusion ...... 144

5. Design Space Exploration: Accelerator Tuning 145 5.1. Single Accelerator Exploration ...... 146 5.1.1. Model Representation and Problem Definition ...... 148 5.1.2. Multiple Objectives ...... 150 5.1.3. Objective Functions ...... 151 5.1.3.1. Rapid Estimation Models ...... 152 5.1.4. Optimization Engine ...... 154 5.1.4.1. Baseline: Random or Exhaustive Search . . . . . 154 5.1.4.2. Evolutionary Algorithms ...... 155 5.2. Performance Analysis of Accelerators in an SoC System ...... 161 5.2.1. Modular Performance Analysis (MPA) ...... 163 5.2.2. Objective Parameter Estimation for Accelerators ...... 164 5.2.2.1. Accelerator Performance: Service Curve Estima- tion ...... 166 5.2.3. Optimal Configuration Selection in System Context . . . . . 169 5.2.4. Case Study ...... 170 5.2.4.1. Motion JPEG Decoder ...... 170 5.3. Conclusion and Summary ...... 172

6. Conclusions and Outlook 175 6.1. Conclusion ...... 175 6.2. Future Work ...... 177

A. Glossary 179

B. Hermite Normal Form 181

C. Loop Benchmarks 183

German Part 195

Bibliography 199

List of Abbreviations 219

Curriculum Vitae 221

vii Contents

viii 1. Introduction

The ephemeral craving for more and more performance and the famous Moore’s law have been the driving forces behind the evolution of computer architectures. On the other hand, green computing, which stands for reducing the environmental impact of computing devices, is urging increased energy efficiency. Therefore, next gen- eration embedded, desktop, and supercomputing applications are placing extreme performance demands at low energy cost. According to the ITRS Road map 2007 [101], heterogeneous massively parallel computing is the paradigm that needs to be embraced in order to meet the demands of the next generation applications. A heterogeneous massively parallel comput- ing platform consists of numerous specialized cores around multiple general purpose processors, which are not identical from the programming or the implementation standpoint. These architectures are also called system-on-a-chip (SoC) in the area of embedded computing. This is the next step in embedded architecture evolution, where homogeneous many-core architectures are incremented with task-specific spe- cialized cores called accelerators. The block diagram in Figure 1.1 illustrates such an SoC architecture. The control intensive general purpose software is executed se- quentially on the processors. Whereas, the data intensive parts of the applications such as loop programs are offloaded to the corresponding tailored accelerators. According to Gries [79], the following approaches are needed in order to realize the full potential of SoC architectures:

• Task specific processors: The need for performance at low power cost lead to the inclusion of task-specific processors also known as acceleration engines. They can be dedicated or offer domain specific programmability.

• Correct-by-construction design: The verification and integration time accounts for a large chunk of the development time of SoCs; therefore, the automatic generation of a dedicated accelerator architecture and program code for pro- cessors by a compiler from a high-level specification should replace the error- prone manual design process. Embedded computing is embracing new soft- ware tools for correct-by-construction design.

• Hardware/software co-design: The term refers to early system development on higher levels of abstraction and subsequent refinement on an accelerator/pro- cessor system.

1 1. Introduction

CPU CPU DSP RAM USB

Wireless Video DDR-RAM Accelerator FFT Accelerator RAM Controller Engine Engine

Encryption Accelerator Matrix RGB2YUV DMA I/O Engine Multiplication

Figure 1.1.: SoC architecture that has multiple processors, special purpose accelera- tors, connectivity IPs (e.g. USB, ...), memories, and controllers, which can be accommodated on a single chip.

• Design space exploration: The architect must be aided in pruning of the large design space and in the identification of optimal designs through systematic exploration. However, it is usually the task of the designer to handcraft data intensive parts (i.e., loop programs) in domain-specific languages like VHDL to obtain the best-fit accelerator in terms of performance, cost, and power. The optimizations of such loop specifica- tions exploit partitioning, efficient data reuse, transfer, storage, and other transforma- tions in the programmer’s bag of tricks. Furthermore, the manual process of synthesis and exploration of the accelerator design space is becoming increasingly tedious and error-prone in face of increasing hardware and software complexity. Therefore, a major challenge to be surmounted for SoC architectures is the lack of mapping methodologies for correct-by-construction of task-specific processors (i.e., accelerators). Furthermore, the tools must support accelerator integration in an SoC (i.e., HW/SW co-design) and discovery of best-fit accelerator designs (i.e., design space exploration). Several electronic system level (ESL) synthesis tools and partly answer the problem by enabling automatic generation of the register transfer level (RTL) descriptions of hardware accelerators or program code for domain specific accelera- tion engines. In this dissertation, we address all the needs for realizing an automated design flow for synthesis, integration, and exploration of accelerator designs, for a given algorithm or application specified in a high-level language. In this chapter, we first characterize the nature and requirements of next generation applications in Section 1.1. Subsequently, we discuss the historical developments, future trends, and the motivation for accelerator-based systems-on-a-chip in Section

2 1.1. Next Generation Applications

1.2. Section 1.3 gives an overview of programming models and system software for harnessing accelerator-based SoC architectures. The problem statements and contri- butions of the dissertation are summarized in Section 1.4 and 1.5, respectively. The final outline of the rest of the dissertation is given in Section 1.6.

1.1. Next Generation Applications

Embedded applications, their workload, and underlying algorithms are in general characterized by (a) different levels of parallelism: The applications not only exhibit task level parallelism, but the algorithm primitives also contain loop and operation level parallelism, (b) high computation intensity: The applications are characterized by a high ratio of computation to communication operations, (c) real-time processing of small data types with extensive data reorganizations: The applications especially in embedded computing are characterized by hard and soft deadlines specified by perception of responsiveness. The currently existing architectures, compilers, and programming models were developed with different sets of embedded applications. They hardly meet the high performance demands and rigorous power constraints of next generation streaming applications. The compelling next generation streaming applications are the driving force for SoC architectures. The applications usually involve convergence of pre-processing, recognition, mining, and analysis of workloads [30]. The pre-processing is concerned with enhancing the quality of input data (e.g., signal, image, video). The recognition, mining, and synthesis (RMS) is concerned with modelling, identification, and evalu- ation of events or objects. Let us consider an angiography application from medical imaging, which deals with interactive catheter guidance and minimal-invasive vessel treatment. A major problem faced by such medical imaging applications, is the poor signal to noise ratio (SNR) of the images due to limited dosage and exposure for health reasons. Limited dosage especially is an issue, if fluoroscopic images are ac- quired. Fluoro images are not used for diagnostic purpose, but for guiding a catheter, placing a device (e.g., a stent), and other interactive interventional procedures tak- ing up to several hours. Therefore, the use of image pre-processing algorithms for reducing the noise along with preservation of visual structures is a major field of re- search. Usually, a sequence of algorithms in an imaging pipeline is used to tackle the different types of noise as shown Figure 1.2. The RMS part of the application deals with modelling, recognition of stenosis (narrowing of blood vessels), aneurysm (localized, blood-filled dilation due to disease), embolization (selective occlusion of blood vessels), and helping the doctors evaluate the complications during the surgery. The whole application pipeline can require up to teraflops (1012 operations per sec- ond) of performance with stringent power constraints. Table 1.1 summarizes some other applications from different fields, their performance requirements, and typical power constraints. Just as office and presentation software were the killer application

3 1. Introduction

Figure 1.2.: An example of problems and applications in medical imaging application in angiography. for sequential processor in the 1990s, some of the above-mentioned applications are already the driving forces of next generation architectures. Computationally intensive algorithms are the building blocks of applications in Ta- ble 1.1. These applications usually involve mathematical models, which in turn can be solved using mathematical algorithms. For example, computational fluid dynamic applications are based on partial differential equations (e.g., Navier-Stokes equation), which in turn are solved by iterative grid solvers like the Gauss-Seidel method. In [7], the underlying communication and computation pattern of algorithms that are building blocks of many applications benchmarks were identified. These numerical patterns were classified into 13 groups called dwarfs. These numerical primitives are dense and sparse matrix algorithms, spectral methods, n-body algorithms, struc- tured/unstructured grids, Monte Carlo methods, finite-state machines, combinational logic, graph traversal, dynamic programming, backtracking, and branch and bound (B&B) methods. These so-called dwarfs are mostly written as loop programs. A major focus of research activities is on programming models and accelerator archi- tectures for such loop kernels. The applications usually contain multiple dwarfs, which possibly communicate with each other (see Figure 1.2). Therefore, the perfor- mance boost not only depends on a single loop kernel but also on the composition of multiple communicating loop kernels. Hence, there is a drastic need for new architectures, programming models, and software to pave the way for new applications. These applications are characterized

4 1.2. Accelerator based SoC Architectures

Area Applications Performance Power Mobile and Speech recognition, video Wireless compression, network coding and 10-40 Gops 100 mW Computing encryption, holography. High Computational fluid dynamics, 100-10000 100KW Performance molecular dynamics, life sciences, Gops -1000KW Computing oil and gas, climate modelling. Medical 3D reconstruction, image 100mW- Imaging and registration and segmentation, 1-1000 Gops 100W Equipments battery-driven health monitoring. Lane, collision and pedestrian 500mw Automotive detection, driving assistance 1-100 Gops -10W systems. Home and Gaming physics, ray tracing, 10-1000 20W Desktop CAD/CAM/CAE/EDA tools, web Gops -500W Applications mining, digital content creation. Portfolio selection, smart cameras 1W Business [186], asset liability management 1-1000 Gops -100W [30].

Table 1.1.: Summary of computationally intensive next generation applications with typical performance and power requirements. by the presence of multiple computationally intensive loop algorithms. In the next section, we explore state-of-the-art architectures, which cater to the needs of these applications.

1.2. Accelerator based SoC Architectures

Historically, Moore’s law has been the propelling force behind the evolution of the mainstream computer architectures. One of the old conventional wisdom was that increasing the clock frequency would improve the processor performance; however, this also led to a sharp increase in power consumption. In recent years, there has been a slowdown in processor performance because of limited increase in clock frequency due to physical constraints and power considerations. The new conventional wisdom is that increasing parallelism is the primary method for increasing processor perfor- mance, without increasing the power consumption substantially [7]. Informally, the new corollary of Moore’s law is that the number of processors per chip doubles every two year. If Moore’s law continues, one would be able to pack multiple processors,

5 1. Introduction accelerators, IPs, leading to an increased functional diversity on a single chip (also known as more than Moore). Whereas, in case of breakdown of Moore’s law only new technologies like optical interconnects, 3-d chips, and accelerator-based pro- cessing can extract more performance. There is a growing acceptance of accelerator-based architectures in the complete spectrum of computing from supercomputing, mainstream, to embedded computing. The supercomputer which broke the petaflop (1015 operations per seconds) barrier is IBM Roadrunner [11]. This system consists of 6,948 dual-core Opteron proces- sors, as well as 12,960 Cell processors, which function as accelerators. Remark- ably, it is also the fourth-most energy-efficient supercomputer in the world on the Green500 list [68]. Similarly, in mainstream desktop computing, there has been a major thrust on the use of general purpose graphic processing units (GPGPUs) as ac- celeration engine for high performance applications. The Tesla GPGPU from nVidia consisting of 30 multiprocessors each with 8 ALUs (multiply-accumulate operation), already has a peak performance of 936 GFlops (single precision) [124]. The Cell pro- cessors used in Sony PlayStation gaming consoles [103] is another major player in the mainstream acceleration arena. Acceleration technology for supercomputing and mainstream desktop computing exist in form of add-on coprocessors or boards that are integrated into systems. For embedded computing, accelerators bring significant additional performance to a system-on-chip. TI OMAP, ST Nomadik are some of the known SoC systems catering to the needs of the mobile communication industry [185]. SoC architectures combine the flexibility of CPUs with performance of accelera- tors on a single chip. The CPU is responsible for the housekeeping control functions, which are inherently sequential in nature. The accelerators are dedicated to boost- ing the performance of the parallel part of the application. The performance/power efficiency of the accelerators comes from exploiting the following features: (a) par- allelism, (b) control amortization (c) reconfigurability, and (d) hierarchical storage. Modern embedded processor architectures include features like branch prediction, out-of-order execution, deep execution pipeline, and others to exploit instruction level parallelism. However, studies showed that such features had diminishing returns on performance, which was outweighed by power and energy usage [75]. Therefore, ac- celerator architectures rely on a simplified datapath without functions like data cache or prefetching, and have multiple small processor elements with partly general and specialized datapath. The PEs have less control overhead, as each PE synchronously executes a copy of the same program. Therefore the control cost (i.e., instruction fetch/decode unit or global controller) is amortized over multiple PEs. This is also known as SIMD (single instruction multiple data) in parallel architecture jargon. To counter the memory bottleneck and hide data latency, hierarchical storage is used. The accelerator models have main memory, which has high access latency. It has also local memory types such as intermediate local buffers (scratch-pad memory), internal local memory, and register file in PEs for faster access. Certain accelerator

6 1.2. Accelerator based SoC Architectures

(a) Embedded Processors (ARM, 1 Mops/mW)

DSPs GPGPUs (TIc6x, 5Mops/mW) (GTx, 10 Mops/mW)

FPGAs Domain-specific (Virtex2Pro 20 Mops/mW) programmable processors Flexibility (WPPA, 25 Mops/mW) Hardware Loop Accelerators (100Mops/mW)

1 10 100 Power Efficiency[Mops/mW]

(b) Global FU IO/memory Controller REG Controller

RAM RAM

RAM RAM

RAM RAM

Figure 1.3.: (a) Power efficiency versus flexibility trade-offs of different accelera- tor alternatives in embedded computing, (b) dedicated hardware loop accelerator. architectures provide the possibility of defining an application-specific instruction set and datapath at synthesis-time (configurable), or different program functionality at run-time (reconfigurable). The accelerators used in embedded computing come in different characteristics in terms of power efficiency and flexibility (see Figure 1.3(a)). Our focus is on accel- erating loop kernels and their composition with maximum efficiency. Therefore, we focus on non-programmable accelerators throughout the dissertation. Furthermore, the mapping principles for non-programmable accelerators can be extended for pro- grammable processor arrays. The accelerator architecture can be in the form of the processor array as shown in Figure 1.3(b). The power efficiency comes at the cost of flexibility, as they do not support multi-threading, reconfigurability, and programma- bility. In order to support the designer, synthesis tools and compilation methods for correct-by-construction design of hardware accelerators, from algorithm specifica- tion in a high-level language, are required. Furthermore, these design tools must support SoC integration and design space exploration [79]. To summarize, accelerator-based system-on-chip (SoC) architectures are the next step in evolution of embedded architectures in order to meet the performance de- mands and stringent power constraints. Hardware accelerators offer maximum effi-

7 1. Introduction

(a) (b)

Applications Architecture

Behavior Mapping System Synthesis

Implementation Software High Level Synthesis Synthesis Hardware Design Software Code Logic Objectives Implementation Synthesis Synthesis

Figure 1.4.: (a) Double roof model for embedded system design [167], (b) Y-chart approach for iterative synthesis at each abstraction level [110]. ciency at the cost of flexibility among all accelerator types. In the next section, an overview of programming models for realizing accelerator-based SoC architectures is given.

1.3. Programming Models for SoC

In the previous sections, we discussed several application and architecture models. The applications are typically computationally intensive algorithms specified by com- municating loop programs. Whereas, the architecture template is a heterogeneous massively parallel architecture characterized by the presence of multiple processors, accelerators, and other IP cores. The task of the programming model is to match and map an application to the architecture. It encompasses areas of languages, compilers, and libraries for efficient system design. In the area of embedded computing, the term electronic system level design (ESL) is used synonymously with programming model for system-on-chip (SoC) platforms. The double roof model shown in Figure 1.4(a) is a nice visual cue illustrating a top- down system design process. The right and left side of the model is concerned with the hardware and the software part of the system, respectively [167]. The upper and lower roof separates the behaviour and implementations aspects of the design. The different levels of abstractions show the design refinement steps for hardware and software. In the beginning, one can create modular application functionality at the system level. This can be done using popular programming languages like C, C++, or model-based design frameworks like Simulink. The application is then partitioned onto a complex architecture that includes both hardware and software components

8 1.3. Programming Models for SoC

(system synthesis). This is also known as hardware/software partitioning. The task of generating an RTL description (hardware synthesis) or software program (software synthesis) is necessary for the implementation. There are also other abstraction layers for logic and code synthesis for the generation of netlist and executable object files for hardware and software, respectively. In this thesis, we study aspects of system and hardware synthesis. The system syn- thesis aspects cover system integration and deployment of accelerator IP in an SoC. Whereas, hardware synthesis concerns itself with the generation of loop accelerators from loop descriptions in a given high-level language. Therefore, we deal mainly with the shaded abstraction layers in Figure 1.4(a). The tasks to be fulfilled in each abstraction layer form an iterative process, which can also be illustrated with help of the Y-chart approach shown in Figure 1.4(b). Given an application and an architecture, one can synthesize and refine designs us- ing this approach. The synthesis includes the basic tasks of allocation of resources, binding functionality to resources, and scheduling. These mapping tasks are carried out by a compiler. Figure 1.5 shows the design flow for accelerator generation and integration as de- veloped in this thesis. The application is modelled by a graph representing communi- cating loop programs. This not only helps to handle the system complexity, but also naturally represents the applications task parallelism. Therefore, the input is a mod- ular description of the application consisting of multiple loop nests, where each loop nest is described in a high-level language. Also, the design constraints on resources, compiler transformation parameters, and golden simulation data for testbench gener- ation have to be provided. The computation synthesis relies on techniques like and loop tiling for parallelization. The compiler front end contains a transformation toolbox for massaging given input program, or may rely on the programmers knowledge to pro- vide appropriate directives in the loop description. A major task of the compiler is to exploit architectural features like multiple levels of parallelism available in the architecture (functional units, processing element) and the communication structure (registers, local buffer, and external memory). Loop transformations in combination with scheduling generate an intermediate representation (IR) of an accelerator im- plementation, mostly in form of processor arrays as shown in Figure 1.3(b). This requires the generation of the PE data path, a local/global controller, I/O controller and buffers depending on the selected allocation and schedule. Subsequently, for each loop in the graph describing an application, a dedicated ac- celerator RTL is synthesized. After back end optimizations, the compiler generates the architecture (HW/SW) specific code for hardware/software co-design and deploy- ment. Using standard industry synthesis tools, the accelerator RTL in VHDL can then be compiled to a target architecture (e.g., ASICs or FPGAs). In case of communi- cating loop applications, the hardware accelerators are hooked to other accelerators using dedicated communication subsystems. In order to couple the accelerator to a

9 1. Introduction

for (j=0; j

Back-end Metric Pareto Exp Est.

Processor Array Controller I/O Interface 1. Area Synthesis Generation Synthesis 2. Power 3. Perf.

Search API (EA, Simulated Decision Making Annealing, ...) Real-time Calculus

Communication System Memory Size Memory App. Engine Driver Interconnect Synthesis Integration Estimation Synthesis Synthesis Software Glue Logic

CPU

Application acceleration Accelerator 2 Accelerator 1 Engine

Figure 1.5.: Overview of the design flow. processor, interface generation is a necessary step. It requires the generation of a memory map, glue logic for accelerator wrapper, as well as a device driver for data communication and synchronization. This depends on throughput, execution order, allocation, and scheduling. The compiler produces an implementation, which can be analyzed in terms of design metrics like performance, cost, and power. These num- bers can then be used to suggest architectural improvements, application modifica- tion, or a different mapping strategy to obtain a best-fit design. This iterative process of improving the system is called design space exploration. The design space ex- ploration combines the use of modern search heuristics and analytical methods, and tries to find the best-fit accelerator architectures, depending on architecture alloca- tion, compiler parameters, and system workload scenarios, in our design flow. The programming model of our design flow for accelerator generation deals with several aspects at system synthesis and hardware synthesis level. It contains novel loop transformations and back end optimizations in controller and communication architecture generation for the hardware accelerator synthesis; furthermore, it sup- ports the integration of accelerators at system level and design space exploration.

10 1.4. Problem Definition

Since programming models are central to this work, an elaborate groundwork on fundamentals and related work on programming models for accelerators is presented in the next chapter. In this thesis, we tackle different problems associated with the generation of hardware accelerators.

1.4. Problem Definition

The problems can be summarized as follows:

1. How do we map a single loop nest onto a dedicated hardware accelerator? The compiler community has been facing this problem for a long time. In the last two decades, a lot of research in academia and industry has spawned state-of-the-art design tools and methodologies such as PICO-Express [155], MMalpha [81], PARO [93], and many others. The first problem of such a de- sign methodology is automatic parallelization. The second perpetual challenge of such tools is in matching a given loop program to the resource constraints of given architecture (e.g., number of PEs, functional units, memory banks, I/O pins). Finally, the back end problem of efficient code and hardware RTL generation for the corresponding loop program cannot be underestimated.

2. How do we generate the acceleration engine for an application consisting of multiple communicating loop nests? How do we integrate such an acceler- ator into an SoC? Multiple communicating loops are usually mapped on a pipeline of dedicated hardware accelerators. Each of the accelerators has par- allel read and write operations with a particular execution order. Therefore, a customized communication subsystem is needed for the efficient transfer of multi-dimensional arrays between the accelerators. Furthermore, in an SoC, there are different communication architectures like bus, dedicated point-to- point channels, or networks-on-chip for coupling the accelerators. Therefore, the glue logic for coupling an accelerator over the interconnect of choice needs to be generated. The driver program for data transfer and synchronization be- tween processor and accelerator device needs to be generated depending on the selected allocation and schedule.

3. How do we identify optimal accelerator designs? The selection of an optimal architecture can be daunting due to a plethora of design decisions. The choices include architecture parameters like number of processing elements (PE), func- tional units (FU), registers. They can also be compiler optimization parameters like a loop tiling strategy for describing the allocation and execution order. These parameters are important design decisions for system architects, as they directly influence not only performance, but also power, and chip area. An

11 1. Introduction

exhaustive search in the parameter space is not feasible, because of time con- straints. Furthermore, the optimal choice also depends on the system properties like bus bandwidth, workload behaviour, etc. Therefore one needs techniques for the efficient evaluation of accelerator designs, for searching the parameter space, and the consideration of system behaviour.

4. What are appropriate benchmarks for evaluating the proposed programming models and generated accelerator architectures? The benchmarks should cap- ture the communication and computation pattern of the computationally inten- sive programs.

1.5. Contributions and Bibliographic notes

Solutions to the above problems and the major steps of our design flow are illustrated in Figure 1.5. The following is a brief summary of the novel contributions of this dissertation.

1. Traditional design flows depend on the programmer’s knowledge for specify- ing the degree of parallelism, balance computational intensity, and exploit data locality. Central to our design flow is a sophisticated transformation called hier- archical tiling, which assists the designer in matching or specifying the degree of parallelism (number of PEs), local memory, and requisite communication bandwidth of the accelerator architecture. Other flows contain at most trans- formations, which can match only one or the other criterion [46, 48]. We also introduce a novel methodology for control generation, which reduces the con- trol hardware overhead by moving common control predicates to a global con- troller. Therefore, the controller cost is amortized over multiple PEs. This leads to significant reduction in control path area compared to existing approaches [43, 45, 47]. The back end of our design flow automatically generates a RTL description of PE datapath, the local/global controller, and a memory interface corresponding to the selected partitioning and scheduling strategy [51, 50, 93]. Therefore, we are able to design hardware accelerators with 2.5x, 4.5x, and 50x improvement in terms of area, power, and performance with respect to embed- ded processors. Furthermore, a correct-by-construction design flow improves the design productivity by 100 times.

2. Communication synthesis for the data transfer and synchronization between loop accelerators has been a major challenge. The complexity of the problem arises from the fact that an optimal memory mapping and address generation in a communication subsystem for parallel data access and out-of-order commu- nication depends on the allocation and the scheduling choices. The problem of communication synthesis is solved by leveraging windowed synchronous data

12 1.5. Contributions and Bibliographic notes

flow (WSDF) model for communication synthesis [107]. In this context, an intermediate representation of communicating loops in the polyhedral model and a unified methodology for its projection onto the WSDF model is pro- posed. Finally, we present an architecture template, a synthesis methodology, and evaluate the overhead of the communication primitive [44]. For the inte- gration of an accelerator in an SoC, we also generate interconnect glue logic and a software driver for data transfer and synchronization with the CPU [56]. It is shown that, when using accelerators as co-processor in a HW/SW system, the performance gain scales down by an order of magnitude when compared to a pure hardware accelerator-based system.

3. Accelerator designs may have many design knobs like number of functional units, and high-level transformations like loop tiling, which effect the performance, power, and cost trade-off. Exhaustive exploration of the design space is prohibitive due to time constraints. Therefore, we use modern search heuristics based on evolutionary algorithms to identify Pareto- optimal designs (i.e., designs with best trade-offs). We also decouple the search of Pareto-optimal designs and best-fit accelerator designs for a given system workload behaviour. The proposed analytical method for finding the best-fit accelerators from a Pareto-optimal set of designs is based on real-time calculus [172]. Therefore, systematic tuning of accelerators simultaneously taking into account the objectives of performance, area, power consumption and system workload in a reasonable time of about an hour is another major contribution [52].

4. Finally, the validation of the proposed design flow with help of new standard benchmarks and complex applications consolidates its novelty and efficiency [54]. The benchmarks are algorithms and applications from [7] and used to evaluate the corresponding application acceleration engine and their system integration. The efficiency of the developed design flow is illustrated with gen- eration of accelerators with substantial speedups and high quality of results for many compute-intensive algorithms.

The solutions proposed in this thesis can also be extended to generate programmable accelerators. Some other completed works not included in this dissertation are

• Extension of design flow for hardware accelerators to a unified mapping method- ology for coarse-grained reconfigurable architectures [88, 89, 90, 60] and weakly programmable processor arrays [42, 49, 55, 87, 112, 166].

• Acceleration of multi-resolution imaging algorithms on general purpose graphic processing units (GPGPUs) and Cell processor [131, 130, 132, 129].

13 1. Introduction

1.6. A Guided Tour through the Thesis

In Chapter 2, we present the rudimentary fundamentals of our input specification, compilation model, and accelerator architectures. Section 2.1 presents the modelling and language specification of the nested loop programs considered. For the purpose of modelling communicating loops, data flow models of computation are discussed. The characteristics of state-of-the-art loop accelerator architectures are presented in Section 2.2. The principles of compilation technology for mapping loop programs onto accelerator architectures is explained in Section 2.3. After discussion and differ- entiation of related work on design space exploration in Section 2.4, we differentiate our design flow from existing high-level synthesis tools in Section 2.5. Finally, we close the chapter with conclusions and a summary of remaining challenges in Section 2.6. The problem of efficient accelerator synthesis is dealt with in Chapter 3. The developed loop optimizations in our compilation technology are presented in Sec- tion 3.1. Essential to our design flow is the automatization of a loop transformation called hierarchical tiling, which allows constraints aware allocation and scheduling on the accelerator architecture. The different loop transformations introduce con- trol and communication overhead. Therefore, in Section 3.2, we propose a back end transformation for automatic generation of a hybrid controller architecture with lo- cal/global control path and communication controller. The computationally intensive algorithms in embedded system are used to illustrate and benchmark our methodol- ogy in Section 3.3.1. The characteristics and results of generated hardware in terms of performance, power and area cost is given in Section 3.3.2. The second problem of generating acceleration engines for applications composed of multiple communicating loops and their integration in an SoC is discussed in Chap- ter 4. The modelling of communicating loops and platforms is explained in Section 4.1. The synthesis of the communication subsystem for a pipeline of accelerators is concentrated on in Section 4.2. The generation of glue logic for coupling of accelera- tors over high performance buses and host drivers for interacting with an accelerator device is handled in Section 4.3. Different applications are used as case studies for demonstrating our design flow for generation of accelerator subsystems. The last problem of identifying an optimal accelerator architecture for a given application, depending on resource allocation, compiler parameters and system be- haviour is matter of discussion in Chapter 5. The modelling of architecture design parameters, objectives, fast evaluation through hardware resource estimation, and dif- ferent search techniques like evolutionary algorithms for discovering good designs, is presented in Section 5.1. In Section 5.2, a methodology based on real-time calculus is used for identifying best-fit designs from a Pareto-optimal set of designs, depending on the system workload contracts. Apart from a conclusion at the end of each chapter, the last Chapter 6 concludes the work by highlighting the contributions of the dissertation and possible future work.

14 2. Fundamentals and Related Work

In the previous chapter, we highlighted accelerator-based heterogeneous systems-on- chip as solution to the high performance needs and stringent power constraints of next generation applications. In this chapter, we give an overview of the fundamen- tals and related work. The fundamentals of the underlying models for algorithm and application specification are presented in Section 2.1. In Section 2.2, we analyze and classify the accelerator architectures. Section 2.3 is a primer on compiler technology used for mapping algorithms onto hardware loop accelerators, communication syn- thesis for accelerator systems, and SoC integration. Section 2.4 discusses the specific problem of searching for Pareto-optimal and best-fit accelerators for a given system. In Section 2.5, we differentiate our design flow methodology from other existing state-of-the-art tools for accelerator synthesis. The chapter ends in Section 2.6 with a summary of differentiation and open problems to be solved in this dissertation.

2.1. Algorithm Specification in the Polytope Model

In embedded electronics applications, 80% of the execution time is typically spent in 20% of the program code [162]. These computationally intensive parts of the program are usually loop nests, which contain inherent parallelism. Hence, lot of re- search in the field of parallel computing has been spent on modelling, parallelization, scheduling, allocation, code generation, and memory management of loop nests.

2.1.1. Fundamentals: Algorithm Specification The polytope model is universally accepted as a basis for mathematical optimizations of loop nests in compiler theory [123, 66]. In this model, all the iterations of a nested for loop form the iteration space, which can be modelled by a Z-polyhedron.

Definition 2.1.1 (Z-polyhedron) is mathematically defined as integer solutions of a system of affine inequalities

n = I Z AI + b 0 (2.1) I { ∈ | ≥ } m m n where b Z and A Z × denote a integer vector and integer matrix, respectively. ∈ ∈

15 2. Fundamentals and Related Work

A polytope is a bounded polyhedron. A Z-polyhedron can model the iteration space of the nested loops. Each loop iteration is an integer point in the Z-polyhedron, and the loop counter bounds define the set of half-spaces (as affine inequalities) whose intersection gives Equation (2.1). For example, the loop nest of a bilateral filter al- gorithm is given in Figure 2.1(a). Assuming the image dimension to be 1024 1024 pixels and the mask dimension to be 4 4, i.e., M = 1024, N = 1024, P = 4, ×Q = 4. × The iteration space of the bilateral filter algorithm can be given as the following Z- polyhedron:

1 0 0 0 0 1 0 0 0 1023   −      i 0 1 0 0 i 0           j  4  0 1 0 0  j   1023   =  Z  −  +   0 (2.2) I  m ∈ |  0 0 1 0  m  0  ≥           n   0 0 1 0  n   3     −               0 0 0 1   0           0 0 0 1   3     −           The resulting polyhedron is visualized in Figure 2.1(b) for M = 8, N = 8,P = 4, Q = 1 (for sake of visualization Q = 1). In [164], the concept of linearly bounded lattices (LBL) was introduced, which could represent any arbitrary n-dimensional integer point set (i.e., any loop nest).

Definition 2.1.2 (Linearly Bounded Lattice). A linearly bounded lattice (LBL) de- notes an iteration space of the form:

n = I Z I = Mκ + c Aκ b (2.3) I { ∈ | ∧ ≥ } l n l n m l m l where κ Z ,M Z × , c Z ,A Z × and b Z . κ Z Aκ b denotes ∈ ∈ ∈ ∈ ∈ { ∈ | ≥ } the set of integral points within a Z-polyhedron or in case of boundedness within a l Z-polytope in Z . This set is affinely mapped onto iteration vectors I using an affine transformation (I = Mκ + c).

If M is an identity matrix, then an LBL is a Z-polyhedron. Throughout the thesis, we assume that matrix M is square and of full rank. Hence, each vector κ maps to an unique iteration point, I. A further extension of the Z-polyhedron is the parametric Z-polyhedron, which can model loops with parametric bounds [64]. There is a rich mathematical framework for operations based on these polytope models. For exam- ple, the Fourier-Motzkin elimination algorithm [156] can be used to find the existence of a solution to a system of affine inequalities given in Equation (2.1). Parametric integer programming (PIP) is an extension to handle parametric bounds and integer solutions [64]. Erhart polynomials can be used for counting the number of integer

16 2.1. Algorithm Specification in the Polytope Model

(a) (b) image(j)

u[M][N]; // input image mask (m) y[M-P][N-Q]; // output image c[P][Q]; //closeness mask s[P][Q]: //similarity mask LUT image(i) outer loop for (j=0; j

Figure 2.1.: (a) a loop nest description of a bilateral filter algorithm and (b) iteration space graph (dependencies are not shown). points inside a parametrized Z-polytope as a function of its parameters [177]. Loop nests can be represented in the polyhedral model which offers a rich mathematical framework for loop transformations.

Dynamic Piecewise Linear Algorithms Loop nests are ubiquitous in imperative languages for representing the repetitive patterns of computationally intensive func- tions. A lot of work has been done in the area of high performance compilers for extracting parallelism from loop nests [158]. On the other hand, recurrence equa- tions in the polytope model have been used to organize computations in a single assignment manner with implicit order of execution. Hannig et al. define the class of dynamic piecewise linear algorithms [95] for the specification of loop kernels by set of recurrence equations.

Definition 2.1.3 (DPLA). A dynamic piecewise linear algorithm consists of a set of N quantified equations, S1 [I],...,Si [I],...,SN [I]. Each equation Si [I] is of the form

I RT I i : xi [PiI + fi] = i ...,x j Q jI d ji ,... if (I) C [I] (2.4) ∀ ∈ I F − Ci ∧ i    where xi, x j are linearly indexed variables, i denotes arbitrary functions, Pi,Q j are F constant rational indexing matrices and fi, d ji are constant rational vectors of cor- n responding dimension. The dots ... denote similar arguments. i Z is a linearly I ⊆ bounded lattice (LBL) called iteration space of the quantified equation Si [I]. The set of all vectors PiI + fi,I i is called the index space of variable xi. Furthermore, in order to account for∈ irregularities I in programs, we allow quantified equations

17 2. Fundamentals and Related Work

I Si [I] to have iteration dependent conditionals i (I) and run-time dependent condi- RT C tionals i [I]. The iteration dependent conditions can be equivalently expressed by C n I Ci Z , where the space Ci is an iteration space called condition space. A run- ∈ I ⊆ RT I time conditional ( i [I] = b(...,x j Q jI d ji ,...) ) is given by a Boolean-valued function involvingC constantsF and indexed variables.− In case of uniform dependencies,   i.e., Pi,Q j,Qk are all identity matrix, then the DPLA is also called dynamic piecewise regular algorithm (DPRA). The following example shows the specification of a bilateral filter algorithm as a DPLA.

Example 2.1.1 Consider the loop nest of the bilateral filter shown in Figure 2.1(a). The bilateral filter is described by Equation (2.5), where y(i, j) is the processed output pixel, and u(i m, j n) describes the neighbourhood of input pixels, whose values are required for− the− computation. The convolution is carried out with a closeness mask, c and a similarity mask, s. The coefficients of the similarity mask depend on the difference of the origin pixel and the neighbourhood pixels as shown in the Figure 2 2.2(a). It also depends on the geometric spread constant, σd and photometric spread 2 constant, σr [54]. P Q y(i, j) = ∑ ∑ u(i m, j n) c(m,n) s(i, j,m,n) (2.5) m=0 n=0 − − · · (m2+n2)/2σ2 c(m,n) = e− d (2.6) (u(i, j) u(i m, j n))2/2σ2 s(i, j,m,n) = e− − − − r (2.7) Equations (2.5)-(2.7) can also be represented by the following DPLA: S1 :u[i, j,m,n] = U[i m, j n] if i m 0 j n 0 − − − ≥ ∧ − ≥ S2 :c[i, j,m,n] = C[m,n] if i = 0 j = 0 ∧ S3 :u[i, j,m,n] = 0 if i m < 0 j n < 0 − ∨ − S4 :d[i, j,m,n] = u[i, j,m,n] u[i, j,0,0] − S5 :s[i, j,m,n] = LUT(d[i, j,m,n]) S6 :a[i, j,m,n] = s[i, j,m,n] c[i, j,m,n] · S7 :z[i, j,m,n] = a[i, j,m,n] u[i, j,m,n] · S8 :y1[i, j,m,n] = z[i, j,m,n] if m = 0 n = 0 ∧ S9 :y1[i, j,m,n] = y1[i, j,m,n 1] + z[i, j,m,n] if n > 0 − S10 :y[i, j,m,n] = y1[i, j,m 1,n] + y[i, j,m,n] if m > 0 n = Q 1 − ∧ − S11 :m1[i, j,m,n] = a[i, j,m,n] if m = 0 n = 0 ∧ S12 :m1[i, j,m,n] = m1[i, j,m 1,n] + a[i, j,m,n] if m > 0 − S13 :m[i, j,m,n] = m[i, j,m,n 1] + m1[i, j,m,n] if n > 0 m = P 1 − ∧ − y[i, j,m,n] S14 :yo[i, j,m,n] = i f m = P 1 n = Q 1 m[i, j,m,n] − ∧ −

18 2.1. Algorithm Specification in the Polytope Model

The iteration space is the Z-polyhedron given in Equation (2.2). The weights s[i, j,m,n] are computed as the exponential function of the difference of the pixel value (u[i, j,m,n]) with the center pixel (u[i, j,0,0]). The possible values of the exponential function are stored in a look-up-table defined by the function LUT. The variables z[i, j,m,n] build the product of the mask coefficients (a[i, j,m,n]) and the corresponding pixel (u[i, j,m,n]). The variables y1[i, j,m,n] and y[i, j,m,n] store the intermediate and final weighted pixel sum. Similarly, m1[i, j,m,n] and m[i, j,m,n] store the intermedi- ate and final sum of the mask coefficients. Finally, yo[i, j,m,n] computes the output pixel value. After embedding and localization transformation (see Section 2.3.1.1), the DPLA can be represented as a DPRA. Statement S1 is then replaced by

S11 : u[i, j,m,n] = u[i 1, j,m 1,n] i f i > 0 m > 0 − − ∧ S12 : u[i, j,m,n] = u[i, j 1,m,n 1] i f j > 0 n > 0 3 − − ∧ S1 : u[i, j,m,n] = Ui m, j n i f m = 0 − − Recurrence equations denote a more general framework than sequential loop nests since the dependence vector in recurrence equations are not limited to being lexi- cographically positive as in the case of loop nests. This leads to the computability problem, that is, a system of recurrence equation might not always be computable. In [34], Darte et al. showed that the problem is related to finding an explicit order of computations, i.e. a schedule. Our design flow uses an algorithm specification in form of recurrence equations for the class of algorithms falling under DPLAs. For the entry of dynamic piecewise linear algorithm in our design tool, the language PAULA has been developed [86, 94].

Dependence Graph A dependence graph leverages the geometrical representation of the polyhedral model for representation of dynamic piecewise linear algorithms. The topology of the unfolded dependence graph (also known as iteration space graph) is specified by the iteration space, i.e., each node of the graph is an iteration of the loop nest. An edge contains the dependency information consisting of a variable name and the relative displacement of the source and target iteration. The representa- tion of the iteration space graph can be very large if not unbounded, as the number of nodes is equal to the number of iterations. DPRAs can also be succinctly represented by a multi-graph G called reduced dependence graph (RDG).

Definition 2.1.4 (DPRA: Reduced Dependence Graph) Let a dynamic piecewise reg- ular algorithm be given. Then the corresponding reduced dependence graph (RDG) is a graph G = (V,E,D), where V denotes the set of nodes and E V V denotes the ⊆ × set of edges. For each variable xi of the DPRA, there exists a node vi V. Also, there ∈ exists a edge, e = (vi,v j) E annotated with an n-dimensional dependence vector ∈ di j D, if the variable x j is dependent on variable xi of the algorithm. ∈

19 2. Fundamentals and Related Work

0 U U 1 0  0   0  1 1     S3,_ S1,_  0  S2,_  0          0 0 1    0  0 0      1   1    S4,-       S5, c LUT 0 0 S6,*   0 0   0  1  S7,*     0      1    S9,+ S8 S12  S11 + + + 0  0  S10 1 S13 0   +  0  + 0       1   S14  0    /  

Figure 2.2.: (a) Bilateral Filter, (b) Reduced Dependence Graph

The dependence vector di j = 0 indicates a dependence between variables across dif- ferent iterations. The extended6 version of the RDG used in our design flow, contains different types of nodes and edges, each with attributes denoting the variable, func- tionality, the iteration space, and the indexing function [86]. The reduced dependence graph of the bilateral filter algorithm in Example 2.1.1 is shown in Figure 2.2(b). Each node is associated with an equation and corresponding operation. The edges are annotated with dependence vectors, which denote inter-iteration dependencies. The reduced dependence graph is a compact graph representation of a correspond- ing loop program formulated by recurrence equations. It is needed for solving the local allocation and scheduling problem. In the next section, we describe the funda- mentals of modular modeling of applications with multiple communicating loops.

2.1.2. Specification of Communicating Loop Nests Modularity for application specification is advocated, where an application is divided into smaller parts, which are scheduled separately. The modular application repre- sentation not only exhibits task-level parallelism explicitly, but is also intuitive and allows good software engineering practices. The lack of analizability and inexpress- ibility of sequential languages like C, C++, and Java lead to research in Models of Computation (MoC). A MoC denotes the semantics of the interaction between mod-

20 2.1. Algorithm Specification in the Polytope Model

1 1 2 3 2 7 8 7 5 1

Upsampler Downsampler

Figure 2.3.: Multi-rate SDF system of DAT-to-CD rate converter which converts 44.1kHz sampling rate to 48kHz. ules or components. It is also helpful for system synthesis, analysis, verification, and optimization. Therefore, there has been a significant amount of work on models of computation for representing data and control intensive applications. Well known MoCs are differential equations, Petri-nets, Statecharts, communicating sequential processes (CSP), and many others [25]. For modeling compute intensive stream- ing applications, the focus has been on data flow MoCs like synchronous data flow (SDF) [121], cyclostatic data flow [18], multi-dimensional SDF [138], Kahn process networks [104], and their variants. The fundamental structure of data flow models of computation is a graph, where the nodes represent processes called actors and edges represent data communication in terms of so-called tokens. The functionality of the actors can be specified in an arbitrary programming language. The need for modelling digital signal processing applications for software systems inspired SDF, in the seminal work by Lee et al. in [121]. SDF graph is a network of synchronous nodes which models a fixed production rate, p and consumption rate, c of tokens to specify the relative sample rates of each process in a signal processing system (see Figure 2.3). The basic idea being that the process reads and writes fixed number of tokens (p and c, respectively), each time it fires. One can determine a static schedule, buffer requirements, and analyze properties like deadlocks by solving the balance equations. The major disadvantage associated with system level modelling using SDF is the lack of representation for multi-dimensional data arrays, which is ubiquitous in streaming image and video processing applications. Therefore, an extension called multi-dimensional SDF (MD-SDF) [138] was proposed, where the edges represented production and consumption rates as vectors, ~p and~c for denoting the multi-dimensional array transfers. MD-SDF is, however, not able to model the execution order (i.e., order in which loop iterations are executed). Therefore, it is not able to represent out-of-order communication arising due to blocking or tiling [44]. In order to remove these modelling limitations, windowed data flow (WDF) and its static counterpart windowed static data flow (WSDF) was introduced in [108]. It can model applications characterized by the presence of multiple communicating loops and offers a framework for buffer estimation and communication synthesis. As it will be required in this thesis for modelling applications given by communicating loop nests, we will introduce this MoC next in detail.

Definition 2.1.5 Windowed Synchronous Data Flow (WSDF) [108] is a data flow

21 2. Fundamentals and Related Work graph, i.e., a tuple G = (V,E,D) where V is a set of nodes representing processes, E is a set containing directed edges, and D = (~p,~v,~c,∆~c,~δ,~bs,~bt,Owrite,Oread) is a set of data labels that are partial functions from E into some specified range of values.

The WSDF graph is explained in detail as follows. Each vertex of a windowed syn- chronous data flow graph represents a process, also called an actor whose functional- ity is described by a nested loop program. Here, also the functionality can be specified in an arbitrary programming language. The kernel of the loop nest contains the com- putations, which read from the input ports, process the data, and provide the results on the output ports. Virtual token, ~v: Each edge e E of the data flow graph represents the transport ∈ n of a multi-dimensional array (also called virtual tokens) of size~v N from the edge ∈ source to its sink, with n N being the number of dimensions of the transferred variable. Intuitively, it may∈ represent a complete image or a sub-image (for n = 2). Producer token vector, ~p and consumer token vector, ~c: The n-dimensional arrays being transported are given by a virtual token, ~v. However, they are not produced and consumed as a whole. Instead, each source invocation generates a so called n effective token which consists of ~p N data elements. It characterizes the fine- grained parallel data transfer of pixels∈ or small array blocks. Whereas, each sink n invocation consumes an effective token whose size is given by~c N . ∈ Communication order, Owrite and Oread: They model the communication order of the source and sink loop, respectively. These functions contain a list of vectors, which defines a sequence of source and sink invocations that define a block hierarchy of write and read operations. For example, a simple raster scan order of the read T T execution of a s s image is represented as Oread = [(s,1) ,(s,s) ]. × Window movement sampling vector, ∆~c and data offset vector, ~δ: The window movement sampling vector, ∆~c gives the distance between two consecutive sink in- vocations. This is useful for modelling upsampling and downsampling operations. The data offset vector, ~δ represents a multi-dimensional array, that denote the min- imum amount of data tokens required to start the computation. It is required for modelling feedback loops [106]. Boundary extension vector, ~bs and ~bt: The vectors ~bs, ~bt represent the virtual bor- der extensions for dealing with boundary handling methods, which arise when pixels on the border strip do not have enough neighbouring pixels to process the filter algo- rithm. It is used to model zero padding and symmetric extensions for image process- ing applications like discrete wavelet transform [28]. Several applications like binary morphological reconstruction, lifting-based wavelet kernel, JPEG2000, etc. can be represented in WSDF (see [106]). In order to supple- ment the understanding of the above introduced concepts, we consider the following example.

22 2.1. Algorithm Specification in the Polytope Model

(a)

(b) 512 512 Owrite = , read 512 512 1 512 O = , " ! !# " 1 ! 512 !# 1 1 p = c = 1 1 1 b = ! 2 ! s 1 bs = ! 2 ! 1 Source Conv3x3 Conv5x5 b = 2 s 1 1 ! bt = 1 bt = 512 2 ! 1 1 ! bt = v = ∆c = 1 ! 512 ! 1 ! SAD

1 2 1 b = b = bt = s 1 s 2 1 ! ! ! 1 Source Conv3x3 Conv5x5 b = Sink s 1 1 2 ! bt = bt = 1 ! 2 !

(c) for(i=0;i<512;i++) (d) for(i=0;i<512;i++) for(j=0;j<512;j++) for(j=0;j<512;j++) for(m=0;m<3;m++) for(m=0;m<3;m++) for(n=0;n<3;n++) for(n=0;n<3;n++) { { //Read //Read inner image a[i,j,m,n]=A[0,0,m,n]; IF(m==0 && n==0) u[i,j,m,n]=U[i-m,j-n,0,0]; u[i,j,m,n]=U[i,j,0,0]; //Execute //Boundary handling(Left border) z[i,j,m,n]=a[i,j,m,n]*u[i,j,m,n]; IF(j==0) y[i,j,m,n]=u[i,j,m,n]; IF(m==0) THEN //data propagation (localiztation) y1[i,j,m,n]=0 + z[i,j,m,n]; IF(i>0 && m>0) THEN ELSE u[i,j,m,n]=u[i-1,j,m-1,n]; y1[i,j,m,n]=y1[i,j,m-1,n]+z[i,j,m,n]; ELSEIF(j>0 && n>0) IF(n==0) u[i,j,m,n]=u[i,j-1,m,n-1]; y[i,j,m,n]= 0+y1[i,j,m,n]+z[i,j,m,n]; ELSE //Execute y[i,j,m,n]= y[i,j,m,n-1]+y1[i,j,m,n]+z[i,j,m,n]; //over inner image .... //Write //Write IF(m==2 && n==2) IF(m==2 && n==2) Y[i,j,m,n]=y[i,j,m,n] Y[i,j,m,n]=y[i,j,m,n] } }

Figure 2.4.: (a) The shaded pixels show the pixels already processed in different stages of the accelerator pipeline, (b) The graph denotes the windowed synchronous data flow notation. All the edges of the graph have the same value of ~p, ~c, ~v, Owrite, Oread, ∆~c, ~δ. Therefore, only a single edge has been annotated. Only, the boundary extension vectors, ~bs, ~bt are dif- ferent, (c) non-localized version of the 3 3 convolution loop program (actor Conv3 3), (d) same loop program× with localized dependencies and boundary× handling.

23 2. Fundamentals and Related Work

Example 2.1.2 A Stereo Depth Extraction (SDE) algorithm finds the distance of ob- jects from the camera with help of two fixed cameras [35]. The application con- sists of several convolution filter followed by a sum of absolute difference (SAD) computation. The SDE algorithm represented in WSDF notation is shown in Figure 2.4(b). It illustrates the communication parameters per edge. The chosen mapping leads to a rate matched production and consumption of a single pixel. Therefore, ~p =~c = (1,1)T denotes the effective tokens contributing to the transport of the com- plete image array, ~v = (512,512)T for all the edges. Similarly, the execution or- der of the computation nodes is in raster scan order (i.e., row major). Therefore, Owrite = [(512,1)T ,(512,512)T ] and Oread = [(512,1)T ,(512,512)T ] represent the communication orders of all the edges. The sliding window moves by one pixel in horizontal direction and one pixel in vertical direction (Hence, −→∆c = (1,1)T ). As in realistic applications, the sliding window can transcend the image boundary. In which case, the pixel array is virtually extended with a border as illustrated in Figure 2.4(a). The data offset vector~δ is a zero vector as there is no feedback loop. A modular approach of models of computation called WSDF, is considered for repre- senting applications. In the next section, we present other approaches for modelling of loop programs and types of concurrency.

2.1.3. Related Work Loop nests and recurrence equations are popular models for representing computa- tionally intensive algorithms. Most of the high-level synthesis (HLS) tools use loop nests in imperative languages for specification. In order to support loop optimizations and parallelization, they have compilers like gcc [76], LLVM [119], Open64 [143], which use a so-called static single assignment (SSA) form as intermediate represen- tation. However, it is restricted only to the basic block level and does not apply to the loop level. Therefore, almost all high-level synthesis tools rely on loop unrolling to extract parallelism. Unlike other HLS methodologies, we rely on recurrence equations (DPLAs) for the representation of loop kernels. The major advantage is the explicit representa- tion of parallelism, which allows powerful transformations like tiling and scheduling, which lead to a higher quality of results (QoR). Karp et al. first described dependence vectors and systems of uniform recurrence equations in their seminal work on struc- tured computations [105]. Since then, several works have contributed to modelling of recurrence equations through iteration dependent conditions (RIAs) [151], affine dependencies (SARE) [150], representation of iteration spaces as LBLs (PLA) [164], reductions (AIA) [58], and data-dependent conditions (DPLA) [95]. We refer to [86] for more details on the evolution of modelling by recurrence equations. A restricted class of loop nests called static control part (SCoP) [13] can be con- verted to recurrence equations. Feautrier [65] introduced an algorithm for converting

24 2.1. Algorithm Specification in the Polytope Model these loop nests into single assignment form as a set of recurrence equations. In recent years, there have been novel ideas on the transformation of loop programs to single assignment form using the polyhedral model and are being integrated in industry-grade compilers [148].

Application modelling The traditional categories offering concurrency in a pro- gramming languages are based either on the usage of (a) actor/process networks, (b) data-parallel/SPMD (single program, multiple data), or (c) shared-memory/dynamic threads. The design methodology in high performance embedded computing has tradition- ally used actor/process networks to compose complex systems with communicating agents. The cumbersome access to shared and mutated states in shared memory and data parallel programming model, is adequately compensated by safe and intuitive programming with communicating components of actor/process networks. The huge body of research on data flow graphs and tools like Simulink [99], Ptolemy [25] support actor-based modelling. Communicating sequential processes (CSP) [96] and Kahn process networks (KPN) [104] model applications, where components are sequential processes that run con- currently. Programming languages like Handel-C [159] and Occam [154] are based on the CSP model of computation. In KPN, the communication takes place over unbounded FIFOs with blocking writes. ESPAM [142] is a design flow based on KPN for synthesizing multiprocessor system-on-chip (SoC). DoL [173] is a software framework also based on the Kahn process network model of computation for spec- ifying applications for multi-processor systems. A mix of XML and C programs is used to specify the process network. Data flow models like SDF and their variants are good for representing applica- tions without complex control flow. Several hardware and software programming systems like StreamIT [174], PeaCE [84], and others are rooted in the SDF MoC and its extensions. One can also use other data flow models of computation describ- ing communicating loops like multi-dimensional synchronous data flow (MD-SDF) [138], or windowed synchronous data flow graphs (WSDF) [108]. WSDF is a frame- work built on MD-SDF which can undertake memory estimation and communication synthesis. The task level parallelism of communicating loops was also represented as poly- hedral reduced dependence graph (PRDG) for stream-based function (SBF) network by Rijpkema et al. in [153]. The polytope model was used to describe applications consisting of communicating loops in [27]. The aim of the work concentrated on ef- ficient data access and storage for programmable processors. Alpha [178] is a single assignment language based on the polyhedral model of compilation, which is targeted for synthesis of systolic arrays for corresponding loop programs. The Kahn process networks (KPN) and data flow graphs are powerful models for

25 2. Fundamentals and Related Work representing and mapping communicating loops onto parallel hardware. One can determine the static schedule, buffer requirements, and analyze properties like dead- locks. Powerful synthesis methods have been developed in the polyhedral model leading to high quality of results (QoR). There are very few works which try to ex- ploit the best of both worlds, i.e., analysis and communication synthesis capability of data flow models, and high QoR synthesis in the polytope model, respectively. The interaction between these two domains in the context of multi-dimensional mod- elling is still less considered. Keinert et al. utilized analysis in the polyhedral domain for memory estimation and synthesis of channels in WSDF graphs [106]. Sen et al. studied the problem of converting affine nested loop programs modeled as poly- hedral reduced dependence graphs (PRDG) into binary parametrized cyclostatic data flow graphs [38]. The Array-OL specification model has been projected onto KPN for mapping onto distributed architectures [5]. The related work on loop nests, re- currence equations, different categories of concurrency, and some models of com- putation has shown that there are different approaches for modelling applications containing loop programs. In this thesis, we propose an intermediate modular repre- sentation of communicating loops in the polyhedral model called loop graph. We also propose a methodology for converting a loop graph into a corresponding windowed synchronous data flow representation. The data flow model is then leveraged for syn- thesis of HW/HW communication and HW/SW communication for communicating loop accelerators that can be synthesized automatically for each loop nest.

2.2. A Generic Accelerator Scheme

A loop accelerator is a specialized processor which executes computationally inten- sive tasks specified by a loop nest with a high performance and in a power efficient manner. Often, the applications are characterized by multiple communicating loops. In this case, a pipeline of accelerators is needed for custom execution of the applica- tion. It must be noted that we use the term loop accelerator both for a single accelera- tor and for a subsystem consisting of a pipeline of accelerators. The accelerators can also act as a co-processor to a general purpose processor. Therefore, custom integra- tion in a system-on-chip (SoC) is an important aspect of the whole architecture. In the following subsections, we (a) classify and characterize a single accelerator archi- tecture model and (b) focus on accelerator pipeline interface architectures for system integration.

2.2.1. Characterization and Classification of Loop Accelerators Accelerators may vary in terms of granularity ranging from dedicated hardware cir- cuits to programmable coarse-grained processors like PACT-XPP [14], and WPPAs [55]. A scheme of an accelerator-based system architecture developed in the course

26 2.2. A Generic Accelerator Scheme

Status Signals

Accelerator(A1) Global Controller Address, r/w enable FIFO flags empty, full CPU I/O Controller FSM FSM FSM A2

FIFO/RAM PC PC PC FIFO/RAM

FSM FSM FSM A3

FIFO/RAM PC PC PC FIFO/RAM A1 DRAM FSM FSM FSM

A2 FIFO/RAM PC PC PC FIFO/RAM Data, Control Interconnect

Loop Accelerator Pipeline Input Stream Output Stream

Figure 2.5.: (a) Overall system architecture of an SoC containing multiple communi- cating loop accelerators, (b) A generic accelerator engine. of this thesis is shown in Figure 2.5. The architecture components of the accelerator can be broadly divided into three categories: data path, control path, and storage. The computation unit of an accelerator may consist of several (non)-programmable and/or reconfigurable processor elements (PEs). The data path of the PEs contain functional units (FUs), registers, and local memory. The binding of operations of a given loop kernel onto the limited resources of a PE (i.e., resource sharing) requires a local control path. The control path is used to control the computations in the data path of an accelerator. The control path contains a finite state machine (FSM) logic for controlling multiplexers. In case of programmable accelerators, the local control path consists of a program memory, a program counter, and an instruction decoder. In addition, a global control and counter unit, which controls the global schedule of the loop iterations is needed. Global control signals are propagated to the PEs through the interconnect network. The memory within accelerator PEs (i.e., registers and local memory) stores data for reuse. Several banks of addressable memory or FIFOs (also known as scratch-pad memory) are available at the border for providing the necessary data bandwidth. The accelerators can also act as co-processors in an SoC system. Each accelerator may be defined differently depending on performance, flexibil- ity, and cost trade-off which is determined by the granularity of the data path, con- trol path, and storage. They can be broadly classified as (a) stream-centric, (b) reconfiguration-centric, and (c) hardware-centric. Stream-centric accelerator architectures refer to homogeneous or heterogeneous multiprocessor chip architectures. They are characterized by the presence of many processor cores, single-precision or double-precision floating-point units, and support

27 2. Fundamentals and Related Work multi-threading. The popularity of digital signal processors (DSPs), Cell broadband engine [103], GPUs like Tesla [124], Larrabee [82], and others support their viabil- ity. They are usually used as coprocessors or add-on boards in supercomputing, and mainstream computing. Reconfiguration-centric accelerators refer to dynamically reconfigurable proces- sors or also coarse-grained reconfigurable arrays. These architectures are character- ized by the presence of multiple lightweight PEs with ALUs, multipliers, and mul- tiple contexts for reconfiguring the processor at run-time. Several coarse-grained array architectures have been developed both in academia and in industry like PACT XPP [14], ADRES [128], Montium Tile Processor from Recore Systems [152], DRP from NEC [136], MorphoSys [160], WPPA [55], and many others. These architec- tures offer particular idiosyncrasies from varying degree of programmability, memory structure, and communication bandwidth. A brief survey on dynamically reconfig- urable accelerator architectures is presented in [4]. The hardware-centric category includes dedicated loop accelerators. Unlike the previous categories, these accelerators are non-programmable. They may implement a computationally intensive application with far higher performance and power effi- ciency than a programmable/reconfigurable processor. They exploit loop level par- allelism through processor array architecture with PEs organized in a grid structure such as shown in Figure 2.5. Each PE has a local control path, registers for storage, and delay registers as interconnect memory. The architecture model of a hardware accelerator contains information on the re- sources and their binding possibilities in a library. Each component of such a high- level resource library of functional units (adders, ALUs, multipliers, comparators), connectors (multiplexer, demultiplexer) and memories (registers) is annotated with its area, power, bit-width, latency, and pipeline rate. An overview of typical re- sources is given in Table 5.2 on page 153, in Chapter 5. We use the PAULA language [94] not only for specifying loop behavior but also for describing the architecture model. We refer to [94] for definition of architecture models and loop programs. There has been a plethora of handcrafted accelerator IPs for different applications in signal and image processing, wireless, and others. The accelerators can be real- ized on field programmable gate arrays (FPGAs) or as application-specific integrated circuits (ASICs). These platforms can also realize the complete SoC. After having described different types of loop accelerators, we concentrate on hardware-centric accelerators, which implement a loop program with highest effi- ciency and can be realized either on ASICs or FPGAs. In next section, we look at aspects of accelerator generation for applications consisting of multiple loops.

2.2.2. Accelerator Subsystem for Streaming Application Computationally intensive applications are typically characterized by multiple com- municating loops such as the stereo depth extraction (SDE) algorithm shown in Ex-

28 2.2. A Generic Accelerator Scheme ample 2.1.2 and Figure 2.4. Apart from high-level synthesis of hardware loop accel- erator, two other aspects are important. Firstly, in case of custom mapping of each given loop nest to a hardware acceler- ator, it is not possible to implement different loops on the same piece of hardware. The favored approach is to have a pipeline of hardware accelerators connected over a dedicated communication subsystem as also shown in Figure 2.5. FIFOs or SRAMs can be used to implement the communication memory. The synchronization between the individual accelerators can take place with help of a timing controller or blocking read/write operations. The channel synthesis depends on the throughput, scheduling, and algorithm parameters of the source and target loop accelerator, as they determine memory mapping, parallel data access, and out-of-order communication. Using these parameters, one can not only undertake memory estimation, but also custom memory architecture synthesis and address generation. In the final step, the overall system is assembled by instantiation of the generated hardware accelerators interconnected by channels. We discuss different approaches of related work in the field of accelerator synthesis in Section 2.5 on the basis of support for multi-rate systems, data reuse, out-of-order communication, parallel data access, and memory mapping. As will be shown later, the major shortcomings of other works based on multi-dimensional mod- els of computation (KPN, SDF, and others) are that none of them solve or support all of these requirements in its entirety. In [106], the WSDF model has been used for buffer analysis and communication synthesis. We leverage this model of computation for communication synthesis of communicating accelerators in Chapter 4. Secondly, the accelerators or accelerator pipelines are often integrated to a sin- gle SoC. In order to support communication and synchronization between processors and the accelerators, the software device drivers depending on the schedule and al- gorithm parameters of the accelerators need to be generated. The integration also requires generation of a hardware wrapper for interfacing the system to accelerator over interconnect of choice (Bus, P2P, or NoCs). In Chapter 4, we introduce our methodology for automated synthesis of memory map, software driver, and interface circuits under the aegis of system integration. Despite the classification of different accelerator architectures, the basic structure of multiple PEs organized typically in a mesh structure, including distributed lo- cal memories and control path, memory banks girdling the accelerator data path is similar for all accelerator categories. The mapping process of applications onto stream-centric/reconfigurable processors are similar and often referred to as multi- core compilation. Whereas generation of hardware accelerator circuit for a loop pro- gram is achieved by high-level synthesis. The main difference between both is that in multi-core compilation, the application is mapped onto a given architecture. Using high-level synthesis, a suitable architecture for the application is generated. In this dissertation, we concentrate on compilation for hardware-centric accelerators (i.e., based on high-level synthesis). The loop iterations can be executed simultaneously on multiple processing elements (PEs) in the accelerator data path (loop level paral-

29 2. Fundamentals and Related Work lelism). The complex operations inside the loop kernel for processing a single loop iteration can be mapped onto functional units within a PE (instruction or operation level parallelism). To summarize, the task of high-level synthesis tools is not only to generate a hard- ware accelerator from a high-level description of a loop nest but the twin tasks of communication synthesis and system integration are necessary for platform-based synthesis of the accelerator subsystem for streaming applications. In the next section, we give an overview of fundamentals of our design flow for high-level synthesis.

2.3. High-level Synthesis of Hardware Accelerators

In the previous section, we identified the problem of generating hardware-centric accelerators as our subject of interest. Now, the problem of automatically generating a hardware accelerator from a specification of a loop nest in a high-level language will be solved through high-level synthesis (HLS). In this section, we give a primer on our HLS-based design flow (front end and back end). Figure 2.6 shows the design flow called PARO [53, 92] for generation of hardware accelerators. The front end deals with source-to-source loop transformations, scheduling and allocation. The back end is responsible for RTL/code generation for each accelerator including the generation of communication channels and interfaces.

2.3.1. Front End: Loop Transformations The front end includes loop optimizations, allocation, scheduling, and parsing of the input loop algorithm. Each loop algorithm is individually specified using the PAULA language [94]. This language is designed to specify a loop nest with the properties of dynamic piecewise linear algorithms (DPLA) as the underlying semantics. The typical design trajectory to generate such an accelerator is shown in Fig. 2.6. In the first step, the input program along with the architecture specification in the PAULA language is parsed. The restrictions to the initial PAULA specification for hardware synthesis are that

• all the equations of the algorithm must be in output normal form1. The equa- tions must have affine (i.e., Q j = E) or uniform dependencies (i.e., Q j = E). E 6 is the identity matrix. The functions i can also be reduction operations like sum or products. F

• The iteration space of the algorithm must be a Z-polytope.

1 A PLA with each equation of the form, xi I0 = i ...,x j Q0 I0 d0 ,... is called to have F j − ji output normal form according to [164]. h i  h i 

30 2.3. High-level Synthesis of Hardware Accelerators

Algorithm Description (PAULA)

High Level Transformation Toolbox Simulation Loop Optimizations Expression Splitting Localization Output Normal Form

Tiling

Space-time mapping Scheduling Architecture Model Allocation Resource Binding Front End

Interface Generation Hardware Synthesis WSDF Conversion Processor Array I/O Device Driver Memory Map Controller Multidimensional FIFO

Back End Hardware Wrapper Processor Element Generation

Simulation Backend SW and HDL Generation(C, VHDL) Testbench Generation

Accelerator Subsystem for SoC

Figure 2.6.: Paro design flow for accelerator generation [53, 92].

The high-level synthesis toolbox contains source-to-source program transformations which recast the input loop program with affine dependencies but not in output nor- mal form, into an equivalent DPLA program in PAULA satisfying the above proper- ties. Furthermore, algorithms like multi-rate downsampling that have iteration spaces in the form of linearly bounded lattices, can also be rewritten as Z-polytope. The ma- jor aims of the transformations are program simplification and tuning the synthesis output for maximizing or minimizing attributes like performance, latency, and others. Some of the major transformations are summarized in the following subsection.

2.3.1.1. Program Transformations Standard compiler optimizations like common sub-expression elimination (CSE), con- stant and variable propagation, dead-code elimination, operator , expression splitting, and others [137] are available. They lead to an improved register and functional unit usage through elimination of redundant code and optimization of arithmetic expressions. The CSE transformation identifies identical expressions and replaces them by a common variable. Dead-code elimination removes redundant code which is not used in the program. Operator strength reduction replaces costly operations like integer divisions and multiplications with a combination of shifts, adds, and subtracts.

31 2. Fundamentals and Related Work

A major restriction of our design flow is that the equations specifying a dynamic piecewise linear algorithm must be in output normal form. The transformation output normal form (ONF) converts a equation to its output normal form. It involves affine transformations of the index space and dependencies. For example,

x[i + 1, j + 1] = a[i + 1, j] b[i, j 1] x[i, j] = a[i, j 1] b[i 1, j 2] · − → − · − −

Loop Perfectization is a transformation which converts non-perfectly nested loops into perfectly nested-loops. There exist also affine loop transformations like loop skewing, embedding, and others [158] which can be modeled in the polytope model. The I/O variables in a DPLA description of the nested loop algorithm with different numbers of indices need to be embedded into an index space with same dimensions as the iteration space. Mathematically, embedding can be described by an affine transformation, Λ I + γ such that a variable y is embedded into the index space as y[Λ I + γ] [164]. · · Loop unrolling is a major optimization transformation which exposes parallelism in a loop program. Loop unrolling by a factor n expands the loop kernel by enumer- ating n 1 consecutive iterations. − Localization is a transformation which converts affine data dependencies into uni- form data dependencies by the propagation of variables from one iteration point to neighboring iteration points [164, 126]. The transformation enables maximum data reuse within the processor array and henceforth minimizes the amount of external I/O communication (with peripheral memory) by replacing broadcasts with short propa- gations. Expression splitting is applied for the decomposition of complex expressions. Af- ter its application, the program has equations with expressions containing only a sin- gle function and their arguments on its right hand side. After parsing the program and applying transformations, certain pre-processing steps need to be carried out before generation of a reduced dependence graph. Each operation like “+”, “ ” in the PAULA language has to be replaced by corresponding functions add(a,b) and∗ mul(a,b). The expression tree of each equation is parsed re- cursively. Subsequently, using the function declarations in the architecture model and the node attributes of the expression tree, each node is assigned a return data type. This information on the return data type and function name is a node attribute in the reduced dependence graph. After replacing operations by function, propagation of data type, and expression splitting, a reduced dependence graph (RDG) can be gener- ated. The nodes of the RDG are annotated with binding possibilities, return data type, equation number, node type, variable name, and iteration space. The most important transformation in our design flow, tiling, is discussed in the next subsection.

32 2.3. High-level Synthesis of Hardware Accelerators

(a) u[8][8]; // input image (b) u[8][8]; // input image y[8][8]; // output image y[8][8]; // output image for (j=0; j<8; j++) for (j2=0; j2<4; j2++) outer tile for (i=0; i<8; i++) { for (i2=0; i2<4; i2++) { y[i][j] = g*u[i][j] + o ; // S1(i,j) for (j1=0; j1<2; j1++) inner tile } for (i1=0; i1<2; i2++) { y[i1][j1][i2][j2] = g*u[i1][j1][i2][j2] + o ; // S1’(i1,j1,i2,j2) }

0 1 2 3 0 1 2 3 4 5 6 7 0 1 j2 j j1 0 0 0

1 1

2 1 i1

3

4 2

5

6 3

7

i i2

Figure 2.7.: (a) A perfectly nested loop nest and its iteration space, (b) Corresponding loop nest and its iteration space after tiling.

2.3.1.2. Tiling Tiling is a transformation which enables matching of a loop algorithm implementa- tion to the accelerator architecture. It covers the iteration space of computation using congruent hyperplanes or parallelepipeds called tiles. Through selection of tile size and tiling strategy, one decides the global allocation of processor elements, memory usage, and communication bandwidth. The transformation also known as loop tiling or partitioning has been studied in detail for compilers. Its use has led to efficient loop implementations through better cache reuse on sequential processors, and implementation of algorithms on parallel architectures from supercomputers to multi-DSPs and FPGAs [189, 158]. For hard- ware accelerators in form of processor arrays (PAs), it is carried out in order to match a loop nest implementation to resource constraints such as number of processors and processor array dimension. We illustrate the transformation with a gain-offset algorithm, a perfectly nested loop without inter-iteration dependencies. Figure 2.7(a) shows the loop nest and the corresponding iteration space. Figure 2.7(b) shows the corresponding loop nest after 2 0 tiling the iteration space with a tile given by tiling matrix P = . 0 2 ! Mathematically, given a tiling matrix, P, the iteration space is decomposed as fol- lows

33 2. Fundamentals and Related Work

1 2 , where I 7→ I ⊕ I (inner tile) (outer tile) |{z} |{z} n n 1 2 = I = I1 + P I2 I1 1 I2 2 P Z × (2.8) I ⊕ I { · | ∈ I ∧ ∈ I ∧ ∈ }

1 and 2 denote the iteration inside the tile and origins of tiles, respectively. Differ- entI parallelizationI strategies can be undertaken. Well known among them are local sequential and global parallel (LSGP) scheme, (also known as outer loop paralleliza- tion) and local parallel and global sequential (LPGS) scheme (also known as inner loop parallelization) [24, 164]. These strategies allocate the iterations to processor elements in a processor array and is discussed in detail in Section 2.3.2.1. The twin problems of loop tiling are: (a) How do we obtain the tiled iteration space? (b) How have the statements inside the loop kernel to be modified? Further- more, hierarchical tiling might be necessary, for matching loop programs to architec- tures with multiple hierarchies of memory and parallelism. Therefore, both problems need to be solved in the context of hierarchical tiling. In [164], the problem is solved for a single level of partitioning that encompasses both LSGP and LPGS. In [58], co- partitioning is introduced as 2-level partitioning which the code generation problem only in context of uniform dependencies. Note that there is no existing work, which solves both problems in the context of a generic hierarchical tiling. In this thesis, we introduce a methodology for (a) tiling of iteration space and (b) code generation of statements for the tiled loop in the context of recurrence equations in Section 3.1.2. This enables automated hierarchical tiling.

2.3.2. Front End: Scheduling

The designer needs to deal with the problem of scheduling which assigns loop it- erations and loop kernel operations to time steps for execution on the accelerators PEs. Furthermore, the loop computations must be allocated to the processing ele- ments with optimal granularity in terms of computation intensity, data parallelism, and locality. The derivation of latency-optimal static schedules in our design flow is based on the formulation and solution of a mixed integer linear program (MILP) [86]. Particu- lar to our method is a holistic approach, here the schedule within a processor element (local schedule) and also the schedule between all PEs of the processor array (global schedule) are optimized simultaneously. The global allocation of iterations to pro- cessor elements is determined by the tiling strategy, whereas the local allocation of resources within a PE is specified in the architecture model or determined from a prescribed throughput.

34 2.3. High-level Synthesis of Hardware Accelerators

2.3.2.1. Global Scheduling and Binding The intuitive nature and minimum overhead of linear scheduling and allocation func- tion has made it popular for mapping loop computations. This is realized by using the following affine transformation

p Q q = I + (2.9) t ! λ !· γ ! Θ which assigns to each iteration point I| {z a} processor index p (allocation) and ∈ I s n 1 n ∈ P a time index t (scheduling). Here, Q Z × and λ Z × denote the allocation ∈ T ∈ ∈ s matrix and scheduling vector, respectively. The offset vectors q Z , and γ Z ∈ ∈ are optional. Z is called time space, that is the set of all time steps where T ⊂ s executions take place. Z is called processor space. In the literature, the affine transformation Θ is knownP ⊂ as space-time mapping [117], scheduling function and processor allocation [69], and scattering function [13]. The processor allocation requires the determination of Q which depends on the tiling strategy. The tiling transformation initially decomposes the iteration space into inner and outer tile loops, 1 2. Depending on the tiling scheme, the elements within a tile have to executeI 7→ in I parallel⊕I (LPGS) or sequentially (LSGP), and the tile itself will be executed either sequentially (LPGS) or in parallel (LSGP). The advantage of the LPGS method is in the minimal requirement of local memory for each PE. However, this benefit is offset by high communication costs and external memory requirements. In LSGP, the communication cost is minimal but either the amount of local memory or the number of required PEs is controlled by the selection of the tile size. Therefore, the programmer plays a major role in determining the degree of parallelism through tile size and tiling strategy selection. Mathematically, the processor allocation is given by

I I p = E 0 1 (LPGS) p = 0 E 1 (LSGP) I2 ! I2 !  Q   Q  where 0 and E| denote{z } the zero and identity matrix.|I1 and{z I2}each denote n-dimensional iteration vectors for indexing of iterations within a tile and origins of tiles, respec- tively. The allocation matrix, Q contains information on which loop variable is exe- cuted sequentially or in parallel, respectively. To summarize, Q is a linear allocation matrix which is determined by the tiling strategy (i.e., LPGS, LSGP, and hierarchical tiling) and matrices. The scheduling function is represented by the affine transformation

ti (I) = λ I + τ(vi) (2.10) ·

35 2. Fundamentals and Related Work here, λ is called the global schedule vector and provides the start time of each iteration point I. τ(vi) gives the start times of each variable computation within the loop iteration. The overall start time of variable (node) vi of a reduced dependence graph G at iteration point I is given by Equation (2.10). The schedule vector, λ and τ(vi) determine the global and local schedule, respectively. In case of a tiled iteration space with the LSGP or LPGS tiling scheme, the scheduling function may be defined as follows:

I1 ti (I) = λpar λseq + τ(vi)(LPGS) (2.11) I2 !  I1 ti (I) = λseq λpar + τ(vi)(LSGP) (2.12) I2 !  where λpar and λseq are the parts of the schedule vector, which correspond to iteration variables being executed in parallel and sequentially, respectively. For the iterations or tiles which shall be executed sequentially, a sequential order of execution, also called scanning order, has to be defined. This information may be represented by a loop matrix, which can be given by the user or determined by a heuristic.

s s Definition 2.3.1 (Loop matrix [164]) R = (r1 r2 ... rs) Z × determines the or- dering of iteration points within a parallelepiped-shaped∈ tile. The iteration points in direction of r1 are mapped side by side. Iteration points in direction of r2 are separated by blocks of points in direction r1 and so on. The ordering is similar to a sequential nested loop program where the loop index ik corresponds to iterations in direction of rk. The inner loop index is i1, and the outermost loop index is is.

Example 2.3.1 In Figure 2.8(a), the tiled iteration space of a gain-offset algorithm is shown. After selecting the loop matrix R and choosing the LSGP strategy, a space- time mapping is obtained as shown in Figure 2.8(b). The allocation implies that iteration variables i2 and j2 are executed in parallel and denote the processor dimen- sion, whereas i1 and j1 are executed sequentially. Similarly for LPGS, the space-time mapping is shown in Figure 2.8(c). The allocation implies that iteration variables i1 and j1 are executed in parallel and denote the processor dimension, whereas i2 and j2 need to be executed sequentially. Therefore, the processor array implementation determined by space-time mapping for LSGP and LPGS is shown in Figure 2.8(d) and 2.8(e), respectively. A linear schedule may be obtained by solving a latency minimization problem for- mulated as a mixed integer linear program similar as presented in [86, 171]. Latency is defined as the execution time of the loop nest, i.e., the time difference between first and last loop operation (see Glossary). The minimization of the resulting latency, L

36 2.3. High-level Synthesis of Hardware Accelerators

0 1 2 3 (a) (b) (d) p2 i 0 p 001 0 1 PE PE PE PE 1 j 0 1 2 3  1   p2  =  000 1  j · i2 0 1 2 t 122 3   1 PE PE PE PE 0 j2      j2  0       Θ   I 2 PE PE PE PE 1 2 0 1 i1 R = 0 2 3 PE PE PE PE

p1 i1 2 p 100 0 0 1 (c) 1 (e) p2  j1   p2  =  010 0  · i2 0 t 112 8   PE PE      j2        3 Θ   I 1 4 0 PE PE R = p1 0 4 i2

Figure 2.8.: (a) Tiled iteration space, (b) space-time mapping and loop matrix for LSGP tiling, (c) space-time mapping and loop matrix for LPGS tiling, R is the loop matrix, (d) processor array for LSGP, (e) processor array for LPGS. is the objective function. The constraints in the ILP correspond to precedence con- straint and sequentialization constraint [86]. On solving the MILP with solvers such as CPLEX [100] one gets the global schedule, λ as well as local schedule, τ and the binding. The details on integer linear program for determining the schedule can be found in [86]. To summarize, given the RDG, iteration space, processor allocation function, and loop matrix, one can determine a linear global schedule λ and local schedule τ by minimizing the overall latency using a MILP formulation.

2.3.2.2. Local Scheduling and Resource Binding The problems of synthesis, allocation, binding, and scheduling need to be solved for obtaining the information needed to synthesize a loop accelerator. Allocation assigns each resource type ri, the number of available instances per processor element (PE) α(ri). The binding possibility β(vi) defines the resources (such as adders, multipliers, and other functional units) on which the functionality of the RDG node, vi can be executed. The execution time w(vi,ri) and the pipeline rate p(vi,ri) of node vi on resource ri should be given in the architecture model. The term local scheduling refers to the determination of the start time of each node, τ(vi) for the execution of the loop iteration. The local schedule determines the throughput, which depends on allocation of functional units. The so called iteration interval is the time between consecutively scheduled loop iterations and is defined as follows:

Definition 2.3.2 (Iteration interval): The iteration interval II is the number of time steps (clock cycles) between the evaluation of two successive instances of a variable

37 2. Fundamentals and Related Work from consecutive iterations within the same processing element [171]. Therefore, for scheduling, a suitable II may be chosen but alternatively depending on the allocation of FUs, the II may be determined. In order to specify resource constraints, a so-called resource graph has to be specified which expresses the bind- ing possibilities of operations to functional units. The ILP model for determin- ing the global schedule is updated with resource constraints for finding schedules [135, 167, 86]. The instance binding is done afterward using a modified left-edge algorithm [93].

Example 2.3.2 Let an iteration space and a loop matrix defining the raster scan order be given. The space-time mapping for the bilateral filter introduced in Example 2.1.1 and shown in Figure 2.1, is as follows:

m p 1 0 0 0 1 n p = 0 1 0 0    2   · i t 1 1 1 1024        j          The global schedule is represented by the numbers next to the iterations in Figure 2.9(a). 1024 is the number of pixels in a single image line. The interpretation is that p1 = m and p2 = n means that all iterations (i, j) corresponding to same coefficient position (same m and n) undergo computation on the same PE. The filter has a 3 3 mask for image processing. The throughput for the bilateral filter example is given× 2 by the iteration interval II = 1 (see local schedule in Figure 2.9(b)). Statement S1 T represents the computation of statement S1 from second iteration, I = (0 0 1 0) . Therefore, at each time step, 7 pixels are being executed in parallel in a PE due to functional pipelining. The global scheduling is responsible for the loop level parallelism and the local scheduling is responsible for operation level parallelism. The final throughput is a function of the iteration interval and the number of PEs, which in turn depends on tiling parameters.

2.3.3. Back End: Synthesis The back end is responsible for the synthesis of the data path, control path, and communication network of the accelerator. The back end of our design flow for synthesizing an accelerator in form of a processor array entails the following steps:

• synthesis of the processor elements,

• synthesis of the interconnection structure, and

38 2.3. High-level Synthesis of Hardware Accelerators

(a) m (b) 1 2 3 i 2 3 4 ... 1 2 1 2 3 4 1024 3 4 5 DIV S14 S14 n j ... 1 2 ...... 1025 1026 LUT S5 S5

... 1 2 ...... SUB S4 S4

... 1 1 2 2 ... ADD S11S12S11S12

... 1 1 2 2 ... ADD S8 S9 S8 S9 ...... 1 2 ...... MUL S7 S7

... 1 2 ...... MUL S6 S6

... 0 1 2 3 4 5 6 7

Figure 2.9.: (a) Iteration space graph of bilateral filter annotated with dependencies and global schedule, (b) Gantt chart of local schedule of operations inside an iteration executed within each PE respectively with iteration interval of 1.

• synthesis of the control structure. In the following subsections, we give a brief overview of these steps and identify con- trol generation as a challenging problem to be solved in the context of sophisticated tiling techniques.

2.3.3.1. Synthesis of Processor Element The purpose of processor element synthesis is to generate an RTL-level hardware description for each individual processor element. A processor element (PE) consists of the processor core and a local controller. The processor core implements the data path where the actual computations are performed. The synthesis of the processor requires a scheduled reduced dependence graph. During the binding phase, each operation in the loop body was assigned a functional unit (resource) that executes the operation. These functional units are instantiated in the processor core. In case of reuse of functional units, input multiplexers are required in order to select the correct operands at each time step. Just like functional units, the registers may also be reused. Therefore, multiplexers are also required for register sharing. Apart from resource sharing, the selection of input variables also requires multiplexers in case of border handling or iteration dependent conditions. All the control signals for the multiplexers come from the hardwired control path, which is presented in Section 2.3.3.3. The interconnection between the functional units can be directly derived from the reduced dependence graph. The number of time steps an intermediate result must be stored is similar to the interconnect between PEs and can be computed by

39 2. Fundamentals and Related Work

Equation (2.14). After the hardware description of each processor type has been synthesized in an intermediate RTL, the corresponding processor elements can be instantiated in the processor array. The RTL of instantiated units are converted to VHDL which can be synthesized by commercial synthesis place and route tools for ASICs as well as FPGAs.

2.3.3.2. Synthesis of Array Interconnection Structure The processor array interconnection structure is responsible for the propagation of data and control signals within the hardware accelerator. The interconnect architec- ture consist of dedicated links with delay shift registers for each dependency. One can determine the synthesis of the processor interconnection for a data dependency, d ji as implied by a equation, xi [I] = x j I d ji by first determining the processor displacement, d p as − ji   p d = Q I Q I d ji = Q d ji (2.13) ji · − · − · where the allocation matrix is Q. The equation  finds the difference of processor index of PEs executing the source and sink iteration of the data dependency. Then, depending on the schedule vector, λ, the temporal delay for each data dependency is t determined by time displacement, d ji which can be calculated using Equation (2.14).

t d = λ d ji + τ x j + w x j τ(xi) (2.14) ji · − Applying processor allocation and scheduling,  the semantically equivalent space- p t time equation is x j [p,t] = xi p d ji,t d ji . In processor arrays, intermediate re- − − p sults are propagated from sourceh PE, pt to thei target PE, pt +d ji using shift registers, FIFOs or even external memory. The length of these shift registers is given by the t time displacement, d ji.

2.3.3.3. Synthesis of Control Hardware A very important step during the synthesis of processor arrays is to generate efficient control structures that orchestrate the correct computation of the algorithm on the processor array. In the proposed architecture, different types of control signals are generated:

• Iteration dependent control signals: For algorithms with iteration dependent conditionals, each processor can perform different operations, and use different input data depending on the current iteration point I , that means, control signals for functional units and input multiplexers must∈ be I generated depending on the value of I.

40 2.3. High-level Synthesis of Hardware Accelerators

External pixel_center (Output) FIFO pixel_input output pixels SUB LUT Adaptive Mask write enable FIFO full fixed mask MUL External (Input) FIFO sum_pixel MUL

input sum_weights ADD pixels FIFO ADD rd enable empty Computation Kernel

Pipelined Divider Input/Output output FIFO Controller Internal Buffer pixel

enable Global Counter Internal Buffer

Global Controller

Controller Bilateral Filter IP

Figure 2.10.: Bilateral filter accelerator architecture (II=1).

• Local schedule control signals: If functional units are reused for multiple oper- ations in the loop body, input multiplexers must be controlled to select the cor- rect input data. The same applies in case of reuse of internal registers. Multi- functional units like ALUs need to know which operation must be performed at a certain time step.

• I/O control signals: Finally, access to external memory and FIFOs requires that additional control signals and addresses have to be generated.

In order to maintain the scalability of processor arrays, it is crucial that the size of the control path remains nearly constant, regardless of the problem size (i.e., the size of iteration space and the size of the processor array ). In order to handle large scale problems| asI| well as to balance local memory requirements|P| with I/O-bandwidth, and to use different hierarchies of parallelism and memory, one needs a sophisti- cated transformation called hierarchical tiling. Innately, the applications are data flow dominant and have almost no control flow, but the application of hierarchical tiling techniques has the disadvantage of introducing a more complex control flow. In Section 3.2.1, an efficient methodology for the automated control architecture syn- thesis in the context of hierarchical tiling is given. The different steps of processor array synthesis are illustrated with help of our running example of the bilateral filter introduced in Example 2.1.1.

41 2. Fundamentals and Related Work

Example 2.3.3 Figure 2.10 shows the generated hardware accelerator for the bi- lateral filter. It consists of three parts, namely (i) the computation kernel, (ii) the internal delay buffers, and (iii) the controller. The computation kernel performs the arithmetic operations and deploys heavy pipelining (see Gantt chart in Figure 2.9(b)) such that several pixels can be processed in parallel. Similar to the handcrafted de- signs, the internal delay buffers temporarily store the input pixels such that each input pixel is read only once in order to avoid I/O bottlenecks. Their size can be calculated from the schedule and the data dependencies. It may be noted that the size of the line buffers is 1024 + 1 which can be inferred from the algorithm description obtained after the space-time mapping developed in Example 2.3.2. The allocation defines the dimension of processor array as 3 3 which is same as the filter mask dimension. Similarly, the number of line buffers× (=3 1 = 2) can also be inferred from the iteration conditions. The controller is responsible− for keeping track of the image pixels being processed and orchestrates the correct computations and I/O. For this purpose, the global counter generates the coordinates of the currently processed pixel from which the global controller derives the control signals. They select which conditional statement to execute such that correct border processing and I/O access is possible. The pixel coordinates provided by the global counter also allow gener- ation of the correct read, write enable signals for the input and output FIFOs. In case the input FIFO is empty or the output FIFO is full, the I/O controller stops the accelerator.

2.4. Accelerator Design Space Exploration

In this section, we present some related work on design space exploration. Design space exploration in embedded systems refers to the search for design parameters for meeting the application-specific requirements (e.g., chip area cost, power dissi- pation, and performance). The design alternatives depend on the synthesis in differ- ent abstraction layers like logic synthesis, high-level synthesis, software compilation, system synthesis. For example, in our design flow for accelerator synthesis, resource allocation in the architecture model and tiling parameters have a major influence on accelerator’s area, power dissipation, and performance. Another important observa- tion is that the exploration problem is characterized by the presence of millions of design alternatives due to a large architecture/compiler parameter space. Therefore, exhaustive search is expensive and intelligent heuristics like genetic algorithms and others are used for fast exploration of the design space. There has been a lot of research work on efficient design space exploration at dif- ferent levels of abstraction. Very often, these approaches are integrated in the design flow for synthesis. In the field of software compilation, iterative compilation and auto-tuning approaches have been used to match loop algorithms to a given pro- cessor architecture with multiple memory hierarchies [113, 181]. In these methods,

42 2.4. Accelerator Design Space Exploration different versions of the loop program depending on loop unrolling, loop blocking, or other transformation parameters are generated, and timed on the processor to find the optimal code. These procedures lead to superior results in terms of performance portability across different multi-processor architectures [182]. Recently in [149], a genetic algorithm that leverages a polyhedral compiler-in-loop, was used to traverse the loop transformation search space. The idea of exploring architecture and com- piler parameter space simultaneously for configurable application-specific instruction set processors (ASIPs) has been studied in [70]. For design space exploration, effi- cient pruning techniques based on Pareto-dominance and binary search have been combined to explore a vast search space of processor parameters such as register files, number/type of functional units, and compiler optimizations. In [2], an efficient architecture and compiler co-exploration based on analytical models and Pareto sim- ulated annealing algorithm has been presented for single processor architectures. At system level, genetic algorithms have been used to efficiently explore a parametrized system-on-chip architecture to find Pareto-optimal configurations in a multi-objective design space in [144, 63].

For accelerator architectures, an automated design space exploration for embed- ded computer systems consisting of a VLIW processor and/or a customized sys- tolic array accelerator is presented in [1]. Here, exhaustive search methods for sub- problems combined with heuristics for walking the design space are presented. Pilato et al. [145] use evolutionary algorithms for area and throughput optimization for ac- celerators realized on FPGAs, considering only resource allocation. In [161], an iterative exploration considering architectural constraints and as compiler transfor- mation loop unrolling is presented. Banerjee et al. target loop transformations and architecture specification for the exploration of accelerators generated from initial specifications in Matlab [9]. They also undertake an iterative compilation approach, but do not consider modern heuristics like evolutionary algorithms. In the ROCCC project, fast area and delay estimators for exploration of accelerators generated from C specification for fast iterative compilation are presented [118].

In all the above-mentioned works, the system constraints and workload behavior is not taken into account. Furthermore, many of the mentioned related works have not investigated the use of modern heuristics like evolutionary algorithms for design space exploration. In Chapter 5, an efficient search of the design parameter space con- sisting of the accelerator architecture allocation and compiler transformations based on modern heuristics like multi-objective evolutionary algorithms is presented. Fur- thermore another novel contribution is that we incorporate real-time calculus for the dimensioning of accelerators according to given system contracts, which enables per- formance analysis and dimensioning of communicating loop accelerators.

43 2. Fundamentals and Related Work

(a) (b) Input Input Input Load MMU Memory 0 Memory 1 ... Memory n ALUALULoadLoad Branch ALUALU LoadStore BranchBranch StoreStore Branch Smart Smart Smart Store RegistersRegisters buffer 0 buffer 1 buffer n RegistersRegisters Address Address Address Instruction Fetch/Decode Cache Gen 0 Gen 1 Gen n VLIW

Scalar Datapath

Controller Host Interface Task Frame Memory

PEPE PEPE PEPE PEPE PEPE PEPE PEPE PEPE FIFO FIFO FIFO PEPE PEPE PEPE PEPE buffer 0 buffer 1 buffer n PEPE PEPE PEPE PEPE Address Address Address PE PE PE PE Gen 0 Gen 1 Gen n PEPE PEPE PEPE PEPE PE PE PE PE Timing Controller NPA 1 NPANPA 1 1 Output Output Output NPA 1 Memory 0 Memory 1 ... Memory n

Figure 2.11.: (a) ROCCC accelerator architecture template, (b) PICO non-prog- rammable accelerator connected to a custom VLIW processor.

2.5. High-level Synthesis Tools

The major aim of this thesis is to obtain a full-fledged design flow for the genera- tion of hardware loop accelerators from a high-level language. In this section, we finally compare some of the existing design flows and methodologies based on spec- ification, synthesis, support of communicating loops, system integration, and design space exploration. DEFACTO [41]: This design flow converts a sequential C implementation into a RTL description of an accelerator in VHDL using the SUIF compiler front end [183] and an own back end. They use a hardware structure called tapped delay line and scalar replacement for data reuse. Furthermore, a parametrizable memory controller can be generated which is optimized for external memory access. In addition, a cus- tom data layout is undertaken by the compiler where the array data is distributed across multiple memory banks according to the chosen loop unrolling. It can also implement applications consisting of multiple loops, and even over multi-FPGA sys- tems. Until now, there has been no efforts on targeting embedded SoC platforms. Therefore, interface synthesis is not an issue. The design space exploration aims at maximizing accelerator performance subject to area capacity constraints but is not regarded as a multi-objective design space exploration problem. Streamroller [115]: Kudlur et al. propose a methodology for the generation of accelerator pipelines with prescribed throughput requirements. The resulting data path is similar to a VLIW template, which is generated by a compiler based on the

44 2.5. High-level Synthesis Tools

SUIF compiler front end and Trimaran compiler as back end. The accelerators are connected over SRAM buffers. The authors are now evolving a methodology [98] targeting streaming applications described in Stream-IT [174] onto FPGA-based ar- chitectures. The basic building blocks of the applications are filters, and queues im- plemented on SRAM buffers. The system integration is task of the designer. Further- more, design space exploration is not yet considered. ROCCC [82]: The high-level synthesis tool formerly known as SA-C [19] con- verts a restricted subset of C into VHDL. The scalar pipeline data path as shown in Figure 2.11(a) is generated from an intermediate representation generated by the Machine-SUIF compiler system [97]. Outstanding in the design flow is the ability to handle while loops and smart buffers for storage. The smart buffers facilitate data- reuse in loop algorithms through an architecture consisting of dedicated address gen- erators and multiple connected registers. The data communication engine between on-chip block RAM and the inter-chip data streams is not part of the code generation [82]. Therefore, it is the task of the designer to generate hardware and software for communication between multiple accelerators or communication between accelerator and processor, respectively. There has been an investigation on the impact of com- piler transformation like loop unrolling on objectives like accelerator area, power, and performance. However, systematic design space exploration is not addressed. MMAlpha [81]: MMAlpha has a long history in leveraging the polytope model for high-level synthesis of parallel hardware accelerators in form of a processor array. A high-level description of the loops in the ALPHA single assignment language can be manipulated using loop transformations to generate the accelerator. Instead of a scalar pipeline data path, the accelerator data path consists of multiple PEs and the local controller of each PE orchestrates the computation. In [33], a methodology for the generation of a hardware/software interface for system integration is proposed. However, it is the task of the designer to generate a communication subsystem for multiple loop accelerators. Derrien et al. proposed models for power estimation of special purpose accelerators in [39]. The authors also find optimal tiling parameters for minimum energy consumption per PE. PICO [155]: Formerly developed at Hewlett-Packard, introduced by Synfora, and now part of Symphony C compiler from Synopsys [163]. This tool supports accelerator-RTL generation, communicating loops, and interface synthesis from al- gorithm descriptions in ANSI-C. The overview of an architecture template is shown in Figure 2.11(b), where the VLIW processor is configured to run the sequential code annotated with driver code of the accelerator. The accelerator is a pipeline of processor arrays, which implements communicating loops. It is not able to support complex out-of-order communication. The processor array is generated from a non- programmable accelerator (NPA) template defining a grid of PEs with local memory. It is connected to a VLIW processor and an external memory. Instead of blocking read/write, the communication between NPAs takes place with help of a timing con- troller. It was one of the first design flows integrating design space exploration [1].

45 2. Fundamentals and Related Work

ESPAM-PICO [176]: The ESPAM design flow has the ability to synthesize multi- processor designs from algorithm descriptions in C [142]. The algorithms are written as static affine nested loop programs (SANLP) which are converted to Kahn pro- cess network (KPN) descriptions. Distributed memory is used for communication between system components. Recently, in [176], the PICO tool was integrated in the design flow to generate the hardware accelerators. It uses a tightly coupled accelera- tor block (TCAB) wrapper model with read, write and execute unit for integration of hardware accelerators. Simulink [170], Peace [84]: The design flows PeaCE [84], Simulink [170] com- pose complex systems using communicating blocks. This not only helps the user to cope with the application complexity, but also allows easy extraction of the contained task level parallelism. However, hardware synthesis still requires lots of manual in- tervention from the user. In particular, the algorithms have to be redesigned in such a way that module communication takes place on the granularity of pixels and exploits data reuse. Gaspard2 [22]: The Array-OL specification language has been specifically devel- oped to model data intensive signal and image processing applications. The Gaspard project extends the Array-OL language and allows modelling, simulation, testing and code generation of SoC applications and hardware architectures. Starting from an UML specification, it generates SystemC code. It has not yet been targeted for accelerator generation; although, the model can appropriately represent the multi- dimensional application and apply some very sophisticated transformations. Spark [83]: The SPARK framework was particularly targeted to mapping control intensive functional blocks in signal and image processing onto dedicated hardware. There are several industry strength high-level synthesis tools like Mentor Graph- ics CatapultC [133] , Forte Cynthesizer [71], Cadence C-to-Silicon [26], Altera C2H [120] compiler, Synphony C compiler [163], which convert C/C++ or SystemC code into RTL design. Autopilot [190] follows the approach based on converting a vir- tual instruction set derived using the LLVM compiler [119]. Bluespec is another impressive design system, which not only targets compute intensive applications, but also control intensive applications. The design entry is done in an object-oriented language System Verilog [141]. An excellent overview of the state-of-the-art HLS tools is given in [32]. Table 2.1 gives a overview of comparison of different design flows for accelerator generation. The differentiation is based on the language, model of computation, loop transforma- tions, hardware accelerator generation, support for data reuse, accelerator generation for communicating loops, interface synthesis for system-on-chip integration, and de- sign space exploration. Our design flow for accelerator synthesis and exploration is able to support all the above aspects in its entirety.

46 2.6. Conclusion

Tools Lang. MoC Loop HW Data Com. Sys. DSE Trafo. Gen. Reuse Loop Int. DEFACTO C - Unrolling X X X × X Stream- C SDF - X roller StreamIT × × × × RoCCC C - Unrolling - X X × X MMalpha Alpha Polyhedral Tiling X X × X × Synfora, C Polyhedral Tiling X X X X X PICO ESPAM C KPN Unfolding X × X X X Gaspard2 UML Array-OL Tiling × X X × × PARO PAULA Polyhedral Hierarch. X X X X X WSDF Tiling Handel-C Handel CSP Unrolling X X X C Manual × × Simulink Simulink - - X × X × × Matlab Matlab - - X X (Match) × × × Spark C - Unrolling X × X × ×

Table 2.1.: Comparison of design flows for hardware loop accelerator generation in terms of language, model of computation, support for loop transforma- tions, hardware generation, data reuse, communicating loops, system in- tegration, and design space exploration.

2.6. Conclusion

In this chapter, we presented the fundamentals of the design flow for generating hard- ware accelerators from nested loop programs. The initial specification of loop pro- grams in the polyhedral model can lead to a high quality of result. Modular ap- proaches of models of computation (MoC) such as a windowed synchronous data flow (WSDF) is considered for modelling consisting of communicating loop pro- grams. Therefore, a major open question is how does one convert an application specification consisting of communicating loops in the polytope model to an equiva- lent WSDF MoC? We also classified different accelerator architectures and identified hardware accelerators as the focus of research. The mapping problem for such an accelerator is solved by means of high-level synthesis. After discussing elementary loop transformations, we pinpointed tiling as a major transformation for matching

47 2. Fundamentals and Related Work the parallel implementation of accelerator architectures with multiple levels of paral- lelism and memory. In this context, a methodology supporting automatic tiling of the loop iteration space and kernel code generation is needed. Subsequently, an overview on global and local scheduling was given, which enables generation of intermediate RTL of the accelerator in form of a processor array. It has been observed that hier- archical tiling introduces many control conditionals, which are dependent on the it- eration variables, thereby increasing the amount of control flow in the scheduled and allocated specification. Hence, a holistic methodology for generation of controller engine for supporting hierarchical tiling and external communication is important. In the context of communicating loops, accelerator design involves a set of new challenges: For a proper modelling of communicating behavior, we propose to model the application by windowed synchronous data flow (WSDF) model. Another ma- jor challenge is dimensioning and the automatic generation of the communication subsystem between the accelerators for hooking up the accelerator subsystem. Fur- thermore, accelerators are not foreseen as the only system component, but also as co-processors in a system-on-chip. Therefore, system integration is an important is- sue. Finally, related work on design space exploration has been analyzed with respect to multi-objective optimization and efficient search heuristics. Lastly, an overview of related work of existing methodologies was differentiated in the context of above- mentioned challenges.

48 3. Accelerator Generation: Loop Transformations and Back End

In order to meet the orthogonal objectives of high performance, ample flexibility, and low cost, accelerator-based systems-on-chip were identified as quintessential ar- chitectures. To realize the full potential of these architectures, the twin problems of the lack of mapping tools for accelerator generation and automated multi-accelerator system design need to be tackled. In this chapter, we address the important aspects of the first problem, i.e., the optimal mapping for accelerator generation. The considered accelerator architectures are characterized by multiple hierarchies of parallelism and memories. They have different levels of memory along with a large number of processing elements (PEs), where each PE can further contain multiple functional units. This leads to the problem of accelerator matching, i.e., of matching computationally intensive loop accelerator implementations with multiple hierarchies of parallelism and memory of the architecture. We introduce a new transformation called hierarchical tiling in Section 3.1. This transformation partitions the iteration space of the loop algorithms with hierarchies of multiple congruent tiles to produce different variants of loop code in form of DPLA. Depending on the selected tiling strategy and corresponding tile sizes, the accelerator resource allocation in terms of PEs, local memory, and memory banks is determined. The principles of some other important source-to-source loop transformations are also given in Section 3.1. Innately, the loop applications are data flow dominant and have almost no control flow. However, the application of tiling techniques has the disadvantage of introduc- ing more complex control and communication flow. Therefore, a generic controller which orchestrates the data transfer and computation according to the given schedule, needs to be generated. In Section 3.2.1, we present a methodology for the automatic generation of the control engines of such accelerators. The key insight of identify- ing and separating local and global control signals, leads to a considerable reduction in controller cost. Another formidable problem for integration of accelerators in a system-on-chip is interface synthesis. In this context, a number of questions have to be answered: What should be the number and the size of the local buffers? Accelera- tors hide the memory latency and provide high I/O bandwidth by including multiple local buffer banks. How would data transfer to/from local buffers correspond to the static schedule and dynamic system behaviour? A controller generates and processes status signals (start, done, empty, full, ...) for interaction with external components. The I/O controller is also responsible for generating enable signals, addresses, and

49 3. Accelerator Generation: Loop Transformations and Back End evaluating FIFO status flags for data transfer from the memory to the processor ar- ray of the accelerator. The methodology for interface controller and custom memory synthesis is presented in Section 3.2.2. Loop optimizations are important for realizing accelerators; however, they also lead to the introduction of complex control, which must be realized efficiently by the controller architecture. In this context, the question arises, what is the overhead of the data path, the memory, the I/O channels and their controller on the accelerator in terms of area, power, and throughput? Often, the performance bottleneck lies in the delay of the critical path in the controller architecture. Also, the area require- ments of the controller could be larger than the accelerator data path itself in certain cases. Therefore, it is important to study the overhead of different accelerator sub- components and its relation to loop tiling and scheduling. Apart from this overhead analysis, we show the benefits of our design methodology in terms of automated loop accelerator generation for a wide class of algorithms in Section 3.3. Finally, the contributions of this chapter are summarized in Section 3.4.

3.1. Loop Optimizations for Accelerator Tuning

In this section, we present some standard loop optimizations and a novel transforma- tion called hierarchical tiling for accelerator tuning. The generic hardware model of a programmable and a non-programmable accelerator is shown in Figure 3.1(a). Accel- erators contain several homogeneous processing elements (PEs), which are organized in a one or a two-dimensional grid structure. These contain multiple functional units performing simple add, mul, and special functions like trigonometric or exponential functions. Similarly, the memory model of an accelerator, as shown in Figure 3.1(b), may be hierarchical. The accelerator data is stored in the main memory, which can be transferred by the CPU or by a direct memory access (DMA) unit into the local buffers. The border PEs in the accelerator have dedicated access to the FIFO banks for the parallel access of data. Existence of register banks and shift registers enable further data reuse within each PE. The task of efficient programming or the genera- tion of such loop accelerators needs to fulfil several optimization goals, including ef- ficient access of different memory levels along with the optimal usage of all compute hierarchies. This problem has been well studied in the high performance computing research community for efficient mapping of loop algorithms on single processor and multi-processor systems. Several popular compiler optimizations like prefetching, tiling, loop unrolling, and loop permutation have been developed, which lead to sev- eral variants of accelerator implementations [137]. These automated optimizations replace manual tweaking of the programs and allow to utilize the accelerator’s archi- tecture model effectively. Recently, it has been observed for simple applications like matrix-matrix multiplication that the application of hierarchical tiling (also known as multi-level tiling) with the combination of the above mentioned optimizations yield

50 3.1. Loop Optimizations for Accelerator Tuning

(a) (b) Main Memory

Accelerator n CPU DMA Accelerator 3 Accelerator 2 Accelerator 1 FIFO/RAM FIFO/RAM FIFO/RAM FIFO/RAM FIFO/RAM FIFO/RAM Datapath Datapath Datapath Datapath Datapath Datapath FBSR FBSR FBSR Datapath Datapath Datapath FBSR FBSR FBSR

RegistersRegisters RegistersRegisters RegistersRegisters PE(0,0) PE(1,0) PE(2,0) Datapath Datapath Datapath Datapath Datapath Datapath Datapath Datapath Datapath Datapath Datapath Datapath Datapath FBSR FBSR FBSR Datapath Datapath Datapath Datapath FBSR FBSR FBSR FBSR FBSR FBSR Datapath FBSR FBSR FBSR FBSR Datapath Datapath Datapath Datapath FBSR FU FBSR FBSR Datapath FBSR Accelerator FU Registers Registers Registers RegistersRegisters RegistersRegisters RegistersRegisters RegistersRegisters RegistersRegisters RegistersRegisters RegistersRegisters Registers Registers

Figure 3.1.: (a) Accelerator hardware model. (b) Accelerator memory model. best performance results for CPUs, multi-cores, and graphic processors [180, 12]. Here, the key insight is that tiling at different hierarchies enables targeting of differ- ent levels of memories and computation units. In the following section, we revise important loop transformations like loop permutation and tiling. Then, we concen- trate on hierarchical tiling and present a new methodology for automatic hierarchical tiling of loop nests described as dynamic piecewise linear algorithms.

3.1.1. Loop Transformations Many toolchains for converting human-readable high-level languages into massively parallel hardware circuits and parallelized program code are based on standard and advanced compiler optimizations. We use the following running example to illustrate the effect of compiler loop transformations on efficient accelerator synthesis.

Example 3.1.1 The Matrix-Matrix Multiplication (MMM)1 is a simple and ubiqui- tous computationally intensive linear algebra algorithm, which offers numerous chal- lenges to a high performance application programmer. The following program code shows its implementation in the programming language C. Each element of result matrix C[i][ j] is the dot product of the ith row of A and the jth column of B. The first

1The first MEMOCODE 2007 HW/SW co-design contest involved the challenge of writing the fastest co-designed/optimized implementation of a complex matrix-matrix multiplication

51 3. Accelerator Generation: Loop Transformations and Back End

(a) (b) Main Memory

CPU DMA MEM(A) MEM(B) MEM(C)

Accelerator ADD k j Registers

MUL Registers Registers PE(0,0) i

Figure 3.2.: (a) Iteration space graph of matrix-matrix multiplication. (b) Hardware accelerator executing the inner kernel of the matrix-matrix product.

step towards accelerating the application is to generate a special hardware circuit for executing the multiply-add instruction in line 6. Therefore, the simplest accelerator consists of a single PE with a MUL and ADD functional unit.

1 v o i d mmmKernel(Number* A , Number* B , Number* C, i n t N){ 2 i n t i , j , k ; 3 f o r (i = 0; i

The accelerator architecture shown in Figure 3.2(b), consists of multiply and add functional units for multiplying and adding the numbers corresponding to the loop kernel. The software iterates over loop counter variables k, j, and i and is respon- sible for the data transfer. Therefore, there is an overhead of 3N3 read and write operations as each iteration has two reads and one write operation. This could have been avoided, if data reuse within the accelerator had been exploited. A lot of data reuse and parallelization information is lost in the above C description of the al- gorithm. The following program code (in PAULA notation) represents the MMM algorithm formulated as a DPLA. The matrices A and B are embedded into the it- eration space by the following equations, a[i,0,k] = Aik, b[0, j,k] = Bk j. The itera- tion data dependence vectors (0 1 0)T , (1 0 0)T , and (0 0 1)T represent the inter-

52 3.1. Loop Optimizations for Accelerator Tuning nal data reuse for matrix A, B, and C, respectively. The iteration space is given by T = I = (i j k) Z3 0 i, j,k N 1 . For N = 8, the iteration space graph (ISG)I is shown with∈ data| dependence≤ ≤ vectors− in Figure 3.2(a). 

program matmul { variable A 3 in integer<16>; variable B 3 in integer<16>; variable C 3 out integer<16>; variable a,b,c,z 3 integer<16>;

parameter N = 512; par(i>=0 and i<=N 1 and j>=0 and j<=N 1 and k>=0 and k<=N 1) {a[i, j, k] =A[i,− 0, k] if (j− == 0); //read matrixA− b[i, j, k] =B[0, j, k] if (i == 0); //read matrixB a[i, j, k] = a[i, j 1,k] if (j > 0); //Data reuse b[i, j, k] = b[i 1,− j, k] if (i > 0); //Data reuse − z[i, j, k] = a[i, j, k] * b [ i , j , k ] ; c[i, j, k] =c[i, j, k 1]+z[i, j, k] if (k> 0); c[i, j,k]=z[i, j,k]− if (k==0); C[i, j,k]=c[i, j,k] if (k==N 1); //write C } − }

The two essential rudiments for obtaining higher performance are

• increasing the PE count of the accelerator, i.e., parallelism

• better memory access through data reuse.

Using the above matrix-matrix product specification, the leveraging power of loop transformations for exploiting the above principles is shown in the following subsec- tions.

3.1.1.1. Loop Permutation A loop permutation may improve the performance through better memory access by utilizing spatial locality. The efficient implementation of computationally intensive algorithms often requires intermediate storage of matrices, images, or signals. Usu- ally, these large data sets cannot be stored completely in on-chip memory due to size constraints. Therefore, efficient data transfer between slower off-chip main memory (DRAM), faster intermediate on-chip memory (cache, local buffer), and the accelera- tor is mandatory. An efficient memory controller always exploits the spatial locality,

53 3. Accelerator Generation: Loop Transformations and Back End

(a) (b) j (c) j k k k B k j k j k j i i i i A i C

Figure 3.3.: (a) I/O matrices are embedded into a common iteration space, whose values are propagated by reuse vectors. (b) Memory access pattern of original version (i jk). (c) Access pattern after loop permutation (ik j).

which refers to the fact that the likelihood of referencing a memory location is higher, if a neighbouring memory location was just referenced. The naive implementation of the MMM specification leads to low performance because of inefficient reuse of data locality. This is illustrated in Figure 3.3(b). For the MMM example in C, there exists 6 different formulations (i jk, ik j, jik, jki, ki j, k ji) obtained by loop permutation. In the initial version (i jk) refers to the execution of k as innermost loop and i as the outermost loop. Here, the matrix B is accessed in column major order. The matrix A is accessed in row-major order. The access pattern of matrix A has a stride of 1. However, the column major access of matrix B has stride N, which does not exploit spatial locality. Loop permutation is a unimodular transformation, which changes the schedule order of the loop program [10]. The idea of permuting is to switch the inner and outer loop. Figure 3.3(c) illustrates the effect of the permutation transformation for MMM on memory access, by interchanging the j and the k loop (ik j). It can be seen that all the matrices now exploit spatial locality (i.e. matrix B and C have stride of 1), which may lead to better overall performance. With respect to the C program only lines 4 and 5 of the initial code needs to be interchanged. In our methodology, loop permutation is equivalent to a column permutation of the so-called loop matrix (see Definition 2.3.1). This can be obtained by the product of m m m m the given loop matrix, R Z × with a permutation matrix, Pm Z × . The per- mutation matrix is obtained∈ by elementary column interchange of the∈ identity matrix, Im. For the ik j loop order formulation of the running example, the new loop matrix becomes

54 3.1. Loop Optimizations for Accelerator Tuning

(a) (b) (c) Main Memory j CPU DMA k k MEM(A) MEM(B) MEM(C)

k j i i ADD ADD ADD ADD

MUL MUL MUL MUL Reg Reg Reg Reg

Figure 3.4.: (a) LPGS tiling of MMM specification by a factor of 4, (b) corresponding accelerator architecture, (c) memory access of the I/O matrices.

0 0 N 0 0 1 0 0 N Rnew = RoldPm =  0 N 0  1 0 0  =  N 0 0  N 0 0 0 1 0 0 N 0           3.1.1.2. Loop Tiling

Loop permutation exploits the spatial locality of memory accesses, but not the par- allelism of multiple PEs in an accelerator or the temporal locality in case of large arrays. The concept of temporal locality is that a memory location that is referenced at one point in time will be referenced again sometime in the near future. Loop tiling as a transformation was introduced briefly in Sections 2.3.1.2 and 2.3.2.1. It may in- crease the overall performance further by exploiting temporal locality and allocating computations onto multiple PEs in an accelerator. This can be observed in Figure 3.3(c), when a large matrix size inhibits the storage of matrices in an intermediate memory (i.e., cache or local buffer of the accelerator). Each iteration of the i loop requires the storage of the complete B matrix and a row of matrix C. Therefore, if the intermediate memory cannot store N2 + N data elements, a large I/O overhead to the main memory is incurred, which may lead to a significant communication bottleneck. In order to exploit temporal locality, the iteration space is divided into congruent tiles of size s s s, the tile size. This requires a reduced intermediate storage size of 2 × × s + s matrix elements. Then, tiling or blocking can also be understood as dividing N 3 the computation into smaller ( s ) times matrix multiplication of matrices of size s in case of perfect tiling. In terms of a loop program, the loop depth doubles from 3 to 6 (i.e., the number of iteration variables is 6 for the tiled program).

55 3. Accelerator Generation: Loop Transformations and Back End

The loop tiling transformation includes the problem of tiled code generation and parallelization strategies like LPGS (local parallel, global sequential, also referred as inner loop parallelization) or LSGP (local sequential, global parallel, often also referred as clustering, blocking or outer loop parallelization) [169]. Mathematically, the formal definition of the tiling methodology and strategy is given by Equation (2.8). The compiler optimization LPGS tiling also known as strip mining, is done by splitting a single loop into two nested loops; the inner loop contain iterations within so-called strips [158], whereas the outer loop steps between consecutive strips. These iterations within the inner loop can be mapped onto independent PEs or vector units. The strips are then processed sequentially according to a given schedule order. In the following C code fragment, a tiled version of the original MMM source code is presented.

void mmmKernel(Number* A, Number* B , Number* C , i n t N) { int i, j, k, j1; for (i = 0; i

This transformation is illustrated in Figure 3.4. The four iterations inside the tile are executed in parallel by the loop accelerator. The LPGS scheme is associated with a minimal local memory consumption of the accelerator PEs, and therefore char- acterized by a high communication overhead. In LSGP, the communication cost is minimal. But only either the amount of the local memory or the number of required PEs can be controlled. Recent research results in tuning high performance implemen- tations enforce the usage of multiple hierarchies of tiling for matching the and different levels of parallelism in an architecture [180, 181, 111]. In the next section, we present a source-to-source loop transformation called n-hierarchical tiling, where n is the number of introduced tiling levels.

3.1.2. Hierarchical Tiling In the previous section, we identified tiling as a major transformation for matching loop nest implementations to architecture constraints. The use of multiple levels of tiling is in targeting the multiple levels of memory like in work of Eckhardt et.al. in [59]. The authors propose two levels of tiling called copartitioning for targeting the foreground memory (i.e. registers) and the background memory (caches, RAM, disk).

56 3.1. Loop Optimizations for Accelerator Tuning

The repeated use of co-partitioning separately for each level of background memory (i.e., caches, RAM, disk) for optimization of memory access is also proposed, which is a case of hierarchical tiling. Furthermore, the state-of-the-art parallel architectures are characterized by multiple levels of parallelism like sub-word parallelism, VLIW parallelism, multiple PEs, which are clustered in processor arrays. Therefore, a for- mal definition of hierarchical tiling is needed to target multiple levels of parallelism and memory structures as they could require more than 2 levels of tiling. Such tiling techniques require a hierarchy of tiling matrices in order to partition the iteration space not only for software compilation but also high level synthesis. In this section, we explain hierarchical tiling using an example and present a method for automating the transformation of hierarchical tiling. Copartitioning [57] is an example of a 2-level hierarchical tiling, where the itera- tion space is first partitioned into LS (local sequential) tiles. This tiled iteration space is then tiled once more using GS (global sequential) tiles. In copartitioning, the itera- tion points within the LS tiles are executed sequentially. All the LS tiles within a GS tile are executed in parallel by the processor array. Therefore, the number of proces- sors in the array is equal to the number of LS tiles within a GS tile. The GS tiles are executed sequentially. Copartitioning uses both LSGP and LPGS methods in order to balance local memory requirements with I/O bandwidth and has the advantage of problem size independence. Copartitioning is formally defined as follows

Definition 3.1.1 (Copartitioning) is defined by tiling the m-dimensional iteration space, into iteration spaces, 1, 2, and 3, i.e., I I I I m 3m Copartitioning : Z Z , 1 2 3, where → I 7→ I ⊕ I ⊕ I 1 2 3 = I = I1 + P0 I2 + P0 I3 I1 1 I2 2 I3 3 I ⊕ I ⊕ I { 1 · 2 · | ∈ I ∧ ∈ I ∧ ∈ I } m m m m using two congruent tiles defined by two tiling matrices, P1 Z × and P2 Z × , 0 0 m ∈ ∈ where P1 = P1 and P2 = P1 P2. 1 Z represents the points within the LS (inner) m · I ∈ m tiles and 2 Z accounts for the regular repetition of the origin of LS tiles. 3 Z accountsI for∈ the regular repetition of the GS (outer) tiles. The space-time mappingI ∈ in case of copartitioning is an affine transformation of the form

I p 0 E 0 1 =  I2  (3.1) t ! λ1 λ2 λ3 ! I3   m m 1 m  1 m 1 m where E Z × is the identity matrix, λ1 Z × , λ2 Z × , λ3 Z × . ∈ ∈ ∈ ∈ Similarly an n-hierarchical tiling method partitions the iteration space into n + 1 iteration spaces. Formally, this is represented as I

1 ... n 1 = I = I1 +P0 I2 +...+P0 In 1 I1 1 I2 2 ... In 1 n 1 I ⊕ ⊕I + { 1 · n · + | ∈ I ∧ ∈ I ∧ + ∈ I + }

57 3. Accelerator Generation: Loop Transformations and Back End

(a) (b) 0 j j 3 0 1 2 3 4 5 0 1 j 0 1 2 2 j1 0 0 0 0

1 1

2 1 i1

3

4 i 1 2

5

6

7

i U AY i3

Figure 3.5.: Dependence graph of a 6-tap FIR filter and the dependence graph after copartitioning. The affine data dependency of variable, A is not shown. n-hierarchical tiling not only changes the loop code in terms of the iteration space (i.e., loop depth, variables, and bounds), but the statements inside the loop kernel are also changed in the new iteration space,

m m tiled = (I1,I2,...,In+1) I1 Z ,...,In+1 Z I { | ∈ ∈ } Other hierarchical tiling schemes can be realized using an appropriate selection of the affine transformation, which characterizes the scheduling and the allocation of the iteration points. The problem of determining an optimal sequencing index (i.e., λ1,λ2,...) is solved by a Mixed Integer Linear Programming (MILP) formulation [86]. Very often, tiling is not quite straightforward and intuitive for the programmer. Therefore, in this section, we present a methodology for automatic hierarchical tiling of loop programs which are described as a set of recurrence equations. We use an FIR filter for illustrating our methodology.

Example 3.1.2 An FIR (Finite Impulse Response) filter is described by the simple N 1 difference equation y(i) = ∑ − a( j) u(i j) with 0 i < T, N denoting the number j=0 · − ≤ of filter taps, a( j) the filter coefficients, u(i) the filter input, and y(i) the filter result. A 6-tap FIR filter can be expressed by the following DPLA with the iteration domain = (i, j) 0 i T 1 0 j N 1 , where T = 8 and N = 6 I { | ≤ ≤ − ∧ ≤ ≤ − }

58 3.1. Loop Optimizations for Accelerator Tuning

a[i, j] = A[0, j] u[i, j] = U[i, j] if j = 0 u[i, j] = 0 if i = 0 j > 0 ∧ u[i, j] = u[i 1, j 1] if i > 0 j > 0 − − ∧ z[i, j] = a[i, j] u[i, j] · y[i, j] = z[i, j] if j = 0 y[i, j] = y[i, j 1] + z[i, j] if j > 0 − Y[i] = y[i, j] if j = N 1 −

The original iteration domain with data dependencies between the individual vari- ables is shown in Figure 3.5(a).

We first illustrate the problem of hierarchical tiling for an FIR filter loop program. The above DPLA program of an FIR filter is tiled to realize a hierarchically tiled FIR filter in the following example.

Example 3.1.3 The FIR filter undergoes a 2-hierarchical tiling (copartitioning) with tiles represented by the following matrices.

2 0 2 0 P1 = P2 = 0 3 ! 0 2 !

P1 and P2 denote the number of iterations inside an inner tile and the number of PEs in the processor array in case of copartitioning, respectively. The tiled FIR filter is represented as a dynamic piecewise linear algorithm where the iteration space and the program body is given in the following. The tiled iteration space of the program along with the data dependencies between the iterations is shown in Figure 3.5(b). The iteration space can be interpreted as a set of three nested for loops of the FIR filter as shown in Figure 3.6(a). It scans the iterations of the original iteration space, where 1 denotes the iteration space of the inner tile whereas, 2 and 3 denote the iterationI spaces, which correspond to the origin of the inner andI theI outer tile, respectively. Formally, the iteration spaces are defined by 1 2 3, where I → I ⊕ I ⊕ I T 2 1 = (i1 j1) Z 0 i1 1 0 j1 2 (Loop1) I ∈ | ≤ ≤ ∧ ≤ ≤ T 2 2 = (i2 j2) Z 0 i2 1 0 j2 1 (Loop2) I  ∈ | ≤ ≤ ∧ ≤ ≤ 3 = i3 Z 0 i3 1 (Loop3) I { ∈ | ≤ ≤ }

The iteration variable j3 = 0 and is therefore not shown in the tiled iteration space. The statements in the loop kernel are transformed into the following set of recurrence

59 3. Accelerator Generation: Loop Transformations and Back End equations in DPLA notation

a[i1, j1,i2, j2,i3] = A[i1, j1,i2, j2,i3] if i1 = 0 i2 = 0 i3 = 0 ∧ ∧ u[i1, j1,i2, j2,i3] = U[i1, j1,i2, j2,i3] if j1 = 0 j2 = 0 ∧ u[i1, j1,i2, j2,i3] = 0 if i1 = 0 i2 = 0 i3 = 0 ∧ ∧ j1 + j2 > 0 ∧ u[i1, j1,i2, j2,i3] = u[i1 1, j1 1,i2, j2,i3] if i1 > 0 j1 > 0 − − ∧ u[i1, j1,i2, j2,i3] = u[i1 + 1, j1 1,i2 1, j2,i3] if i1 = 0 j1 > 0 i2 > 0 − − ∧ ∧ u[i1, j1,i2, j2,i3] = u[i1 1, j1 + 2,i2, j2 1,i3] if i1 > 0 j1 = 0 j2 > 0 − − ∧ ∧ u[i1, j1,i2, j2,i3] = u[i1 + 1, j1 + 2,i2 1, j2 1,i3] if i1 = 0 j1 = 0 i2 > 0 − − ∧ ∧ j2 > 0 ∧ u[i1, j1,i2, j2,i3] = u[i1 + 1, j1 1,i2 + 1, j2,i3 1] if i1 = 0 j1 > 0 i2 = 0 − − ∧ ∧ i3 > 0 ∧ u[i1, j1,i2, j2,i3] = u[i1 + 1, j1 + 2,i2 + 1, j2 1,i3 1] if i1 = 0 j1 = 0 i2 = 0 − − ∧ ∧ j2 > 0 i3 > 0 ∧ ∧ z[i1, j1,i2, j2,i3] = a[i1, j1,i2, j2,i3] u[i1, j1,i2, j2,i3] · y[i1, j1,i2, j2,i3] = z[i1, j1,i2, j2,i3] if j1 = 0 j2 = 0 ∧ y[i1, j1,i2, j2,i3] = y[i1, j1 1,i2, j2,i3] + z[i1, j1,i2, j2,i3] if j1 > 0 − y[i1, j1,i2, j2,i3] = y[i1, j1 + 2,i2, j2 1,i3] + z[i1, j1,i2, j2,i3] if j1 = 0 j2 > 0 − ∧ Y[i1, j1,i2, j2,i3] = y[i1, j1,i2, j2,i3] if j1 = 2 j2 = 1 ∧ For the FIR filter, a loop nest of depth 2, after 2-hierarchical tiling (copartitioning), has a resulting loop nest of depth 6. One can verify that the index point I = (4,3) is uniquely mapped to I1 = (0,0),I2 = (0,1), and I3 = (1,0) after copartitioning. The data dependency of variable, u leads to several new equations to account for the embedding of an equation in the new tiled iteration space. For example, I = (4,3) receives the u value from I = (3,2) in the original iteration space. In the new iteration space of six dimensions, the same iteration, i.e., I1 = (0,0),I2 = (0,1), and I3 = (1,0) receives the u value from I1 = (1,2),I2 = (1,0), and I3 = (0,0). This is accounted by the last equation of u. The multiple equations arise due to different cases of data dependencies crossing the tiles.

When comparing the tiled FIR filter program on the previous page with the ini- tial program on page 58, one may notice that the number of equations in the tiled program is substantially more than the number of equations in the initial program. Furthermore, the iteration space of the tiled program is different from the iteration space of the initial program. Therefore, given DPLA with both uniform and affine data dependencies, then for given hierarchy n and tiling matrices, the following ques- tions need to be answered to enable automated n-hierarchical tiling:

• How do we generate the tiled iteration space tiled given the hierarchical tiling matrices? I

60 3.1. Loop Optimizations for Accelerator Tuning

(b) Tiling Matrices (a) P1, P2,...,Pn Loop Program

Hierarchy for{i3 = 0 ; i3 < 2 ; i3 ++} n (Tiling) for{j3 = 0 ; j3 < 1 ; j3 ++} Loop3 Iteration Space Decomposition for{i2 = 0 ; i2 < 2 ; i2 ++}} for{j2 = 0 ; j2 < 2 ; j2 ++} Loop2 (EXPAND) for{i1 = 0 ; i1 < 2 ; i1 ++}} Embedding Data Dependencies for{j1 = 0 ; j1 < 3; j1 ++} Loop1 { } //Statement body (EXPAND) .... Embedding Conditions } (REDUCE)

Global Scheduling Hierarchical tiling methodology

Figure 3.6.: (a) Tiled iteration space represented as for loop. (b) Overview of tiling methodology.

• How does one obtain an output DPLA preserving the data dependencies after hierarchical tiling (code generation)? • How do we allocate and schedule the iterations of the output DPLA for efficient hardware generation? Our approach for hierarchical tiling answers the above questions. It consists of the steps: (a) tiling, (b) expand, i.e., embedding of data dependencies and control condi- tions, and (c) reduce, i.e., global scheduling (see Figure 3.6(b)). The first step is tiling of the iteration space, which is equivalent to the problem of loop tiling in compiler theory. In our methodology, we go one step further by embedding the data dependen- cies in the new tiled iteration space. The advantage of this data dependence analysis step is that we remain in the polytope framework, which offers the possibility of mapping the algorithms onto massively parallel architectures. Hierarchical tiling not only embeds the data dependencies but also the iteration dependent conditions in the new iteration space tiled. Furthermore, the new data dependencies are also associ- ated with new uniqueI iteration conditions. Finally, global scheduling is an important step for describing the place and time coordinate of execution of each iteration in the tiled iteration space, and for hardware generation. The introduced n-hierarchical tiling method encompasses all possible tiling techniques (i.e., LSGP, LPGS, coparti- tioning, . . . ). We use congruent parallelotope-shaped tiles for tiling, which can also describe the non-rectangular tiles. As already mentioned, tiling is known in modern compiler theory under different nuances like strip-mining or loop tiling [158]. The traditional loop tiling as in com- piler theory introduces integer division, mod, ceil, and floor operators. The DTSE methodology [27] from IMEC is another compilation method based on the polytope model, which addresses hierarchical tiling but only for general purpose embedded processors and not for processor arrays or multi-processor systems. Tiling in context

61 3. Accelerator Generation: Loop Transformations and Back End of recurrence equation for a single hierarchy as in LSGP and LPGS was presented in [164], and was extended for affine data dependencies in [168]. The idea of coparti- tioning was introduced in [57]. However, the exact methodology for automating the transformation is not studied. One of the main contributions of the presented trans- formation is the generalization of tiling approaches presented in [168, 57] for any given hierarchy. Hierarchical tiling methods have also been studied in [111]. Here, the authors also present a method for generating multi-level tiled loops for paramet- ric tiles. However, they do not transform the statements in the loop kernel. In the LooPo framework [78], tiling is performed after the space-time mapping. Another outstanding extension in this framework is the possibility of handling while loops [80]. Our approach is similar to the traditional approach, where tiling is done before the space-time mapping. The code generation scheme for tiling in the Pluto frame- work [21] uses Fourier-Motzkin elimination and Cloog [13] for fixed tile sizes. The major difference also lies in the architecture target, most of the above framework aim at code generation for existing multi-processor systems whereas, here, we intend the mapping onto hardware accelerators in form of processors arrays or even synthesize them from scratch.

3.1.2.1. Tiling: Decomposition of the Iteration Space In this section, we present a method for realizing the tiled iteration space of a DPLA. Traditional loop tiling decomposes a loop nest into two nested loops. The outer loop iterates over tile origins and the inner loop steps over single iterations within the tile. Given the simple loop program in C-notation:

for (i = 0; i < N; i++){a[i]=...}

The resulting tiled program, using a 1-d tile of size S according to [158] is then

for (i1 = 0; i1

Here, the loop bounds are not divisible by the tile size which leads to complex division and floor operations. Furthermore, if the tiles are not orthogonal, then com- plex floor, division, min, and max operations in the loop bounds result. Our tiling of an iteration space for recurrence equations differs from traditional loop tiling. We introduce dummy operations for invalid iteration points instead of introducing com- plex floor and division operations in the loop bounds. Furthermore, the recurrence

62 3.1. Loop Optimizations for Accelerator Tuning equations are embedded in the new iteration space. Therefore, tiling as proposed in this chapter would lead to following loop code:

for (i1 = 0; i1 < N/S; i1+=1) for (i2 = 0; i2 < S; i2++){ i f ( i 1 *S+i2>=N){dummy op}; else a[i1][i2]=...}

Here, the loop bounds are statically determined due to the assumption of fixed tile sizes. Loop tiling of a loop nest of depth 2 gives a loop nest of depth 4, where 1 2, where 1 and 2 denote iterations inside the tile and origin of the tiles, respectively.I → I ⊕ I Similarly,I an nI-hierarchical tiling converts the global iteration space of dimension m into a (n+1) m dimensional iteration space. Since 1 2 ... · I → I ⊕I ⊕ ⊕ n 1 where 1 accounts for the index points in innermost tiles and 2 accounts for I + I I the regular repetition of innermost tiles (i.e., 1), and so on, they collectively form the new iteration space. The tiles are parallelepipedsI and are described by n tiling matrices (P1, P2,..., Pn) and a tiling offset vector, q. The method for obtaining the new iteration space should be generic such that it covers simple loop tiling and copartitioning. The following theorem describes the formula for obtaining the tiled iteration space.

m Theorem 3.1.1 Let the initial iteration space be = I Z AI b . Then, given I { ∈ | ≥ } n tiling matrices (P1,P2,...,Pn), an n-hierarchical tiled iteration space tiled is obtained as follows: I

I1 A1 0 ... 0 I1 b1 I 0 A 0 I b  2 2 ... 2 2  tiled =  .   . . .  .   .  , (3.2) I  . | . . .. 0 . ≥ .         I   0 0 A I  b   n+1  ... n+1 n+1  n+1         

m 1 = I1 Z A1I1 b1 I { ∈ m| ≥ } 2 = I2 Z A2I2 b2 I . { ∈ | ≥ } . m n+1 = In+1 Z An+1In+1 bn+1 , where I { ∈ | ≥ }

63 3. Accelerator Generation: Loop Transformations and Back End

σ1 adj(P1) 0 A1 = · b1 = (3.3) σ1 adj(P1) σ1 det(P1) o + w1 − · ! − · · ! σ2 adj(P2) 0 A2 = · b2 = σ2 adj(P2) σ2 det(P2) o + w2 − · ! − · · ! . . . .

σn adj(Pn) 0 An = · bn = σn adj(Pn) σn det(Pn) o + wn − · ! − · · ! 0 0 0 0 0 AnPn An In+1 bn + An q An+1In+1 bn+1 Proj − · ,(3.4) ≥ ≡ 0 A ! I ! ≥ b ! P0 0 σn adj( n) 0 0 An = · bn = (3.5) σn adj(P0 ) σn det(P0 ) o + w0 − · n ! − · n · n! where 1 denotes the iteration space that belongs to the tile defined by P1. 2 contains I I the origin of tiles defined by P1. Similarly, n 1 contains the origin of tiles defined by I + Pn such that it covers original iteration space. Also in Equations (3.3) and (3.4), T T m j j m o = (1 ... 1) Z and w j = u ... um Z where ∈ 1 ∈   m j 1 j j ui = j ∏ gk 1 i m and gk = gcd Pj,k,l l 1...m gi k=1 ∀ ≤ ≤ ∀ ∈ { }  det P0 ( i ) 0 where Pj,k,l is the (k,l) element of tiling matrix Pj. Also, let σi = | | ,Pn = det(Pi0 ) n ∏i=1 Pi, and adj(P) define the adjugate matrix.

Proof 3.1.1 If P is the tiling matrix, then as per definition it contains the side-vectors of a tile as column vectors. Column vectors are the linearly independent vectors, which are the sides of the tiles. According to the Minkowski theorem [156], the iteration space of the points within the tile can be represented as

n n T T = I Z I = Pκ z κ < o κ Q , z = (0 ... 0) , o = (1 ... 1) I { ∈ | ∧ ≤ } ∈ In [86], it was shown that the following implicit definition is equivalent to the above Minkowski characterization.

n = I Z AI b I { ∈ | ≥ } n σ adj(P) 0 = I Z · I (3.6) ∈ | σ adj(P) ≥ σ det(P) o + w ( − · ! − · · !)

64 3.1. Loop Optimizations for Accelerator Tuning

det(P) T n where σ = det(P) , z is the zero vector and and o = (1 ... 1) Z , respectively. Also | | ∈

T n w = (u1 ... un) Z , where ∈ 1 n ui = ∏ gk 1 i n gi k=1 ∀ ≤ ≤ gk = gcd pk l l 1,...,n , ∀ ∈ { }  On replacing with the corresponding tiling matrix of each hierarchy, i.e., P1,...,Pn in Equation (3.6), we get the iteration spaces 1,..., n. Now, it needs to be found, how the outermost tile covers the original iterationI space?I The outermost tile is given by 0 n tiling matrix Pn = ∏i=1 Pi. The origin of outermost tiles In+1 is the same as tiling 0 0 with matrix Pn. In this case, the iteration points inside the tile In are described by the polytope (using the Minkowski characterization)

0 0 n 0 0 0 n = In Z AnIn bn I ∈ | ≥ n o

 σ adj(P)  0 n 0 0 = In · In  (3.7)  Z 0 0   ∈ | σ adj Pn ! ≥ σ det(Pn) o + wn !  − · − · ·    An0 bn0    | {z }  | {z }  0 0 The above points can also be obtained as In = I (PnIn+1 +q), (In+1 denotes origins of the outermost tile). Therefore, we have −

A0 I P0 In 1 + q b0 (3.8) n − n + ≥ n    as the system of inequalities describing all points I0 . Then, the polytope An 1In 1 n + + ≥ bn+1 describing the origin of non-empty outermost tile can be determined by the pro- jection in Equation (3.4). The upper row in Equation (3.4) for determining polytope An 1In 1 bn 1 is same as Equation (3.8). Proj defines the projection over the sub- + + ≥ + space defined by the variables in In+1, for eliminating all variables I. This is done by using the Fourier-Motzkin elimination [156]. Therefore, we get a hierarchical tiling of iteration space . I 

Simple loop tiling can also be interpreted from Equation (3.2) as the special case of

65 3. Accelerator Generation: Loop Transformations and Back End n = 1.

1 = I1 A1I1 b1 2 = I2 A2I2 b2 , where I { | ≥ }I { | ≥ } σ adj(P) 0 A1 = · b1 = σ adj(P) σ det(P) o + w1 − · ! − · · ! A1PA1 I2 b1 + A1 q A2I2 b2 Proj − · ≥ ≡ 0 A ! I ! ≥ b ! This is the same result as proposed in seminal work of Ancourt et.al. in [6], where this outer tile set is given by

I2 2 I s.t. AI b A1 (I PI1) b1 = I2 2 A2I2 b2 { ∈ I |∃ ∈ I ≥ ∧ − ≥ } { ∈ I | ≥ } Example 3.1.4 The iteration space of the FIR filter in Example 3.1.2 is = (i j)T 0 i T 1,0 j N 1 ) for T = 8 and N = 6 is copartitioned (i.e., 2-hierarchicalI { | ≤ ≤ − ≤ ≤ − } 2 0 2 0 tiling) with help of tiling matrices given by P1 = and P2 = for in- 0 3! 0 2! ner and outer tiles, respectively. Then after applying hierarchical tiling according to Theorem 3.1.1, one obtains the new iteration space as shown in Equation (3.9).

3 0 0 0 0 0 0 0 2 0 0 0 0 0       3 0 0 0 0 0 3     −  −   i1  0 2 0 0 0 0 i1  4   −  −   j1  0 0 2 0 0 0 j1  0        i2  0 0 0 2 0 0 i2  0  tiled =        (3.9) I  j | 0 0 2 0 0 0 j  ≥  2  2   2      −   −  i3   0 0 0 2 0 0i3   2    −   −   j3  0 0 0 0 1 0 j3  0            0 0 0 0 0 1   0              0 0 0 0 4 0  4    −    0 0 0 0 0 6  0            After simplifying the bounds one can write the tiled iteration space in terms of a for loop as shown earlier in Figure 3.6. The PARO output is

par (j3 == 0 and i1 >= 0 and j1 >= 0 and i1 <= 1 and j1 <= 2 and i2 >= 0 and j2 >= 0 and i2 <= 1 and j2 <= 1 and i3 >= 0 and i3 <= 1)

66 3.1. Loop Optimizations for Accelerator Tuning

Let, the initial iteration space of a loop algorithm be given. The tiling transforma- tion requires n tiling matrices for n-hierarchical tiling (e.g., n = 2 for copartitioning). Then, the new iteration space is given by Equation (3.2). It may be noted that the tiled iteration space can be written in terms of for loops, which scan all the points inside the tiled iteration space. This is a code generation problem. In Section 3.2.1.1, we present a novel scheme for scanning the tiled iteration space, which is necessary for implementing the parallelization scheme (i.e. LSGP, LPGS, ...). It must be noted here that the presented methodology deals with fixed loop boundaries and tile sizes. An important future work is the research on a methodology for parametrized tiling, i.e., where loop bounds and tile sizes are expressed by parameters. This would be neces- sary for dynamic mapping of loop programs onto so-called resource-aware invasive architectures [165].

3.1.2.2. Embedding: Splitting of Data Dependencies The existing affine and uniform iteration-carried dependencies need to be embedded in the new tiled iteration space. This step introduces new equations with new data dependencies and iteration conditions, in case the data dependencies cross different tiles. All general recurrence equation in a DPLA (as in Def. 2.1.3) is brought in the following form

x[I] = (...,y[QI d],...) I c F − ∀ ∈ I by a transformation called output normal form. For the sake of simplicity, the above equation is written in equivalent form as follows [168]:

x[I] = (...,z[I],...) I c; F ∀ ∈ I z[I] = y[QI d] I c; − ∀ ∈ I The variables and the dependencies are embedded in the new tiled iteration space tiled as follows: I x[I1,I2,...,In 1] = (...,z[I1,I2,...,In 1],...) + F + z[I1,I2,...,In+1] =y[QI1 d R0,QI2 (R1 R0),...,QIn (Rn 1 Rn),QIn+1 + Rn] − − − − − − − By definition of tiling,

I = I1 + I2 + ... + In, where I1 = I1 0 I2 = P1I2 (3.10) ...... 0 In+1 = PnIn+1

67 3. Accelerator Generation: Loop Transformations and Back End

Therefore, one can write

QI d = Q(I1 + I2 + ... + In) d − − = (QI1 d R0)+(QI2 R1 +R0)+...+(QIn (Rn 1 Rn 2))+(QIn+1 +Rn 1) − − − − − − − − as all the Ri terms cancel each other when added together in the above equation. The values of (R0,R1,...,Rn 1) give the new data dependencies. Hence, the problem is − to find all distinct (R0,R1,...,Rn 1), such that −

QI1 d R0 1 − − ∈ I H1(QI2 (R1 R0)) 2 (3.11) − − . ∈ I .

Hn(QIn+1 + Rn 1) n+1 − ∈ I 1 0 − where Hn = Pn . In [168], it was shown for n = 1 (i.e., simple tiling), one can set up a constraint polytope and enumerate all its point to find the different possible val- ues of R0. In this section, we extend the method for embedding of data dependencies in case of a general n-hierarchical tiling. We need to define the embedding opera- tion for hierarchical tiling and use an induction for defining the operation. Starting with the embedding transformation for copartitioning (i.e., n = 2) which leads to an equation of the form:

x[I1,I2,I3] = (...,z[I1,I2,I3],...) F z[I1,I2,I3] = y[QI1 d R0,QI2 + R0 R1,QI3 + R1] − − − Therefore the problem is to find all (R0,R1), s.t.

QI1 d R0 1 (3.12) − − ∈ I H1 (QI2 + R0 R1) 2 (3.13) − ∈ I H2 (QI3 + R1) 3 (3.14) ∈ I Assume the target and the source iteration of a data dependence vector are given by 1 1 1 2 2 2 I1 ,I2 ,I3 and I1 ,I2 ,I3 , respectively. From Equation (3.2) and Equation (3.14), we infer   1 1 I3 = I3 A3 I3 b3 (3.15) 2 ∧ · 2 ≥ H2 Q I3 + H2R1 = I A3 I b3 (3.16) · · 3 ∧ · 3 ≥ Hence, using above Equations. (3.15) and (3.16), R1 must satisfy

1 2 1 2 1 R1 = H− I H2QI = P0 I QP0 I (3.17) 2 3 − 3 2 3 − 2 3 

68 3.1. Loop Optimizations for Accelerator Tuning

Similarly from Equation (3.2) and Equation (3.13), we infer

1 1 I2 = I2 A2 I2 b2 2 ∧ · 2 ≥ H1 Q I2 + H1 (R0 R1) = I A2 I b2 · · − 2 ∧ · 2 ≥ Hence, R0 must satisfy

2 1 2 1 2 1 H1R0 = H1R1 + I H1QI = H1 P0 I QP0 I + I H1QI 2 − 2 2 3 − 2 3 2 − 2   which implies that

2 1 1 2 1 2 1 2 1 R0 = P0 I QP0 I + H− I H1QI = P0 I QP0 I + P0 I QP0 I (3.18) 2 3 − 2 3 1 2 − 2 2 3 − 2 3 1 2 − 1 2 Lastly, A1 I1 b1 and A1(Q I1 d R0) b1 (from Equation. (3.12) and I1 = I1). · ≥ · − − ≥ By replacing R0, we get the following set of inequalities or the so-called constraint polytope.

A1 QA1Q P0 A1 P0 A1Q P0 A1 P0 b1 + A1d · · 1 − · 1 · 2 − · 2 I A 0 0 0 0 1 b  1  I1  1  0 A 0 0 0  2  b  2  I2  2  (3.19)  0 0 A 0 0  2 ≥  b   2  1  2   I3     0 0 0 A3 0  2  b3   I3     0 0 0 0 A3    b3           The above polytope has 5m variables and one must enumerate all its integral points to find distinct (R0,R1). This is done by replacing the distinct point co-ordinates, 1 2 1 2 obtained on enumeration, i.e., I2 , I2 , I3 , I3 in Equation (3.18) and Equation (3.17). For each distinct value, a new equation is generated. The enumeration is done by scanning the polytope for integer points lying in the convex hull of the polytope. The complexity and scalability of this approach is discussed in Section 3.1.3. The above derivation can be extended by induction for an n-hierarchical tiling and gives the following set of equations for constraint polytope. n 1 2 A1 QI1 + A1Q P0 I A1 P0 I b1 + A1d · ∑ · n n − · n n ≥ i1 A1I1 b1 1 ≥ 2 i n + 1; AiIi bi ∀ ≤ ≤ 2 ≥ 2 i n + 1; AiI bi ∀ ≤ ≤ i ≥ Similarly by enumerating the above polytope for each data dependency (Q and d) and finding distinct values of (R0,...,Rn 1), one can add new equations to the partitioned −

69 3. Accelerator Generation: Loop Transformations and Back End

Equations Q d R0 R1

0 0 T T T a[i1, j1,i2, j2,i3] = a[0, j1,0, j2,0] (0 0) (0 0) (0 0) 0 1! T T T u[i1,...] = u[i1 1, j1 1,i2, j2,i3] E (1 1) (0 0) (0 0) − − T T T u[i1,...] = u[i1 + 1, j1 1,i2 1, j2,i3] E (1 1) (1 0) (0 0) − − T T T u[i1,...] = u[i1 1, j1 + 2,i2, j2 1,i3] E (1 1) (0 1) (0 0) − − T T T u[i1,...] = u[i1 + 1, j1 + 2,i2 1, j2 1,i3] E (1 1) (1 1) (0 0) − − T T T u[i1,...] = u[i1 + 1, j1 1,i2 + 1, j2,i3 1] E (1 1) (0 0) ( 1 0) − − T T − T u[i1,...] = u[i1 + 1, j1 + 2,i2 + 1, j2 1,i3 1] E (1 1) (0 1) ( 1 0) − − T T − T x[i1,...] = a[i1, j1,i2, j2,i3, j3] u[i1,...] E (0 0) (0 0) (0 0) · T T T y[i1,...] = 0 + x[i1, j1,i2, j2,i3] E (0 0) (0 0) (0 0) T T T y[i1,...] = y[i1, j1 1,i2, j2,i3] + x[...] E (0 1) (0 0) (0 0) − T T T y[i1,...] = y[i1, j1 + 2,i2, j2 1,i3] + x[...] E (0 1) (0 1) (0 0) − T T T Y [i1,...] = y[i1, j1,i2, j2,i3] E (0 1) (0 0) (0 0)

Table 3.1.: The table describes the set of new equations obtained after copartitioning of the FIR filter. description of the algorithm. The matrix Q and vector d represent the affine and regular part of the data dependencies in the initial program. In Section 3.1.3, it will be shown that enumeration is a feasible approach for determining the data dependencies in the hierarchically tiled iteration space for several realistic loop benchmarks.

Example 3.1.5 For our running FIR filter example one obtains the set of new equa- tions as shown in Table 3.1. The variable u with uniform data dependency ( (1 1)T) gives 6 distinct values of (R0,R1). Therefore, the number of equations in the obtained output DPLA for variable u is only 6. E is the identity matrix. The input variable U is also embedded in the tiled iteration space.

In case of tiling, the intuitive explanation of the larger number of equations is due to the data dependencies crossing the tiles. The circuit interpretation of the program is that one needs multiplexers to select the correct input, which could come from the memory, neighbouring PE, or the same PE. The control signals to the multiplexers are determined by the iteration-based control conditions as shown in next section.

3.1.2.3. Iteration dependent Conditions In this subsection, we will embed the initial iteration conditions in the tiled iteration space. Furthermore, embedding of data dependencies leads to new equations, which in turn are associated with unique iteration-based conditions. For an n-hierarchical

70 3.1. Loop Optimizations for Accelerator Tuning tiling, each new equation has the following new conditions depending on the value of (R0,...,Rn 1): −

A1(Q I1 d R0) b1 · − − ≥ A2(QI2 H1(R1 R0)) b2 − − ≥. .

An+1(QIn+1 + HnRn 1) bn+1 − ≥

This is because the conditions QI1 d R0 1, QI2 H1(R1 R0) 2,..., − − ∈ I − − ∈ I QIn+1 + HnRn 1 n+1 need to be guaranteed. The conditions contain a lot of in- equalities which− may∈ I create a lot of control overhead in the hardware implementation. However, after removal of redundant inequalities (inequalities that may be omitted from the specification without changing the iteration condition), one obtains a sim- plified form of the iteration-based conditions for practical examples. If the initial iteration-based condition (I c) is a linearly bounded lattice, which can be repre- sented as follows. ∈ I c = I = A I + b As I bs I { · ∧ · ≥ } then the transformed iteration-based condition is given by

1 1 1 1 AsA− I1 + AsA− I2 + ... + AsA− In 1 bs + AsA− b · · · + ≥ where (I1,...,In+1) denotes the iteration vector of the tiled iteration space.

Example 3.1.6 Table 3.2 summarizes the iteration-based control conditions obtained for the given FIR filter example. The conditions correspond to the equations as given in Table 3.1. These conditions need to be evaluated by the processor array to control multiplexers that select the input for correct evaluation of variables. Furthermore, a counter is needed to provide the value of the iteration variables in case they are chosen for sequential execution by the parallelization strategy. The methodology for efficient synthesis of control units is discussed in Section 3.2. Once the control conditions are obtained, one obtains the output DPLA as shown in Example 3.1.3. Finally, the iterations and operations need to be scheduled on the processor array. This is done using an affine transformation realizing different tiling schemes and is discussed in the next section.

3.1.2.4. Parallelization of Tiled Piecewise Linear Algorithms An important aspect of tiling involves allocation of PEs and global scheduling of iterations. Scheduling determines a cycle-accurate execution of the algorithm on the

71 3. Accelerator Generation: Loop Transformations and Back End

Control Conditionals Q d R0 R1 0 0 a Empty (0 0)T (0 0)T (0 0)T 0 1! T T T u i1 > 0 j1 > 0 E (1 1) (0 0) (0 0) ∧ T T T u i1 = 0 j1 > 0 i2 > 0 E (1 1) (1 0) (0 0) ∧ ∧ T T T u i1 > 0 j1 = 0 j2 > 0 E (1 1) (0 1) (0 0) ∧ ∧ T T T u i1 = 0 j1 = 0 i2 > 0 j2 > 0 E (1 1) (1 1) (0 0) ∧ ∧ ∧ T T T u i1 = 0 j1 > 0 i2 = 0 i3 > 0 E (1 1) (0 0) ( 1 0) ∧ ∧ ∧ T T − T u i1 = 0 j1 = 0 i2 = 0 j2 > 0 i3 > 0 E (1 1) (0 1) ( 1 0) ∧ ∧ ∧ ∧ − x E (0 0)T (0 0)T (0 0)T T T T y j1 = 0 j2 = 0 j3 = 0 E (0 0) (0 0) (0 0) ∧ ∧ T T T y j1 > 0 E (0 1) (0 0) (0 0) T T T y j1 = 0 j2 > 0 E (0 1) (0 1) (0 0) ∧ T T T Y j3 = 0 j2 = 1 j2 = 2 E (0 1) (0 0) (0 0) ∧ ∧ Table 3.2.: The iteration dependent conditions for the copartitioned FIR filter.

architecture and exploitation of parallelism in its data path. The LPGS and LSGP strategies of parallelization are associated with inner and outer loop parallelization, respectively. This is mathematically represented by the space-time mapping as given in Equation (2.11) and Equation (2.12). Similarly, copartitioning is represented by the space-time mapping given in Equation (3.1). We illustrate the copartitioning strategy of tiling and parallelization with help of the running FIR filter example.

Example 3.1.7 For the FIR filter in Example 3.1.3, a feasible space-time mapping of the copartitioned DPLA is shown in Equation (3.20). The schedule for the chosen copartitioning scheme such that iterations within the inner tile are executed sequen- tially on the same processor. Whereas the inner tiles are executed in parallel on different processors.

p = (i2 j2) λ = 3 1 4 3 8 (3.20)  

The description obtained after applying the above affine transformation to the output DPLA of the FIR filter in Example 3.1.3 is shown in Equation (3.21). One can derive

72 3.1. Loop Optimizations for Accelerator Tuning

a a Local Local From Controller 0 Controller Counter Mem U00 u u Global Mem Mem 0 Controller A00 A01 * *

+ + Mem PE p PE p y y 0,0 0,1 0 U00 0 Yout

Mem a a U10 PE p PE p From 1,0 1,1 Mem U10 Local Local Controller Controller u u control signals * * counter values data signals + + y y 0 0 Yout

Figure 3.7.:2 2 processor array architecture implementing the copartitioned FIR filter× according to Example 3.1.7. the description of the processor array architecture from the following description.

a[p1, p2,t] = a[0, j1,0, j2,0, j3] n UA(i1,i2,i3, j3) if j1 = 0 p2 = 0 ∧ u[p1, p2,t 4] if i1 > 0 j1 > 0  − ∧  u[p 1, p ,t 2] if i = 0 j > 0 p > 0  1 2 1 1 1  − − ∧ ∧  u[p1, p2 1,t 4] if i1 > 0 j1 = 0 p2 > 0 u[p1, p2,t] =  − − ∧ ∧  u[p1 1, p2 1,t 2] if i1 = 0 j1 = 0 p1 > 0 p2 > 0  − − − ∧ ∧ ∧ u[p1 + 1, p2,t 2] if i1 = 0 j1 > 0 p1 = 0 i3 > 0 − ∧ ∧ ∧  u p p t if i j p p  [ 1 + 1, 2 1, 2] 1 = 0 1 = 0 1 = 0 2 > 0  − − ∧ ∧ ∧  i3 > 0  ∧ x[p1, p2,t] = a[p1, p2,t] u[p1, p2,t] (3.21)  · x[p1, p2,t] if j1 = 0 p2 = 0 j3 = 0 ∧ ∧ y[p1, p2,t] = y[p1, p2,t 1] + x[p1, p2,t] if j1 > 0  −  y[p1, p2 1,t 1] + x[p1, p2,t] if j1 = 0 p2 > 0 − − ∧ Yout [p1, p2,t] = y[p1, p2,t] if j3 = 0 p2 = 1 j1 = 2 (3.22)  ∧ ∧ The obtained processor array architecture is shown in Fig. 3.7. The resulting archi- tecture is a 2 2 processor array. The boxes denote delay registers. The number of registers can be× obtained by multiplying the schedule vector λ with the dependency vector. E.g., the data dependency vector of u, (1 1 0 0 0 0)T leads to a connection

73 3. Accelerator Generation: Loop Transformations and Back End with four registers. The steps needed for performing an n-hierarchical tiling are summarized in Algo- rithm 3.1. After tiling of the iteration space of the DPLA, each data dependency is embedded in the new iteration space. Also, corresponding to each data dependency, the iteration-based condition is determined. Finally, the space-time mapping is cho- sen so as to define the allocation and scheduling. The user can specify the tiling and parallelization strategy in order to match the implementation to the architecture in terms of availability of PEs and memories.

Algorithm 3.1 An algorithm for n-hierarchical tiling of a DPLA Require: : Iteration space, n: hierarchy of tiling, P1,...,Pn: tiling matrices, DPLA: dynamicI piecewise linear algorithm Ensure: tiled: tiled Iteration space, DPLAtiled: tiled dynamic piecewise linear algo- rithmI code 1: tiled= Tiling( ,n,P1,...,Pn) I I 2: for all variables xi in equations Si of DPLA with dependencies, Q, d do 3: new dependencies, d(R0,...,Rn 1)=Expand( tiled,d,Q,n,P1,...,Pn) − I 4: for each distinct d(R0,...,Rn 1) do −tiled 5: compute condition space, =Expand( C , tiled,d,Q,n,P1,...,Pn) ICi I i I 6: Update new equations with new data dependencies and conditional space in the tiled DPLA code 7: end for 8: end for 9: perform space-time mapping

3.1.3. Results: Scalability and Overhead of Hierarchical Tiling In this section, we study the scalability of n-hierarchical tiling using different loop algorithms as benchmarks. The scalability study is done in order to learn about the effect of the iteration space size and its dimension on the execution time of the trans- formation. This is important because the code generation of statements inside the loop kernel depends on the scanning of the constraint polytope. The tiled iteration space is found applying the Fourier-Motzkin elimination once for the outermost tile (see Equation (3.4)) and simple matrix operations for inner tiles. In addition, the ef- fect of the hierarchy of tiling on the run-time of the transformation needs to be stud- ied. Figure 3.8 shows the execution time for performing of hierarchical loop tiling transformation for different loop programs like FIR filter (FIR), matrix multiplication (MMM), and edge detection (ED). The different versions 1 and 2 represent variants with different loop bounds of 100 and 1000, respectively. n denotes the number of hierarchical tiling levels. The experiments were carried out on an Intel Core2Duo processor running at 1.86 GHz with L2 cache size of 1 MB. It can be observed that

74 3.2. Controller Generation

100 n=1 n=2 n=3

10

1 execution time(s)

0.1

0.01 FIR1 FIR2 MMM1 MMM2 ED1 ED2

Figure 3.8.: Execution times for performing the hierarchical loop tiling transforma- tion for different loop programs like FIR filter, matrix multiplication, and edge detection. n denotes the number of hierarchical tiling levels. the execution time of hierarchical tiling for a given n is independent of the iteration space size of same loop program. However, different loop programs require different execution time because of the different number of data dependencies and loop state- ments. For the same loop program, the execution time increases with the number n of tiling levels, the reason being that the enumeration requires a conversion of the constraint polytope from the set of inequalities, i.e., half space representation to dual vertex representation (a linear combination of so-called lines, a convex combination of vertices, and a positive combination of extreme rays). This depends exponentially on the number of inequalities in the half space representation. The number of in- equalities in turn increases linearly with hierarchy n of tiling. The proposed method suffices for accelerator architectures with three levels of hierarchy in parallelism and memory model.

3.2. Controller Generation

In order to handle large scale problems, to balance local memory requirements with I/O-bandwidth, and to employ different hierarchies of parallelism and memory, a sophisticated transformation called hierarchical tiling was introduced in the previous section. A methodology for control generation that generates an efficient controller architecture is needed to control the execution of the operations within a processor array according to the determined allocation and scheduling, because of the following reasons:

75 3. Accelerator Generation: Loop Transformations and Back End

1. The tiling transformation not only increases the loop nest’s code size but also introduces a more complex control flow in the program. This can also be ob- served in Example 3.1.3, where the piecewise linear representation of the tiled FIR filter has several additional iteration dependent if conditions as compared to the initial program. This arises due to data reuse cases when tiling a loop algorithm. Therefore, control unit is required which depending on the current time steps synchronizes the data flow of data through the whole processor ar- ray. Hence, a control engine is needed for implementing the global schedule, λ (see Equation 2.10).

2. The access to external memory and FIFOs requires additional iteration depen- dent control signals and address generation.

3. The resource sharing inside each processing element due to limited availability of registers and functional units requires a local control unit that synchronizes the flow of data through the multiplexers. In other words, the control engine is responsible for implementing the local schedule, τ(vi) (see Equation 2.10).

Therefore, as discussed in Section 2.3.3.3, (a) iteration dependent control signals, (b) I/O control signals, and (c) internal control signals have to be generated by a suitable control engine. Control synthesis and optimization, although being a well studied problem, takes care only of the last part (i.e., local schedule) [135, 167]. The reason is that traditional high-level synthesis uses at most loop unrolling, which replicates the loop kernel so as to expose parallelism for hardware circuit implementation. How- ever, the first and second cases discussed above arise due to tiling of the loop program onto a regular array consisting of tightly-coupled processing elements. This leads to the problem of implementing the global schedule, which cannot be effectively dealt with by current high-level synthesis tools. The difficulty of the control generation problem is illustrated with help of the co- partitioned FIR filter in Example 3.1.3 on page 59. Each of the iteration dependent conditions after copartitioning as seen in the right hand side of the DPLA can be represented in one of the following forms2:

A1 I1 b1 A2 p b2 A3 I3 b3 (3.23) · ≥ ∧ · ≥ ∧ · ≥ A1 I1 + A2 p + A3 I3 b (3.24) · · · ≥ T The original iteration space co-ordinates I after tiling are given by Itiled = (I1,I2,I3) . Therefore, the calculation of memory addresses and control predicates, which are affine functions of Itiled as shown in Equations (3.23) and (3.24), can be done only if given the following:

2 Note, p = I2 directly follows from the definition of the space-time mapping for copartitioning, cf. Definition 3.1.1.

76 3.2. Controller Generation

• inner tile co-ordinates I1: For each given time step t and processor p, the inner tile co-ordinates I1 of the iteration vector being executed have to be known.

• tile co-ordinates I2: Due to the defined space-time mapping for copartitioning (cf. Definition 3.1.1 on page 57), p = I2. Therefore, the processor index is equivalent to tile co-ordinate I2.

• outer tile origin co-ordinates I3: For each time step t and processor p, the outer (GS) tile origin I3 of the iteration vector being executed is required. We generate a 2m-dimensional global counter which keeps track of the tile co-ordinates m m I1 Z and I3 Z . I.e., the counter implements the scanning code for the sequen- tial∈ tile-coordinates∈ according to the given loop matrix, which determines the global schedule. In the next subsection, we present an efficient methodology for the control synthesis.

3.2.1. Accelerator Control Engine: Architecture and Synthesis Methodology In the following, we propose a hierarchical control engine architecture consisting of a global and a local controller. In order to reduce the amount of control conditions, the purpose of global control is to calculate once the iteration conditions which are independent of the processor index in a global controller, and thus does away with the computation of the condition by each local controller of individual processor element. The most important part of the global controller is the global counter, which keeps track of the iteration vectors needed for calculating iteration dependent conditions and address generation according to the given schedule. The global controller also contains the global decoder, which calculates the iteration conditions being inde- pendent of the processor index (i.e., iteration conditions, which are executed in all processors). The counter and the global decoder signals are propagated into the pro- cessor array using appropriate delay registers. This architecture style of separating iteration conditions for execution in a local or a global controller leads to a consider- able reduction in area cost of the control path as also shown in [47, 43]. This concept avoids unnecessary control decoding in each PE which would have lead to additional logic resources. Only those iteration-based conditions that depend on the processor index are implemented by a local controller which is generated for each of the PEs. It also contains the internal controller, which takes care of the resource sharing, local schedule, and module selection. Figure 3.7 shows a 2 2 processor array realization of a copartitioned FIR filter of Example 3.1.3. The following× four steps constitute our approach for control path generation.

1. Scanning of the tiled iteration space: In this step, a global counter for producing values of the sequential iteration space variables is determined.

77 3. Accelerator Generation: Loop Transformations and Back End

2. Determination of PE types: This step finds processor regions of the same type based on iteration conditions. This helps in classification of control predicates for workload distinction between local controllers and global controller.

3. Initialization of local and global control signals: In this step, local, global, and internal controllers for the computation of control predicates are generated.

4. Propagation of control and iteration variables: Here, the requisite interconnect and corresponding delays required for the propagation of global counter and control signals are determined.

In the next subsection, we present each of the steps in detail.

3.2.1.1. Counter Generation

In the previous section, n-hierarchical tiling was introduced which transformed the m (n+1) m iteration space, tiling : tiled, Z tiled Z · . The tiled iteration space I → I I ∈ ∧I ∈ tiled is given by Equation 3.2. For a given loop program with iteration space, tiled = I T I I = (I1 I2 ... In+1) AI b , the global allocation and scheduling is assumed to be given by the following| space-time≥ mapping. 

I1 p E 0 ... 0 0 ... 0 . =  .  (3.25) t ! λ1 ... λk 1 λk λk+1 ... λn+1 ! − I  n+1  θ   It can be assumed| without loss of generality{z that all the loop} iteration variables apart from Ik are executed sequentially, whereas Ik is executed in parallel. Ik = p from Equation (3.25) implies that Ik for each processor is constant and is equivalent to the processor index p. In order to evaluate each iteration condition, the value of (I1,...,Ik 1,Ik+1,...,In+1) needs to be computed first. Since θ is in general not in- − vertible, it is not possible to determine the value of Iseq = (I1,...,Ik 1,Ik+1,...,In+1) from processor index p and time t. − Therefore, a counter is needed to scan the internal points of a tile which are exe- cuted sequentially, i.e., (I1,...,Ik 1,Ik+1,...,In+1). The purpose of this section is to synthesize a d-nested counter, which− produces values of the required iteration vari- ables seq (e.g., I1, I3 for copartitioning, I1 for LSGP, or I2 for LPGS) as specified by the globalI schedule. For the space-time mapping given in Equation 3.25 d = n. In a d-nested counter, each counter produces the m-dimensional iteration vectors corre- sponding to I1,...,Ik 1,Ik+1,...,In+1, respectively. For an efficient implementation of each of these d counters,− the concept of path strides was introduced [86].

78 3.2. Controller Generation

Definition 3.2.1 (Path Strides) The path strides of an iteration space I = (i1,i2,...,im) given by an m-dimensional parallelotope, are defined by a matrix ~S = (~s1 ~s2 ... ~sm), where ~si are vectors, which are added to the iteration point to get the next iteration. The step~s1 is added to iteration point I until it crosses the iteration space boundary in direction i1. The stride~s2 takes it back into the iteration space. Similarly, stride~s j is added to iteration point I if it crosses iteration space in direction i j. Such conditions determining the selection of strides are called stride conditions.

Let, the tiled iteration space seq = Iseq = (I1,...,Ik 1,Ik+1,...,In+1) and the cor- I − responding loop matrices (R1,...,Rk 1,Rk+1,...,Rn+1) be given. For the sequential part of tiled iteration space, we require− n-nested counters can be described which can be described without DO-loops [23] as follows:

I1 = I2 = ... = In+1 = 0 label1 : I1 = next1(I1) goto label (3.26) . .

labelk 1 : Ik 1 = nextk 1(Ik 1) goto label − − − − labelk+1 : Ik+1 = nextk+1(Ik+1) goto label . .

labeln+1 : In+1 = nextn+1(In+1) goto label label : (3.27) max I f (I1 = I ) (3.28) 6 1 goto label1 else I1 = 0 . . max I f (In 1 = I ) + 6 n+1 goto labeln+1

The function nextl determines the value of next iteration vector Il l with help of a path stride vector and stride conditions as follows: ∈ I

s ~s1 if C1(Il) s ~s2 if C2(Il) nextl : nextl(Il) = Il + . . (3.29) . .  s ~sm if Cm(Il)  The condition Cs(I ) selects the increment~s , which is added for obtaining the value of i l i the iteration vector corresponding to the next execution time step. The enable signal

79 3. Accelerator Generation: Loop Transformations and Back End

2 i2 2 6 10 42 46 50 i2 1 r1 1 5 9 13 17 21 41 45 49 53 57 61 1 2 i1 i1 0 4 8 12 16 20 24 28 32 40 44 48 52 56 60 64 68 72

r2 11 15 19 23 27 31 51 55 59 63 67 71

22 26 30 62 66 70

0 1 i3

Figure 3.9.: Execution of points within a tile defined by the iteration space given by Equation (3.30) and corresponding to the schedule λ = (4 3 40). The dashed arrows denote the path strides (i.e., increments). −

e C (Il) is generated and used to freeze the computation and to stop the processor array, if at a given time step, a iteration point lies in the iteration space Il and hence, needs to be computed. This is often the case for affine schedules for non-rectangular tiles as will also be shown in next example. The counter is incremented every II (Iteration Interval) cycles which gives the time interval between execution of two consecutive iterations. We explain the computation function next with the following example, before determining the requisite path stride and the corresponding stride conditions.

Example 3.2.1 Fig. 3.9 shows the iteration space of an inner tile I1 = (i1,i2,i3) given 3 6 0 by Equation (3.30). Let, loop matrix R = 3 3 0 defines the sequential exe-  −  0 0 1  3  cution first in r1 direction and then in r2 direction , which leads to the schedule vector λJ, s.t λ = (4 3 40). Let the iteration space be given by the following Z-polyhedron. − 3 6 0 0 3 3 0 0  −  i   0 0 1 1 0 A1I1 b1   i   (3.30) ≥ ⇔  3 6 0  2 ≥  24       − −  i3  −   3 3 0    24   −    −   0 0 1   1   −   −      3For determination of schedule vector for sequential execution, loop matrix is required whose col- umn vectors define the sequential order of execution. This is given by the user or can be deter- mined during scheduling [86]. Therefore loop matrix R1,...,Rk 1,Rk+1,...,Rn+1 are necessary − pre-requisite for sequentially executed iteration vectors I1,...,Ik 1,Ik+1,...,In+1. −

80 3.2. Controller Generation

Global loop counter Global decoder Global Controller Mod-II t Counter Counter 1 I1 Stride-LUT inc Enable

Enable logic 1 1 FUs Loop h s1 ... sm Counter 2 conditionals C I2 counters inc . FSM MUX regs

Decoder Cs . regs FU regs . FUFU regs Comparator PE(0) Counter n In n n s ... s inc 1 m FSM MUX regs regs FU regs FUFU regs PE(1)

Figure 3.10.: Hardware realization of the global counter for the processor array.

Therefore, a counter is required which produces the values of I1 = (i1,i2,i3) syn- chronously for each time step t = λ I1. The description of the counter in form of Equation (3.29), as verified from Figure· 3.9, can be written as follows:

label1 : I1 = next(I1) max I f (I1 = I ) (3.31) 6 1 goto label1 else I1 = 0

1 1 2 0 i1 8 − − −  1  if  1 1 0 i2  8   − ≥ −   0   0 0 3 i3  2     −    −         i i  2 1 2 0 i1 8 1 1  −  next i2 = i2 +  3  if  1 1 0 i2  8  ;      − − ≥ − i i  3 3  0   0 0 3 i3  2         −    −              8 1 2 0 i1 8  −  if  0   1 1 0 i2  8   − ≥  1 0 0 3 i 2     3      −    −  The stride conditions are given by the iteration space boundaries  in respective  direc-  tions. The iteration interval II is 1 for the example which implies that the counter updates with next function every cycle. The enable condition is given by Equation (3.30), i.e., when the counter value which lies in the iteration space.

After determination of the next function for iteration spaces I1,...,Ik 1,Ik+1,...,In+1 the counter can be generated with help of description in Equation 3.28.− Figure 3.10

81 3. Accelerator Generation: Loop Transformations and Back End shows the hardware architecture of a n-dimensional global counter. On the basis of the iteration vector values I1,...,Ik 1,Ik+1,...,In+1 produced by the n counters, the s − stride conditions Ci and enable conditions are evaluated in the decoder inside the global controller. Based on the stride conditions, the corresponding increments are selected from a stride look-up-table (LUT) and added to the counter variables. The modulo-II counter together with the calculated enable condition generates the enable logic for the counters. In order to obtain the path strides and stride conditions for the next function, we perform the following steps:

1. The tiled iteration space is transformed to an equivalent orthogonal space.

2. The strides and stride conditions are found in the transformed orthogonal space.

3. The strides are then transformed back to original space for the path strides and the stride conditions in the tiled iteration space.

As will be shown for the running example in Figure 3.11 on page 87, a transforma- 1 tion matrix pair T and T − is needed, which maps the tiled iteration space onto a 1 rectangular space 0 and back (i.e., T : 0 and T − : 0 ). The loop matrix R is equivalent to theI tiling matrix, whereI −→I the column vectorsI → contain I information on the scanning order. Furthermore, the images of the original iteration spaces must be m integers in the rectangular space, i.e., 0 Z . I ⊆ m m Lemma 3.2.1 If I0 = TI I Z , then I0 Z iff T is an integral matrix. ∧ ∈ ∈ Proof 3.2.1 see [77]  The following theorem gives the construction of the transformation matrix.

Theorem 3.2.2 Given, a loop matrix, R of dimension m. Then the transformation of the iteration space, to an orthogonal space, 0 is given by matrix, T, I I

det(R) 1 det(R) T = σ G R− , where σ = (3.32) det(G) det(R)   | | g1 0 ... 0 . . 0 g .. . G =  2 , where . .. ..  . . . 0     0 ... 0 gm    gi = gcd rij  j=1...m ∀ 

82 3.2. Controller Generation

Proof 3.2.2 Since column vectors of R are the side vectors of the tile. Therefore, 1 column vectors of T − should be parallel to column vectors of R. This also implies 1 that the row vectors of T are parallel to the rows of R− . Furthermore, since we map to a rectangular space in a standard right-handed coordinate system (where the x axis goes to the right and where the y axis goes up). 1 det(R) Hence, T = σVR− , where V is a m m diagonal matrix and σ = det(R) . × | | 1 adj(R) Also, R− = det(R) , where adj(R) and det(R) are the adjugate and the determinant T of R. By finding the greatest common denominator of each row of R = (r1 r2 ... rm) (ri is row vector of R) , then the matrix R can be rewritten as T T 0 0 0 R = (r1 r2 ... rm) = r1 r2 ... rm G, where  

g1 0 ... 0 .. .  0 g2 . . G = , g = gcd r . .. .. i ij  . . . 0  j=1...m   ∀   0 ... 0 gm    Furthermore,  

T 0 0 0 adj(R) = adj r1 r2 ... rm G     T 0 0 0 = adj(G)adj r1 r2 ... rm (adj(AB) = adj(B)adj(A))    Hence,

T 0 0 0 adj(G)adj r1 r2 ... rm 1 adj(R) T = σVR− = σV = σV   det(R) det (R)  T 0 0 0 adj r1 r2 ... rm 1 = σVG− det(G)    det(R)  Now using Lemma (3.2.1), one must select diagonal matrix V such that T is an inte- gral matrix. Therefore, if

det(R) V = G det(G) T adj r0 r0 ... r0 T det(R) 1 2 m Then, T = σ G G 1det(G)   = σ adj r0 r0 ... r0 , wh- det(G) −  det(R)  · 1 2 m ich is an integer matrix.   

83 3. Accelerator Generation: Loop Transformations and Back End

Hence, using Lemma 3.2.1, we infer that T according to Equation (3.32) is a valid transformation matrix. Similar transformation matrix has also been chosen in [86, 77] for scheduling. 

Since T is a non-unimodular matrix, the corresponding orthogonal space has holes. The non-filled points in the rectangular iteration space are called holes (see Figure 3.11). These points do not have an integer pre-image in the original iteration space. The next step is to find the strides and stride conditions for the next function in Equa- tion 3.29 for the transformed iteration space. Such a function will count the actual points and leaves out the holes. Formally, for the scanning code of the rectangular T T 4 0 0 0 0 0 0 0 0 space , I = i1 i2 ... im , the strides S = s1 s2 ... sm and the mutually exclu-  C0 I C0 I C0 I  sive stride conditions s1 ( ), s2 ( ),..., sm ( ) for selecting the strides must be found. The major challenge is therefore, (a) to select stride, si such that the next point is inte- gral, and (b) to find a stride condition C0 I which corresponds to the situation when si+1 ( ) taking stride si leads to an iteration point lying outside the tile boundary. Therefore, on satisfying C0 I , the stride s0 is taken. We use a property of Hermite normal si+1 ( ) i+1 form H0 of transformation matrix T that columns of H0 determines the lattice points of the boundary of fundamental brick (see Appendix B). Unlike other methodologies which use Fourier-Motzkin elimination [13], the method is more intuitive. The fol- lowing theorem is used to determine the requisite strides and stride conditions in the transformed hyper-rectangular iteration space, given the transformation matrix T.

Theorem 3.2.3 If H0 is the Hermite Normal Form of the transformation matrix T = 1 VR− . Then, the path strides are given by

v 1 v 1 0 0 1,1 0 0 1,1 0 h1,1 h1,2 − h1,1 ... h1,m h1,m 1 + − h1,1 − h10 ,1 − − − h10 ,1    .  .     0 . . 0 h2,2 . .   0  .. vm 2,m 2 1  S = 0 0 . h0 h0 + − − − h0  m 2,m m 2,m 1 h0 m 2,m 2   − − − − − m 2,m 2 − −     − −     . .. .. vm 1,m 1 1  . . . h0 − − − h0  m 1,m h0 m 1,m 1   − − m 1,m 1 − −    − −    0 ... 0 h0   mm   (3.33) 

Proof 3.2.3 The Hermite normal form (see Appendix B) H0 of the transformation matrix T is given by

4 Such scanning code for nexti function must be found for all i, i = 1,...,k 1,k + 1,...,n + 1 I −

84 3.2. Controller Generation

0 0 0 h1,1 h1,2 ... h1,m . . 0 h0 .. . 0 ~ 0 ~ 0 ~ 0  2,2  H = h1 h2 ... hm = ...... 0  . hm 1,m     −   0 ... 0 h0   m,m    0 Intuitively, the diagonal element h j, j gives the increment of the iteration variable i j T 0 0 0 0 for the nested update of I = i1 i2 ... im [74, 77, 189]. Therefore, the first path T 0 0   stride is~s1 = h1,1 0 ... 0 . The corresponding stride condition when the iteration vector lies within the rectangular domain is given by i0 < v1 1 1 ... i0 < vm m 1. 1 , − ∧ ∧ m , − Therefore, the stride condition corresponding to~s1 is

T T i0 i0 ... i0 ((1 v11)(1 v22) ... (1 vmm)) (3.34) − 1 − 2 − m ≥ − − −   whereas in the column vector,~h j, the non-diagonal element is equivalent to the offset to the adjacent iteration in direction, i j [74, 77, 189] (see also Hermite normal form in Appendix B). According to the definition of path strides (see Definition 3.2.1), the first stride is taken as long as iteration point I crosses the iteration space. For the transformed rectangular iteration space, the corresponding boundary of the iteration space is i1 v1,1 1. Now the last iteration point in direction i1,Ii1 does not neces- ≤ − T sarily lie on the boundary and is multiple of ~s0 . Therefore, Ii = x h0 0 ... 0 . 1 1 · 1,1 v 1 0  1,1  The value x should minimize x h1,1 (v1,1 1), which leads to x = − . There- · − − h10 ,1   fore, the path stride in direction i2 can be obtained as the difference of the offset in direction i2 and the last iteration in direction i1. Therefore,

v 1 0 1,1 0 h1,2 − h1,1 h10 ,1 h0      2,2  0 ~s2 = . . −  .     .       0   0          The corresponding stride condition for stride~s2 is given by,

T T i0 i0 ... i0 ((v11 1)(1 v22) ... (1 vmm)) (3.35) 1 − 2 − m ≥ − − −   For lexicographic scanning in other directions, il, l > 2, an offset of hi,l 1 needs to be added to for all rows, i l 2. Hence by induction and similar reasoning,− − we ≤ −

85 3. Accelerator Generation: Loop Transformations and Back End obtain v 1 0 1,1 0 h1,m 1 + − h1,1 − − h10 ,1  .   .   ~  vm 2,m 2 1  ~sm = hm h0 + − − − h0  m 2,m 1 h0 m 2,m 2  −  − − − m 2,m 2 − −    − −   vm 1,m 1 1  − − − h0   h0 m 1,m 1   m 1,m 1 − −    − −    0    Therefore, we obtain the path stride vector for the rectangular space as columns of S0 = (~s1 ~s2 ... ~sm) as in Equation (3.33). The corresponding stride condition for stride vector ~s j is given by,

T 0 0 0 0 T i1 ... i j 1 i j ... im (v1,1 1) ... (v j 1,, j 1 1)(1 v j, j)... (1 vm,m) − − − ≥ − − − − − − (3.36)     In the final step, the path strides need to be transformed back to the original domain. This is simply done by transforming the path stride in the rectangular space to the original domain as follows:

1 S = T − S0 (3.37)

s 0 Similarly, the stride condition for s j, Cj(I ) is given by

T diag(1 ... 1 1 ... 1) T I diag(1 ... 1 1 ... 1)(v1 1 1 ... vm m 1) − − · · ≥ − − , − , − j 1 j 1 − − (3.38) For illustration|{z} purposes, we show the derivation|{z} of path strides and stride conditions using the running Example 3.2.1 as follows:

Example 3.2.2 For the iteration space and the loop matrix given in Example 3.2.1, using Theorem (3.2.2) and (3.2.3) gives

1 2 3 0 0 9 9 0 1 2 0 det(R) 1 27 1 1 T = σ GR− = 1 − 0 3 0 − 0 = 1 1 0 det(G) − · 9   9 9   −  0 0 1 0 0 1 0 0 3       V  1    R− | {z }| {z }

86 3.2. Controller Generation

r1

j2 2 6 10

1 5 9 13 17 21 T j1 0 4 8 12 16 20 24 28 32 1 T − 11 15 19 23 27 31

′ j2 22 26 30

r2

′ j1

Figure 3.11.: (a) Iteration space of original tile (b) Transformed orthogonal domain, where iteration point are represented by black points.

1 2 0 3 1 0 H = HNF 1 1 0 = 0 1 0  −    0 0 3 0 0 3      T    It may be noted that the first, the| second,{z and the} third column of the Hermite normal 0 0 0 form are equivalent to offset in direction i1, i2, and i3 of transformed orthogonal domain. After using Theorem (3.2.3), we get the path stride matrix for the rectangular domain as

3 1 8 3 0 1 + 8 3 3 8 8 − 3 − − 3 − − S = 0 1 0 8 1 = 0 1 8    − 1      −  0 0 3 0 0 3           After transforming back to the original domain, we get the path stride matrix as

1 2 3 3 0 3 8 8 1 2 8 0 1 1 − − − − S = − 0 0 1 8 = 1 3 0  3 3  −   −  0 0 1 0 0 3 0 0 1  3      1  S    T − Each column of| the path{z stride} matrix| denotes{z a} corresponding stride. The stride conditions are obtained for s1, s2, s3 by applying Equation (3.38). Therefore, the counter can constructed as given in Example 3.2.1.

87 3. Accelerator Generation: Loop Transformations and Back End

The counter generation methodology is summarized in Algorithm 3.2. It requires as input all the loop matrices R1,R2,...,Rn+1 of the tiled hierarchically tiled itera- tion spaces 1, 2,..., n+1 which are to be executed sequentially. The novelty of the methodologyI forI counterI generation proposed in this section is that it encompasses all possible tiling techniques (i.e., LPGS, LSGP, copartitioning and other hierarchical tiling methods) for congruent parallelepiped tiles. It must be noted that the rect- angular tiles are a special case of our methodology. It must also be noted that the methodology is valid only if there exists a linear affine schedule.

Algorithm 3.2 Algorithm for scan counter generation Require: Loop matrix, R1,R2,...,Rn+1 Ensure: Scan counter according to Equation (3.28) 1: for all Loop matrix, R1,R2,...,Rn+1 do 2: Determine transformation matrix, T using Equation (3.32) 3: Determine Hermite normal form of T, H0 = HNF(T) 4: Determine strides ~s0 as columns of S0 in the transformed space using Equa- tion (3.33) 0 5: for all stride,~si do 6: Determine stride condition using Equation (3.38) 7: end for 8: Transform the stride matrix and stride condition back to the original space using Eqs. (3.37) and (3.38) 9: Determine the next function as given in Equation (3.29) 10: end for 11: Construct scan counter as in Equation (3.28) 12: Update the Scan Counter every II cycles, where II is the iteration interval

3.2.1.2. Determination of Processor Element Type The main aim of the determination of PE types is to

• Separate predicates which can be executed by a global controller from those which must be locally computed.

• Generate specialized hardware implementations for each individual processor type, which contain only the required functionality to execute all equations defined for the respective processor element type. For example, in a systolic array, only the border processors are supposed to communicate with external memory.

Without the determination of different PE types, the methodology would implement the local control model, i.e., all predicates would be computed in a local controller

88 3.2. Controller Generation inside each PE. The following strategy is used for the separation of predicates to be implemented in a local or a global controller. The iteration dependent conditions based on processor index p are of the form as in Equation ( 3.24 on page 76). They are decoded by a local controller in each PE, if A2 = 0. Control conditions of types 6 as in Equation (3.23) and in Equation (3.24) (only if A2 = 0) are decoded in a global controller. However, processor regions associated with global control signals have to be iden- tified. The first step of processor type classification is to calculate for each equation Si of the DPLA and its corresponding global condition I , the set of processors ICi (I) S , which will execute this equation at least once. That is P i

S = p : I I , p = Q I + q (3.39) P i ∈ P ∈ I ∩ ICi (I) · n o The problem of identifying the processor regions is equivalent to finding a non- intersecting set of polyhedra associated with the processor space of the corresponding Sk statement. The different processor types i, where = i, i j = 0/ : i = j P P i=1 P P ∩ P 6 can be obtained by finding intersection sets of the S . The quadratic complexity of P i finding intersection sets using pairwise comparison of all Si is in practice feasible for real world examples [93]. P

Example 3.2.3 For the DPLA of the FIR filter and the selected space-time map- ping in Example 3.1.7 on page 72, the accelerator architecture has four proces- sor types depending on the regions, PE1 ( 1 = p : p1 = 0 p2 = 0 ), PE2 P { ∈ P ∧ } ( 2 = p : p1 > 0 p2 = 0 ) , PE3 ( 3 = p : p1 = 0 p2 > 0 ), and P { ∈ P ∧ } P { ∈ P ∧ } PE4 ( 4 = p : p1 > 0 p2 > 0 ), respectively. This is obtained by intersec- tion ofP iteration{ ∈ conditions P which∧ depend} on the processor index.

3.2.1.3. Global and Local Controller Unit A loop program has several iteration dependent conditions due to data-reuse, mem- ory access, and conditional execution. This section deals with the automated control unit generation. The if conditions also known as housekeeping code, describe the conditional execution of the recurrence equations. The conditions are classified into processor independent (type G, i.e., global control) and dependent parts (type L, i.e., local control). The if conditions for copartitioning under type (L) are characterized by a processor index dependent equation (given A2 = 0) as in the following Equa- tion (3.40): 6

˜ ˜ T 3 m if I , = I = (I1 I2 I3) Z · A1 I1 + A2 I2 + A3 I3 b (3.40) ∈ I I { ∈ | · · · ≥ } T The “if” conditional under type (G) is described in the space I = (I1 I3) , explicitly describes those iterative conditions that are independent of processor index p (since

89 3. Accelerator Generation: Loop Transformations and Back End

p = I2 for copartitioning) as shown in Equation (3.41) and can therefore be imple- mented by a global controller which checks for the following iteration condition. ˆ ˆ T 3 m if I , = I = (I1 I2 I3) Z · A1 I1 b1 A2 I2 b2 A3I3 b (3.41) ∈ I I { ∈ | · ≥ ∧ · ≥ ∧ ≥ } The processor independent conditions are evaluated in the global controller. The processor dependent condition need to be evaluated within the local controller of each PE. Therefore, for a DPLA, the iteration dependent conditions can be represented in the DPLA equations as follows:

1 1 1 (...,x j[I d1 1]...) if I ˜ (L) I ˆ (G) F1 − , ∈ I1 ∧ ∈ I1 . . x1[I] =  . .  W1 W1 W1  (...,x j[I dW 1],...) if I ˜ (L) I ˆ (G) F1 − 1, ∈ I1 ∧ ∈ I1 . . . (3.42) . . . 1 1 1 (...,x j[I d1 K],...) if I ˜ (L) I ˆ (G) FK − , ∈ IK ∧ ∈ IK . . xK[I] =  . .  WK WK WK  (...,x j[I dW K],...) if I ˜ (L) I ˆ (G) FK − K , ∈ IK ∧ ∈ IK  T where I = (I1 I2 I3) denotes the loop iteration vector after copartitioning. For each equation, iteration conditions are separated into processor independent and dependent parts on the basis of an AND term and the space-time mapping information. The global iteration conditions are then evaluated within a global decoder using logic consisting of adders, multipliers, and comparators. These signals are then propagated to the local controller of each PE as will be shown in next section. A local controller not only decodes processor dependent iteration conditions but is also responsible for orchestrating the local schedule given by τ(vi), i.e., the start time of all operation in loop body (see Equation (2.10)). So additional logic is required to assure the correct execution behavior. In an iterative schedule, the iteration interval II defines the number of time steps between the start of two subsequent iteration points. Since during every iteration interval, the same sequence of control signals must be generated, the required control functionality can be implemented by a modulo-II counter [16, 93]. Its output is connected to a decoder logic, which generates the control signals for (a) resource sharing of functional units and multiplexers, (b) FUs supporting multiple operations, and (c) a clock enable signal for storing intermediate results in registers. This modulo-II counter could also be implemented globally; however, cost analysis has shown that the required delay registers would be more expensive than the local modulo-II counters. Unlike the memory resources, the size of the control unit is independent of the tiling parameters. The local and global control units are problem size independent, as the number of control variables is independent of the number of iteration points in the iteration space. The propagation of the global control signals and counter variables to the individual PEs is discussed in the next section.

90 3.2. Controller Generation

3.2.1.4. Propagation of Global Control and Counter Signals

The counter variables denoting the iteration vectors I1,...,Ik 1,Ik+1,...,In+1 and the global control signals need to be propagated through the processor− array to the PEs using interconnect delay registers; therefore, all the PEs are connected to the global controller with an appropriate number of delay registers, ∆d. The number of delay registers for PE p is equal to the number of time steps required by the global signals to travel to processor element p, i.e.

∆d = tmin(p) = λpar p

λpar is the part of the schedule vector which corresponds to the iteration space, k mapped onto processors. The signals must be synchronized with the start-time of theI PEs, tmin(p). The start time of the processor array is assumed to be zero, i.e., tmin = 0. Similar to data signals, the control signals are also propagated into the processor array, where each PE receives the global control signal from the neighbouring PE, (p dp), where dp denotes link (interconnect) to neighbouring processor. Thereby, the− regularity of the processor array is maintained. The length of the interconnect delay registers is given by the following equation:

∆dp = λpardp

The control interconnection network is given by the propagation direction of each processor dp. For the processor array, the iteration variables and the global control signals for the PE at origin are directly taken from the counter and global controller, respectively. All the possible propagation vectors dp are chosen as orthogonal inter- T T connection links, i.e., dp = (1 0) ,(0 1) ,... in case of 2-d processor array. { }

Example 3.2.4 The global counter produces the value of sequential iteration vectors I1 = (i1, j1) and I3 = (i3, j3) in Example 3.1.7 on page 72.For the selected space- time mapping, almost all the control conditions are of the type (G) as in Equation (3.23). For example, c1 = ( j1 = 0) and c2 = ( j2 = 0) are to be evaluated as iteration condition c1 c2. c1 is processor independent and c2 is processor dependent as I2 ∧ is mapped onto processor index. Therefore, c1 is evaluated in global controller and propagated to local controller of the PEs for further processing. c2 is true only for PE(0,0) and PE(1,0) and is set to false in other PEs. The local controller then evaluates the condition c1 c2. Almost all iteration conditions for the example are evaluated in the global controller∧ and propagated into the processor array. For the selected space-time mapping in Example 3.1.7 on page 72, the interconnect delay can be calculated as follows:

91 3. Accelerator Generation: Loop Transformations and Back End

1 ∆(1 0)T = 4 3 = 4 0 !   0 ∆(0 1)T = 4 3 = 3 1 !   Therefore, the control signals are propagated with delay of 4 and 3 cycles in p1 and p2 direction, respectively. The key characteristic of our methodology is the use of combined global and lo- cal control facilities. This hybrid version of the control path lies between a com- plete global control path for SIMD architectures and a local control path for multi- processor architectures. This strategy reduces the required control overhead area at the cost of delay registers for propagation and improves the clock frequency of the hardware accelerator [43].

3.2.2. I/O Communication Controller The last few decades have seen a significant increase in memory access in terms of processor cycles. This conventional wisdom is also known as “Memory Wall”. The accelerators also need to address this problem in order to hide the memory latency and provide high bandwidth. Therefore, the streaming workloads must be stored in local buffers, memory banks, or scratch-pad memories close to the accelerators. In addi- tion, controller engines for address generation and synchronization of I/O communi- cation for the local buffers are required. Its automatic generation requires knowledge of the undertaken scheduling, tiling, and allocation strategies. In this section, we present a methodology for the generation of custom memory architectures and their corresponding I/O controllers.

3.2.2.1. Buffer Modeling and Synthesis The problem of buffer modeling is to generate a custom memory architecture for parallel access of data by the hardware accelerator. To this end, memory mapping considers the question, how to distribute I/O arrays or variables to different physical memories constituting the custom memory architecture? In case of hardware accel- erators, the memory mapping is determined by the chosen space-time mapping. As discussed in Section 2.3.2.1, the space-time mapping determines the allocation of the loop iterations to the PEs through the allocation matrix, Q. This allocation is in turn decided by the tiling strategy. Each I/O variable and its dimension are defined with keywords in and out in the DPLA program specified in the PAULA language (see program on page 53).

92 3.2. Controller Generation

Definition 3.2.2 (Processor mapping of I/O variables) The iteration conditions, I ∈ C corresponding to I/O variables, xi in the loop program define the index space of I i the I/O variables. The corresponding processor space for the I/O variable, xi can be obtained by the following mapping f :I P. →

i = p p = Q I I C (3.43) P { ∈ P| · ∧ ∈ I i } where is the processor space of the hardware accelerator. P The task of memory mapping is to generate memory modules for the parallel access of array elements to the corresponding processor elements.

Definition 3.2.3 (Memory mapping of the I/O variables) is defined as memory mod- ules, which feed contained I/O data elements to the processor elements on which they are mapped. The memory space of I/O variable, xi is given by the following mapping g : I M → n i = m Z m = Q I I Ci (3.44) M { ∈ | · ∧ ∈ I } where for each element m in the memory space of I/O variable i, i, a memory connected to corresponding processor p is generated. M The reason for the same mapping matrix is that in our model, I/O variables with same chosen matrix Q are mapped onto processor and memory which requires minimum communication.

Example 3.2.5 Consider the FIR filter in Example 3.1.7 with the given space-time mapping. Using Equation (3.44), one can determine the memory space of variable U, A, and Y. The memory space of variable, U is U = 0 i2 1 j2 = 0 . Therefore, there are two memory banks containing valuesM of{ U≤ and≤ are∧ connected} to PE(0,0) and PE(1,0), respectively. Similarly, there are two memory banks for variable A as can also be seen in Figure (3.7). The approximate heuristic for setting the buffer sizes uses the value of the number of iterations corresponding to I/O variable for a single processor element. Therefore, given the maximum number of iterations of I/O variable xi mapped on a PE requiring the I/O variable, is the approximated size of the buffer. In the next subsection, we discuss address generation, FIFO control, and synchro- nization mechanisms. The memory modules can be generated in two different modes: FIFO and addressable memory. In case of fine granular HW-HW communication be- tween accelerators, the memory can be synthesized or configured in FIFO mode, as it is viable for fine granular communication between accelerators. The HW-SW communications between accelerator and processors can take place over different communication alternatives like buses and others. However, different rates of data

93 3. Accelerator Generation: Loop Transformations and Back End production and the control overhead for the exchange of status signals for each data- transfer would be a significant bottleneck. This recommends applying a burst model of data transfer. The configuration of memory as addressable memory (RAM) is apt for such a scenario.

3.2.2.2. I/O Controller Synthesis In the previous section, the methodology for buffer generation and for parallel access of I/O data was presented. In this section, we generate the signals for accessing the buffers according to a chosen schedule. In case of FIFO mode, these signals are read/write (R/W) enable and valid signals. For the RAM interface, the address values are needed in addition to the R/W enable signals. It must be noted that in case of FIFO mode, the data is stored in memory according to the schedule determining the consumption of data elements. In the next chapter, a communication primitive called multi-dimensional FIFO is presented which reorders the data elements in case of HW-HW communication according to given schedule. All the information pertaining to the generation of memory control signals is ob- tained from the statements in the given loop specification, which use/define the I/O variables. The control signals are generated with help of output Boolean signals eval- uated according to the I/O variables iteration condition. Whereas, the address gener- ation unit depends on the affine access function of the I/O variables. The evaluation of an address requires the values of the loop counter variables, which are provided by the global counter. The evaluation of the iteration condition corresponding to the I/O statement takes place in the global controller, which subsequently propagates the read/write enable signals to the I/O controller. The I/O controller generates the necessary signals for the local buffer. The scheme of the I/O controller is shown in Fig. 3.12(a). The mutual components of the I/O controller for both FIFO and addressable memory (RAM) are • Modulo-II counter: Since an I/O access takes place every II cycles. • Hold signal: In case the counter is disabled, the I/O access is not allowed. This situation arises in case of non-tight schedules, where a new iteration is not executed every II cycles. • Global control signals such as conditions from the decoder and the counter enable signals are evaluated together for determining valid I/O access. The counter enable indicates whether the values of the loop counters are still in the iteration space. I/O access will be forbidden if this condition is not satisfied. The logical conjunction of the common signals is used for enabling or disabling the I/O controller. The left and right lower component on different sides in Fig- ure 3.12(a), uses a common signal and other inputs to generate signals for the FIFO and the addressable memory mode, respectively.

94 3.2. Controller Generation

(a) (b) I/O Controller valid(A) Modulo-II AND AND CE Counter conditionals FIFO Enable AND read controller (A) enable enable hold empty FIFO(A) PE(0) FIFO controller Address generator status flags FIFO CAST State check Loop read logic counter enable

AND empty ADD FIFO(A) PE(1)

Control Signals: valid re/we address re/we Data Signals:

Figure 3.12.: (a) Detailed component view of the I/O controller for both FIFO and RAM mode. (b) The valid signal for the computation kernel. The short lines perpendicular to the interconnection signals denote delay registers.

The valid signal from the FIFO state check logic shown in Fig. 3.12, is responsible for identifying the state of the computation kernel. That is, if any FIFO of an I/O variable reaches the empty or the full state, the FIFO cannot be read or written any more. All valid signals guarantee that the functional units inside the computation kernel can execute further even though some input/output FIFOs are empty/full, if the I/O access to the FIFOs is not required. Only when the I/O controller generates the valid signal and the common signal is true, the read or write enable signal is generated. In case of the RAM interface, the enable signal is generated by the common com- ponent containing the modulo-II counter as shown in Figure 3.12. All loop counter signals which correspond to the value of the iteration variables are used in the address generation unit. The cast component is used, if the data type of the counter value does not has the same bit-width. The address of the input and the output data corresponds to its iteration position in the polytope. Each input and output variable can be mapped onto multiple memory banks. The schedule order of a variable access is same for all memory banks. Hence, only a single controller or address generation unit for each variable is needed and not for each memory bank corresponding to the variable. The memory signals are propa- gated with delays to the memory banks. These delays correspond to the offset of the start-time of the neighbouring memory modules, which belong to the same variable. The length of the delay register is defined by the equation, td = min λ (I1 I2) , { · − } where I1 and I2 denote the iteration vectors accessing the neighboring memory banks of the same I/O variable. This optimization avoids requiring a dedicated I/O con- troller for each memory bank. We use our example of the bilateral filter to illustrate

95 3. Accelerator Generation: Loop Transformations and Back End

variable A in 2 integer <16> (b) image(j) (a) 0 2 3 5 variable U in 2 integer <16> 0 2 3 5 variable Y out 2 integer <16> Global Controller 0 2 3 5 4 6 77 9 FORALL (i >= 0 and i <= M-1) Global loop counter 0 2 3 5 Global decoder 4 6 77 9 {FORALL (j >= 0 and j <= M-1) II 0 2 3 5 Counter 1 mask (m) 4 6 77 9 0 2 83 10 5 11 13 {FORALL (m >= 0 and m <= N-1) 4 6 77 9 Stride-LUT inc FUs 0m 2m 83 10 5 11 13 { a[i,j,m] = A[0,j,m]*LUT(u[i,j,m]-u[i,j,0]) 2 41 6 77 9 Enable logic 0 2 83 10 5 11 13 h Counter 2 4 6 12 77 9 15 17 IF (m==0) THEN C 8 10 1411 13 4 6 1277 14 9 15 17 { u[i,j,m] = U[i,j] // Read input inc 8 10 11 13 image(i) 4 6 1277 14 9 15 17 8 10 16 11 13 19 21 z[i,j,m] = a[i,j,m] * u[i,j,m] 12 14 1815 17 Counter n Comparator 8 10 16 11 18 13 19 21 s[i,j,m] = 0 + a[i,j,m] 12 14 15 17 Decoder s 8 10 16 11 18 13 19 21 C inc 12 14 20 15 17 23 25 y[i,j,m] = 0 + z[i,j,m] 16 18 2219 21 12 14 2015 22 17 23 25 } 16 18 19 21 12 14 2015 22 17 23 25 16 18 24 19 21 27 29 ELSE 20 22 2623 25 Loop counters Enable conditionals 16 18 2419 26 21 27 29 {z[i,j,m] = a[i,j,m] * u[i,j,m] 20 22 23 25 16 18 2419 26 21 27 29 20 22 28 23 25 31 33 s[i,j,m] = s[i,j,m-1] + a[i,j,m] 24 26 3027 29 20 22 2823 30 25 31 33 y[i,j,m] = y[i,j,m-1] + z[i,j,m] 24 26 27 29 20 22 2823 30 25 31 33 I/O controller FSM MUX 24 26 27 29 IF (i==0) THEN regs 28 30 31 33 24 26 27 29 u[i,j,m] = 0 Address Gen/ regs 28 30 31 33 24 26 27 29 MEM(U) FU regs 28 30 31 33 ELSE FIFO Control(U) FUFU regs 28 30 31 33 u[i,j,m] = u[i-1,j,m-1] //data reuse PE(0) 28 30 31 33 } FIFO IF (m == N-1) flags FSM MUX regs Y[i,j] = y[i,j,m]/s[i,j,m] // Output Address Gen/ regs } MEM(Y) FU regs FIFO Control(Y) FUFU regs } PE(1) }

Figure 3.13.: (a) data flow graph with numbers indicating the start time of the iter- ations. (b) Corresponding accelerator with processor array, memory, global and I/O controller subsystem. our methodology.

Example 3.2.6 The corresponding data flow graph of the bilateral filter is shown in Figure 3.13(a) with M = 8 and N = 4. Variables A, U, and Y denote the input mask coefficients, input image pixels, and output image pixels. The iteration conditions determine the conditional execution of statements including the input and output as- signment statements. The tiles represent the tiling of the iteration space, which lead to the allocation of two corresponding processors. Each processor executes the points within the tile sequentially. Formally, this introduces an extra loop dimension, m1 and m2 instead of m, similar as in loop blocking [158]. Due to the selected tiling strategy, the iteration variables m2, i, and j are executed sequentially whereas, it- erations with same m1 are executed on the same processor, whose processor index is given by the value of m1 (i.e., p = m1). The scheduling for the data flow graph in Figure 3.13 gives λ = (4 32 3 2) for the iteration vector I = (i j m1 m2). The obtained iteration interval is II=2, which can be explained by the availability of only one . For this example, the input and output variables, U and Y are associated with iteration conditions, m == 0 and m == N 1, respectively. This corresponds to − m1 == 0 m2 == 0 and m1 == 1 m2 == 1 for N = 4 in the tiled program. Hence, the input,∧ U and output variables, Y∧ are mapped to a single memory bank, connected to PE(0) and PE(1), respectively (p = m1), corresponding to the processor depen- dent part of the conditional, i.e., m1 == 0 and m1 == 1. The connection of memory banks to PEs are also shown in Figure 3.13(b). The global decoder generates control signals corresponding to m == 0 and m == N 1 for the input and output FIFO − controller, respectively (i.e., m2 == 0 and m2 == 1, respectively). The conditional is true every 4th cycle. In case of a FIFO interface, as shown in Figure3.13(b), vari- able A is mapped onto PE(0) and PE(1). The read enable for each FIFO and its

96 3.3. Results empty flag are required to feedback its state to the I/O controller for A. All these signals along with the clock enable (CE) from outside are evaluated to determine the clock enable signal (CE) of the computation kernel. The loop counter variable, m2, i, and j are used for address generation for memories U and Y. The execution order of the processor array is m2, i, j, which is defined by the schedule vector, λ where p = m1 and m2, i, j are executed sequentially. The address of variable U is given by m2 + (N/2 1) i + (N/2 1) M j. − ∗ − ∗ ∗ The RTL functionality is verified by an auto-generated VHDL testbench with input and output data created by a functional simulation of the accelerator. The accelerators with I/O interface can exist independently on an FPGA or be coupled to a processor in an SoC. The methodology for SoC integration will be presented in Chapter 4.

3.3. Results

In this section, we show the benefits of our methodology in terms of automated loop accelerator generation for a wide class of algorithms. Our design tool has also been used to quantify the effect of loop tiling and the requisite iteration interval on control overhead in terms of logic area, power, and performance.

3.3.1. Embedded Computation Motifs In [7], underlying communication and computation patterns common to many ap- plications and standard benchmarks were identified. These numerical patterns were classified into 13 groups called dwarfs. We are interested in loop accelerators for systems-on-chip. Hence, from the embedded computing standpoint the following classes of patterns have been selected as benchmarks:

• Dense Matrix Algorithm: complex matrix-matrix multiplication

• Structured Grid Algorithm: edge detection, bilateral filter

• Spectral Algorithm: discrete cosine transform

• Dynamic Programming Algorithm: Smith/Waterman algorithm

• Combinational Algorithm: cyclic redundancy check

The dense matrix algorithms from linear algebra like matrix-matrix multiplication, matrix-vector multiplication, QR decomposition, singular value decomposition, and others are very often seen in signal and image processing systems. The algorithms use multiple matrices as data structures and are characterized by regular access of data. The complex matrix-matrix multiplication (N = 64) (i.e., matrix elements are

97 3. Accelerator Generation: Loop Transformations and Back End

7000 Area 1400 Power 6000 1200

5000 1000

4000 800

3000 600 Power (mW) Area (Slices) 2000 400

1000 200

0 0 CMM DWT DS FIR DCT SW CRC

Figure 3.14.: Area and power consumption of loop accelerator for different loop benchmarks complex numbers) is used to quantify the performance, area, and power overhead for this class of algorithms. The structured-grid algorithms like edge detection, bilateral filters, image corre- lation, and other image processing algorithms have a regular grid as data structure, where each point is updated by a computation characterized by spatial locality, that is, the computation depends on the values of neighboring grid points. It may also sample only a subset of the grid like, downsampling. Image processing algorithms like discrete wavelet transform (DWT), downsampling (DS), and a signal process- ing algorithm 5-tap FIR filter are considered as benchmarks. Images of dimension 512 512 are used. Furthermore, the algorithms are characterized by complex border processing× schemes. Spectral algorithms like DCT, FFT, IDCT, and IFFT convert the data from the spa- tial domain to the frequency domain, or vice versa. The discrete cosine transform (DCT) is a well known signal processing algorithm used in compression and can be solved as a two-stage matrix multiplication X = Ct x C. However, the unique pattern of the constant elements in the matrix C, allows saving· · of several operations. Fur- thermore, the separability of the 2-d transform into two 1-d DCTs, a horizontal DCT (hDCT) and a vertical DCT (vDCT), simplifies the number of operations. Therefore, we consider a hDCT for an 8 8 image block as loop benchmark as it is used in compression schemes like Motion× JPEG [146]. Dynamic programming is a method of solving complex problems by breaking them down into simpler steps. The Smith/Waterman (SW) algorithm is used in bio- informatics for sequence alignment, which determines similar regions between two DNA chromosomes or protein sequences. The major part of the total calculation time is taken by a similarity matrix score calculation [3]. Therefore, this loop is selected

98 3.3. Results

Algorithm CMM DWT DS FIR DCT SW CRC32 Speed-up 215 50 30 17 12 27 3 PAULA (LoC) 115 121 68 58 121 67 79 VHDL (LoC) 26205 6904 2368 8124 9122 9150 2172 Prod. gain 227 57 35 140 75 136 27

Table 3.4.: Speed-up and productivity gain in terms of lines of code of our design methodology. as benchmark for this class of algorithms. Combinational logic algorithms exploit bit-level parallelism by performing opera- tions on multiple data. The example are DES (Data Encryption Standard) or cyclic redundancy check (CRC) algorithms. The cyclic redundancy check, or CRC, is a technique based on polynomial arithmetic for detecting errors in digital data. We use CRC-32, which works on 32-bit messages for this class of algorithms as a benchmark [179]. The PARO program of these benchmarks are shown in the Appendix C. The two other motifs, which are often seen in embedded computing, are graph traversal and finite state machine (FSM) algorithms. Graph traversal algorithms like collision detection, quicksort, and others have less computation and many hierarchies of indirections visiting nodes of a graph. The traditional data dependence analysis for our class of algorithms is not able to model indirections through pointers. Further- more, FSM-like algorithms like Huffman decoding and others are sequential in nature and are therefore not of interest. The accelerator configuration of the complex matrix-matrix multiplication (CMM) is a 4 4 PE array performing complex number multiplication and a local memory for data× reuse of the matrices. The DWT performs 5 3 lossy compression on 16- bit images of dimension 512 512. The accelerator× has an iteration interval of 1 and has enough functional units× in a single PE to perform 15 multiply-add and shift operations in parallel. The downsampler accelerator outputs a single pixel for each 2 2 image window and has an iteration interval of 1. The DCT accelerator contains several× addition, subtraction, and multiply units in a single PE which process 8 pixels with an iteration interval of 1. The accelerator for the Smith-Waterman algorithm is in form of a 1 8 Processor array with identical cascading PEs doing arithmetic and max operations.× The CRC-32 accelerator processes each message of 32-bit in 8 cycles. The area and power requirements of the corresponding accelerator implementations of all the above algorithms on an FPGA are shown in Figure 3.14. All synthesis results are obtained using Xilinx ISE 9.2 on the Xilinx Virtex-2 FPGA (xc2v8000-4- ff1517). Microblaze is a soft-core processor for SoCs realized on Xilinx FPGAs. The area and dynamic power of the Microblaze clocked at 50 MHz is 4062 slices and 1480

99 3. Accelerator Generation: Loop Transformations and Back End

Algorithm MMM BPF IDCT (II=16) SOBEL Overhead (%) 12.5 16.2 46.5 8.2

Table 3.6.: Controller overhead in percent of the total design area (LUTs) for different algorithms. mW, respectively [187]. A quite optimistic reference performance of 30 MOPS (Mil- lion operations per second) is assumed for Microblaze implementations. The area and power of the Microblaze is used as baseline reference for comparing loop ac- celerators to processors. At the cost of flexibility, the accelerators have an average advantage of 2.5x and 4.5x in terms of area and power consumption over the soft processor cores for FPGA technology. On considering peak performance, the accel- erators have an advantage of almost 50x (average gain over all benchmarks). For the selected mapping configuration, the gain for each algorithm is 215, 50, 30, 17, 13, 27, and 3 times, respectively as also shown in Table 3.4. The configuration reflects the standard degree of parallelism chosen for a hardware implementation of these loop algorithms. The theoretical peak performance assumes that the data is available for computation. An important aspect is the productivity gain on using our methodology as also shown in Table 3.4. From an average software description with 100 lines of code (LoC) an average of 10000 LoC in VHDL describing the hardware architecture of the accelerator is obtained. This improvement of 100x in productivity shows the importance of using a high-level design methodology, for realizing the potential of hardware accelerators.

3.3.2. Impact of Compiler Transformations on Controller Overhead The compiler transformation of loop tiling determines not only the processor array dimensions, but also the local memory requirements within each processor element. In [91], we have shown the linear increase in area and power consumption and almost constant clock rate depending on a linear increase in the number of PEs in the hard- ware accelerator, which in turn depended on tiling parameters. However, its effect on the overhead of the control path is not well studied. Furthermore, one can specify the resource allocation for each PE or the required iteration interval (II). Higher iteration intervals correspond to lower throughput, which also implies a lower resource allo- cation for each PE. However in [56], we have shown that a higher II does not lead to a linear decrease in area due to control overhead. Therefore, our design tool has been used to quantify the effect of loop tiling and a requisite iteration interval on the control overhead in terms of logic area, power, and

100 3.3. Results

50 BPF Control Overhead(%) 40

30

20

Percentage (LUT) 10

0 0 2 4 6 8 10 12 14 16 Partitioning(#PEs) 50 45 IDCT Control Overhead(%) 40 35 30 25 20 15

Percentage (LUT) 10 5 0 0 2 4 6 8 10 12 14 16 Iteration interval

Figure 3.15.: Control overhead in percent of the total design area (LUTs) for BPF for different tile sizes. The determinant of the tiling matrix gives the number of PEs on the x-axis. Control overhead in percent of the total design area (LUTs) for IDCT for different throughput rates. performance for the following algorithms as benchmarks: • MMM: matrix-matrix multiplication • BPF: band-pass filter • IDCT: inverse discrete cosine transform • SOBEL: image edge detection algorithm The accelerator configuration is a 4 4, 3 3, 1 1 processor array for (MMM, BPF), SOBEL, and IDCT, respectively.× The× hierarchical× analysis of the design was done using the Xilinx floorplanner. As a result, the fraction of the control overhead in terms of logic area for these algorithms lies between 10-50% as shown in Table 3.6. The upper graph in Figure 3.15 shows the effect of the tiling parameters on the control area overhead for the BPF algorithm. It can be seen that with tiling matrices corresponding to a larger number of PEs, the control area overhead converges to a constant fraction. In this case, the controller area overhead is at around 15% of

101 3. Accelerator Generation: Loop Transformations and Back End

Figure 3.16.: Clock frequency and power efficiency of matrix-matrix multiplication. the total area of the accelerator. The overhead converges for larger accelerators due to the constant cost of the global and I/O controller whereas the total size of local control scales linearly with the number of PEs. This means that the control overhead converges to a constant overhead on using tiling leading to a larger number of PEs. The IDCT accelerator was also synthesized for different throughput requirements. It was observed that a higher II was associated with larger controller cost because of a higher number of states for the local control FSM. Since the number of states of con- troller for local schedule is proportional to the II. It is associated with lower resource requirements in terms of functional units like multipliers and adders. Therefore, the control overhead as part of the total design increases as shown by the lower graph in Figure 3.15. For higher throughput, i.e., lower II, the number of states is lesser, whereas a larger number of functional units is required for higher throughput. The controller cost in terms of area can take up to 50% of the IDCT accelerator design as shown in Table 3.6 and Figure 3.15. The effect of the tiling parameters on the achievable clock frequency is shown in Figure 3.16. The critical path of the clock passes through the global and the lo- cal controller. Interestingly, the maximal clock frequency does not decrease with an increasing number of processor elements due to pipelining of the control with interconnect delays. The clock frequency for different designs varies between 100 and 120 MHz. For the estimation of the dynamic power for FPGA technology, Xil- inx XPower [188] was used in combination with the post-place & route simulation models of the accelerator designs. The average power consumed by the accelerator also needs a Modelsim simulation of set of inputs involving the generated accelerator RTL. The obtained average power is multiplied with the latency to obtain the total energy. For example, an energy efficiency of 20-25 MOPS/mW (considering only the dynamic power) is obtained for the matrix multiplication example as shown in

102 3.3. Results

Figure 3.17.: Energy-delay product (EDP) map for matrix-matrix multiplication benchmark.

the right graph in Figure 3.16. This implies that power efficiency does not decline with larger processor arrays or controller overhead.

It has been shown that the controller has a substantial impact on the logic area, power, and performance, depending on the chosen tiling, throughput, and resource allocation. For tiling, the control overhead converges to a constant fraction for larger processor arrays. Whereas, higher iteration interval leads to a larger control overhead. Therefore, the problem of finding optimal controller configurations requires finding an optimal accelerator configuration (i.e., tiling, II). In order to find these parame- ters, we selected the energy-delay product (EDP) as a neutral metric for selecting optimal parameters [75]. The EDP is a good metric, since the minimization of the metric results in architectural and compiler decisions contributing most to the energy and area efficiency. In Figure 3.17, the normalized energy (pJ/op) and normalized throughput (op/ns) product for different processor array configurations, again for the matrix multiplication (MMM) case study, is shown. The different tiling parameters lead to processor arrays with 2 1 to 8 7 PEs. It can be observed that for smaller accelerators, the metric is quite× large due× to less power efficiency. The optimal value is achieved for processor arrays of size 6 6 and 8 6. Larger processor array ac- celerators intrinsically minimize the metric× because× of better energy efficiency and clock speeds.

103 3. Accelerator Generation: Loop Transformations and Back End

3.4. Conclusion

Programmable or specific acceleration engines provide a viable architecture for high performance embedded computing algorithms. Loop transformations like tiling are important for realizing the optimal accelerator implementation. In this chapter, we presented an automated source-to-source transformation called hierarchical tiling, which can match or specify all the requirements like number of processor elements, local memory, and I/O bandwidth in an accelerator. In this chapter, we also presented a unified methodology for the generation of controllers. The generated controllers are not only responsible for synchronizing the computations according to a local and global schedule, but also for data transfer from parallel memory banks in the accelerator. The methodology also entails the generation of a custom memory architecture for parallel data transfers to a processor array such that I/O is not a bottleneck for the throughput. The synthesis results show a reasonable controller overhead in terms of area and power depending on the tiling and throughput parameters. The larger iteration interval, the higher is the control overhead. The controller overhead can be as high as 50%. Several realistic loop benchmarks have also been synthesized as accelerators for illustrating the orders of gain in terms of area, power, and performance. The accelerators can be embedded into an SoC by coupling it to a processor or another IP in a multi-accelerator system. The communication between accelerators needs to be supported by special communication structures. In the next chapter, we discuss the automatic generation of dedicated communication subsystems.

104 4. Accelerator Subsystem for Streaming Applications: Synthesis and System Integration

Heterogeneous SoC architectures are ubiquitous in consumer devices [185]. These architectures are characterized by the presence of multiple processors along with do- main specific acceleration engines. Such heterogeneous platforms are finding in- creasing acceptance in high performance embedded computing. The computation- ally intensive parts of the program are usually loop nests, which contain inherent parallelism. Therefore, a significant amount of research has been done on acceler- ating such loop nests on dedicated hardware accelerators or programmable domain- specific processors. However, embedded streaming applications are characterized by the presence of multiple communicating loop nests (see Figure 4.1), where the data is streaming through loop kernels arranged in a certain graph topology. This task level parallelism of communicating loop nests can be exploited by the use of fixed function accelerator pipelines. There are several challenges to the problem of realizing a streaming application as an accelerator subsystem. The first problem is how do we generate an optimal communication subsystem linking the dedicated loop accelerators in a pipeline. The coarse-grained data flow movements must be overlapped with computation. The streaming applications have a large amount of data with limited lifetime. Memory access patterns are known due to the static scheduling of accelerators. Therefore, the communication subsystem should allow deterministic parallel data access. Furthermore, these acceleration engines must be coupled with software, offering features and flexibility (Hardware/Software Co-design). Therefore, the second major problem is how do we integrate the accelerators in an SoC. There are several intercon- nect alternatives, ranging from point-to-point connections, over buses, to networks- on-chip (NoCs). Hence, a generic interconnect glue logic and driver program for accessing the accelerator should be generated. Finally, an integrated design flow for the correct-by-construction synthesis of a streaming application needs to be developed. This would ease verification, integra- tion across the processor/accelerator boundary, and smooth design refinement. We address the above problems in the context of this dissertation. The major con- tributions of this chapter are:

105 4. Accelerator Subsystem for Streaming Applications: Synthesis and System Integration

Camera 1 (a)

Conv3x3 Conv5x5 Depth Map

SAD Camera 2

Conv3x3 Conv5x5

(b) Huffmann Entropy Inverse YCbCr Dequantize Inverse DCT Decoder Decoder ZigZag Decoder

(c) SRC(A) SRC(B)

MMM MMM Sink

SRC(C)

Figure 4.1.: (a) Stereo Depth Extraction, (b) Motion JPEG application, (c) Matrix- Matrix Multiplication D = B (A C). × ×

1. A novel intermediate dependence graph called mapped loop graph, for the rep- resentation of communicating loops and their mapping information, like allo- cation and scheduling in the polyhedral model, is introduced.

2. A methodology for the projection of the mapped loop graph in the polyhedral model onto a data flow model of computation called windowed synchronous data flow (WSDF) is developed.

3. The methods for the generation of an efficient communication primitive for data transfer and synchronization between the loop accelerators by leveraging the windowed synchronous data flow model, is presented. Also, the automated generation of accelerator memory maps, device drivers, and the glue logic for memory-mapped integration of the accelerators into an SoC is supported.

4. A design flow methodology for a correct-by-construction synthesis of the ac- celerator pipelines and its subsequent integration in an SoC is presented and validated with help of real-world benchmarks. This chapter is organized as follows: In Section 4.1, a detailed description of graph models for the specification of communicating loop applications, accelerator-based platforms, and synthesis is given. A methodology for the automated synthesis of FIFO channels connecting the accelerators is described in Section 4.2. The section also contains a novel method for the conversion of communicating loop descriptions

106 4.1. Communicating Loop Model

Accelerators

CPU CPU DMA MEM

Figure 4.2.: The partitioning problem defines the mapping of loop kernels onto ac- celerators or processors, forming the SoC system. in the polyhedral model into the WSDF model of computation. The IP subsystem or individual accelerators need to be integrated in the SoC platform. Section 4.3.1 discusses the automatic generation of memory maps, software drivers, and hardware wrappers for the memory-mapped integration of such accelerator subsystems in an SoC. Subsequently, Section 4.3.2 presents the design trajectory for the generation of an accelerator-based SoC instance from a high-level application, platform, and mapping description. Finally, the conclusion closes the chapter.

4.1. Communicating Loop Model

Applications characterized by the presence of communicating loop nests can be rep- resented using dependence graphs. Similarly, graph models can also describe SoC architectures and the mapping of communicating loop nests onto SoC architectures. In this section, we present graph models for representing applications, architectures, and mapping. In the graph representation of such applications, the nodes are the computation kernels and the edges denote the communication of the array data. This paradigm is also called stream programming model. It is not only intuitive for the programmers, but also exposes parallelism and communication to the compiler and the underly- ing architecture. The SoC architecture model consists of several components such as general-purpose processors, accelerators, DMA engines, and memory. The term hardware/software partitioning is used for the decision process for mapping compu- tation kernels onto SW (CPU) or HW (accelerator) resources. Figure 4.2 shows an example application graph, its partitioning, and mapping onto an SoC architecture. The aim of this section is the modelling of streaming applications, accelerator-based SoCs, and the mapping, which defines the execution.

107 4. Accelerator Subsystem for Streaming Applications: Synthesis and System Integration

4.1.1. Loop Graph The reduced dependency graph (RDG) was defined (cf. Def. 2.1.4) for modeling loop and operation level-parallelism of a single loop nest. Now, we define the loop graph to model task level parallelism of communicating loop nests.

Definition 4.1.1 (Loop Graph), G is a directed acyclic graph defined by a pair G = V,E , where each vertex vi V denotes the loop kernels and ei j E denote the h i ∈ , ∈ multi-dimensional data dependencies between loop kernels vi and v j. Each node has a pair k,RDGk associated with it, where k is the identifier of the loop program. hI i All the edges also have a tuple var, var, var,D , where var, var are the source Ii I j Ii I j and sink iteration spaces of theD transported multi-dimensionalE I/O variable var. D is a partial function representing the dependence between the source and sink array variable var. Each vertex v V of the loop graph G = V,E represents a process (also called actor), whose functionality∈ is described by ah nestedi loop program. Each actor can be classified to one of the following types:

• Source: The variables, signals, images, or other input data consumed by the application are characterized by a source node.

• Normal: Nodes of type normal denote the processing kernels of the streaming application.

• Constant: For each constant data array in the application, there exists a node of type constant.

• Split and Merge: In order to characterize the splitting or the merging of the data flow to/from different branches, this type of node is used.

• Sink: The sink nodes receive the output data of the application.

The description of the functionality of the nodes is given by the associated pair ,RDG , where is the iteration space of the loop program. In case of a hierar- hI i I 1 chically partitioned loop , = 1 2 ... n, where 1 is the iteration space of I I ⊕ I ⊕ ⊕ I I innermost tile and n is the iteration space of outermost tile. RDG is the reduced dependence graph (seeI Definition 2.1.4), which contains information on the iteration domain and the access function of each variable of the loop kernel. The edge ei, j of the loop graph is defined by the multi-dimensional array variable var, which is being produced and consumed by loop i and loop j, respectively. var Ii 1Well known partitioning techniques are multi-projection, LSGP (local sequential global parallel, often referred as blocking or outer loop parallelization) and LPGS (local parallel global sequential, also referred as cyclic(1) mapping or inner loop parallelization).

108 4.1. Communicating Loop Model

j1 j i1 for(i=0;i<2;i++) i for(i1=0;i1<3;i1++) A A for(j=0;j<6;j++){ Isrc Isnk for(j1=0;j1<4;j1++){ A[i,j]=... A[i1,j1]=... } } i i D : 6 1 = 4 1 1 j j    1    Figure 4.3.: Loop graph representation of two communicating for loops.

var and j define the source and sink iteration space of the transported variable and are alsoI known as source and sink data space, respectively. They can be derived from the RDG of the source and sink loops. Intuitively, the iteration domain of each variable can be interpreted as a multi-dimensional array. The parallel computations of the source accelerator produce array data with a distribution that may not match the distribution required by the sink accelerator. Therefore, the links of the loop graph must indicate not only the array distribu- tions produced and consumed by the source and sink accelerator, but also the de- pendency between the distributions. The dependency between the source iteration var var vector Isrc i and the sink iteration vector Isnk Ij is given by a redistribution equation D∈and I is defined as follows: ∈

Definition 4.1.2 (Redistribution equation) between a source loop src and a sink loop snk is defined by the polyhedron

Isrc var var (A B) = 0, Isrc i Isnk j − Isnk ! ∈ I ∧ ∈ I where A and B are matrices, denoting the affine dependency mapping function be- tween the source and the sink iteration vector. The redistribution equation defines an affine dependence relation between the I/O variable var, transported over the edge connecting the source and the sink loop pro- gram. Intuitively, it can be interpreted as an array write and read access of both the source and the sink loop kernel, which refer to the same memory location (i.e., AIsrc = BIsnk = I). The following simple loop graph example is given to facilitate the understanding.

Example 4.1.1 Figure 4.3 shows two communicating for loops, which are obtained by partitioning the iteration domain = (i) 0 i 11 with different tiles of size 6 and 4. The resulting loop graph showsI { the| loop≤ program≤ } in its nodes, the multi- dimensional arrays, and the redistribution equation. For all iterations in the source data space Z-polytope, the dependency equation gives the corresponding iteration in the sink data space Z-polytope.

109 4. Accelerator Subsystem for Streaming Applications: Synthesis and System Integration

The different distributions arise due to different tilings of the same data array. This is often the case in block-based algorithms like DCT and others. Therefore, the depen- dence between the source and sink data space must be given. As stated earlier, copar- titioning [57] was defined as the tiling of the iteration space into spaces 1, 2, and n I I I 3, using two congruent tiles. I1 Z represents the points within the inner tiles and I n ∈ n I2 Z accounts for the regular repetition of the origin of inner tiles. I3 Z accounts ∈ ∈ for the regular repetition of the outer tiles. Intuitively for copartitioning, P1 describes the iteration points inside the tile and P2 describes the number of PEs in the processor array grid. We consider a multi-dimensional array var[0...a1 1,...,0...an 1] of t − − size ~a = (a1 ... an) that is produced according to copartitioning(P1,P2) by P2 output 0 processors of the source accelerator. This array must be redistributed to P2 input pro- 0 0 cessors of the sink accelerator, which undergoes copartitioning(P1,P2). Note, here tiling matrices represent the tiling of the data space and not the entire iteration space. Using Definition 3.1.1 for copartitioning, one can rewrite the redistribution equation as in the following equation.

0 I1 I1 0 0 0 0 (4.1) EP1 P1P2  I2  = EP1 P1P2  I2    I3   I0    3      The redistribution equation has the above form in case of different tilings of the data space. The data space is the iteration space of an I/O variable. Intuitively, the data space is equivalent to the multi-dimensional data array; therefore, the data spaces can be represented as rectangular iteration spaces.

Definition 4.1.3 (Rectangular iteration space) defines the iteration space ~b = I n I { ∈ Z 0 I <~b given by a positive integral vector~b. n is the dimension of~b. | ≤ } Assumption: We represent the source and the sink iteration spaces of variable var var var (i.e., data spaces) by rectangular iteration spaces, i.e., src = ~a and snk = ~b. For T 2 I I I I example, = (i j) Z 0 i < 4 0 j < 3 can be represented as (4,3). I { ∈ | ≤ ∧ ≤ } I In the next section, we present a graph model for representing the SoC architecture containing the accelerators.

4.1.2. Accelerator Model The target platform consist of accelerators, processors, and buses. It can be mod- eled by a so-called architecture graph. The graph is a representation of the system architecture components, which execute and implement the nodes and edges of the loop graph defining the computation and communication. The architecture graph is defined as follows:

110 4.1. Communicating Loop Model

Definition 4.1.4 (Architecture graph) is a bipartite graph defined by Ga = VC VB,E , h ∪ i where VC denotes the processing components, VB denotes communication compo- nents, and E (VC VB) denotes the set of undirected edges. ⊆ × The processing nodes (VC = VA VP VO) of the architecture graph are classified into ∪ ∪ set of accelerator nodes VA, processor nodes VP, and other IP cores VO. A computing node a VA denotes an accelerator and is responsible for the parallel implementation ∈ of the node in the loop graph. A processor node p VP denotes a processor, which provides a sequential implementation of the loop graph∈ nodes. The other IP cores and memories are represented by node type VO. The nodes in VB represent different communication channel alternatives like buses or a point-to-point communication over dedicated FIFO memories, which implement the edges in the loop graph. It also includes architecture components like resident memory, interface circuits, controller logic, and device drivers. The resident mem- ory is needed to store intermediate data elements. The device drivers are software programs in the processing nodes, for accessing the other processing components. Interface circuits or wrappers are responsible for protocol conversion in case of in- compatible protocols at source and sink processing nodes. The controller logic takes care of the synchronization between the processing and communication nodes. In terms of a channel architecture, a blocking read/write synchronization mechanism is undertaken. Self-timed scheduling is deployed in which the processor/accelerator invocations are controlled by the I/O data availability.

4.1.3. Mapping: Putting it all together Given the mapping of a loop graph to an architecture graph, the problem of synthesis is to generate the implementation. The mapping information binds the functionality and communication of the loop graph to the architecture graph (binding). This also includes the configuration information like parallelism (allocation) and execution or- der (scheduling).

Definition 4.1.5 (Binding) β V VC, maps each node of the loop graph (i.e., v ⊆ × ∈ V = V,E ) to a processing node of the architecture graph vc VC. h i ∈ A possible binding (v,vc) may denote a sequential implementation of a loop kernel on a processor, or a parallel execution on a dedicated hardware accelerator. There- fore, the binding β also defines the hardware/software partitioning. The issue of selecting the optimal binding from many possible bindings β∗ has been dealt using heuristics and integer linear programming by Niemann et al. in [140]. The synthesis problem consists of two sub-problems, i.e., computation synthesis and communica- tion synthesis. The computation synthesis involves synthesizing a binding (v,va), i.e., accelerators in form of a processor array, and bindings (v,vp), i.e., a binding to processors is realized in form of software programs.

111 4. Accelerator Subsystem for Streaming Applications: Synthesis and System Integration

Definition 4.1.6 (Node configuration), ∆v : V Q,L of vertices in a loop graph denotes the allocation Q and the scheduling order→ h (Loopi Matrix) L.

Q defines the processor allocation in the space-time mapping equation. The tiling strategy determines this processor allocation function. For a tiled iteration space, = 1 2, 1 is the iteration space of the inner tile and 2 iteration space describing theI originsI ⊕I ofI the outer tile. In LPGS tiling, the processorI space is mapped onto 1 (inner loop parallelization) and for LSGP the processor space is mapped onto I 2 (outer loop parallelization). For copartitioning, = 1 2 3, the processor I I I ⊕ I ⊕ I space is mapped onto 2. Intuitively, one can say Q decides which loop variables are to be executed in parallel.I L is the loop matrix, which gives the execution order of the iterations, being executed sequentially. For example, in an image processing application, row-major and column-major are the two typical execution orders for the image pixels.

Definition 4.1.7 (Edge binding), ψ E VB, binds each edge of the loop graph to a communication channel in the architecture⊆ × graph. The edge binding is said to be valid if the source and sink nodes of the edge have a defined binding.

The generation of communication channels as per edge binding ψ for data trans- fer between accelerators or processors takes into account the granularity, size, rate, and communication order of the produced and consumed data. This depends on the allocation and scheduling functions for the source and the sink node. Hence, both ar- chitecture and communication implementation are determined by the allocation and schedule order parameters. The inter-kernel communication between the loop kernels follow a producer/con- sumer model. Therefore, important properties of the communication between loop kernels, which need to be taken into account, are:

• What is the granularity and size of a data transfer? The communication pattern may include fine-grained data transfers like signals, pixels, or coarse-grained data in form of vectors, tiles, or even complete images.

• What is the rate of production and consumption of data tokens by the source and sink loop kernels? This is important for multi-rate signal and image pro- cessing applications, where intermediate buffers must store the requisite data.

• What is the pattern of data production and consumption? The write and read order of the source and the sink loop determine not only the intermediate buffer size, but also the latency of the communication subsystem.

Similar to the node configuration required for the accelerator generation, we describe the communication semantics using an edge configuration (see Def. 4.1.8).

112 4.1. Communicating Loop Model

(a) Loop Graph Accelerator Graph (b)Binding (c) Node Configuration

∆v V Q L β V V 1 1 CPU C 0 8 0 1 CPU 4 0 4   e1 2 HW1 2 0 0 1 0 4 0 0 0 0 1 0 2 Bus 3 HW2     2 3 1 0 0 0 2 0 4 CPU 0 1 0 0 0 2     HW1 e2 Edge Binding Edge Configuration

3 ∆e E Qsrc Qsnk Lsrc Lsnk FIFO ψ E VC 1 Q(1) Q(2) L(1) L(2) e1 Bus e3 2 e2 FIFO Q(2) Q(3) L(2) L(3) 4 HW2 e3 Bus 3 Q(3) Q(4) L(3) L(4)

Figure 4.4.: (a) An example of a loop graph, an architecture graph, and their binding. (b) The node binding β and edge binding ψ (c) The node and edge con- figuration shows the processor allocation and the schedule order for the iteration space.

var var var var Definition 4.1.8 (Edge configuration), ∆e : E < var,Qsrc ,Qsnk,Lsrc ,Lsnk > of an edge of the loop graph is a tuple, which defines→ the allocation and schedule order information of the source and the sink node of the edge.

The dimension and the size of a produced and a consumed data token is decided by var var the processor space of the source and sink loop accelerators. Qsrc and Qsnk decide the allocation of iterations of data space to source and sink accelerator PEs. In com- bination with iteration space of the I/O signal var, this can be used to describe the dimension and granularity of the produced and consumed data tokens. The execu- tion order of the source and the sink iterations describe the patterns of production and consumption of multi-dimensional I/O arrays. This is given by partial functions, var n n which are described by the loop matrix, and are defined by Lsrc : E Z × and var n n 7→ Lsnk : E Z × , respectively. Finally, we refine the loop graph by annotating it with the mapping7→ information, i.e., binding and configuration information. This refined loop graph is defined as follows:

Definition 4.1.9 (Mapped Loop Graph) G, is a loop graph, where each node vk has in addition a partial function called node configuration ∆vk associated with it . Each edge ei, j has a partial function called edge configuration ∆ei, j . In a mapped loop graph, the tuple < var, var, var,D,Qvar,Qvar,Lvar,Lvar > defines Isrc Isnk src snk src snk ∆e the communication properties of an edge esrc,snk in the mapped loop graph. | {z }

113 4. Accelerator Subsystem for Streaming Applications: Synthesis and System Integration

Example 4.1.2 Figure 4.4 shows an example for illustrating the concepts of a mapped loop graph. The loop graph has four communicating loop kernels. The iteration do- main of the loops 1 and 4 are, = (i j)T 0 i < 8 0 j < 4 . The iteration domain of loop 2 and 3 are obtainedI { by LSGP| and≤ LPGS∧ tiling≤ of }with 4 2 tiles. From the binding information β, we can observe that the loops 2 andI 3 are× mapped onto dedicated hardware and loop 1 and 4 onto the same processor. The communi- cation between loop 2 and 3 is mapped onto a FIFO. Whereas, other communication takes place over the bus as given by the edge binding ψ. The node configuration ∆v shows that loop 1 and 4 are executed sequentially in row major order. Whereas, loop 2 and 3 undergo outer (LSGP) and inner loop (LPGS) parallelization, respectively. The nodes of the graph represent the system functionality, while the edges model the communication of the multi-dimensional arrays. It may be noted that the edge var var var var configuration function like Qsrc , Qsnk, Lsrc , Lsnk can be determined from the vertex descriptions of the mapped loop graph. However, this chosen redundancy leads to the decoupling between functionality and communication. This is beneficial for the design of complex systems, as the vertices and edges can be synthesized independent of each other. Furthermore, the system analysis is simplified significantly, because it is possible to use an abstract view provided by the communication semantics, instead of considering the internal details of the vertices. We also restrict the discussion to orthogonal tiling matrices, which can be repre- sented as diagonal matrices. Again, the tiling matrices denotes the partitioning of the data space and not the iteration space. It can be obtained by projecting the tiles onto 0 ~ 0 ~ the data space. If P1 = diag(~a1), P2 = diag(~a2), and P1 = diag(b1), P2 = diag(b2) are the tiling matrices for the source and the sink loop data space in case of co- partitioning, respectively. P1 describes the iteration points inside the tile and P2 de- scribes the number of output PEs in the source processor array grid. The source and the sink iteration space of variable var are given by

var var src ~a ~a ~a I ~ ~ ~ (4.2) I → I 1 ⊕ I 2 ⊕ I 3 snk → Ib1 ⊕ Ib2 ⊕ Ib3 inner tile proc sp. outer tile inner tile proc sp. outer tile |{z} |{z} |{z} Where ~a3~a2~a1 =~x and~b3~b2~b1 =~x, and ~x is the data|{z} space of|{z} the transported|{z} multi- dimensional array. The global allocationI of output PEs in case of copartitioning is given by

I1 p = 0 E 0  I2  = I2   I3 Q     We use the succinct representation for the source data space var. It must | {z(~a1,~a2,~a3}) src be noted that the sequential execution,I LSGP, and LPGS can be representedI as special

114 4.1. Communicating Loop Model

i1 i1 (a) j1 k1 102080 i 104080 i  2  =  2  010808 j2 010408 k2  i3   i3   j   k  MMM1  3   3  MMM2     (2,8,8),(4,1,1),(1,1,1) 2 0 4 0 (4,8,4),(2,1,2),(1,1,1) I L1 = L1 = I 0 8 0 4 I(2,8),(4,1),(1,1) I(4,4),(2,2),(1,1) 1 0 1 0 L2 = L2 = 0 1 0 1

Data space of Array C (b)

Data space of Array C

k j

i

(c)

Figure 4.5.: (a) two communicating matrix multiplications (E=(A B) D) shown as mapped loop graph, (b) LSGP tiling of iteration space,× (c)× corresponding accelerators as processor array, showing transferred data arrays. case of copartitioning2. The global schedule is determined by the loop matrix, which determines the sequential execution order. The loop matrix L = (l1 l2 ... ls) s s ∈ Z × determines the ordering of the iteration points within a parallelotope-shaped tile of the data space. The following example shows a mapped loop graph of two communicating matrix-matrix multiplications, where each undergo a different tiling.

Example 4.1.3 Figure 4.5(a) shows two communicating matrix matrix multiplica- tions (E=(A B) D) as a mapped loop graph, where each matrix is of dimension 8 8. The iteration× × space is tiled for the source and the sink loop using tiling ma- trices× diag(2,8,8), diag(4,1,1), and diag(4,8,4), diag(2,1,2), respectively. The

2 In case of sequential execution P1 = diag(~1) , P2 = diag(~1). For LPGS P1 = diag(~1) , P2 = diag(~p). For LSGP P1 = diag(~a) , P2 = diag(~p), where ~a~p =~x

115 4. Accelerator Subsystem for Streaming Applications: Synthesis and System Integration data space is not of the same dimension as the iteration space (see Figure 4.5). Therefore, data space is tiled by tiling matrices P1 = diag(2,8),P2 = diag(4,1), 0 0 and P1 = diag(4,4),P2 = diag(2,2), respectively. The dependence equation shows the relation between iteration variables of the data space of array C corresponding to the transported array variable. By definition of the allocation function for co- partitioning, the part of the iteration space shown by the arrow in Figure 4.5(a), is executed in parallel. The loop matrices L1 and L2 show the sequential execution or- der of the inner and the outer tile of the data space, which is also shown by arrows in Figure 4.5(c). The problem of accelerator synthesis was dealt with in the previous chapter. Given the node configuration, the iteration domain, and the loop description, a dedicated ac- celerator was generated. The major remaining challenge is to synthesize the commu- nication channels for data transfer between communicating accelerators and achieve their synchronization. There are different cases of this problem depending on the implementation of the source and the sink node.

1. The source and the sink nodes of the edge are mapped onto processor node vp of the architecture graph. 2. The source and the sink nodes of the edge are mapped onto the accelerator type node vA and the processor type node, vp or vice versa.

3. The source and sink nodes are both mapped onto the accelerator type node va of the architecture graph. In the first case, either the interface circuit of the processor to the bus or another com- munication alternative is available. Therefore, only the drivers for the data transfer and the synchronization are needed. In the second case, an interface circuit for the accelerator needs to be developed along with a software driver for the processor. In the last case, a dedicated communication channel for accelerator to accelerator com- munication needs to be developed. This is illustrated in Figure 4.6, which shows the implementation for the mapping of the loop graph given in Example 4.1.2. The task of the communication synthesis is to generate dedicated communication FIFOs, software drivers, and hardware wrappers in order to facilitate the communi- cation. In the next section, we deal with the communication synthesis problem for the last case, as it is required for the generation of dedicated accelerator subsystems.

4.2. Automated Generation of a Communicating Accelerator Subsystem

We follow a two-pronged approach for the generation of the accelerator subsystem, by separately treating the accelerator and the communication synthesis. The commu-

116 4.2. Automated Generation of a Communicating Accelerator Subsystem

(a)Specification (b)Accelerator Synthesis (c) Communication Synthesis loop1(); 1) Mapped Loop Graph Software Generation Interface loop4(); 2) Binding Generator Template Library 3) Acclerator Graph StreamWrite1(); loop1(); StreamRead4(); loop4(); int StreamWrite(int n, int dat[]) CPU {//blocking write CPU for(i=0;i

HW2 HW2 hardware read channel selector wrapper write channel selector controller

Figure 4.6.: (a) Specification from Example 4.1.2, (b) accelerator synthesis, (c) com- munication synthesis. nication synthesis problem depends on the accelerator allocation and the scheduling parameters, as they determine parallel access to data, memory mapping, and out-of- order communication. In this section, we leverage the conversion of the loop graph from the polyhedral model to the Windowed Synchronous Data Flow (WSDF) model [108] of computation for the synthesis of hardware-to-hardware communication sub- systems. The polytope model was used to describe communicating loops by using a poly- hedral dependency graph in [153, 27, 178]. However, in [67], it was shown that scheduling is not scalable for complex applications; therefore, modularity for appli- cation development was advocated. The concept of a mapped loop graph takes a modular approach for representing communicating loops and mapping information. In Section 2.1.3, we discussed some formal models for representing data access pat- terns of streaming applications, like synchronous data flow (SDF) [121], cyclostatic data flow [18], multi-dimensional SDF [138], and their variants. The advantage of the WSDF model is that it is capable of representing the schedule order, which is necessary for the synthesis of the communication subsystem. The is some related work which leverages a different model by projection, for the purpose of synthesis or analysis, like Array-OL to Kahn process networks [5], or polyhedral reduced depen- dence graphs (PRDG) into binary parametrized cyclostatic data flow graphs [38]. The communication synthesis problem must perform the array redistribution such that the communication subsystem is not the bottleneck. This problem has been intensively studied using different models in the context of high performance computing systems [40] and systems-on-chip [72]. For the communication between hardware accelera-

117 4. Accelerator Subsystem for Streaming Applications: Synthesis and System Integration tors, fractional SDF has been used for communication synthesis in [102]. In [175], a methodology for out-of-order communication in Kahn process networks was pre- sented. However, all the presented methodologies are not able to support parallel access and out-of-order communication for multi-dimensional applications in its en- tirety.

4.2.1. Modeling of Communication Channels The loop graph specifies the application in the polyhedral model. Traditionally, the polyhedral model has been used for the generation of loop accelerators in form of processor arrays and for the code generation for parallel architectures. In other words, it has concentrated on architecture synthesis. Data flow models of compu- tations have primarily concentrated on communication semantics and analysis. Our solution approach for the communication synthesis problem includes the conversion of the polyhedral representation of a mapped loop graph into the multi-dimensional WSDF model of computation. This new model of computation, proposed in [108], offers the possibility of schedulability and buffer analysis, as well as synthesis of communication primitives.

4.2.1.1. Simplified Windowed Synchronous Data Flow Model Earlier, in Chapter 2.1.2, a data flow model of computation called windowed syn- chronous data flow (WSDF) was discussed for modeling an application characterized by communicating loops. In this section, we show that not all features of the WSDF model are needed in our case: Therefore, we propose a simplified WSDF model for modelling communicating loop nests. Each vertex of the WSDF graph represents a process, whose functionality is de- scribed by a nested loop program. The kernel of the loop nest contains the compu- tations, which read from the input ports, process the data, and provide the result on the output ports. The WSDF communication edge is later synthesized as a so-called multi-dimensional FIFO, which adheres to FIFO semantics. I.e., it is similar to an one-dimensional FIFO with full, empty, r/w count, enable, and data signals. The data read access can take place only if the empty and read count flag indicate that data is available. Similarly, write access can take place only when the full and write count flags indicate that enough empty space is available for writing the data. A conventional FIFO has the same read and write order. The parallel read and write ac- cess to a multi-dimensional array with different communication order is fitted to the FIFO semantics by integrating the information on production and consumption order. Furthermore, the buffer size can be derived by sophisticated calculations taking into account the data dependency, production, and consumption order [107]. In our accelerator design system, issues such as sliding windows, upsampling, downsampling, border extensions are handled within the loop program; therefore,

118 4.2. Automated Generation of a Communicating Accelerator Subsystem the WSDF semantics in our case would not need boundary extension vectors ~bs and ~bt. Furthermore, sliding window vector ∆~c, which simplifies the notation for down- sampling or upsampling, is also described using conditionals within the loop kernel. Therefore, we simplify the WSDF notation as follows:

Definition 4.2.1 (Simplified WSDF) is a data flow graph, i.e., a tuple G = (V,E,D) where V is a set of nodes representing processes, E is a set containing directed edges, and D = (~p,~v,~c,Owrite,Oread) is a set of data labels that are partial functions of E into some specified range of values. The labels are same as in Definition 2.1.5 on page 21. Therefore, the boundary vectors ~bs and ~bt and the sliding window vector ∆~c are not required anymore and can be set to their default values. The default values of the boundary extension vector and window sampling vector are set to the zero vector and the unit vector, respectively. We use the simplified WSDF framework for communi- cation synthesis [44]. In the next section, we derive the simplified WSDF represen- tation from the polyhedral notation of the mapped loop graph (see Def. 4.1.9).

4.2.1.2. Conversion from the Polyhedral Model to the Data Flow Representation In this section, we discuss the conversion of mapped loop graph descriptions in the polyhedral model to the simplified windowed synchronous data flow notation. The polyhedral model offers a rich framework for accelerator synthesis, whereas the ben- efits of using the WSDF model of computation are: • buffer estimation and schedulability analysis (deadlock, determinism) of a self- timed schedule

• automated generation of dedicated communication channels, which are re- quired for a hardware platform or a hardware/software platform The loop description can be automatically classified belonging to the WSDF model of computation, if the communication order for all input and output variables of an actor are identical. In other words, the communication order is determined by the loop schedule. This is true for the considered class of loop algorithms and our affine scheduling technique. Therefore, all communicating loop nests belonging to our class of algorithms can be converted to the WSDF model. There is an one-to-one correspondence between the node description of the mapped loop graph and the WSDF graph. Hence, we can describe a vertex of a WSDF graph with the help of the description of the corresponding vertex in the mapped loop graph; it contains the information on the program, the iteration space, and the reduced de- pendency graph. This is a trivial step discussed only for the sake of completeness. var var var var var var The tuple var, src , snk ,D,Qsrc ,Qsnk,Lsrc ,Lsnk describes the I/O array variable, the iteration spaces,I theI dependency representing the array transfer, the processor

119 4. Accelerator Subsystem for Streaming Applications: Synthesis and System Integration

Iteration domain: Iteration domain src snk WriteI Statement: WriteI Statement: z z [I]= y [θsrcI dsrc] I src − ∀ ∈I < z, z , z ,D,Q ,Q ,L ,L > Read Statement: Isrc Isnk src snk src snk Read Statement: x [I]= z [θ I d ] I x ... snk − snk ∀ ∈Isnk Allocation: Qsrc Allocation: Qsnk Scheduling: Scheduling:Lsrc Lsnk

Figure 4.7.: Polyhedral representation of communicating loop nests as mapped loop graph. allocation, and the execution order of the transported I/O variables. We need to project the mapped loop graph notation to the corresponding WSDF edge notation, ~p,~v,~c,Owrite,Oread . h i For the source and the sink loop nest shown in Figure 4.7, src and snk are the iteration spaces of the source loop program and the sink loop program,I I which con- tain a set of all loop iterations. Figure 4.7 shows the read and write statements in communicating loops, which include the reference to the declared I/O variable z, corresponding to the transported multi-dimensional array. The iteration conditions of the I/O statements define the iteration space of the I/O variables. Intuitively, this data space corresponds to the multi-dimensional array declaration. In Figure 4.7, for the source loop, the indexing function for the I/O variable z is a identity matrix (due to the output normal form) and the access is limited by the iteration-dependent z condition, I src. Hence, the source data space or the iteration space of the source ∈z I variable is src. Similarly, for the statement using the variable z in the sink itera- I x tion space, the iteration condition is I snk. Thus, the sink data space is given by z x ∈ I snk = I I snk}. The affine indexing function θsnk, determines the access to the I { | ∈ I z array defined by snk and is realized in the I/O controller inside the sink accelerator. The major characteristicsI of the projection problem are:

• The complete multi-dimensional array must be transported and thus, the source and the sink iteration spaces of an I/O variable refer to the same array.

• The source node performs computations, which produce tokens ~p composing the multi-dimensional data array. The token production parameters depend on the parallel data write access, which in turn depends on the allocation of the data space of the source variable. The source communication order then defines a sequence in which the tokens are composed to form the output array. This communication order parameter depends on the global schedule order of the source accelerator.

• The sink node extracts the consumer token~c from the multi-dimensional array for performing a single computation. The token consumption depends on the

120 4.2. Automated Generation of a Communicating Accelerator Subsystem

parallel data read access, which in turn depends on the allocation of the data space of the sink variable. The sink communication order defines the sequence in which the tokens are extracted from the input array.

The projection problem needs to consider the redistribution equation. The redistri- bution equation (Equation (4.1)), due to tiling of the common data space, can be represented as:

I I~a1 ~b1 EP P P I = EP0 P0 P0 I (4.3) 1 1 2  ~a2  1 1 2  ~b2    I~a   I~ A  3  B  b3      | {z var } | var{z } where I~a ,I~a ,I~a src and I~ ,I~ ,I~ . It is seldom the simple case 1 2 3 ∈ I b1 b2 b3 ∈ Isnk that both the source and the sink follow the same parallelization strategy (i. e., A =  var var B). The case (i. e., A = B src = snk ) is often observed due to different tiling and parallelization strategies6 ⇒ of I the6 sourceI and the sink loop. However, it is the var same multi-dimensional array that is being transported. Therefore, if = x I I~ (~x = ~a1~a2~a3 =~b1~b2~b3) refers to the common multi-dimensional array being trans- ported, then the different tiling leads to incompatible data spaces. In the following, we assume that the source and the sink data space undergo copartitioning(P1,P2) and 0 0 copartitioning(P1,P2), respectively, where P1 and P2 describe the inner and outer tiles for the data space. The problem of determining the WSDF parameters characterizing the production and consumption of the data array depends on the selected tiling. We propose Algorithm 4.1 for the parameter conversion from the polyhedral model to the WSDF model. Input to Algorithm 4.1 are the I/O data spaces of the source and sink accelerators, which have to be represented as copartitioned iteration spaces, i.e., var var the source data space, src = (~a ,~a ,~a ) and the sink data space, = ~ ~ ~ (see I I 1 2 3 Isnk I(b1,b2,b3) Footnote 2 on page 115). Furthermore, the loop matrices characterizing the sequen- tial execution of the corresponding loop variables are provided. The loop matrices are adjusted to the data space. The algorithm differentiates whether tiling and parallelization leads to contiguous data tokens. In case of sequential execution, copartitioning(~1,~1) or inner loop paral- lelization, copartitioning(~1,~p), the data tokens are contiguous in the multi-dimensional array. In case of inner loop parallelization, all data elements of the I/O array variable in a tile are produced as a single token. In this case the production and the consump- tion token vectors are equivalent to the number of output or input processor elements of the processor array, producing or consuming the array variable. For copartition- ing, the number of I/O PEs is determined by the tiling matrix P2. Therefore we infer ~p = ~a2 and ~c =~b2 from Equation (4.2). The virtual token vector refers to a com- mon multi-dimensional array x, which is tiled differently. Therefore, virtual token, I~ ~v = ~a1~a2~a3 =~b1~b2~b3. The write and the read order are each derived from the corre-

121 4. Accelerator Subsystem for Streaming Applications: Synthesis and System Integration

Algorithm 4.1 Parameter conversion of mapped loop graph edge to WSDF edge var Require: I/O variable, var, source data space, src = (~a1,~a2,~a3), sink data space, var var I I var = ~ ~ ~ , Source loop matrix, Lsrc , Sink loop matrix, L Isnk I(b1,b2,b3) snk Ensure: producer token vector ~p, consumer token vector ~c, virtual token vector ~v, Write communication order Owrite, Read communication order Oread 1: // if source and sink loop undergoes LPGS tiling/sequential execution

2: if ~a1 =~1 ~a2 = 1 ~b1 =~1 ~b2 = 1 then ∨ ∧ ∨ 3: virtual token~v =~a1~a2~a3 =~b1~b2~b3  4: producer token ~p =~a2 5: consumer token~c =~b2 n var 6: Owrite = [~p,~p L~1,~p (L~1 + L~2),...,~p (∑ ~Li)], where L = ~L1 ~L2 ...~Ln · · · i src n var   7: Oread = [~c,~c L~1,~c (L~1 + L~2),...,~c (∑ ~Li)] , where L = ~L1 ~L2 ...~Ln · · · i snk   8: else 9: // one of the loops undergo LSGP/copartitioning, introduce an intermediate copy actor with edges connecting source and sink actor to copy actor 10: for the source edge of the copy actor do ~a1 11: virtual token ~v = (~v1,~v2,~v3), ~v1 = gcd(~a1,~b1), ~v2 = , and ~v3 = gcd(~a1,~b1) ~a2~a3 12: producer token ~p = (~1,~1,~a2) lcm ~a ~a ,~b ~b ~ ~a1 ( 1 2 1 2) 13: consumer token~c = (1,~c1,~c2),~c1 = and~c2 = a gcd(~a1,~b1) ~ 1 n var 14: Owrite = [~p,~p + L~1,~p + L~1 + L~2,...,~p + ∑i ~Li], where Lsrc = ~L1 ~L2 ...~Ln   ~ ~ ~ n~ var ~ ~ ~ 15: Oread = [~c,~c+L1,~c+(L1 +L2),...,~c+(∑i Li)], where Lsnk = L1 L2 ...Ln   16: end for 17: for the sink edge of the copy actor do ~b1 18: virtual token ~v = (~v1,~v2,~v3), ~v1 = gcd(~a1,~b1), ~v2 = , and ~v3 = gcd(~a1,~b1) ~b2~b3 lcm a a ~b ~b ~b1 (~ 1~ 2, 1 2) 19: producer token ~p = (~1,~p1,~p2), ~p1 = and ~p2 = gcd(~a1,~b1) ~b1 20: consumer token~c = (~1,~1,~b2). n var 21: Owrite = [~p,~p + L~1,~p + L~1 + L~2,...,~p + ∑i ~Li], where Lsrc = ~L1 ~L2 ...~Ln   ~ ~ ~ n~ var ~ ~ ~ 22: Oread = [~c,~c+L1,~c+(L1 +L2),...,~c+(∑i Li)], where Lsnk = L1 L2 ...Ln   23: end for 24: end if

122 4.2. Automated Generation of a Communicating Accelerator Subsystem sponding loop matrices. In case of sequential or inner loop parallelization, there is only a single loop matrix. In case of outer loop parallelization or general copartitioning, the data tokens are spread across the array, so that the produced and the consumed tokens must be re- ordered to compose the common data array. The purpose of the copy actor is the regular consumption and production of tokens to support parallel access. In order to prevent the communication subsystem from becoming the bottleneck, the common tokens must be consumed and produced in parallel. Therefore, the second case of Algorithm 4.1 presents the embedding into a data space, where the tokens are con- tiguous and can thus be represented in the WSDF model. It can be shown that the production and consumption pattern repeats after every lcm ~a1~a2,~b1~b2 data ele- ments.  

Theorem 4.2.1 For a source loop undergoing copartitioning(diag(~a1),diag(~a2)) and sink loop undergoing copartitioning(diag(~b1),diag(~b2)), the communication pattern is repeated at every lcm ~a1~a2,~b1~b2 data elements.   0 ~ 0 ~ Proof 4.2.1 P1 = diag(~a1),P2 = diag(~a2), and P1 = diag(b1),P2 = diag(b2) are the tiling matrices of the source and sink loop data space, respectively. Furthermore, from the redistribution equation (4.3), we have

I I~a1 ~b1 EP P P I = EP0 P0 P0 I 1 1 2  ~a2  1 1 2  ~b2    I~a   I~  3   b3      I.e.,

Ia + P1Ia + P1P2Ia = I + P0 I + P0 P0 I = I ~ 1 ~ 2 ~ 3 ~b1 1 ~b2 1 2 ~b3

Let L = (l1,...,ln) be lcm ~a1~a2,~b1~b2 ,i.e. least common multiple of ~a1~a2 and~b1~b2 and~a1 = (a11,a12,...,a1n)and~a2 = (a21,a22,...,a2n). By definition of co-partitioning, the source loop output is mapped onto diag(~a2) processor array grid and sink loop input is mapped onto diag(~b2) processor array grid. Then, any output iteration I = (i1,i2,...,in) is mapped onto the same source processor element psrc as output iteration L + I. Where

i in p = 1 mod a ,..., mod a src a 21 a 2n  11   1n   l + i ln + in = 1 1 mod a ,..., mod a a 21 a 2n  11   1n  

123 4. Accelerator Subsystem for Streaming Applications: Synthesis and System Integration as l = lcm(a ,a ), therefore l = x a a and is divisible by a . Also l1 is 1 11 21 1 1 11 21 11 a11 divisible by a which implies that i1 mod a = l1+i1 mod a . Hence, I and 21 a11 21 a11 21 L+I are mapped onto same source processor.j k Similarly,j it cank be shown that iteration vector I and L + I are mapped onto same sink processor element

i in p = 1 mod b ,..., mod b snk b 21 b 2n  11   1n   l + i ln + in = 1 1 mod b ,..., mod b b 21 b 2n  11   1n   Therefore, the communication pattern repeats every L data elements as they are mapped onto same source and sink processing elements. 

Furthermore, gcd(~a1,~b1) data elements are produced sequentially. Therefore in order to support parallel access, nmem memory banks are required to produce common data tokens.

lcm ~a1~a2,~b1~b2 nmem = gcd (~a1,~b1)  This information is used for the construction of the copy actor with production and consumption reflecting the above number of independent parallel accesses. This is part of the algorithm for producer and consumer token of the introduced copy ac- tor. It can be shown that the construction is a general case, which is also valid for sequential execution or inner loop parallelization. Similar to other parameters, Lsrc var var is modified to Lsrc and Lsnk is modified to Lsnk by embedding into the common data space. The solution approach to the projection problem is illustrated with the help of the following example.

Example 4.2.1 We present the loop program in Figure 4.8(a) to illustrate Algo- rithm 4.1. The data space of the two communicating for loops is obtained by par- titioning of the same iteration domain I = (i) 0 i 11 with different tiles of size 6 and 4, respectively. On selecting different{ allocations| ≤ ≤ like} LPGS (inner loop paral- lelization) or LSGP (outer loop parallelization), different patterns of production and consumption of data elements are caused. In case of inner loop parallelization of both loops, i.e., the processor allocation is given by

j p = (1 0) i ! Qsrc=Qsnk This allocation is equivalent to copartitioning|{z} ((1),(6)) and copartitioning((1),(4)) for the source and the sink data space, respectively. In this case, the rectangu- lar iteration spaces of the source and the sink data space can also be represented

124 4.2. Automated Generation of a Communicating Accelerator Subsystem

j1 (a) j i1 for(i=0;i<2;i++) i for(i1=0;i1<3;i1++) A A for(j=0;j<6;j++){ Isrc Isnk for(j1=0;j1<4;j1++){ A[i,j]=... A[i1,j1]=... } } i i D : 6 1 = 4 1 1 j j    1   

(b)

(c)

1 2 3 4 5 6 7 8 9 10 11 12

(d)

1 2 1 2 3 4 3 4 5 6

Figure 4.8.: (a) loop graph of a toy example, (b) WSDF model of computation on LPGS or inner loop parallelization, (c) problem illustration of non- contiguous data tokens on outer loop parallelization or LSGP, and (d) WSDF notation for LSGP tiling.

125 4. Accelerator Subsystem for Streaming Applications: Synthesis and System Integration as and , respectively. The production and consumption token vec- I(1,6,2) I(1,4,3) tors result in ~p = ~a2 = 6 and ~c = ~b2 = 4 if the generated data tokens are con- tiguous (LPGS tiling). By definition, the virtual token is the common layout of the transported multi-dimensional array. Therefore, the virtual token evaluates to ~v = ~a1~a2~a3 =~b1~b2~b3 = 6 2 = 4 3 = 12. Indeed, the source and the sink accel- × × erators must fire ~a3 and~b3 times for producing and consuming a virtual token. The write and read orders are given by Owrite = [(6),(12)] andO read = [(4),(12)], re- var spectively, which are directly derived from the loop matrices, given by Lsrc = (2) and var Lsrc = (3). This is also illustrated in Figure 4.8(b). In the case that both loops undergo outer loop parallelization, the processor allo- cation is given by

j p = (0 1) i ! Qsrc=Qsnk This allocation is equivalent to copartitioning|{z} ((6),(2)) and copartitioning((4),(3)) for the source and sink data spaces, respectively. The corresponding rectangular iteration spaces for the source and the sink can be represented by and , I(6,2,1) I(4,3,1) respectively. The generated data tokens are not contiguous, because P1 = diag ~1 . 6 This is depicted in Figure 4.8(c). The arrows show the non-contiguous tokens being  produced and consumed by the source and sink accelerators. The dotted arrows must be produced for parallel access. The corresponding token parameters of the source edge of the copy actor are

gcd(6,4) 2 v 6 ~ =  gcd(6,4)  =  3  2 1 2  ·        1 1 1 ~p = 1 ~c = 6 = 3   ∧  gcd(6,4)    2 lcm(6 2,4 3) 2    6· ·          In consequence, the read and write order on the source edge are

1 2 1 2 Owrite =  1 , 3 ,Oread =  3 , 3  2 2 2 2                 If Algorithm 4.1 is applied, the token vectors of the sink edge of the copy actor are

126 4.2. Automated Generation of a Communicating Accelerator Subsystem

gcd(6,4) 2 v 4 ~ =  gcd(6,4)  =  2  3 1 3  ·        1 1 1 ~p = 4 = 2 ~c = 1  gcd(6,4)    ∧   lcm(6 2,4 3) 3 3  4· ·      Similarly, the read and write orders on the sink edge are,  

1 2 1 2 Owrite =  2 , 2  Oread =  1 , 2  3 3 3 3                 The conversion of the mapped loop graph in the polyhedral model into the WSDF model of computation can be summarized as follows:

• each communicating loop nest written in single assignment form as DPLA, can be inferred as an actor describing the node of the WSDF model of computation.

• The data structure used for communication between the nodes of the loop pro- grams is a multi-dimensional array. This multi-dimensional array is repre- sented as an iteration space, called data space in the polyhedral notation. The information on data space, allocation, and scheduling can be used to derive the parameters of edges in the WSDF model. The parameters include a producer and a consumer token vector, which produce common virtual tokens.

• Seldom the iteration spaces of the source and the sink I/O variables are the same. I.e., there are complex dependencies between source and sink data ele- ments arising due to different tiling and parallelization. In order to account for such incompatible arrays, a copy actor must be introduced to reorganize source array tokens into the sink array in case the data tokens are not contiguous.

In the next section, we discuss the synthesis of dedicated communication channels called multi-dimensional FIFOs, which correspond to a WSDF edge.

4.2.2. Multi-dimensional FIFO: Architecture and Synthesis The previous section explained the modeling of the semantics of communicating loop programs. In this section, we explain the communication primitive, which is gen- erated for data transfer between the hardware loop accelerators, depending on the simplified WSDF parameters.

127 4. Accelerator Subsystem for Streaming Applications: Synthesis and System Integration

Multidimensional FIFO I/O full empty I/O Controller write_en Fill Level Control read_en Controller Counter src iterator snk iterator Counter Address Generation Decoder Decoder

PE1 Memory channels(Dual port BRAMs)

PE2 PE1 PE2

wr_data rd_data

PE3 PE3 PE4

PE4

Source Channel Channel Sink Accelerator Selectors Selectors Accelerator

Figure 4.9.: Hardware communication infrastructure.

Unlike classical FIFO communication, this communication primitive must support parallel write and read data access for the producer and consumer accelerator, re- spectively. In addition, out-of-order communication, arising due to different tiling and scheduling strategy must be supported. A novel communication template called Multi-dimensional-FIFO (MD-FIFO) was presented by Keinert et al. in [109, 106] for the data exchange between the loop accelerators. Although, it follows the FIFO semantics, it contains multiple channels, complex address generation hardware and fill level control, which supports parallel data access and out-of-order communica- tion. An overview of the multi-dimensional FIFO architecture is shown in Figure 4.9. The architecture is similar to an one-dimensional FIFO due to interface signals like full, empty, enable, and data. The data read access can take place only if the empty flag indicates that data is available. Similarly, write access can only take place when the full flag indicates that enough empty space is available for writing the data. The communication architecture can be divided into two parts:

• memory subsystem

• controller part, consisting of memory channel selectors, address generation, and the fill level control logic

The necessary input for generating this communication architecture is the semantics of the corresponding edges in the WSDF model, which in turn is obtained from the

128 4.2. Automated Generation of a Communicating Accelerator Subsystem mapped loop graph. We explain different parts of the communication subsystem in the following example.

Example 4.2.2 In a scenario that requires the transmission of the edge map for fea- ture extraction, the Sobel edge detection algorithm and the block-based DCT are part of the image coding pipeline. The output data space of the edge detection is a 512 512-sized array, where each data element of the image is produced sequen- × tially. This data space can also be represented as ((1,1),(1,1),(512,512)) by an equiv- alent copartitioning((1,1),(1,1)). Each tile containsI a single iteration and there is only 1 PE, which generates the output array elements in a sequential, row major or- der. Similarly, the input data space of the DCT Tiler accelerator is a 8 8 64 64- sized multi-dimensional array, where 8 8 elements are consumed by× the× DCT× Tiler × accelerator. This data space can also be represented as ((8,8),(1,1),(64,64)) by an equivalent copartitioning((8,8),(1,1)). Each tile containsI 8 8 iterations. The PE array consumes 8 8 array elements sequentially, and then produces× the blocks se- quentially 64 64×times in a row major order. Figure 4.10(b) shows the execution order of the edge× detection source accelerator and the DCT Tiler sink accelerator. The WSDF parameters for the edge are the producer token vector ~p = (1 1)T , the consumer token vector ~c = (1 1)T , the virtual token vector ~v = (512 512)T , the source write order Owrite = [(1 1)T ,(512 1)T ,(512 512)T ] and the sink read order, Oread = [(1 1),(1 8),(8 8)T ,(512 512)T ]. Similarly, the communication semantics between a horizontal and a vertical DCT which forms the DCT operation is shown in Figure 4.10(d). The memory structure is characterized by the presence of multiple dual-port RAMs, which enable parallel reads and writes of multiple data elements. The parallel read and write access alongside the out-of-order communication can be supported only if the multi-dimensional array is partitioned into multiple memory channels. The memory channels can be mapped efficiently onto physical memories to exploit larger bit-widths, i.e., the parallel tokens are concatenated and are connected to and from a single memory cell. The number of memory channels is given by [106]

m = lcm(~p,~c) Hence, for the ED-DCT example, only a single memory channel forms the commu- nication subsystem. The address generation unit has two purposes. First, because correct data must be read, the first task is the sink address generation. The mem- ory address is derived by linearization in the production order. I.e. the source write address is simply incremented by one and mapped back to zero if equal to B. It is the task of the sink address generation to produce addresses for reading the cor- rect data. For example, the determination of the sink address addr(~iDCT ) is more tricky since it should be determined using the corresponding source iteration vector, ~iED which can lead to complex address generation schemes, for example, we need

129 4. Accelerator Subsystem for Streaming Applications: Synthesis and System Integration

(a) Sobel DCT DCT Tiler ((1,1),(8,1),(1,8)) I((8,8),(1,1),(64,64)) I((8,8),(1,1),(64,64)) I I((1,1),(1,1),(512,512))

(b) Source Address Sink Address

1 2 3 4 5 6 7 8 9... 512 ... IF(i1<7) 513 514 515 516 517 518 519 520 addr:=addr+1 ELSEIF(i1==7 && j1 < 7) 1025 addr:=addr+505 ELSEIF(i1==7 && j1==7 && i2<63) ...... addr:=addr-3583 ELSEIF(i1==7 && j1==7 && i2==63) addr:=addr+1 3585 ENDIF ENDIF ...... ENDIF ENDIF

(c) 1 1 1 8 512 p = Oread = , , , 1 ! " 1 ! 8 ! 8 ! 512 !#

512 1 write 1 512 512 v = c = O = , , 1 ! " 1 ! 1 ! 512 !# 512 ! (d)

hDCT vDCT

((1,1),(8,1),(1,8)) I I((1,1),(1,8),(8,1))

Figure 4.10.: (a) The loop graph of a Sobel filter communicating with a horizontal DCT, which reads 8 8 data sequentially. (b) The numbers show the ad- dress of the data elements,× (c) corresponding WSDF model, (d) WSDF model of horizontal DCT, communicating with vertical DCT.

130 4.3. Synthesis of Accelerators for MPSoCs to generate 1,2,3,4,5,6,7,8,513,...,520,...,3585,... as addresses. This is can be simplified using the observation that appropriate increments lead to a simplified ad- dress generation with nested if-then-else statements [109]. This is also shown in Figure 4.10(b). The fact that linearization leads to simple addressing functions and good memory efficiency is the basis of several heuristics for determining the min- imum memory requirement (i.e., determining optimal B) [122, 106, 27]. Second, depending on the source and the sink address, the data must be forwarded to the cor- rect memory channel in case of multiple channels. This can be done by comparing the generated address in the decoder of the channel selector, in order to generate the multiplexer and demultiplexer control signals. Hence, it is the task of the channel selectors, to divert the incoming data tokens into the correct channel. The fill level control unit is responsible for synchronization with the source and sink accelerators, so that the valid data is read correctly and is not overwritten. The fill level control part of the multi-dimensional FIFO is more complicated in comparison to a classical FIFO due to out-of-order communication. It uses two counters to keep track of the number of data elements in the memory channels, depending on the amount of read and write accesses. Furthermore, the amount of possible write and read operations for the source and the sink must be kept track of. Initially, the number of possible writes is equivalent to the buffer size and the number of reads is zero. Every time, a data item is written or read, the source and sink counters are decremented and incre- mented accordingly. The number of possible reads is also updated after every write operation for the sink level control by calculating the maximum amount of allowable read operations [109]. This problem is formulated by a parametric integer linear pro- gram [106]. A future work is to obtain the conditionals for address generation and fill level control using Hermite normal form. In next section, we examine the integration of accelerators in a system-on-chip.

4.3. Synthesis of Accelerators for MPSoCs

The architecture model of a SoC consists of different components like CPU (soft- ware), accelerators (hardware), memory, and connectivity controllers (e.g., TFT dis- play, USB). The physical communication style between hardware, software, and other components is memory-mapped. As the name suggests, memory-mapped I/O needs addresses for the registers in each of the IP components in the shared address space. I.e., the CPU program communicates with the input, output, and status ports of the accelerator IP using normal memory read and write instructions. In a high-level language like C, one can use pointers for manipulating addresses. Figure 4.11 shows the processing flow of such a hardware/software-based SoC. The CPU initiates the data transfer from the main memory. The data is then stored in an appropriate input memory of the accelerator. The accelerator subsystem reads data from its input memory depending on the status flags stored in the control registers of

131 4. Accelerator Subsystem for Streaming Applications: Synthesis and System Integration

Input Data Regs Bus Master Read Unit 2 CPUCPU Interface

1

Bus Request Control Registers

Qualifiers and (BRAM Controller) Addr

Reply Bus Interface DataOut Glue Logic DataIn Enable Write Unit WEn Output Data Regs 3 4 Main Mem Accelerator Core

Figure 4.11.: Processing flow in an accelerator based SoC system. 1) Data is fetched from the main memory by the CPU/DMA to the accelerator memory, 2) The accelerator executes and reads data in parallel 3) Data is writ- ten back to the accelerator memory 4) Data is stored back in the main memory by the CPU according to the finish control register status. the accelerator. The computed data is then stored back in the output memory of the accelerator. The CPU polls the finish control register. Depending on its status, the output data is then read out and transmitted to the main memory. The control status registers are used by the CPU to check the status of the memory (i.e., full/empty) and for starting or stopping the computation in the accelerator. The execution is single threaded, as the CPU writes the data into the accelerator memory and waits for the computation in the accelerator to finish. This blocking synchronization means that step 1 and 4 in Figure 4.11 are executed sequentially, whereas all other steps follow a pipelined model of execution. This is also known as busy-wait I/O and the process of continuous reading of the status registers is called polling [184]. Therefore, the abstract communication style is buffered, blocked, and asynchronous. The major problem that needs to be addressed to support the memory mapping and processing flow is communication synthesis in the context of HW/SW commu- nication (i.e., between the processor and the accelerator module). Communication synthesis involves the sub-problems of channel binding, communication refinement, and interface generation [62]. Channel binding defines the interconnect topology of the system. Among the different available interconnect possibilities, like point- to-point, buses, network on chip, we bind the accelerator/software communication channels onto buses. This loosely coupled communication offers a trade-off between performance and flexibility. However, the concepts introduced in this thesis are also

132 4.3. Synthesis of Accelerators for MPSoCs

(a) B_00 B_10 (b) (c) /*********************Memory Map***************/ * Base Address Size Description * ------BaseAddr=0x28000000 * 0x28000000 16 A A_00 C_10 Addr_A_00=BaseAddr+0x00 * 0x28000010 16 B Addr_A_01=BaseAddr+0x08 * 0x28000020 16 C Addr_B_00=BaseAddr+0x10 * 0x28000030 1 Ctrl_A C_00 Addr_B_10=BaseAddr+0x18 * 0x28000031 1 Ctrl_B Addr_C_00=BaseAddr+0x20 * 0x28000032 1 Ctrl_C Addr_C_01=BaseAddr+0x24 *************************************************/ C_11 A_01 Addr_C_10=BaseAddr+0x28 #define A_BASE (0x28000000) Addr_C_11=BaseAddr+0x2C #define B_BASE (0x28000010) Addr_A_Ctrl=BaseAddr+0x30 #define C_BASE (0x28000020) C_01 Addr_B_Ctrl=BaseAddr+0x31 #define Ctrl_A_BASE (0x28000030) Addr_C_Ctrl=BaseAddr+0x32 #define Ctrl_B_BASE (0x28000031) #define Ctrl_C_BASE (0x28000032)

Figure 4.12.: (a) 2 2 processor array accelerator for matrix-matrix multiplication. (b) Memory× map for the accelerator. (c) C header definition for accel- erator access. applicable to other alternatives. Communication refinement refers to the dimension- ing of the features of the communication support. In our case, one can set the address or data width, number of masters and slaves on the bus. The interface synthesis deals with the generation of hardware and software communication structures like hardware wrappers and device drivers. The channel binding and communication re- finement is specified by the system architect. In the next subsection, we examine and solve the interface synthesis problem.

4.3.1. Interface Synthesis The generic hardware/software interface template consists of adapters (hardware wrappers) and device drivers. The adapters are used to interface incompatible proto- cols. In our case, this means interfacing the bus signals with the accelerator signals. The device drivers are responsible for data transfer and synchronization between the accelerator and processor. The software drivers in our case access the accelerator di- rectly using memory mapped I/O. Therefore, the support for memory mapping must be integrated in the adapters. The memory map, adapter (hardware wrapper or inter- face circuit), and device driver generation, which are necessary for interface synthesis are discussed in the following subsections.

4.3.1.1. Accelerator Memory Map Generation The first step is to establish a memory map in order to support memory mapped I/O. The base address indicates the location of the accelerator registers or memory in the shared address space of the SoC. The memory map generation is illustrated with help of the following example.

133 4. Accelerator Subsystem for Streaming Applications: Synthesis and System Integration

Example 4.3.1 A matrix-multiplication accelerator is shown in Figure 4.12(a). The variables A, B, and C are stored in 2, 2, and 4 local buffers of size 8, 8, and 4, re- spectively. The base addresses of the local buffers of the I/O variables A, B, and C are stored in the generated memory map as shown in Figure 4.12(b). These hexadec- imal addresses are used in C header definitions for read and write access to/from the accelerator IP (see Figure 4.12(c)).

Algorithm 4.2 Accelerator memory map generation Require: va AG, hardware component of architecture graph. ∈ Ensure: Memory map for va, MMap 1: set Base Address, baseAddress 2: // forall (in/out)going edges of va whose src/snk are mapped on processor 3: for all e E, s.t (src(e) = va snk(e) = vp) (snk(e) = va src(e) = vp) do ∈ ∧ ∨ ∧ 4: // forall local buffers for storing variable, var = var(e) 5: for all m Mvar,(Mvar = QIvar) do ∈ 6: key = memName(m); value=baseAddress 7: MMap.insert(key,value) 8: width = MemSize(m); 9: baseAddress = baseAddress + width 10: end for 11: end for 12: // forall (in/out)going edges of va whose src/snk are mapped on processor 13: for all e E, s.t (src(e) = va snk(e) = vp) (snk(e) = va src(e) = vp) do ∈ ∧ ∨ ∧ 14: //forall ctrl registers for storing variable, var = var(e) 15: if mode == RAM then 16: value=(baseAddress) 17: MMap.insert(“statusReg”,value) 18: else 19: for all I/O variable, var = var(e) do 20: key = ctrlRegName(var); value=(baseAddress) 21: MMap.insert(key,value) 22: baseAddress = baseAddress + 1 23: end for 24: end if 25: end for

The base address of each of the local buffers is stored in the memory map. They are also required for generating an address-decoding FSM in the hardware wrapper. The pseudo-code for generating the memory map of an accelerator va of the architecture graph is given in Algorithm 4.2. The memory map generation is needed only for the accelerators, which are connected to a processor. The initial base address is selected

134 4.3. Synthesis of Accelerators for MPSoCs such that the accelerator address space does not lie in the reserved address space of the system. In order to generate the memory map of an accelerator, we iterate over all input and output edges of vertex va; then, for each I/O variable var on the edge, we find the local buffers allocated to it. Using Equation (3.44), the memory space for I/O variable var can be found. Intuitively, all the points in memory space, var of variable var represent the index of a dedicated memory bank, which is connectedM to a PE in the accelerator (see Section 3.2.2.1). For each memory bank, a base address is assigned. The base address is obtained by adding last base address with offset equivalent to the size of the last memory bank. The same process of assigning a base address to all control registers is also performed. For the FIFO model, the status register contains the status information for each I/O variable. For the RAM model, the status registers consist only of a start and a finish signal.

4.3.1.2. Hardware Wrapper In this section, we discuss the generation of a hardware wrapper in the context of hardware accelerator integration in an SoC. The major parts of the hardware wrap- per are a protocol conversion FSM and an address decoding logic. The hardware wrapper must convert the bus signals into accelerator signals. In our approach, we solve the problem of protocol conversion by using a generic memory controller for translating the bus interface signal into memory-like interface signals (see Figure 4.11). The advantage of using a memory controller instead of a customized logic, is the free availability of a generic optimized memory controller for any particular bus. Whereas, an interface generator would have been required otherwise to gener- ate the customized interface. Subsequently, the memory-like signals are converted into accelerator signals by so called glue logic. In the glue logic, addresses from the bus should be decoded to identify whether the loop accelerator is being accessed by the processor. Furthermore, the input/output/status data has to be routed to/from the corresponding accelerator memory. Therefore, glue logic is to be generated for routing the data correctly, depending on the address information. The glue logic does the address decoding of the accelerator address and the synchronization signals for memory-mapped access. Whereas, the memory controller solves for free the protocol conversion problem. The architecture of the hardware wrapper is shown in Figure 4.13(b). Once, the software device driver writes data in the accelerator address ranges. The BRAM controller converts the bus signals to memory-like signals and forwards them (i.e., address, data, and control signals) to the glue logic. The BRAM address signal (BRAM_Addr) to the accelerator memory is decoded so that the correct local buffer is selected for load/store operation. The input data signal (BRAM_Din) reads the ac- celerator output data or the status register of the accelerator memory. The output data signal (BRAM_Dout) is used to transfer the input data to the accelerator memory. The accelerator IP designed by the user can have a variable number of buffers

135 4. Accelerator Subsystem for Streaming Applications: Synthesis and System Integration

(a) (b) 8 bits to 21 bits to select the base address of the 3 bits for accelerator memory banks of the accelerator bank addr. Input Buffer A_00 Bram_Dout 00100100000000000000000000001 001 A_01

DEMUX B_00 Bram_En B_10 Addr Bram_Addr Output Buffer (c) s_data_prog_mmm_in_0_nid107_A00_true <= ’1’ C_00

when s_addr(28 to 29) = "00" else ’0’; Address Decoder C_01 Bram_WEn s_data_prog_mmm_in_0_nid107_A01_true <= ’1’ C_10 when s_addr(28 to 29) = "01" else ’0’; C_11 s_data_prog_mmm_in_0_nid108_B00_true <= ’1’ Bram_Din Ctrl Register

when s_addr(28 to 29) = "10" else ’0’; MUX s_data_prog_mmm_in_0_nid107_B10_true <= ’1’ when s_addr(28 to 29) = "11" else ’0’;

Figure 4.13.: (a) Address big-endian byte structure of Bram_Addr, (b) Glue logic architecture, (c) VHDL code snippet showing address decoding logic.

(input/output) and control registers inside the IP; therefore, there is a need for au- tomatization of the hardware wrapper generation process. Address decoding is the process of generating the select signals, by decoding address signals for each memory bank in the accelerator system. I.e., the address decoder in the glue logic evaluates the address and enable signals, to produce select signals for the multiplexers/demul- tiplexers. The address bus lines are split into three sections via a 32-bit big-endian data structure as shown in Figure 4.13(a). The L most significant bits are used to generate the select signal for the accelerator device, where alignment of the memory space is at a 2L byte boundary of the address space. The S least significant signals are passed on as addresses to the different memory banks or control registers. The bit region (L : 32 S) is reserved for selecting the local buffer or the control registers of the accelerator.− Algorithm 4.3 illustrates the steps for automatically generating the hardware wrap- per of the accelerator IP. The input required is the accelerator memory map (see Al- gorithm 4.2), which contains the generated base address of each of the local buffers inside the accelerator IP. The memory map also contains the sizes of the buffers and control registers. The output is a RTL description in VHDL of the hardware wrap- per, which consists of a memory controller, an address decoding mechanism, and multiplexer logic. At first, based on the maximum size of the all the buffers width, the algorithm computes how many bits S are required to access each of the locations inside the buffer. This is done by using a simple logarithmic function. Then, those many numbers of least significant bits on the address port are passed directly to the accelerator. Using the size of the alignment boundary, the number of most significant bits L corresponding to the accelerator region is calculated. Finally, the memory map is iterated over all the keys, and depending on the base address value, the address decoding logic is generated for the selection of correct buffer. The VHDL code snip- pet for address generation logic is shown in Figure 4.13(c). It shows the decoding

136 4.3. Synthesis of Accelerators for MPSoCs logic for routing the data corresponding to the address region into the correct buffer. The number of cases in the logic is equivalent to the number of regions inside the IP, which is the sum of number of all buffers (input and output) and control registers inside the IP. The wiring of all generated components, i.e., memory controller, glue logic, and accelerator gives the hardware wrapper of the accelerator IP.

Algorithm 4.3 Hardware Wrapper Generation for accelerator, va Require: MemoryMap of va, width, busWidth Ensure: HW Wrapper (VHDL) 1: set generics BaseAddress, HighAddress, and DataWidth for the BRAM con- troller IP 2: generate Bram Controller 3: // least significant bit number for storing address of local buffer 4: S=log2(width) 5: // most significant bit for selecting accelerator 6: L=log2(accelerator alignment boundary) 7: reserve ((busWidth S) : busWidth) bits for local buffer address − 8: reserve (L : (busWidth S)) bits for selection of local buffer or control register region − 9: iterate memory map to generate address decoder logic 10: generate glue logic 11: wire Bram controller, glue logic, and accelerator RTL to form hardware wrapper

Example 4.3.2 For the memory map of the matrix multiplication algorithm in Ex- ample 4.3.1, the alignment is done at a 256=28 byte boundary. The hexadecimal value of H28000000 is assigned as base address to the loop accelerator as shown in Figure 4.12. The L = 8 most significant bits encode the base address information of the accelerator. H28 will be decoded from the address and will be used to detect the access to accelerator. The address H28000009, where base address is offset with hexadecimal value of 0x09 points to the buffer A01. The bits (8:29) are used to select the access to memory bank A01. The last 3 least significant bits as also shown in Figure 4.13(a) are passed on as address to the memory bank A01 for writing data (BRAM_Dout).

4.3.1.3. Software Driver In embedded systems design, it is a recommended approach to provide drivers for accessing the IPs from processors in order to hide the complexity from the end user. Therefore, for all accelerator IPs communicating with processors, it is necessary to provide drivers for data communication and synchronization purposes. In this sec- tion, we discuss the automated generation of driver programs.

137 4. Accelerator Subsystem for Streaming Applications: Synthesis and System Integration

P A C

volatile int *OUT_BASE=BRAM_CNTLR_0_BASEADDR+0x1000; volatile int *IN_BASE=BRAM_CNTLR_0_BASEADDR; P(){ volatile int *CTRL_BASE=BRAM_CNTLR_0_BASEADDR+0x2000; volatile int *CTRL_BASE=BRAM_CNTLR_0_BASEADDR+0x2000; C(){ for(i=0; i<64; i++) for(i=0; i<64; i++) for(j=0; j<64; j++) for(j=0; j<64; j++) { { out[i,j]=f(...); streamRead(*in, i, j); streamWrite(*out, i, j); ... = g(in); } } } } streamWrite(int* out, int i, int j){ streamRead(int* in, int i, int j){ if(*CTRL_BASE!=FULL) if(*(CTRL_BASE+0x1)!=EMPTY) OUT_BASE[i*64+j]=out[i*64+j]; in[i*64+j]=IN_BASE[i*64+j]; } }

Figure 4.14.: The producer and consumer loop program running on a processor us- ing the memory map information, and driver routines for accessing the accelerator.

The software driver consists of a header definition of the accelerator memory map and stream read/write functions for reading and writing data from/to the accelerator. These read/write functions must be inserted into the program running on the proces- sor. Figure 4.14 shows the structure of loop programs running on a processor and non-programmable accelerator, respectively. The driver provides a programming in- terface consisting of streamRead() and streamWrite() functions as shown in Figure 4.14. The functions P() and C() denote the producer and consumer loop programs running on the processor, which access the accelerator. These loop programs apart from executing their own kernel, are also responsible for data transfer to and from the accelerator. The set of nested loops iterates over the corresponding iteration domains and updates the I/O variables. The streamWrite() and streamRead() functions then transfer the output and input data to and from the accelerator, respectively. The user may need to insert qualifying if statements, if the I/O domain is not the same as the iteration domain. The following steps needs to be undertaken for the generation of a software driver (i.e., streamRead() and streamWrite()): 1. parse the memory map, generate C header information 2. generate the synchronization part of the software driver 3. generate read and write procedures for the software driver functions In the first step, a C-language header file that provides an abstract interface to the hardware is generated. It allows access to the local buffers and control registers of the accelerators by names rather than by addresses. This offers the portability in the sense that if a memory map of the accelerator changes then only the header defini- tions need to be changed. For each key and a base address pair in the memory map, a define macro is generated to assign a symbolic keyname to the base address as shown

138 4.3. Synthesis of Accelerators for MPSoCs in Figure 4.14. The memory-mapped buffers and control registers can now be manip- ulated with help of pointers or array access. The access variable is defined as volatile type. The volatile definition implies that the compiler cannot make any assumptions about the data being stored in the registers. Since the data can be changed not only by software, but also by the accelerator. Therefore, incorrect compiler optimizations can be avoided. It must be noted that the hardware wrapper routes the data in the correct memory bank depending on the address of the read or write statement. Therefore, the software driver needs not to address each local buffer by name but only the base address of buffer corresponding to the memory variable. Polling is used for synchro- nization, where the processor repeatedly checks to see if data can be written or read from the accelerator. In case the accelerator memory is configured as a FIFO, then polling checks, whether it is empty or full for synchronizing the read and write oper- ations. If the accelerator memory is configured in RAM mode, then the number of data elements to be read or written is equivalent to the buffer size. The synchroniza- tion is done through start and finish signals in the control registers of the accelerator. Therefore, the synchronization code is added as guard to the read/write statements. In the final step, the software driver module must be generated, which reads and/or writes that particular accelerator’s control registers directly.

4.3.2. Accelerator Integration in SoC In this section, we present a design method putting together accelerator synthesis and communication synthesis for a seamless integration of accelerators in an SoC. In the previous sections, we discussed the modeling of communicating loops as building blocks of complex applications. After binding the loop graph to an architecture graph, we have a set of loops, which are assigned for implementation onto different system components. These components could be processors executing software loops or dedicated hardware loop accelerators. The task of communication synthesis is to obtain the implementation of abstract channels for interconnecting the system components. In the previous section, we transformed the mapped loop graph (i.e., loop graph with allocation and scheduling information) in the polyhedral model to a data flow description called windowed syn- chronous data flow model (WSDF). Using this representation, we are able to synthe- size hardware channels called multi-dimensional FIFOs for connecting an accelerator (HW) to another accelerator (HW) communication. In addition, a hardware wrapper, a software driver, and a memory map are generated in order to interconnect processor (SW) and accelerators (HW). In this section, we discuss the design flow, which puts together communication synthesis and accelerator synthesis for the integration of accelerators in a given mul- tiprocessor SoC platform. There has been flurry of research in the field of automated SoC synthesis and accelerator IP integration. In [61], hardware/software interfaces are generated from data-flow descriptions. However, it does not take into account

139 4. Accelerator Subsystem for Streaming Applications: Synthesis and System Integration

Loop Graph RDGRDG RDGRDG

Allocation Scheduling Resource Binding

Accelerator Synthesis HW-SW Partitioning Hardware Software Synthesis SW HW HW HW SW Synthesis SWSW HWHW HWHW HW

Interface Synthesis

Communication Subsystem WSDF Conversion Memory Map Generation < , , , > I RDG Q L Software Driver Generation < p, v, c, Osrc,Osnk >

for(i=1;i

streamWrite(){...}

Composition Processor Interface Accelerator

Glue Memory BRAM Application Sofware Driver Logic map Controller FSM

Figure 4.15.: Design flow for accelerator integration in an SoC. loop scheduling and allocation. The loop scheduling and allocation is addressed in [15]. However, it concentrates itself only on hardware wrapper and interface gen- eration aspects. The accelerator driver generation is task of the programmer. In the Chinook project [31], device drivers and hardware wrappers are generated from a control flow graph. However, the communication between the processor and accel- erator is fine-grained as the CFG depicts dependencies between intra loop operations and not between different loops. A library based approach is used for channel imple- mentation, i.e., protocol conversion and device driver in [36]. A memory mapping algorithm is presented in [125], which using the scheduling information directs inter- face synthesis such that minimum amount of registers and multiplexers are required in hardware wrapper. In our synthesis approach, we solve the problem of HW/SW communication subsystem for communicating loops in entirety. Figure 4.15 shows our design methodology for composing an SoC containing loop

140 4.4. Results accelerators. Starting with an initial application description of communicating loop programs, the initial step of HW-SW partitioning defines the mapping of the loop graph onto a given abstract SoC architecture description. The task of automated SoC synthesis is to generate a platform model with given processor, accelerator, and communication subsystem from the abstract problem specification. After the initial mapping, the loop graph is annotated with binding, scheduling, and allocation in- formation. Subsequently, the synthesis of accelerators, the software for processors, and the communication interface is undertaken. These tasks can be carried out in- dependent of each other. The accelerator synthesis is done by applying the design flow trajectory as shown in Figure 4.15. The software synthesis maps the sequential implementation of the loops onto the microprocessors. It is the responsibility of the user to provide software descriptions. The platform components are instantiated from the architecture library. Interface synthesis establishes the communication infrastruc- ture and is categorized into different cases. In case, the source and sink loops are implemented as accelerators, then a dedicated communication primitive with FIFO semantics is generated. In case, either the source or the sink is implemented on a processor. Then, a memory map of the accelerator, a device driver for the proces- sor, and an interface circuit for the accelerator needs to be generated, additionally. The software running on the processors must then be modified by adding the driver routines. During the final integration step, the generated RTL, hardware/software in- terface, software programs, and platform instance are assembled together. Embedded compiler and linker tools for the software part and synthesis tools for realizing the hardware accelerator aid this composition. There are several tools, which aid the cre- ation of the overall system-on-chip like Xilinx Embedded Development Kit (EDK) [187] as used in our case.

4.4. Results

In this section, we study the overhead of the communication hardware and the inter- face components in terms of performance and area cost.

4.4.1. Overhead of Communication Primitives

We use the discussed examples as case studies to quantify the area and throughput overhead of the multi-dimensional FIFO. Table 4.2 summarizes the area requirement of the different components of the dedicated hardware pipeline for the applications. The accelerator area in Table 4.2 is equivalent to the area cost of the processing hard- ware. The communication area is the overhead of the multi-dimensional FIFO con- necting the processing hardware. The clock frequency denotes the attainable clock speed of the communication architecture. All synthesis results are obtained using

141 4. Accelerator Subsystem for Streaming Applications: Synthesis and System Integration

Application Accelerator Comm. area Clock Overhead (LUT,FF,BRAM) (LUT,FF,BRAM) (MHz) (%) SDE (2598,2068,15) (372,204,0) 350 (11.2,0) ED-DCT (1437,1493,3) (280,157,5) 320 (13.4,62.5) hDCT-vDCT (2044,2426,0) (1717,2314,22) 321 (47.1,100)

Table 4.2.: Area and clock overhead of communication subsystem for a dedicated accelerator pipeline. The overhead percentage gives the logic and memory overhead of the communication subsystem.

Xilinx ISE 9.2 on a Xilinx Virtex-2 FPGA (xc2v8000-4-ff1517). All applications are characterized by an input and an output bit-width of 16. The SDE (stereo depth extraction) accelerator consists of dedicated hardware for each of the loop programs. It has 5 accelerators and 4 multi-dimensional FIFOs, which allow a throughput rate of one pixel per cycle. Since, the in-order communi- cation and sequential execution can be handled by classical FIFOs, it is important to observe the overhead of multi-dimensional FIFOs. The area overhead of the FIFOs is proportional to log2(B), where B is the depth of the memory channel. The memory size B = 2 is chosen to quantify the overhead. Each of the FIFOs has an area cost of (93 LUT, 52 FFs,0 BRAMs). The ED-DCT is the accelerator for the communicating edge detection and the dis- crete cosine transform tiler (DCT-Tiler). The dedicated hardware implementation consists of two accelerators, where the DCT-Tiler and the edge detection process one pixel per cycle. The communication is of out-of-order nature. With a minimum mem- ory size of the channel, B = 512 7 + 8, the communication subsystem consists of 5 BRAMs. × The last example, communicating horizontal DCT and vertical DCT need a fully parallel communication subsystem to support a throughput of 8 pixels per cycle. Therefore, parallel access in combination with out-of-order communication leads to a high area cost of the communication subsystem to support the transpose operation. The transported multi-dimensional data array is partitioned onto 64 virtual memory channels, which can later be efficiently mapped onto 22 physical memories. The multi-dimensional FIFOs are not a throughput bottleneck, as the critical path does not lie in the communication subsystem, and it can be used in high performance applications. The overhead of communication can be quantified as percentage of the total area cost. This is calculated with help of a slice-packing formula (see Equation (5.3)). One may observe that with an optimal memory size, the area overhead of the communication architecture depends on the nature of the communication as also shown in Table 4.2. In case of a simple in-order communication for stereo depth extraction, the FIFO subsystem accounts only 11% and 0% of the logic and memory

142 4.4. Results

SW HW HW/SW Communication Time Time (µs) 253905 656 131906 131155 Speedup 1x 387x 1.92x x

Table 4.4.: The execution time of a 256 256 Sobel edge detection for software, hard- ware, and hardware/software× co-design implementation variants. area overhead, respectively. The ED-DCT example needs to store B = 512 7 + 8 data tokens in the communication subsystem. Therefore, it accounts 13% and× 62% of the logic and memory area, respectively. Finally, the out-of-order communication of hDCT-vDCT is not only associated with complex FIFO logic due to out-of-order communication, but also with a large number of memories for providing parallel data access. This is shown by 47% and 100% of logic area and memory respectively. One can summarize that depending on the type of communication, one may require large amount of logic and memory resources for the communication subsystem, in the whole accelerator subsystem. Therefore, one may conclude that communication synthesis is as important as the accelerator synthesis problem. We solve the often neglected problem of communication synthesis efficiently.

4.4.2. Accelerators as Components in SoC Hardware wrapper is used to embed accelerators as components in an SoC over a bus. The hardware wrapper consist of a memory controller and glue logic for the conversion of signals. The fixed part is the BRAM interface controller for protocol conversion and has an area requirement of 25 LUTs, 9 FFs, for 32-bit address and data width, and a maximum frequency of 176 MHz on a Xilinx Virtex2Pro FPGA. The area of the glue logic depends on the number of accelerator buffers and their size. The glue logic for a simple edge detection accelerator with a single input and output memory and few status registers has an area requirement of 20 LUTs and 68 flip-flops. Therefore, for the edge detection example the hardware wrapper makes up 4% (Using, the slice packing formula in Equation (5.3)) of the area cost of the accelerator component. In the previous chapter, the peak performance of the hardware accelerators was presented. The numbers were obtained on the assumption that communication is not a bottleneck. Here, we show the performance gain of the hardware/software imple- mentation over a software implementation in an SoC. The software is responsible for the communication. We use the edge detection example for illustrating the ob- servations. The loop program implements a Sobel edge detection for an 256 256 image. The system architecture consists of a software implementation on a 300Mhz×

143 4. Accelerator Subsystem for Streaming Applications: Synthesis and System Integration

PowerPC CPU on a Virtex2Pro FPGA (XC2VP30) with 256MB DDR-RAM. A 100 MHz clock is used for the processor local bus and accelerator. The whole synthesis is done with help of Xilinx EDK and ISE tools [187]. The performance gain over the software implementation as shown in Table 4.4 is only 2x. This is in stark contrast to the performance gain of 387x predicted on using the loop accelerator in a complete hardware solution. The reason is the communication bottleneck. The communica- tion of the data is sequential and takes up to 52% of total execution time. Therefore, although the hardware accelerator reduces the total execution time through parallel execution of rest 48%. But due to Amdahl’s law, only a meager speedup of 2x is ob- tained, which cannot be improved even on using more parallelization. This suggests the use of DMA and other efficient data caching schemes for reducing the commu- nication time. For matrix-matrix multiplication with matrices of size 256 256, the accelerator/processor solution (196,057us) obtains a speed-up of 26x over× software solution (5,117,087us). This problem has a higher computational intensity, i.e., the computation to communication ratio is higher for this loop benchmark.

4.5. Conclusion

In this chapter, we have presented a novel bridge from the polyhedral model to the data flow model. This derivation not only allows a clean representation of commu- nication semantics and an analysis capability, but also allows automated synthesis of dedicated communication subsystems [109]. The communication primitive supports parallel access and out-of-order communication with a simple FIFO-like interface. The presented graph model for representing communicating loops in the polyhedral framework can be targeted for mapping onto system-on-chip platforms. In order to use loop accelerators as co-processors or components in a heterogeneous SoC, auto- mated generation of memory maps, hardware wrappers, and software drivers, based on allocation and scheduling parameters of the loop accelerator is undertaken. These components enable integration over a system bus and the access of a loop accelerator from the software running on the CPU. The results show that communication synthe- sis is an important problem and is solved efficiently in this chapter so that the data transfer is not a bottleneck for the performance gain.

144 5. Design Space Exploration: Accelerator Tuning

In a real-world accelerator design problem, it is of imperative importance to select a best-fit accelerator architecture in terms of multiple objectives, given several con- straints and workload scenarios. A best-fit solution is normally a Pareto-optimal accelerator design in terms of the major objectives of area, power, and throughput, with enough performance to handle the system specifications on area/power cost and quality of service (QoS). In the previous chapters, a design methodology that aims at the automatic gen- eration of accelerator subsystems for computationally intensive algorithms was pre- sented. A major problem faced by system architects using such a methodology is an explosion in the number of designs in terms of accelerator properties such as area, performance, and power. This happens because of the freedom in the selection of major compiler transformations of tiling (different tile size, shape, and strategy) and architecture configurations (different number of functional units, registers in the ar- chitecture model). The parameters span a so-called design space leading to different solutions of different quality. Therefore, the problem of finding a best-fit accelerator subsystem entails efficient design space exploration (DSE). The three important traits of a good design space search are: as fast as possible, as accurate as possible, as early as possible exploration in the design methodology. Usually, the huge number of design points in the search space and their tedious eval- uation forbids exhaustive determination of a best-fit solution. Therefore, intelligent search based on heuristics coupled with the succinct evaluation of different system properties is of utmost importance for fast exploration. Furthermore, fast and early evaluation needs to be compromised with the contrary need of accurate estimation of system properties in early design phases. The problem of finding a best-fit accelerator can be divided into the following problems of: • Efficient search of Pareto-optimal accelerator designs

• Selection of a best-fit design from the set of Pareto-optimal designs based on system specifications and workload behavior The first problem involves accelerator modeling based on the architecture and com- piler parameters, identification of objectives, and determination of an evaluation func- tion for determination or approximation of the objectives. Subsequently, different

145 5. Design Space Exploration: Accelerator Tuning search techniques like random search, evolutionary algorithms, and others can be used for expeditious location of feasible Pareto solutions. The second problem in- volves matching of the characteristics of the accelerator engines to the system re- quirements on area, power, and performance. An efficient design space exploration for accelerator architectures requires solving these problems. Therefore, this chapter is classified into two sections. Section 5.1 deals with the first problem, i.e., the exploration of Pareto-optimal accelerator designs. The model representation and objective evaluation, which are necessary prerequisites for design space exploration, are presented in Sections 5.1.1–5.1.3. Finally, the optimization engine for the identification of Pareto-optimal designs is treated in Section 5.1.4. The second major problem, the identification of best-fit accelerators from Pareto-optimal designs, given the system requirements, is discussed in Section 5.2.

5.1. Single Accelerator Exploration

One of the fundamental problems of computer engineering is architecture and appli- cation optimization with respect to multiple objectives. We discussed some of the related work in Section 2.4 on multi-objective design space exploration (DSE) and observed that multi-objective Evolutionary Algorithms (MOEAs) are often used for DSE at system level, software level, and high-level synthesis. Since we deal with non-programmable accelerators for FPGAs, our DSE problem involves high-level synthesis. Novel to our exploration approach, compared to other related work, is the use of multi-objective evolutionary algorithms (MOEAs) with elitism and the consid- eration of the compiler transformation of loop tiling. Most of the other works on DSE in high-level synthesis concentrate only on the influence of architectural parameters like resource allocation [145, 114], and at most loop unrolling as compiler parameters [161, 118, 9], and often without using modern heuristics for exploration. The design space exploration problem is illustrated with help of the following example.

Example 5.1.1 The discrete cosine transform (DCT) is chosen to illustrate the design space exploration problem. The data flow graph of the DCT algorithm is shown in Figure 5.1(a). It contains 32 multiplications, 16 additions, and 16 subtractions. The allocation of the maximum number of resources (32,16,16) leads to the highest throughput. I.e., eight pixels are processed in II = 1 cycles. Whereas, the minimum resource allocation (1,1,1) leads to minimum throughput, that is eight pixels are produced in II = 32 cycles. There are 32 16 16 = 8192 possible combinations of architectural parameters that lead to diverse× × accelerator designs. The effect of the allocation of a different number of adders and multipliers on the throughput and accelerator area is shown in Figure 5.1(b). The full points show the Pareto-optimal accelerator designs. Therefore, the major challenge is to discover the Pareto-front or a set of solution near to it as soon as possible using search heuristics.

146 5.1. Single Accelerator Exploration

(a) + + (b) − −

× × × × × × × × × × × × × × × ×

+ + + + + − − − − − − −

× × × × × × × × × × × × × × × ×

+ + + + − − − −

+ + + + + − − − Figure 5.1.: (a) Data flow graph of a 8 point DCT, all the operations are constant addi- tions, subtractions, and multiplications, (b) Area vs. throughput trade-off for different DCT accelerator configurations.

Design space exploration for identifying Pareto-optimal accelerator designs is a dif- ficult problem. Firstly, the number of possible design points can proliferate very fast. For the simple example of DCT, one can have 8192 different accelerator designs con- sidering only architecture parameters, i.e., resource allocation of different functional units. Similarly, for the matrix multiplication (for 64 64 matrices), depending only on the compiler parameter like loop tiling, one can have× 64 64 64 = 218 design points. Therefore, there exists an absurdly large number of possible× × accelerator de- signs on varying both architecture and compiler parameters. This forbids exhaustive search for finding the best solutions in terms of the objectives of performance, area, and power. Normally, the exact evaluation of each design point requires scheduling and synthesis. Static scheduling based on integer linear programming (ILP) has an execution time in the order of seconds. It gives the statistics on performance in terms of throughput and latency. The subsequent generation of an RTL description for the accelerator needs to be simulated and evaluated by ASIC or FPGA synthesis tools to obtain the area and power statistics. This is a time consuming process taking up to several minutes. Hence, RTL synthesis is the bottleneck in fast evaluation of designs.

18 For instance, for the 2 ∼= 0.26 million design points each with evaluation time of 30 minutes per point, then an exhaustive search would require almost 15 years. This motivates the need for an efficient design space exploration. In order to speeden up the exploration process, a two-pronged strategy is needed, which uses (a) intelligent search algorithms based on modern heuristics and (b) fast evaluation functions based on estimation.

147 5. Design Space Exploration: Accelerator Tuning

5.1.1. Model Representation and Problem Definition The architecture and compiler parameters are the key knobs, which lead to an explo- sion of the design space of customized accelerators. Let SA denote the set of possible architecture parameters and SC be the set of possible compiler parameters in the fol- lowing. Then the accelerator design space SD consisting of all feasible accelerator design points can be defined by

SD = SA SC = (sa,sc),sa SA sc SC × { ∈ ∧ ∈ } In general, the architecture of a programmable or a non-programmable accelerator consists of several parameters. The major difference is that for the non-programmable accelerators, some architecture parameters depend on the compiler transformations. For example, the number of PEs, memory banks, and the size of local buffers are determined by the tiling matrix and strategy. The resource allocation of functional units (FUs) like number of adders, multipliers, and others are architectural parame- ters have a major influence on the accelerator area and power cost, as already shown in Chapter 3. However, a fewer number of FUs does not necessarily imply lesser area, since a larger number of registers as well as multiplexers to select their correct input would be required. The allocation of more FUs leads to a reduced number of regis- ters, multiplexers, and lesser control overhead. The number of available registers can also be allocated; thus, is also an architecture parameter. For programmable archi- tectures, architecture parameters are independent of compiler transformations as they are fixed by the system designer. In case of programmable architectures, the number of integer units, float units, branch units, memory units, register size, cache (size, associativity,...) are the usual architecture parameters. In the following sections, we will concentrate on non-programmable accelerators, although the methodology can easily be extended for programmable accelerators. Our accelerator architecture pa- rameter space, SA consists of

• The resource allocation of each distinct functional unit (FU) within a PE of the accelerator

• The number of allocated registers within a PE

The set of possible architecture parameters can be represented as follows:

n SA = X Z X = (x1,x2,...,xn),L1 x1 U1,...,Ln xn Un { ∈ | ≤ ≤ ≤ ≤ }

Any allocation can be represented by an n-dimensional vector, X = (x1,x2,...,xn), where the ith component of the vector represents the allocated number of units of the ith resource. This resource can be a functional unit or a register. For the purpose of design space exploration, an upper bound, Ui and a lower bound, Li, representing

148 5.1. Single Accelerator Exploration the maximum and minimum possible number of the ith resource allocated is given in order to limit the search space. For example, (2,8,4,2) could be a possible resource allocation of adders, multipliers, shifters, and registers, respectively. The compiler parameter space can consist of standard optimizations and loop trans- formations. Although standard compiler optimizations like dead-code elimination, common sub-expression elimination, and others are integrated in our compiler for the generation of non-programmable accelerators. These pre-processing transforma- tions are always applied as default. Therefore, our compiler transformation space, C consists only of S

• loop tiling parameters (tile size, tiling strategy)

Loop tiling has a major influence on the design objectives. It not only determines the degree of parallelism of the accelerator through the number of processor elements (PEs), but also the granularity of communication and local buffer requirements as elaborated in Chapter 3. For the sake of simplicity of representation, we consider only rectangular tiles in the following. In this case, a tiling matrix is represented by a diagonal matrix, i.e., P = diag(t1,t2,...,tn). This can also be represented by an n- dimensional vector T = (t1,t2,...,tn). The application of tiling leads to partitioning of the iteration space into tiles. The strategy of tiling, i.e., LPGS and LSGP allocates PEs onto inner tiles and outer tiles, respectively1. Therefore, we define the compiler parameter space as

n t SC = T Z T = (t1,t2,...,tn) ,L1 t1 U1,...,Ln tn Un ∈ | ≤ ≤ ≤ ≤ The compiler space defines the set of possible tiling matrices for loop tiling. The upper and lower bound for the tile sizes can be specified to limit the design space. For a given loop program, the configuration space of its corresponding loop accelerator design is then given by

SD = SA SC × The configuration parameters of the design space need to be encoded as so-called genotype. In terminology of Evolutionary Algorithms, the genotype is the genetic representation of an individual. An example of the representation of the configuration space of our DSE problem parameters, as a genotype is shown in Figure 5.2. The first part of the genotype encodes the tiling matrix; the second part encodes the allocation of functional units. After modelling of the design parameters as genotype, we discuss the objectives of the accelerator design space exploration in the next section.

1In case of n hierarchical tiling, n tiling matrices are required. In our study, we limit n = 1, i.e., LSGP and− LPGS are the only strategy considered during design space exploration. I.e., simple tiling is considered.

149 5. Design Space Exploration: Accelerator Tuning

Compiler parameters Architecture parameters Tile(x) Tile(y) Tile(z)... ADD SUB MUL ... 2 8 4 2 8 4

Figure 5.2.: Integer genotype, encoding the architecture constraints and compiler pa- rameters for accelerator synthesis.

5.1.2. Multiple Objectives After defining the search space of the design space exploration problem, we need to define objective functions for comparing the designs. For design space exploration, the following objective functions will be optimized simultaneously for determining the set of the Pareto-optimal accelerator designs:

• Performance: The first objective is to minimize the maximum execution la- tency L, or alternatively to minimize the iteration interval II. The throughput and the latency are given in terms of the iteration interval (cycles) II and the number of execution cycles L for computing the complete problem instance.

• Area: The area cost A of the hardware accelerator is determined by the resource usage incurred by the set of allocated components: PEs, controller, and inter- face logic of the accelerator. Since all the benchmarks have been studied for FPGA technology, the area will be estimated by the number of required slices, as they are the basic building blocks of an FPGA.

• Power: Average static and dynamic power dissipation plays a major role in determining battery life of accelerator-based embedded SoCs. Therefore, min- imizing the power P is also another important objective.

Definition 5.1.1 A multi-objective optimization problem can be defined formally as

minimize ~F(~X) = f1(~X), f2(~X),..., fn(~X)

sub ject to ~X = (x1,x2,...,xk)  ∈ F with n objective functions defined by a vector of objective functions ~F(~X) and k op- timization variables defined by the vector ~X. defines the feasible search space for the optimization variables. F Therefore, the formal description of the accelerator design space exploration problem is the following multi-objective optimization formulation.

minimize ~F(~X) = (L(~X), A(~X), P(~X)) (5.1)

150 5.1. Single Accelerator Exploration

where vector ~X SD contains the architecture and compiler parameters for a particu- lar accelerator design.∈ The optimization variables, i.e. the compiler and architecture parameters are represented as genotype (see Figure 5.2). Now, the system designer would like to find the set of Pareto-optimal designs in terms of performance L, area A, and power P. In the next section, we discuss the efficient evaluation of these objectives.

5.1.3. Objective Functions

After describing the different objective functions in the previous section, this section describes the objective functions for determining area, power, and performance of the loop accelerator. The design space exploration must use the evaluation information on area and power to search for Pareto-optimal solutions. One may follow three dif- ferent approaches for the determination of the objective functions for area, power, and performance. The first approach analyzes only the dependence graph of the loop program and considers the compiler and architecture constraints in order to estimate area and power [127]. The second approach performs high-level synthesis to deter- mine the RTL components of the accelerator. The area and power macro-models of the RTL components are summed together to obtain the objective function similar as in [29]. The third approach performs even place and route of the RTL components us- ing state-of-the-art synthesis tools so as to evaluate area and power more accurately. Such synthesis tools not only convert the accelerator RTL description into a netlist, but also give precise numbers on area, power, and clock frequency. The trade-off here is that synthesis may take considerably long time. We follow the second approach in the following, as it is a trade-off between accuracy of the estimates and the run-time of the exploration. Therefore, given the RDG of the loop program, architecture and compiler parame- ter representation as genotype, only the high-level synthesis problem must be solved for each explored design point. The steps involved are scheduling and RTL synthe- sis, which is elaborated in Chapter 2. The scheduling problem is solved by mixed integer linear program (MILP) formulation of the given architecture constraints, with a minimization of the maximum latency as objective function. Therefore, we ob- tain the performance characterized by the given by the latency L or iteration interval II. As also shown in Figure 5.1(b), numerous candidate solutions may lead to the same performance, however, they may have different area and power. This inter- nal structure of area and power is not available in the formulation of the scheduling problem, since information on multiplexers and registers are only available after syn- thesis. Therefore, it is necessary to estimate the area cost A and power consumption P from an accelerator RTL description. Our estimation approach for area and power is presented in following subsection.

151 5. Design Space Exploration: Accelerator Tuning

5.1.3.1. Rapid Estimation Models In order to fasten the automated exploration, estimation models are developed for the determination of area and power.

Area The area cost of an accelerator implementation on Virtex FPGAs is deter- mined in terms of 4-input look-up-tables (LUTs), slice flip-flops (D-type flip-flops), 18 bit DSP multipliers (MUL), and 18 Kb block rams (BRAMs). These components are the basic elements of an FPGA for logic implementation, storage, and efficient multiplication. For characterizing the accelerator area cost, an accurate prediction of the total number of FPGA resources is required. Table 5.2 shows the resource consumption of different RTL components for FPGA technology. The data path re- source usage depends on the type and the number of functional units, registers, and their bit-width, w. Functional units like dividers are instantiated as IP cores from libraries. Hence, they are characterized by a number in the database for the particular bit-width. The controller logic is implemented with help of counters, comparators, and logical gates like NAND and NOR; furthermore, modulo-counters, and one-hot LUTs are also part of the global controller. The resource sharing of functional units and registers leads to multiplexers. The area cost of multiplexers depends not only on the bit-width w, but also on the number of inputs n. The storage units are im- plemented in form of delay shift registers which can be implemented with BRAMs or distributed logic (i.e., LUTs and FFs). Their area cost depends on the length of the delay d and also on the bit-width w. We refer to [139] for derivation of proper estimation formulas. The final area cost A is then determined by summing up the area cost of each RTL component constituting the accelerator architecture using the following formulas:

i i ALUT = ∑ ALUT ,,A = ∑ AFF (5.2) i RTL i RTL ∈ ∈ where the sum of the area of all allocated RTL components denoted by the set RTL gives the area in terms of LUTs ALUT and flip-flops AFF . Although the macro-models based on parameters and pre-characterized results for each component are accurate, it must be noted that there is an over-estimation error due to hardware-dependent and independent optimizations undertaken by the synthesis tools for place and route. However, for the sake of fast design space exploration, this error can be ignored, since the relative accuracy is more important for comparing different designs. The basic blocks of FPGAs are slices, which are made of LUTs and flip-flops (FF). Slice pack- ing refers to the determination of slice usage from the resource usage given in terms of LUTs and flip-flops. We use the following formula, proposed for slice packing in [157], for characterizing the area:

A = ASlices = α0 ALUT + α1 AFF + α2ALUT AFF (5.3) · · ·

152 5.1. Single Accelerator Exploration

RTL Component Area Area Area Area (LUTs) (FFs) (MUL) (BRAM) w Comparator 4 0 0 0 Counter w w 0 0   w d Delay Registers 0 0 0 16384∗ d d Delay Registers 16 w 16 w 0 0 log(w) · ·   log˙(4) e log(4)  1   Logical Gate 3 − w 0 0 0 & ' Modulo-Counter w w 0 0 w Multiplexers (n 1) 2 0 0 0 −n w · One-Hot LUT 4· 0 0 0 Shift Registers 0 d w 0 0   · Registers 0 w 0 0 FU (Adder) w 0 0 0 FU (Multiplier) 0 0 1 0 FU (Mult, 16) 144 172 0 0 FU (Subtractor) w 0 0 0 FU (Divider, 8) 231 571 0 0

Table 5.2.: Resource requirement of different RTL accelerator components for Virtex FPGA technology. w denotes the bit-width of the components.

where ALUT and AFF are the estimated number of LUTs and FFs in the design as obtained from Equation (5.2). The factors α0 = 0.45, α1 = 0.35, and α2 = 2.07 7 · 10− are set for determining the area in terms of slices for the Virtex FPGAs [157]. Therefore, we estimate the area cost by adding the pre-characterized values of each component of the accelerator RTL.

Power The power consumption of an accelerator consists of static and dynamic power. The static power is fixed for a given underlying FPGA architecture and may be assumed independent of the resource usage. The dynamic power consists of logic power, signal/clock power, and I/O power. The logic power along with signal/clock power is dependent on the resource usage. The power consumption depends on sev- eral factors like the switching activity, clock frequency, supply voltage, resource us- age, routing density, and toggle rates. There are different tools like XPower Analyzer and web-based power tools for estimating the power consumption [188]. We use pre- characterization-based macro-modelling to determine the power consumption per

153 5. Design Space Exploration: Accelerator Tuning access of the LUTs and flip-flops. Our power model for FPGA architectures assumes a voltage source of 1.5 V, a nominal clock frequency of 100 MHz, and a toggle rate of 20%. The toggle rate of 20% is used for a worst- case estimate in Xilinx power estimation tools [188]. The toggle rate describes how often the output changes with respect to the input clock. The switching activity of an FU is assumed proportional to its utilization within the iteration interval II. Using the information that one can determine that power consumption of each LUT for switching activity of 1% amounts to PLUT = 6 µW. The switching activity rate, si is determined by the scheduling and is proportional to the corresponding resource utilization of each RTL component. Similarly, for each flip-flop, the power consumption is estimated for a switching activity of 1% as, PFF = 5 µW. Subsequently, given the switching activity and resource usage, one can determine the power consumption P as follows:

i i P = ∑ si ALUT PLUT + si AFF PFF i RTL · · · · ∈ th where si is the switching activity of the i RTL component in the set of allocated components RTL. ALUT and AFF give the area of the RTL component in terms of look-up-tables and flip-flops. The final power consumption is then determined by the summation of the individual contributions of each RTL component. Apart from the efficient evaluation of objectives, intelligent heuristics are needed for exploring the accelerator design space. The next section presents our optimization engine for discovering Pareto-optimal solutions.

5.1.4. Optimization Engine There is no dearth of techniques to explore the search space for Pareto-optimal solu- tions. In this section, we present modern heuristics based on Evolutionary Algorithms used in our design space exploration problem. The overview of the resulting design space exploration framework is shown in Figure 5.3. It shows that different heuristics can be used to vary the genotype (architecture and compiler parameters). The loop accelerators are then generated and evaluated for the corresponding parameters. The Pareto-optimal accelerator designs (i.e., the set of non-dominated designs obtained during the exploration) are then stored in an archive.

5.1.4.1. Baseline: Random or Exhaustive Search

Exact brute force exploration involves evaluating each candidate of the search space for determining the set of Pareto-optimal solutions. This exhaustive approach re- quires the systematic generation of all points in the search space SD. As argued ear- lier, in most of the cases the exhaustive search is prohibitive due to time constraints.

154 5.1. Single Accelerator Exploration

Architecture Compiler Search space pruning Partitioning Tiles Processor Array Loop Matrix Functional Units (FUs) Domain-specific code generation

Simulated Annealing Simulated Annealing Evolutionary Algo Evolutionary Algo PSO, Hill-Climbing PSO, Hill-Climbing Search API

Power, Performance, Cost Estimation

Database

Figure 5.3.: Design space exploration framework.

We propose the use of random search techniques that as the name says select can- didates arbitrarily and stores only the non-dominated solutions. The random search solution can be used to compare the effectiveness of other search methods like those based on Evolutionary Algorithms.

5.1.4.2. Evolutionary Algorithms

In contrast to classical search algorithms like hill-climbing, simulated annealing and others, Evolutionary Algorithms not only start with multiple candidates (population) in the search space, but also instill competition among the candidates based on the principle of survival of the fittest as in a natural selection process. The template of an Evolutionary Algorithm is shown in Algorithm 5.1. It consists of different steps like decoding, selection, and variation. The whole multi-objective exploration flow is also summarized in Figure 5.4(a). Selection refers to preserving good solutions for ensuring convergence to optimal solutions. Variation refers to creating a new population from an existing population. Decoding refers to the problem of evaluating the fitness of the individuals, which in our case corresponds to solving a high-level synthesis problem, i.e., utilizing the PARO compiler in our case. An initial population, P0(X) = X1,X2,...,XP of size P is generated randomly or using a deterministic rule for generation{ 0. For} our exploration problem, each individual of the population is encoded as a genotype as described in Section 5.1.1. Subsequently, the objective function for decoding the fitness of each individual in the population is calculated. For our accelerator exploration problem, the fitness func- tion is given by ~F(~X) as in Equation (5.1), which gives the performance, area cost,

155 5. Design Space Exploration: Accelerator Tuning

Algorithm 5.1 Evolutionary Algorithm Require: Initial population, P0(X) Ensure: Pareto-optimal non-dominated set of solutions generation, gen 0; evalFitness(P(X←)) while not termination condition do for all X P(X), individual in population do gen gen∈ + 1 ← Pgen(X) selector(Pgen 1(X)) ← − Pgen(X) variator(Pgen(X)) ← decode(Pgen(X)) end for end while and power consumption. Given the resource allocation and compiler parameters for the loop program, the steps of binding, scheduling, and accelerator synthesis need to be undertaken for determining the fitness. The compiler-in-the-loop for solving the high-level synthesis problem uses the PARO design methodology. The determina- tion and estimation of objective function based on the PARO design methodology is discussed in Section 5.1.3. Once the fitness function is calculated, multiple best-fit individuals are selected for the formation of new population at generation gen + 1. The best offsprings then re- place the individuals with worst fitness values in the population. An important feature of the multi-objective Evolutionary Algorithms (MOEA) is elitism. I.e., to form the next generation, MOEAs choose N best individuals from a pool of current and off- spring populations. It is part of the selection process as shown in Figure 5.4(a). For MOEA chosen here, the individuals are associated with a non-dominated rank and crowding distance for characterizing non-dominance and diversity. The N best indi- viduals are selected by a lexicographic fitness scheme based on binary tournament selection, where non-dominated solutions with higher rank and with lesser crowding distance are preferred [134]. The tournament selection refers to the process of ran- domly selecting two individuals from the current and offspring populations, and the individual with higher fitness goes into the mating pool. The mating pool consisting of parents is further bred through variators (i.e., genetic operations like crossover and mutation), which give birth to the new offsprings. The mutation operator as the name suggests modifies a genotype with a low rate defined by probability, Pmut. For our problem, this may change the resource allocation of a particular functional unit or change the tiling matrix during loop tiling. The crossover operator is based on the reproduction technique, where two parent genotypes are split and swapped to produce two new offsprings. This is also known as single point crossover. The crossover is applied with a higher probability given by Pcross. This

156 5.1. Single Accelerator Exploration

(a) (b) Crossover point (c) Mutation point Parent 1 Parent 2 Offspring Selection

Decoding

Variation Offspring 1 Offsprint 2 Mutated Offspring

Figure 5.4.: (a) Coupled multi-objective exploration for high-level synthesis, (b) sin- gle point crossover, (c) mutation. operation gives two new individuals with different resource allocation and compiler parameters. The crossover and mutation operators are shown in Figure 5.4(b) and (c), respectively. The heuristic is repeated until a termination condition is met. The termination condition is often given by a maximum number of generations, genmax defined by the user. In order to evaluate our approach for design space exploration, we use the following set of loop kernels as benchmarks:

• DCT: The loop kernel of the discrete cosine transform contains 64 operations with 32 multiplications and rest being additions and subtractions. Depending on the chosen resource allocation of adders, and multipliers, different acceler- ator implementations are obtained.

• Complex-MMM: The complex matrix-matrix multiplication of two 64 64 ma- trices, contain 4 multiplications, and 3 additions and subtractions in× the loop kernel. Depending on the chosen tiling matrix, the accelerator can have up to 16 16 PEs in the processor array implementation due to chosen bounds on tiling× matrices, and each PE with up to 4 multiplications, and 3 additions and subtractions in each PE.

The above benchmarks are compute-intensive loop kernels written in PAULA lan- guage. Pareto-optimal accelerator implementations of these loop nests obtained dur- ing our design space exploration are shown in Figure 5.5. The exploration is success- ful in identifying most of the Pareto-optimal designs within 2 hours of run-time. We also analyze the different search heuristics based on these benchmarks. The design space exploration (DSE) has been implemented in Java with help of the Opt4J frame- work2 using multi-objective Evolutionary Algorithms like NSGA, SPEA, and others.

2opt4j.sourceforge.net

157 5. Design Space Exploration: Accelerator Tuning

7 14 6 12 cycles) 3 5 10 4 MMM DCT 8 3 6 2 90 4 40 1 80 35

70 Execution time (cycles) 60 30 0 0 50 2 10 Execution time (x 10 5 40 15 25 10 15 30 20 20 20 25 20 25 30 35 10 2 30 15 2 40 0 Area (x 10 Slices) 35 10 Area (x 10 Slices) Power (x10 mW) 45 Power (x10 mW) 40

Figure 5.5.: Non-dominated loop accelerator designs in terms of area, performance, and power for the complex matrix-matrix multiplication and DCT benchmarks.

This framework includes modern heuristics based on Evolutionary Algorithms for multi-objective optimization. The design space exploration evaluation is done with help of different search heuristics like • RAND: random search algorithm • EA: standard Evolutionary Algorithm • NSGA: Non-dominated Sorting Genetic Algorithm [37] • SPEA: Strength-Pareto Evolutionary Algorithm [191] The random search algorithm randomly samples the design space arbitrarily and stores/updates the set of non-dominated solutions in a database called archive. As discussed earlier, it serves as a baseline for comparing the evolutionary heuristics. For the sake of comparison, the random search is divided into batches comprising a generation such that the total number of evaluations is the same as for the Evolution- ary Algorithms. First, the simple evolutionary algorithm (EA) is configured as given in Table 5.4 for each run. It uses the genetic operators like crossover and mutation, but does make no use of elitism for selection. We also study two elitist MOEAs: Strength Pareto Evolutionary Algorithm (SPEA2) [191] and Non-dominated Sorting Genetic Algorithm (NSGA2) [37], which make use of elitism (i.e. a non-dominated individual is selected for crossover and mutation with a strictly positive probability). For a detailed discussion of the elitism mecha- nism for both the algorithms, we refer to [192]. The number of individuals which fight against each other in a tournament to become a parent must be set for the se- lectors in both MOEA. For the elitist MOEA, the number of tournaments is set to 4

158 5.1. Single Accelerator Exploration

Parameter Value Population size, N 30 Number of generations, gen 20 Crossover probability, Pcross 0.8 Mutation probability, Pmut 0.1 Table 5.4.: Experimental setup for each run of the multi-objective exploration. for our experiments. All the exploration runs of the elitist Evolutionary Algorithms are also carried out with the parameters shown in Table 5.4. The simulation runs for the different loop algorithms lead to a set of non-dominated designs. Ten different runs are carried out for each of the algorithm and filtered, to establish corresponding reference Pareto-optimal solutions. The reference Pareto-optimal set is obtained by filtering all the runs from different search heuristics for the non-dominated points. Before comparing the search heuristics and analyzing the effect of different Evolu- tionary Algorithm parameters, the metrics for characterizing the solutions in terms of accuracy need to be defined. Before doing the assessment, the objective function values should be mapped to the same interval. This is done by finding upper and lower bounds of each of the elements of the objective function. Subsequently, a normalization of the objective function to the range [1,2] is performed. I.e., each objective function area, power, and latency is normalized to the range [1,2]. Additive ε-indicator, Iε+ is a binary quality indicator that can be used to compare the quality of two Pareto set approximations relative to each other [116].

Definition 5.1.2 (Additive ε-indicator), Iε+ : Given two normalized non-dominated sets A and B, the maximum value d, which needs to subtracted as an offset for Pareto- optimal points in set A such that the offset set dominates all the points of set B, is called the additive ε-indicator, Iε+ . Intuitively, if the measure is negative then set A already dominates set B. The proximity of the solutions generated by a run of multi- objective EA to the reference Pareto set of solutions is proportional to the value of the additive ε-indicator Iε+ . The lesser the value, the closer is the Pareto-optimal solution to the reference set. Formally, it is defined as

Iε+ (A,B) = max min max fi(~x) fi(~y) ~y B ~x A 1 i dim { − } ∈  ∈  ≤ ≤  where~x and~y are the non-dominated designs in set A and B, respectively. We aim to compare the convergence of different search algorithms with help of this metric. In order to compare the different heuristics based on the simulation runs,

159 5. Design Space Exploration: Accelerator Tuning

1 1 SPEA2 NSGA2 0.8 0.8

0.6 0.6 + + ε ε I I 0.4 0.4

0.2 0.2

0 0 0 5 10 15 20 0 5 10 15 20 Generation Generation

1 1 EA RAND 0.8 0.8 median

0.6 0.6 + + ε ε I I 0.4 0.4

0.2 0.2

0 0 0 5 10 15 20 0 5 10 15 20 Generation Generation

Figure 5.6.: Box plots showing the convergence of different search heuristics with increasing number of generations for the matrix-matrix multiplication algorithm. box plots based on the additive ε-indicator are used. A box plot is a convenient way of depicting a data set in terms of: sample minimum, lower quartile, median, upper quartile, and sample maximum. The upper and lower boundaries of the box denote the lower and upper quartiles. The comparison is done by calculating the additive- epsilon indicator value Iε+ of all the runs of an algorithm compared to the reference set. The reference set is obtained by filtering all the non-dominated solutions from different runs. It must be noted that the size of the design space prohibits exhaustive determination of the exact Pareto-sets. The box plots of the different search heuristics looking for the Pareto-optimal loop accelerators for the matrix multiplication algo- rithm is shown in Figure 5.6. The median value of the indicator from all the 10 runs is denoted by the dark line in the figure. From the experimental runs, we can conclude that

• The heuristics based on the Evolutionary Algorithm (EA) outperforms random search (RAND) for realistic loop accelerator benchmarks in terms of conver- gence to reference Pareto front. The SPEA2 algorithm outperforms the NSGA version of Evolutionary Algorithms within 20 generations. The median value

160 5.2. Performance Analysis of Accelerators in an SoC System

of the convergence measure, Iε+ is closer to optimal value of 0 for SPEA (Iε+ (NSGA) = 0.106 and Iε+ (SPEA) = 0.098). However, the use of elitism brings no advantage in comparison to simple evolutionary heuristics. Sur- prisingly, the simple Evolutionary Algorithm (EA) had superior convergence over the elitist Evolutionary Algorithms (Iε+ (SPEA) = 0.077). The random search has inferior convergence compared to the evolution-based heuristics (Iε+ (RAND) = 0.141).

• The convergence rate slows down for all the heuristics after 10 generations. An increase in population size does not help in getting closer to the exact reference Pareto-set.

• A single run of SPEA2, NSGA, EA, and RAND takes on average 6631 sec- onds, 5567 seconds, 5521 seconds, and 1267 seconds, respectively. Therefore, random search is faster than evolution-based search heuristics at the cost of inferior convergence.

Therefore, Evolutionary Algorithms are a viable search heuristic for design space exploration. A future work would be to make use of statistical significance tests for confirming the above observations. Furthermore, one can analyze other heuristics like simulated annealing, particle swarm optimization (PSO), and others. In the next section, we study the problem of selection of the best-fit accelerator engine from a set of non-dominated designs, given a workload scenario.

5.2. Performance Analysis of Accelerators in an SoC System

SoC architectures are realized either as homogeneous tiled core architectures or as heterogeneous architectures containing application specific acceleration engines. One of the major challenges is the efficient performance analysis and exploration of the vast design space because of the numerous mapping possibilities. In the previous section, we studied efficient heuristics for identifying Pareto-optimal accelerator en- gines. However, it did not consider the performance requirements. Fig. 5.7 shows a typical template of a heterogeneous SoC containing an inverse discrete cosine trans- form (IDCT) accelerator IP core. The trade-off between area and throughput for dif- ferent Pareto-optimal designs of the IDCT accelerator is also illustrated in Fig. 5.7. The proper matching of the area and performance requirements of the accelerator to the overall system behavior is important, in particular with respect to the communica- tion behaviour. Hence, the fastest implementation of the IDCT might be an overkill, if the worst-case rate of input events requires much lesser throughput. Therefore, a major challenge is to couple such accelerator design tools with performance analy-

161 5. Design Space Exploration: Accelerator Tuning

4000

8x MUL 8x MUL 8x ADD 8x ADD 8x SUB 8x SUB 3000 DCT FFT IDCT Area (#Slices) 2000 0 5 10 15 20 RAM CPU DSP I/O 1/Throughput (Iteration Interval)

Figure 5.7.: Example of an SoC architecture including several potential acceleration engines. Design space exploration of an IDCT accelerator engine. sis tools, to iteratively identify rate-matched, i.e., best-fit acceleration engine, which lead to significant savings in area and power cost. Existing popular approaches for performance analysis are based on "Excel Sheet" analysis or simulation (e.g., with SystemC) [17]. A simulation-based performance analysis of an SoC is presented in [147]. In [20], a framework for performance anal- ysis for component-based software-only systems is presented. In this section, we use modular performance analysis based on real-time calculus for providing worst-case guarantees, because it is orders of magnitude faster than simulation approaches and more accurate than "Excel Sheet" analysis as it accounts traffic bursts and resource sharing [172]. In particular, we address the following problem: As we are concerned with a multi- objective optimization problem (cost, power, and throughput), there is not only one optimal solution, but typically a set of optimal solutions, so-called Pareto-optimal so- lutions. Let a set of Pareto-optimal hardware accelerator designs and different work- load scenarios for these hardware accelerators be given. Then the question arises: What is the best-fit hardware accelerator that can handle this set of workload? In our approach, different load scenarios from simulation approaches are used to ob- tain worst-case traffic numbers. The worst-case input scenario along with service models of accelerators is used for modular performance analysis. To summarize, the contributions the section are:

• Selection of an optimal hardware accelerator engine in terms of area and through- put with worst-case guarantees using modular performance analysis.

• Fast and accurate characterization of the hardware accelerator performance with service curves using polyhedral theory.

• Presentation of a motion JPEG (M-JPEG) case study application for illustrating the benefits of the proposed methodology.

162 5.2. Performance Analysis of Accelerators in an SoC System

(b) 10 (a) Proc. availibity 9 u u l α [β , β ] 8 t 7 6 [αu, αl] [αu′ , αl′ ] 5 βl 4

#events 3 t 2 [βu′ , βl′ ] 1

1 2 3 4 5 6 7 8 9 10 11 ∆

Figure 5.8.: (a) Node with arrival and services curves. (b) Backlog and delay are represented as vertical and horizontal difference of upper arrival curve and lower service curve.

5.2.1. Modular Performance Analysis (MPA)

Modular performance analysis (MPA) denotes a framework based on real-time cal- culus for the investigation of the performance of real-time embedded systems. For a component in an embedded system with a given input event stimulus. The process- ing rate on this component will always be the minimum of the service capacity of the component and the rate of the input stimulus. This is the paradigm of real-time calculus [172].

In Figure 5.8(a), the building blocks of performance analysis using the real-time calculus are shown. The inputs to the node are the arrival curves of events of data streams and the service curve of the node. The output arrival curve and output ser- vice curve describes the processed stream of events and the remaining service curve of the processing node. These building blocks can be hooked together for analysis of a communicating architecture network. Mathematically, the input stimulus is mod- elled using αu(∆), αl(∆), which denote the maximum and minimum number of input events in a time interval ∆. The service capacity of a component is modelled using βu(∆), βl(∆), denoting the maximum and minimum number of the available service rate for the input events in time interval ∆.

Outgoing arrival/service curves as well as the delay and the buffer size are deter- mined from the incoming arrival and service curves according to equations defined by real-time calculus [172]. The upper and lower output arrival curves, αu0 (∆), αl0 (∆)

163 5. Design Space Exploration: Accelerator Tuning and output service curves, βu0 (∆), βl0 (∆) are given by the following equations [172].

αu0 (∆) = inf sup αu(u + v) + βl(v) + βu(∆ u),βu(∆) 0 u ∆ (v 0 − ) ≤ ≤ ≥ n o αl0 (∆) = inf αl(u) + βl(∆ u) 0 u ∆ − ≤ ≤ n o βu0 (∆) = sup βu(u) αl(u) 0 u ∆ − ≤ ≤ n o βl0 (∆) = sup βl(u) αu(u) 0 u ∆ − ≤ ≤ n o In a simple embedded system, the output arrival curve represents the output data rate depending on the processor availability and the input data event arrival. The output service curve is equivalent to the remaining processor availability. The buffer size refers to minimum size of memory for storing the traffic bursts. Delay refers the maximum latency of the system. The delay and buffer size are given by the following equations.

delay sup inf τ 0 : αu(u) βl(u + τ) (5.4) ≤ u 0 ≥ ≤ ≥ n n oo backlog sup αu(u) βl(u) (5.5) ≤ u 0 − ≥ n o Intuitively, the delay is bounded by the maximum horizontal distance between αu and βl. Whereas, the buffer size is bounded by the maximum vertical distance between them, as also shown in Figure 5.8. We refer to [172] for the methods for evaluating the above equations. The equations are implemented in the MPA toolbox in Matlab3. The toolbox includes several models for describing typical event streams (e.g., periodic with jitter) and service models (e.g., TDMA). Therefore, after modeling of input behavior and accelerator service capacity, modular performance analysis can be used to find designs satisfying the system properties, like delay and buffer requirements.

5.2.2. Objective Parameter Estimation for Accelerators The design space of hardware accelerators is spanned by the choice of the tiling strategies, and the choice of resource constraints of the architecture model. In this section, we illustrate this observation and summarize our exploration approach for finding Pareto-optimal designs. In the LPGS tiling scheme, all iteration points within a tile are executed in parallel, whereas the tiles are executed sequentially (see Fig. 5.9(a)). In Fig. 5.9(a), for a 4-tap adaptive filter, the output ( j = 3) is produced by PE3 (2

3http://www.mpa.ethz.ch/rtctoolbox

164 5.2. Performance Analysis of Accelerators in an SoC System

(a) taps (j) (b) taps (j)

0 0 time (i) 0 1 2 3 time (i) 0 2 3 5

1 2 3 4 1 4 67 9 1

2 3 4 5 8 10 11 13 2 2 2 3

3 4 5 6 12 14 15 17 3 3 3 4

4 5 6 7 16 18 19 21 4 5 5 6 7 8 20 22 23 25 EXP 6 7 8 9 24 26 27 29 SUB

7 8 9 10 MUL 28 30 31 33 ADD DIV PE0 PE1 PE2 PE3 PE0 PE1

2x MUL 2x MUL 2x MUL 2x MUL 1x MUL 1x MUL

2x ADD 2x ADD 2x ADD 2x ADD 1x ADD 1x ADD 1x SUB 1x SUB 1x SUB 1x SUB 1x SUB 1x SUB 1x EXP 1x EXP 1x EXP 1x EXP 1x EXP 1x EXP

1x DIV 1x DIV 1x DIV 1x DIV 1x DIV 1x DIV

Figure 5.9.: (a) Dependence graph of an adaptive filter after clustering (LPGS) with tile (1 4) and its corresponding accelerator. (b) Dependence graph of the filter× after tiling (LSGP) the iteration space with tile (8 2). The numbers in/at the nodes denote the start times of the iterations× and their operations, respectively.

MUL, 2 ADD) every cycle. Whereas, in Fig. 5.9(b), the output is produced by PE1 (1 MUL, 1 ADD) only every fourth cycle. Therefore, the throughput is dependent on resource allocation. In copartitioning as introduced earlier, the iteration space is first partitioned into LS (local sequential) tiles, this tiled iteration space is tiled once more using GS (global sequential) tiles as shown in Fig. 5.10. The start times of iterations of the output vari- able are shown in Fig. 5.10. It illustrates the bursty nature of outputs, as 16 outputs are produced in 8 cycles with a period of 32 cycles. Obviously, the choice of tiling matrices directly influences the number of PEs. Further, the allocation information on resource constraints (functional units) for each PE is specified in the architecture model of the input program. In Figs. 5.9 and 5.10, one observes that different tiling strategies and resource allocations lead to hardware accelerators with different num- ber of PEs, resources, and performance. Therefore, selection of a best-fit accelerator hardware requires a framework for performance analysis in design space exploration. Algorithm 5.2 summarizes the search for finding the Pareto-optimal designs in

165 5. Design Space Exploration: Accelerator Tuning

Algorithm 5.2 EXPLORE Require: Intermediate representation (dependence graph) of algorithm, set of tiling matrices, set of resource constraints in the architecture model P Ensure: set of serviceR curves β of Pareto-optimal set of hardware designs 1: for all each candidate (P,R) Search( ) do ∈ P × R 2: (L,II,λ) Scheduling(P,R) ← 3: (A,Pow) AnalyzeRTLCost(P,R) ← 4: if (A,Pow,II) is non-dominated then 5: β β ServiceCurve(P,R,λ) ← ∪ 6: end if 7: end for terms of area, power cost, throughput, and their corresponding performance charac- terization as service curve for modular performance analysis. The search heuristic determines the candidates for Pareto-optimal designs as discussed in previous sec- tion on design space exploration. As result of scheduling, the minimal latency L, the corresponding iteration interval II, and the area A and power cost Pow is deter- mined. The iteration interval, II is used for rating the performance, as it is inversely proportional to the throughput. Modular performance analysis requires to model the accelerator performance as service curves. The service curve of all Pareto-optimal accelerator designs needs thus to be determined, as discussed in the next section.

5.2.2.1. Accelerator Performance: Service Curve Estimation

In this section, we determine a service curve for an accelerator based on its allocation and scheduling. An important question for modeling the performance is, what can be characterized as an event? For streaming applications, a single problem instance can be viewed as an event (e.g., a single frame for a streaming video filter application). A partition of the problem instance defined by the algorithm specification can also be viewed as an event (e.g., in IDCT, one macro-block or one row of a macro-block can be classified as an event). An event at the finest level of granularity represents the iteration outputs of the application, for instance, each pixel of a frame in a video streaming application. Therefore, modeling of curves requires an event definition. It is not necessary to derive the service curves from simulation traces because of static scheduling undertaken for loop accelerator generation. In the context of loop accel- erator, we characterize a single output iteration as an event. Here, we can consider two methods. Method 1 is a fast estimation of the service curve of an accelerator that can be gener- ated during scheduling. The number of output events NO corresponds to the produc- tion of output variables at border PEs within the latency period L gives the service

166 5.2. Performance Analysis of Accelerators in an SoC System

l3 B

k l2 2 PE(0,0) PE(0,1) 28 29 30 31 60 61 62 63 MUL MUL

30 31 32 33 62 63 64 65 ADD ADD A PE(1,0) PE(1,1) 31 32 33 34 63 64 65 66 MUL MUL

33 34 35 36 65 66 67 68 ADD ADD

92 93 94 95 124 125 126 127 GS Tile 94 95 96 97 126 127 128 129 LS Tile 95 96 97 98 127 128 129 130

97 98 99 100 129 130 131 132 l1

Figure 5.10.: The output space of the matrix multiplication with 8 8 matrices on a 2-hierarchical partitioning. The output data is produced× in bursts. curve. In this case, the service curve β(∆) can be represented by the following piece- wise linear approximation.

NO β(∆) = r∆, where r = (5.6) L However, the above curve models average throughput and cannot represent the bursty behaviour caused by tiling. Method 2: For more accurate modelling, the service curve of each PE needs to be calculated individually. The final service curve is then given by the sum of the indi- vidual service curves of all output PEs. The output PEs are the PEs which produced and writes the output variables. For calculating the service curve for a 2-level hier- archical partitioning (also known as copartitioning), the following steps need to be carried out:

• Determination of output processors, PE1, PE2,. . . , PEn: Let O be the iteration space of the output variable. This is determined by the iterationI condition of the output variable in loop program. Then, the n distinct elements of the output processor space (= Q IO IO O ) gives the set of output processor, PE1, P { · | ∈ I } PE2,. . . , PEn . ∈ P • Determination of service curves for each output processor: For each processor PEi, the following parameters need to be determined to calculate the service curve:

167 5. Design Space Exploration: Accelerator Tuning

64

56

48

40

32

#output samples 24

16

8 cycles 8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128

Figure 5.11.: Service curve of the matrix multiplication accelerator. The dotted ser- vice curve is of a single PE.

– NPEi denotes the number of output iterations for the processor element PEi.

– NPEi (LS) denotes the maximum number of outputs in the corresponding local sequential (LS) tile executed by processor PEi.

– γmin is the difference between the start-time and end-time of first and last output of the same local sequential (LS) tile.

– γmax is the time difference between the execution of two successively scheduled LS tiles. Then, the service curve IS represented as a piecewise linear curve with three segments, and is given by

βi(∆) = min r∆,r1∆ + Np (LS),Np (5.7) { i i } N (LS) N (LS) where r = pi ,r = pi is a measure for the short term and long term γmin 1 γmax burstiness of output, respectively.

• Find the final service curve as a function of the individual service curves of n each processor: The output service curve is given by β(∆) = ∑i βi(∆).

Example 5.2.1 For the matrix multiplication example in Fig. 5.10, NPEi (LS) = 4,

NPEi = 16, γmin = 4, γmax = 32. The service curve for a single PE is obtained using Equation (5.7), and is shown for the entire accelerator in Fig. 5.11. LSGP and LPGS are a special case of a 2-level hierarchical partitioning. For LSGP, γmin = γmax as the LS tile is the only one to be executed by a corresponding processor. For the same

168 5.2. Performance Analysis of Accelerators in an SoC System

reason, NPEi (LS) = NPEi . For LPGS, Npi (LS) = 1 and γmin = γmax as each iteration can be considered as local sequential tile. The important problem to be solved for the accurate derivation of arrival curves is to find the required variables N (number of output iterations) and γ (dependent on schedule). The first problem can be solved by counting the index points lying within a polytope. The counting problem is solved by computing the iteration space volume as explained in [8]. As approximation of γ, the equation γ = max(λI1) min(λI2), − where I1,I2 O is used. The arrival∈ I curves must also be modified, since in realistic systems an accelera- tor has to process more than one event from different input streams that trigger an output event. Therefore, if an accelerator consumes ni (i = 1,...,m) events from m streams to produce one output event, then the bound on arrival curves must be mod- u l ified as [αi /ni,αi/ni] and then taken as input for an abstract AND component [85]. For example, in a matrix multiplication, one has inputs from two different streams (matrices).

5.2.3. Optimal Configuration Selection in System Context The modelling of a hardware accelerator’s performance is simplified because of static scheduling. The worst-case input throughput is obtained by analyzing several simu- lation traces or by an experienced system architect, and is modelled as arrival curve α. Algorithm 5.3 can then be used to select a best-fit design based on the trade-off between buffer size, delay, and resource utilization.

Algorithm 5.3 MATCH Require: Input arrival curve, α, set B of service curves of Pareto-optimal hardware accelerators Ensure: Optimal set B of service curves β 1: Bopt ← {} 2: for all candidate β B do ∈ 3: DELAY RTCDEL(α,β) ← 4: BUFFER RTCBUF(α,β) ← 5: UTILIZATION RTCUTIL(α,β) ← 6: if UTILIZATION < 100 then 7: Bopt Bopt β ← ∪ 8: end if 9: end for

The algorithm also finds the DELAY, BUFFER, and UTILIZATION for a given arrival curve and a given service curve for each Pareto-optimal hardware accelera- tor. The delay and buffer size are calculated using min-plus algebra operations (see

169 5. Design Space Exploration: Accelerator Tuning

[βu, βl] u l [α , α ] Huffmann Entropy Source Parser Dequantization Decoder Decoder

Inverse Inverse Discrete Frame YCbCr Sink ZigZag Cosine Transform Shuffler Decoder

Figure 5.12.: M-JPEG decoder. All algorithms are implemented in hardware. There- fore, there is no output service curve.

Equation (5.4) and (5.5) ) and integrated as functions RTCDEL, RTCBUF (intuitively maximum horizontal and vertical distance between arrival and service curve) in the u α (tmax) RTC toolbox. The utilization can be calculated as U = where tmax is the end β(tmax) time. If the resource utilization is less than 100% then the corresponding hardware accelerator is added to set of solutions. Resource utilization greater that 100% would imply that the accelerator does not have enough performance or throughput to pro- cess the input data rates. If the rate of the input is known, then one can simply plugin the iteration interval in the scheduling problem. However in case of application with multiple loops and dynamic input behavior, this rate information is available only af- ter simulation and design space exploration. Therefore, exploration is necessary for faster modular performance analysis. The accelerator is completely dedicated to the processing of input stimuli. Therefore, the upper and lower service curve are the same (βu(∆) = βl(∆)). The optimal configuration may not be the fastest accelerator imple- mentation, but a rate-matched implementation. The rate-matched implementation suffices the throughput requirement corresponding to the worst case input stimulus.

5.2.4. Case Study In this section, we will apply the presented approach of performance analysis to match the throughput of an IDCT accelerator in a motion JPEG decoder to a given workload. I.e., we find a best-fit accelerator for a given set of workload scenarios.

5.2.4.1. Motion JPEG Decoder The sequence of algorithms in the M-JPEG decoder algorithm is illustrated in Figure 5.12. The IDCT is a data-intensive stage, which is usually implemented in hardware. Our design tool is used to synthesize the IDCT accelerator for different resource constraints, each resulting in different throughput. The fastest IDCT (II = 1) requires 3562 slices on a Xilinx Virtex1000 FPGA, whereas an IDCT with iteration interval of 8 contains 2224 slices, because of the resource sharing of 4 multipliers, 4 adders, and 4 subtracts. For obtaining a realistic estimation of input simulation traces to the

170 5.2. Performance Analysis of Accelerators in an SoC System

Upper Arrival/Service Curves Input Arrival Curves 18000 2000 αu αu l 16000 β 1800 α

1600 14000

∆ ∆ 1400 12000 1200 10000 1000 8000 800 6000

#(macroblock rows) during #(macroblock rows) during 600

4000 400

2000 200

0 0 0 1000 2000 3000 4000 0 100 200 300 time interval ∆ (10−4ms) time interval ∆ (10−4ms)

Figure 5.13.: (a) Arrival and service curve (II=8) for IDCT stage in M-JPEG (b) zoomed view of arrival curves.

IDCT, a SystemC model of the M-JPEG pipeline was written. The inputs for the M-JPEG simulation are sequences of encoded 176 144 QCIF images with varying degree of compression. × Fig. 5.13(a) shows the service curve for II=8. Fig. 5.13(b) shows the upper and lower output arrival curves from the inverse zig-zag level of M-JPEG, which is in- put to the IDCT component. These arrival curves are obtained from the SystemC simulation traces (worst-case) at the input of IDCT hardware. The curves illustrate the bursty nature of the IDCT input. These curves are then taken as input arrival curve, α for the IDCT. Afterward, they are matched with service curves, β for the set of Pareto-optimal hardware accelerators. The Pareto-optimal set contains only few hardware designs considering only area and throughput (II=1,2,4,8,16). The results of applying Algorithm 5.3 are shown in Table 5.5. The consideration of worst-case input event streams in the simulation setup shows only a resource utilization of 7% for the fastest implementation (II=1) of the IDCT. I.e., the accelerator is busy only 7% of the time. Therefore, one can increase the resource utilization by incorporating an IDCT accelerator with larger iteration inter- val (i.e., lower throughput). The lower throughput requirement allows for a solution with a lower number of resources, which also reduces the accelerator area (37.6% using II=8). The delay and buffer sizes refer to the internal FIFO requirements for storing input bursts and they are determined through Equation (5.4) and (5.5) in MPA

171 5. Design Space Exploration: Accelerator Tuning

II Delay Buffer Area 4 (cycles) (10− ms) (#Cells) Reduction(%) 1 0.14 7 0 2 0.28 7 14.2 4 0.56 7 31.2 8 192.92 1072 37.6 16 - - -

Table 5.5.: Trade-off between buffer size, delay, and area reduction with respect to the solution with II = 1 for IDCT butterfly implementation. "-" indicates a resource utilization of more than 100 percent. toolbox, respectively and is different from latency and accelerator area determined during design space exploration. The computed delay and buffer sizes are further op- timization objectives, which need to be compared. The jump in the computed delay and buffer size in Table 5.5 is because of the worst-case behavior that requires the storage of bursts in FIFO buffers. On close observation, the Pareto-optimal hardware design (II = 4) can also be selected based on the area, delay, performance, and buffer sizes. The results hold for a particular implementation of all other components in the M-JPEG pipeline. In case of different implementation of other components, the simulations need to be repeated for obtaining the arrival curves. The simulation of a single design containing a complete hardware implementation of the M-JPEG de- coder takes about 62s time as compared to 0.07s using the MPA toolbox. Therefore, the combination of functional simulation with analytic methods can considerably fas- ten the search for best-fit designs by orders of magnitude.

5.3. Conclusion and Summary

The freedom in choosing the architecture allocation and compiler transformations leads to an explosion in the number of potential loop accelerator designs. There- fore, the task of finding Pareto-optimal designs in terms of performance, area, and power cost is of utmost importance. An exhaustive exploration of all designs with exact determination of the design objectives cannot be undertaken in reasonable time. In this chapter, we introduced an exploration framework utilizing intelligent search heuristics based on Multi-Objective Evolutionary Algorithms (MOEAs) for identi- fying respectively the approximate set of Pareto-optimal designs. In addition, an estimation-based approach for determining design objectives fastens further the ex- ploration such that design spaces consisting of thousands of designs can be explored in time less than an hour. The selection of best-fit designs from a given set of Pareto- optimal designs considering the workload contracts was also shown. This problem

172 5.3. Conclusion and Summary is important as required throughput rates will be determined by the environment or the communicating neighbor accelerator blocks in a complex SoC design. Hence, the required throughput rate will only be known once the overall architecture has been fixed. It was shown here how modular performance analysis can be used to find best-fit hardware accelerator engines in terms of area and throughput for SoC with worst-case performance guarantees for given work contracts in design phase. A motion JPEG case study application was chosen to validate the benefits of the methodology in combination with simulation. It shows that choosing a rate-matched Pareto-optimal design may lead to 31% reduction in area of an IDCT accelerator IP core. The matching of hardware accelerator performance by service curves using polyhedral theory indicates that similar work on optimal cache and functional unit usage can be extended and used to model software performance on a general purpose processor [27]. Therefore, a combination of modular performance analysis as plug-in for the use within HW/SW compiler tools is an important step towards optimization of accelerator-based SoC designs. In the design phase, the system architect can also perform design space exploration for each accelerator and then use real-time calcu- lus for performance analysis of different combination of Pareto-optimal designs for each accelerator. This is the next exploration step which needs to be investigated in future.

173 5. Design Space Exploration: Accelerator Tuning

174 6. Conclusions and Outlook

This dissertation presents novel contributions in methodology for automated syn- thesis and exploration of hardware accelerators for computationally intensive nested loop programs. In this chapter, we summarize the key contributions of the disser- tation. In addition, an overview of possible future work in the area of accelerator synthesis and code generation is given.

6.1. Conclusion

The compelling next generation streaming applications containing several compu- tationally intensive nested loop programs are the driving force for the System-on- chip (SoC) architectures. SoC platforms are characterized by the presence of di- verse components like traditional processors augmented with programmable or non- programmable accelerators. The accelerators implement computationally intensive loop programs of a given application. An SoC platform can improve the perfor- mance, optimize power, and reduce cost by orders of magnitude due to specialized execution on these accelerators. The effort for programming and synthesizing the loop accelerators for streaming applications is still enormous. Therefore, the PARO design methodology is presented in this thesis, which enables automated synthesis and exploration of accelerators for streaming applications. The methodology contains novel contributions in the areas of compiler transformations, control/communication synthesis, and design space exploration of loop accelerators. In this dissertation, we present a source-to-source transformation called hierar- chical tiling, which kneads the loop descriptions in a high-level language using a polyhedral framework with multiple hierarchies of tiles. This important extension of well known transformation called loop tiling, enables utilization of multiple levels of parallelism and memories in an accelerator architecture. With hierarchical tiling, the system designer is thus able to specify the degree of parallelism (number of PEs), local memory usage, and requisite communication bandwidth for the accelerator ar- chitecture. Other design flows contain simple tiling or loop unrolling, which can only specify only one or the other criteria but not all of them. The transformation, however, may create a loop code characterized by the pres- ence of a lot more control conditions, which could become a performance bottleneck. Therefore, a novel back-end methodology for control generation was presented. It contains an efficient method for synthesizing global counter, which scans the loop

175 6. Conclusions and Outlook program according to specified tiling strategy. Furthermore, combined facilities of lo- cal and global controller techniques leads to a low-cost hardware overhead for control code implementation, as compared to previous works on control generation which used only local control facilities. I/O communication can also become a performance bottleneck. Therefore, depending on the tiling parameters and schedule, a custom memory architecture with multiple banks, address generators, and I/O controllers is automatically generated. The consideration of front-end transformation like hierarchical tiling and novel back-end synthesis of control and I/O communication unit, leads to highly efficient accelerator implementations. We used several loop benchmarks from different class of loop algorithms ranging from linear algebra, image processing, to networking. The synthesized loop accelerators show an average gain of around 2.5x, 4.5x, and 50x in terms of area, power, and performance over embedded processors. The design flow leads to an increased productivity gain of upto 100x, due to the ease of programming in a high-level language rather than cumbersome RTL coding in VHDL. Streaming applications are characterized by the presence of multiple communi- cating loops. With state-of-the-art polyhedral techniques, one can generate fast in- dividual loop accelerators, but there exists no methodology for automated genera- tion of intermediate communication hardware which is often a major bottleneck. In this context, a novel intermediate dependence graph called mapped loop graph was introduced for the modular representation of task level parallelism of communicat- ing loops and their mapping information in the polyhedral model. Traditionally, the polyhedral model has been used for computation synthesis and the data flow models of computation for communication analysis. Therefore, a methodology is proposed for the projection of a mapped loop graph in the polyhedral model onto model pa- rameters of a data flow model of computation called windowed synchronous data flow (WSDF). Subsequently, the generation of an efficient dedicated communication primitive called multi-dimensional FIFO from the windowed synchronous data flow model is undertaken. The communication synthesis is able to generate communica- tion hardware for data transfer and synchronization between the loop accelerators, which is able to handle parallel access and out-of-order communication in entirety, unlike the existing approaches. The accelerators can also be used as co-processors in an SoC. In order to ease the integration of accelerators in an SoC, a methodology for automated generation of a memory map, a software driver, and a hardware wrapper was proposed. The software drivers are the programs running on processors, which are responsible for the data transfer and synchronization to/from the accelerator. Whereas, the hardware wrapper implements the protocol conversion of signals for integration of accelerators over a system bus. The experimental results also show that the performance gain may scale down by an order of magnitude for a hardware/software co-design as compared to a pure hardware implementation due to the communication bottleneck. Still, the hardware/software codesign approach may offer an advantage of 2-20x over pure

176 6.2. Future Work software solutions. The dissertation tackles the often neglected problems of control and communication synthesis in context of loop transformations, and gives efficient and generic solutions. The selection of an optimal architecture can be daunting due to a plethora of ar- chitecture and compiler design decisions. Exhaustive exploration of the design space is prohibitive due to absurdly large execution time. Therefore, we propose a method using modern search heuristics based on evolutionary algorithms and estimation of objectives to identify Pareto-optimal designs (i.e., non-dominated designs with best trade-offs) in terms of area cost, power consumption, and performance. This not only reduces the exploration time to a matter of around an hour for relevant acceler- ator benchmarks, but also delivers better design solutions than random search tech- niques. For a given workload scenario, a best-fit accelerator can be chosen from the Pareto-optimal set of designs. The proposed analytical method for finding the best-fit accelerator is based on real-time calculus and uses system performance models. To summarize, major contributions of the thesis enables the realization of loop ac- celerators. Several other research problems need to be solved in the future to ease pro- gramming of next generation multiprocessor system-on-chip accelerator platforms.

6.2. Future Work

In this dissertation, we have looked at static compilation of loop programs onto mas- sively parallel hardware accelerator engines to be used within system-on-chip archi- tectures. Another important aspect of compilers for future multi-processor system- on-chip architectures is the dynamic component. The applications must not only be able to utilize architecture features and their availability, which can be an unknown at compile-time, but also be able react to system behaviour (temperature, power, faults), workload features (throughput), and others. For example, highly dynamic nature of signal processing application and environment may require a programmer to sup- port different encoding in compression applications like MPEG4. Therefore, a new paradigm of computing called invasive computing has been proposed in [165]. It enables applications to explore and dynamically spread their computations and ex- ecute them depending on the circumstances at run-time. In order to support inva- sive computing in a programmable processor array accelerator, several architectural and compiler innovations are needed: On the compiler side, loop transformations like hierarchical tiling should be extended to handle parameters (e.g., dynamic loop bounds), which are unknown at compile-time. Furthermore, several architecture in- novations need to be undertaken in tightly-coupled processor arrays (TCPAs). The controller architecture must not only be able to handle parameters, but one needs to develop a generic VLIW control PE with limited programmability such that the com- putation control and I/O control does not throttle the computation speed. Therefore, the concepts introduced in this thesis must be extended for generating control and

177 6. Conclusions and Outlook invasion code for TCPAs. Broadly speaking, a code generation methodology and ar- chitecture improvements are needed to handle dynamic parallel computation, control, and communication. The implementation of multiple loops on programmable arrays may make use of reconfiguration (i.e., time multiplexed). The ideas on communication synthesis for non-programmable arrays presented in this dissertation can be extended for handling the communication synthesis for programmable arrays with reconfiguration capabil- ities. Furthermore, since communication is the bottleneck in acceleration achievable using hardware accelerators in an SoC setup, the investigation of communication al- ternatives like DMA, or data caches is necessary. Last but not the least, the design space exploration of accelerator architectures might benefit from the use of machine learning algorithms. These are currently used in the MILEPOST project for finding optimization parameters and flags for gcc com- pilation [73]. Furthermore, the ongoing research work on cyclic dependencies in real-time calculus would enable the analysis of communicating loop accelerators with backpressure. This would further reduce the amount of simulation effort in system design. This set of challenges needs to be considered for the design automation of accelerators in complex next-generation SoC architectures.

178 A. Glossary

Accelerator: a dedicated device that enhances the performance by faster execution of a specific workload. The device may require an invocation from host pro- gram on general purpose system parts.

Design space exploration: a process which refers to the search of Pareto-optimal designs. For accelerators, it shows a relationship between measured objectives of the architecture (like area, power, and performance) for a range of parameter values that each represent particular architecture and compiler design choices.

FPGA: is an integrated circuit containing logic blocks and reconfigurable intercon- nects, which is designed to be configured by the customer/designer after man- ufacturing, for realizing simple and complex combinational functions.

Evolutionary algorithm: is a generic population-based meta-heuristic optimization algorithm. The objective is to converge to the true Pareto front of a multi- objective optimization problem which normally consists of a diverse set of points.

Latency: of a pipelined loop nest execution is the difference between start time of start operation of first loop iteration and end operation of last loop iteration.

Memory map: is a data structure that contains the information on memory space, re- garding the size of total memory, reserved regions, and address space assigned for different system components.

Moore’s Law: proposed in 1965, states that the number of transistors on a chip will double about every two years (which was later modified to 18 months). The popular version is processor performance doubles every two years.

Pareto optimal accelerator: a decision vector x X is said to be non-dominated re- ∈ garding a set A X iff @ a A : a dominates x. Moreover, x is said to be Pareto- optimal iff x is⊆ non-dominated∈ regarding X. Let x = (cost,throughput) X denote a decision vector and X denotes the decision space of all vectors x.∈ For any two decision vectors x1,x2 X, x1 dominates x2 if and only if (cost(x1) < ∈ cost(x2) throughput(x1) throughput(x2)) (cost(x1) cost(x2) throughput(x1) > ∧ ≥ ∨ ≤ ∧ throughput(x2)).

179 A. Glossary

Pareto Front: for a given multi-objective problem, F(x) and Pareto optimal set P, the Pareto front PF is defined as: PF = ~u = F(x) x P . { | ∈ } Polytope model: is a framework for loop nest optimization and implementation, where iteration space of loops are modeled as Z-polytopes. Processor array: is a type of integrated circuit which has a massively parallel ar- ray of hundreds of processing elements interconnected to each other in a grid network.

Single assignment: is a representation in which one cannot bind a value to a vari- able, if a value is already assigned to that variable.

SoC: stands for “systems-on-a-chip” and refers to the integration of heterogeneous system components like processor, buses, accelerators, and others on a single chip.

Software driver: is a program running on a processor, which allows processor pro- grams to interact with the hardware accelerator device.

180 B. Hermite Normal Form

n n Given a square non-singular integer matrix A Z × , there exists an unimodular n n n n ∈ matrix U Z × and a matrix H Z × known as the Hermite normal form (HNF) of A such∈ that H = AU. The entries∈ of H satisfy:

1. H is a upper right triangular, that is, hi, j = 0 for all i > j,

2. hi,i > 0 for all i, and

3. hi i > hi j 0 for all i < j. , , ≥ The right multiplication of A by an unimodular matrix U corresponds to a sequence of the following elementary column operations. • Interchange two columns. • Multiply a column by -1. • Add an integral multiple of one column to another. The Hermite normal form can be found in polynomial time by using a sequence of the above defined elementary column operations [156].

(1,1)

(0,0) (3,0)

1 2 Figure B.1.: A lattice generated by columns of R = are denoted by the 1 1 − ! filled points

1 2 For example, the lattice in Figure B.1 is generated by the matrix R = 1 1 − ! 3 1 and its Hermite normal form is H = . The matrix H can be considered as 0 1 !

181 B. Hermite Normal Form corresponding to the tiles in Figure B.1. The diagonal elements determine the size of the tiles and the non-diagonal elements determine the offset of the tile [74].

182 C. Loop Benchmarks

Cyclic Redundancy Check (CRC)

1 /* Algorithm: CRC32 2 * Source: Hackers Delight, chapter 14 */ 3 include("examples/Architecture/shifter_32.arch.paro") 4 include("examples/Architecture/alu_32.arch.paro") 5 program crc32 { 6 variable mesg 2 in unsigned integer<32>; 7 variable init_crc 2 in unsigned integer<32>; 8 variable crc 2 unsigned integer<32>; 9 variable crc_in 2 unsigned integer<32>; 10 variable byte 2 unsigned integer<32>; 11 variable byte_in 2 unsigned integer<32>; 12 variable z 2 unsigned integer<32>; 13 variable y 2 unsigned integer<32>; 14 variable cond 2 integer<32>; 15 variable crc_out 2 out unsigned integer<32>; 16 parameter I1 = 8; //Number of Mesg 17 parameter N = 1; 18 par ( i>=0 and i<=N-1 and i1 >= 0 and i1 <= I1-1) { 19 byte[i,i1] = mesg[i,0] if (i1 == 0); 20 byte[i,i1] = byte[i,i1-1] << 1 if (i1 > 0); 21 crc_in[i,i1] = init_crc[i,i1] if(i1==0); 22 crc_in[i,i1] = cast >(crc[i,i1-1]) if(i1 >0); 23 //input is already 32-bit reversed 24 z[i,i1] = cast > (crc_in[i,i1] << 1); 25 cond[i,i1] = ((cast > (crc_in[i,i1])) ^ (cast< integer<32> >(byte[i,i1])) ); 26 y[i,i1] = cast > (cast > (z[ i,i1]) ^ cast > (0x04C11DB7)); 27 crc[i,i1] = ifrt( cond[i,i1] < 0, y[i,i1] , z[i,i1]); 28 crc_out[i,i1] = crc[i,i1] if(i1 == I1-1); 29 } 30 }

183 C. Loop Benchmarks

Complex Matrix-Matrix Multiplication (CMM)

1 /* Algorithm: Complex Matrix Multiplication, partitioned */ 2 include("examples/Architecture/adder_16.arch.paro") 3 include("examples/Architecture/multiplier_16x16_16.arch.paro ") 4 include("examples/Architecture/subtractor_16.arch.paro") 5 include("examples/Architecture/shifter_32.arch.paro") 6 include("examples/Architecture/alu_32.arch.paro") 7 program matmul { 8 variable A 6 in signed integer<32>; 9 variable B 6 in signed integer<32>; 10 variable C 6 out signed integer<32>; 11 variable tempA 6 integer<32>; 12 variable tempB 6 integer<32>; 13 variable tempC 6 integer<32>; 14 variable tempCr 6 integer<32>; 15 variable ar 6 integer<16>; 16 variable ai 6 integer<16>; 17 variable br 6 integer<16>; 18 variable bi 6 integer<16>; 19 variable bi_tmp 6 integer<32>; 20 variable cr 6 integer<16>; 21 variable ci 6 integer<16>; 22 variable zr 6 integer<16>; 23 variable zi 6 integer<16>; 24 parameter I1 = 64; 25 parameter J1 = 64; 26 parameter K1 = 64; 27 parameter I2 = 1; 28 parameter J2 = 1; 29 parameter K2 = 1; 30 par (i1 >= 0 and j1 >= 0 and k1 >= 0 and i1 <= I1-1 and j1 <= J1-1 and k1 <= K1-1 and i2 >= 0 and i2 <= I2-1 and j2 >= 0 and j2 <= J2-1 and k2 >= 0 and k2 <= K2-1) { 31 tempA[i1,j1,k1,i2,j2,k2] = A[i1,0,k1,i2,0,k2] if (j1 == 0 and j2 == 0); 32 tempB[i1,j1,k1,i2,j2,k2] = B[0,j1,k1,0,j2,k2] if (i1 == 0 and i2 == 0); 33 // get low 16 bits 34 ar[i1,j1,k1,i2,j2,k2] = cast >[i1,j1,k1,i2,j2,k2] >> 16) if (j1 == 0 and j2 == 0); 35 // get high 16 bits

184 36 ai[i1,j1,k1,i2,j2,k2] = cast >(tempA[i1,j1,k1,i2, j2,k2] & 0x0000FFFF) if (j1 == 0 and j2 == 0); 37 // get low 16 bits 38 br[i1,j1,k1,i2,j2,k2] = cast >(tempB[i1,j1,k1,i2, j2,k2] >> 16) if (i1 == 0 and i2 == 0); 39 // get high 16 bits 40 bi_tmp[i1,j1,k1,i2,j2,k2] = tempB[i1,j1,k1,i2,j2,k2] & cast< integer<32> >(0x0000FFFF) if (i1 == 0 and i2 == 0); 41 bi[i1,j1,k1,i2,j2,k2] = cast >(bi_tmp[i1,j1,k1,i2 ,j2,k2]) if (i1 == 0 and i2 == 0); 42 // real part A 43 ar[i1,j1,k1,i2,j2,k2] = ar[i1,j1-1,k1,i2,j2,k2] if (j1 > 0); 44 ar[i1,j1,k1,i2,j2,k2] = ar[i1,j1+J1-1,k1,i2,j2-1,k2] if ( j1 == 0 and j2 > 0); 45 // imag part A 46 ai[i1,j1,k1,i2,j2,k2] = ai[i1,j1-1,k1,i2,j2,k2] if ( j1 > 0); 47 ai[i1,j1,k1,i2,j2,k2] = ai[i1,j1+J1-1,k1,i2,j2-1,k2] if ( j1 == 0 and j2 > 0); 48 // real part B 49 br[i1,j1,k1,i2,j2,k2] = br[i1-1,j1,k1,i2,j2,k2] if (i1 > 0); 50 br[i1,j1,k1,i2,j2,k2] = br[i1+I1-1,j1,k1,i2-1,j2,k2] if ( i1 == 0 and i2 > 0); 51 // image part B 52 bi[i1,j1,k1,i2,j2,k2] = bi[i1-1,j1,k1,i2,j2,k2] if (i1 > 0); 53 bi[i1,j1,k1,i2,j2,k2] = bi[i1+I1-1,j1,k1,i2-1,j2,k2] if ( i1 == 0 and i2 > 0); 54 // real part C 55 zr[i1,j1,k1,i2,j2,k2] = ar[i1,j1,k1,i2,j2,k2] * br[i1,j1,k1, i2,j2,k2] - ai[i1,j1,k1,i2,j2,k2] * bi[i1,j1,k1,i2,j2,k2]; 56 zi[i1,j1,k1,i2,j2,k2] = ai[i1,j1,k1,i2,j2,k2] * br[i1,j1,k1, i2,j2,k2] + ar[i1,j1,k1,i2,j2,k2] * bi[i1,j1,k1,i2,j2,k2]; 57 // image part C 58 cr[i1,j1,k1,i2,j2,k2] = cr[i1,j1,k1-1,i2,j2,k2] + zr[i1,j1,k1 ,i2,j2,k2] if (k1 > 0); 59 ci[i1,j1,k1,i2,j2,k2] = ci[i1,j1,k1-1,i2,j2,k2] + zi[i1,j1,k1 ,i2,j2,k2] if (k1 > 0); 60 cr[i1,j1,k1,i2,j2,k2] = zr[i1,j1,k1,i2,j2,k2] if (k1 == 0 and k2 == 0); 61 ci[i1,j1,k1,i2,j2,k2] = zi[i1,j1,k1,i2,j2,k2] if (k1 == 0 and k2 == 0); 62 cr[i1,j1,k1,i2,j2,k2] = cr[i1,j1,k1+K1-1,i2,j2,k2-1] + zr[i1, j1,k1,i2,j2,k2] if (k1 == 0 and k2 > 0);

185 C. Loop Benchmarks

63 ci[i1,j1,k1,i2,j2,k2] = ci[i1,j1,k1+K1-1,i2,j2,k2-1] + zi[i1, j1,k1,i2,j2,k2] if (k1 == 0 and k2 > 0); 64 // merge real and image parts of C 65 tempCr[i1,j1,k1,i2,j2,k2] = (cast >(cr[i1,j1,k1, i2,j2,k2])) << 16 ; 66 tempC[i1,j1,k1,i2,j2,k2] = tempCr[i1,j1,k1,i2,j2,k2] | (cast >(ci[i1,j1,k1,i2,j2,k2])) ; 67 C[i1,j1,k1,i2,j2,k2] = tempC[i1,j1,k1,i2,j2,k2] if (k1 == K1 -1 and k2 == K2-1); 68 }} Discrete Wavelet Transform (DWT)

1 /* $Id: DWT.paro $ * * Algorithm: 1) DWT with boundary treatment * 2) and easily readable (yet antiparallel) dependencies * requiring index shift. * 3) Source: Najjar, ROCCC * Status: successfully simulated (functional, synthesized RTL) * */ 2 include("examples/Architecture/adder_16.arch.paro") 3 include("examples/Architecture/subtractor_16.arch.paro") 4 include("examples/Architecture/multiplier_16x16_16.arch.paro ") 5 include("examples/Architecture/comparator_16.arch.paro") 6 //allocation adder_16 4; 7 //allocation multiplier_16x16_16 1; 8 //allocation subtractor_16 3; 9 //allocation comparator_16 3; 10 program Sobel { 11 variable pi 2 in integer<16>; 12 variable p 2 integer<16>; 13 variable p_0 2 integer<16>; // x-1, y-1 14 variable p_1 2 integer<16>; // x, y-1 15 variable p_2 2 integer<16>; // x+1, y-1 16 variable p_3 2 integer<16>; // x-1, y 17 variable p_4 2 integer<16>; // x, y 18 variable p_5 2 integer<16>; // x+1, y 19 variable p_6 2 integer<16>; // x-1, y+1 20 variable p_7 2 integer<16>; // x, y+1 21 variable p_8 2 integer<16>; // x+1, y+1 22 variable p_9 2 integer<16>; // x-1, y 23 variable p_10 2 integer<16>; // x, y 24 variable p_11 2 integer<16>; // x+1, y 25 variable p_12 2 integer<16>; // x-1, y+1

186 26 variable p_13 2 integer<16>; // x, y+1 27 variable p_14 2 integer<16>; // x+1, y+1 28 variable sum 2 integer<16>; 29 variable s 2 integer<16>; 30 variable po 2 out integer<16>; 31 parameter IMGSIZE_X = 512; 32 parameter IMGSIZE_Y = 512; 33 par (x >= 0 and x <= IMGSIZE_X-1 and y >= 0 and y <= IMGSIZE_Y-1) { 34 /* input */ 35 p[x,y] = pi[x,y]; 36 /* collect pixels for convolution, boundary treatment */ 37 // inner image 38 p_0[x,y] = p[x-2,y-4] if (x>2 and x4 and y2 and x4 and y2 and x4 and y2 and x4 and y2 and x4 and y2 and x4 and y2 and x4 and y2 and x4 and y2 and x4 and y2 and x4 and y2 and x4 and y2 and x4 and y2 and x4 and y2 and x4 and y2 and x4 and y

187 C. Loop Benchmarks

54 // border pixels 55 p_0[x,y] = 0 if (x>=0 and x<3 and y>=0 and y<5); 56 p_1[x,y] = 0 if (x>=0 and x<3 and y>=0 and y<5); 57 p_2[x,y] = 0 if (x>=0 and x<3 and y>=0 and y<5); 58 p_3[x,y] = 0 if (x>=0 and x<3 and y>=0 and y<5); 59 p_4[x,y] = 0 if (x>=0 and x<3 and y>=0 and y<5); 60 p_5[x,y] = 0 if (x>=0 and x<3 and y>=0 and y<5); 61 p_6[x,y] = 0 if (x>=0 and x<3 and y>=0 and y<5); 62 p_7[x,y] = 0 if (x>=0 and x<3 and y>=0 and y<5); 63 p_8[x,y] = 0 if (x>=0 and x<3 and y>=0 and y<5); 64 p_9[x,y] = 0 if (x>=0 and x<3 and y>=0 and y<5); 65 p_10[x,y] = 0 if (x>=0 and x<3 and y>=0 and y<5); 66 p_11[x,y] = 0 if (x>=0 and x<3 and y>=0 and y<5); 67 p_12[x,y] = 0 if (x>=0 and x<3 and y>=0 and y<5); 68 p_13[x,y] = 0 if (x>=0 and x<3 and y>=0 and y<5); 69 p_14[x,y] = 0 if (x>=0 and x<3 and y>=0 and y<5); 70 /* convolution */ 71 sum[x,y] = 6*p_0[x,y] + 6*p_1[x,y] + 6*p_2[x,y] + 2*p_3[x,y] + 2*p_4[x,y] + 2*p_5[x,y] + -1*p_6[x,y] + -1*p_7[x,y] + -1*p_8[x,y] + 8*p_9[x,y] + 8*p_10[x,y] + 8*p_11[x,y] + -4* p_12[x,y] + -4*p_13[x,y] + -4*p_14[x,y]; 72 /* sum */ 73 s[x,y] = sum[x,y]<<3; 74 /* output */ 75 po[x,y] = s[x,y]; 76 }} Downsampler (DS)

1 program downsampler_filter_block { 2 3 typealias input_t integer<16>; 4 typealias output_t integer<16>; 5 typealias internal_t integer<16>; 6 // input images 7 variable g_in0_0 2 in input_t; 8 variable g_in0_1 2 in input_t; 9 variable g_in1_0 2 in input_t; 10 variable g_in1_1 2 in input_t; 11 variable g_in_tmp0_0 2 internal_t; 12 variable g_in_tmp0_1 2 internal_t; 13 variable g_in_tmp1_0 2 internal_t; 14 variable g_in_tmp1_1 2 internal_t; 15 // 3x3 window of input images

188 16 variable g 4 internal_t; 17 variable hx 2 internal_t; 18 variable hy 2 internal_t; 19 variable f_out 2 out output_t; 20 21 // image size 22 parameter INPUT_IMGSIZE_X = 512; 23 parameter INPUT_IMGSIZE_Y = 512; 24 parameter IMGSIZE_X = 256; #INPUT_IMGSIZE_X/2; 25 parameter IMGSIZE_Y = 256; #INPUT_IMGSIZE_Y/2; 26 27 pb_main: par (x >= 0 and x <= IMGSIZE_X-1 and y >= 0 and y <= IMGSIZE_Y-1) { 28 29 30 /* collect pixels of g_in for convolution, boundary treatment */ 31 g_in_tmp0_0[x,y] = cast(g_in0_0[x,y]); 32 g_in_tmp0_1[x,y] = cast(g_in0_1[x,y]); 33 g_in_tmp1_0[x,y] = cast(g_in1_0[x,y]); 34 g_in_tmp1_1[x,y] = cast(g_in1_1[x,y]); 35 // inner image 36 g[x,y,0,0] = g_in_tmp1_1[x-1,y-1] if (x>0 and y>0); 37 g[x,y,0,1] = g_in_tmp0_1[x ,y-1] if (x>0 and y>0); 38 g[x,y,0,2] = g_in_tmp1_1[x ,y-1] if (x>0 and y>0); 39 g[x,y,1,0] = g_in_tmp1_0[x-1,y ] if (x>0 and y>0); 40 g[x,y,1,1] = g_in_tmp0_0[x ,y ] if (x>0 and y>0); 41 g[x,y,1,2] = g_in_tmp1_0[x ,y ] if (x>0 and y>0); 42 g[x,y,2,0] = g_in_tmp1_1[x-1,y ] if (x>0 and y>0); 43 g[x,y,2,1] = g_in_tmp0_1[x ,y ] if (x>0 and y>0); 44 g[x,y,2,2] = g_in_tmp1_1[x ,y ] if (x>0 and y>0); 45 // left border 46 g[x,y,0,0] = g_in_tmp1_1[x ,y-1] if (x==0 and y>0); 47 g[x,y,0,1] = g_in_tmp0_1[x ,y-1] if (x==0 and y>0); 48 g[x,y,0,2] = g_in_tmp1_1[x ,y-1] if (x==0 and y>0); 49 g[x,y,1,0] = g_in_tmp1_0[x ,y ] if (x==0 and y>0); 50 g[x,y,1,1] = g_in_tmp0_0[x ,y ] if (x==0 and y>0); 51 g[x,y,1,2] = g_in_tmp1_0[x ,y ] if (x==0 and y>0); 52 g[x,y,2,0] = g_in_tmp1_1[x ,y ] if (x==0 and y>0); 53 g[x,y,2,1] = g_in_tmp0_1[x ,y ] if (x==0 and y>0); 54 g[x,y,2,2] = g_in_tmp1_1[x ,y ] if (x==0 and y>0); 55 // upper border 56 g[x,y,0,0] = g_in_tmp1_1[x-1,y ] if (x>0 and y==0); 57 g[x,y,0,1] = g_in_tmp0_1[x ,y ] if (x>0 and y==0);

189 C. Loop Benchmarks

58 g[x,y,0,2] = g_in_tmp1_1[x ,y ] if (x>0 and y==0); 59 g[x,y,1,0] = g_in_tmp1_0[x-1,y ] if (x>0 and y==0); 60 g[x,y,1,1] = g_in_tmp0_0[x ,y ] if (x>0 and y==0); 61 g[x,y,1,2] = g_in_tmp1_0[x ,y ] if (x>0 and y==0); 62 g[x,y,2,0] = g_in_tmp1_1[x-1,y ] if (x>0 and y==0); 63 g[x,y,2,1] = g_in_tmp0_1[x ,y ] if (x>0 and y==0); 64 g[x,y,2,2] = g_in_tmp1_1[x ,y ] if (x>0 and y==0); 65 // upper left corner 66 g[x,y,0,0] = g_in_tmp1_1[x ,y ] if (x==0 and y==0); 67 g[x,y,0,1] = g_in_tmp0_1[x ,y ] if (x==0 and y==0); 68 g[x,y,0,2] = g_in_tmp1_1[x ,y ] if (x==0 and y==0); 69 g[x,y,1,0] = g_in_tmp1_0[x ,y ] if (x==0 and y==0); 70 g[x,y,1,1] = g_in_tmp0_0[x ,y ] if (x==0 and y==0); 71 g[x,y,1,2] = g_in_tmp1_0[x ,y ] if (x==0 and y==0); 72 g[x,y,2,0] = g_in_tmp1_1[x ,y ] if (x==0 and y==0); 73 g[x,y,2,1] = g_in_tmp0_1[x ,y ] if (x==0 and y==0); 74 g[x,y,2,2] = g_in_tmp1_1[x ,y ] if (x==0 and y==0); 75 76 77 hy[x,y] = 1*g[x,y,0,0]+2*g[x,y,0,1]+1*g[x,y,0,2]+ 2*g[x,y ,1,0]+4*g[x,y,1,1]+2*g[x,y,1,2]+ 1*g[x,y ,2,0]+2*g[x,y,2,1]+1*g[x,y,2,2]; 78 79 f_out[x,y] = cast(hy[x,y]); 80 } Discrete Cosine Transform (DCT)

1 program DCT_stage2 { 2 typealias input_t integer<8>; 3 typealias output_t integer<8>; 4 variable x0 2 in input_t; 5 variable x1 2 in input_t; 6 variable x2 2 in input_t; 7 variable x3 2 in input_t; 8 variable x4 2 in input_t; 9 variable x5 2 in input_t; 10 variable x6 2 in input_t; 11 variable x7 2 in input_t; 12 variable z0 2 out output_t; 13 variable z1 2 out output_t; 14 variable z2 2 out output_t; 15 variable z3 2 out output_t; 16 variable z4 2 out output_t;

190 17 variable z5 2 out output_t; 18 variable z6 2 out output_t; 19 variable z7 2 out output_t; 20 variable P01 2 integer<16>; 21 variable P02 2 integer<16>; 22 variable P11 2 integer<16>; 23 variable P12 2 integer<16>; 24 variable P21 2 integer<16>; 25 variable P22 2 integer<16>; 26 variable P31 2 integer<16>; 27 variable P32 2 integer<16>; 28 // zi[k,0]: row k, column i of result matrix 29 par(k >= 0 and k <= 7 and l == 0) { 30 P01[k,l] = 23170*x0[k,l] + 30274*x2[k,l] + 23170*x4[k,l] + 12540*x6[k,l]; 31 P02[k,l] = 32138*x1[k,l] + 27246*x3[k,l] + 18205*x5[k,l] + 6393*x7[k,l]; 32 P11[k,l] = 23170*x0[k,l] + 12540*x2[k,l] - 23170*x4[k,l] - 30274*x6[k,l]; 33 P12[k,l] = 27246*x1[k,l] - 6393*x3[k,l] - 32138*x5[k,l] - 18205*x7[k,l]; 34 P21[k,l] = 23170*x0[k,l] - 12540*x2[k,l] - 23170*x4[k,l] + 30274*x6[k,l]; 35 P22[k,l] = 18205*x1[k,l] - 32138*x3[k,l] + 6393*x5[k,l] + 27246*x7[k,l]; 36 P31[k,l] = 23170*x0[k,l] - 30274*x2[k,l] + 23170*x4[k,l] - 12540*x6[k,l]; 37 P32[k,l] = 6393*x1[k,l] - 18205*x3[k,l] + 27246*x5[k,l] - 32138*x7[k,l]; 38 z0[k,l] = cast( P01[k,l] + P02[k,l]); 39 z1[k,l] = cast( P11[k,l] + P12[k,l]); 40 z2[k,l] = cast( P21[k,l] + P22[k,l]); 41 z3[k,l] = cast( P31[k,l] + P32[k,l]); 42 z4[k,l] = cast( P31[k,l] - P32[k,l]); 43 z5[k,l] = cast( P21[k,l] - P22[k,l]); 44 z6[k,l] = cast( P11[k,l] - P12[k,l]); 45 z7[k,l] = cast( P01[k,l] - P02[k,l]); 46 }} Finite Impulse Response (FIR)

1 include("examples/Architecture/adder_16.arch.paro") 2 include("examples/Architecture/multiplier_16x16_16.arch.paro ")

191 C. Loop Benchmarks

3 program FIR { 4 typealias coeff_t integer<16>; 5 typealias input_t integer<16>; 6 typealias prod_t integer<16>; 7 typealias output_t integer<16>; 8 variable a_in 2 in coeff_t; 9 variable u_in 2 in input_t; 10 variable a 2 coeff_t; 11 variable u 2 input_t; 12 variable x 2 prod_t; 13 variable y 2 output_t; 14 variable y_out 2 out output_t; 15 parameter N = 5; parameter T = 10; 16 par(i >=0andi <=T-1 &&j >=0andj <=N-1) { 17 a[i,j]=a_in[0,j] if(i==0); 18 a[i,j]=a[i-1,j] if(i>0); 19 u[i,j]=u_in[i,0] if(j==0); 20 u[i,j]=0 if(i==0andj>0); 21 u[i,j]=u[i-1,j-1] if(i>0andj>0); 22 x[i,j] = a[i,j] * u[i,j]; 23 y[i,j]=x[i,j] if(j==0); 24 y[i,j] = y[i,j-1] + x[i,j] if (j > 0); 25 y_out[i,j]=y[i,j] if (j ==N-1); 26 }} Smith-Waterman Algorithm (SW)

1 program smithWaterman { 2 variable s 2 integer<16>; 3 variable t 2 integer<16>; 4 variable A 2 integer<16>; 5 variable A1 2 integer<16>; 6 variable A2 2 integer<16>; 7 variable A3 2 integer<16>; 8 variable A4 2 integer<16>; 9 variable S 2 in integer<16>; 10 variable T 2 in integer<16>; 11 variable Vout 2 out integer<16>; 12 variable V 2 integer<16>; 13 parameter N = 8; 14 parameter M = 8; 15 par (x >= 0 and x <= N-1 and y >= 0 and y <= M-1) { 16 //Input: Embedding 17 s[x,y]=S[x, 0] if (y == 0);

192 18 t[x,y]=T[0, y] if (x == 0); 19 V[x,y]=0 if(x<1andy<1); 20 //localization 21 s[x,y] = s[x,y-1] if (y > 0); 22 t[x,y] = t[x-1,y] if (x > 0); 23 A[x,y] = s[x,y] - t[x,y]; 24 //if condition 25 A1[x, y] = ifrt(A[x, y]==0, V[x-1,y-1], V[x-1,y-1] + 100); 26 A2[x, y] = A1[x,y] - V[x-1,y] - 10; 27 A3[x, y] = ifrt(A2[x, y] > 0, A1[x,y], V[x-1,y] + 10); 28 A4[x, y] = A3[x,y] - V[x,y-1] - 100; 29 V[x,y] = ifrt(A4[x, y] > 0, A3[x,y], V[x,y-1] + 100) if(x>0 and y>0); 30 Vout[x,y] = V[x,y]; } 31 }

193

German Part

Deutscher Titel und Zusammenfassung

Synthese und Exploration von Schleifenbeschleunigern für Ein-Chip-Systeme

Unter „System-on-a-Chip (SoC)“ versteht man die Integration von verschiedenen Komponenten, wie z.B. herkömmliche Prozessoren mit programmierbaren oder nicht programmierbaren Schleifenbeschleuniger, auf einem Stück Silizium. Viele rechen- intensive Schleifenprogramme, z.B. aus dem Bereich der Signalverarbeitung oder Bildverarbeitung, lassen sich effizient auf Schleifenbeschleunigern ausführen. SoC- Plattformen können Leistung, Flächenkosten und Leistungsaufnahme von eingebet- ten Anwendungen aufgrund dieser Beschleuniger optimieren und um Größenord- nungen reduzieren. Obwohl solche SoCs auf FPGAs oder ASICs realisiert werden können, ist der aufwand für die Realisierungs Schleifenbeschleuniger immer noch sehr hoch. Um hierbei Abhilfe zu leisten, wird in dieser Arbeit die PARO Design- Methodik vorgestellt. Es handelt sich im Wesentlichen um einen Compiler, der eine RTL Beschreibung der Schleifenbeschleuniger in Form von Prozessor-Arrays gene- riert. Darüber hinaus werden in dieser Dissertation neuartige Beiträge in den Berei- chen der Compiler-Transformationen, Synthese und Entwurfsraumexploration von Schleifenbeschleunigern vorgestellt. Kachelung ist eine bekannte Compiler Transformation. In dieser Dissertation wird die hierarchische Kachelung präsentiert, eine Transformation, die die Schleifenbe- schreibungen in der Hochsprache transformiert, um mehrere Parallelitäts- und Spei- cherebenen in einer Beschleuniger-Architektur zu nutzen. Die Transformation teilt nicht nur den Iterationsraum von Schleifen, sondern berücksichtigt auch neue Daten-

195 German Part abhängigkeiten zwischen mehreren Hierarchien von Kacheln und passt die Kontroll- bedingungen entsprechend an. Mit dieser Transformation ist der Systementwickler in der Lage, den Grad der Parallelität (Anzahl der PEs), den lokalen Speicher und die erforderliche Kommunikationsbandbreite der Beschleunigerarchitektur zu spezi- fizieren. Andere Design-Methodiken enthalten Transformationen, die nur jeweils eine der beiden Kriterien berücksichten können. Die in dieser Arbeit vorgestellten Trans- formation erzeugt ein Schleifenprogramm, welches viel mehr Kontrollbedingungen enthält. Diese können ein Performanzengpass hervorrufen. Aus diesem Grund wur- de eine generische Methode zur Kontrolgenerierung vorgestellt, die die lokale und globale Steuerung für eine effiziente Ausführung der Schleifen kombiniert. Außer- dem wird eine passende Speicherarchitektur mit mehreren Speicherbänken, Adress- generatoren und I/O-Controllern automatisch generiert, so dass die Kommunikation nicht der Flaschenhals ist. Mit all diesen Techniken ist man in der Lage, Schleifenbe- schleuniger zu generieren, die einen durchschnittlichen Gewinn von Faktoren 2,5, 4,5 und 50 in Bezug auf Fläche, Strom, und Leistung gegenüber Embedded-Prozessoren erzielt. Der vorgestellte Entwurfsfluss führt zu einer 100-fachen Steigerung an Pro- duktivität, wegen der einfachen Programmierung in einer Hochsprache, anstelle der umständlichen RTL-Kodierung in VHDL. Typische streaming-Anwendungen enthalten mehreren kommunizierenden Schlei- fen. Für eine modulare Darstellung der Parallelität der kommunizierenden Schlei- fen und ihren Abbildungsinformationen als Abhängigkeitsgraph im polyedrischen Model, wurde der sogenannte Loop-Graph entwickelt. Diese einheitliche Methodik für die Projektion eines Loop-Graphen im polyedrische Modell auf das sogenannte „Windowed Synchronous Data Flow (WSDF)“-Modell wurde vorgestellt. Die soge- nannten „multi-dimensionalen FIFOs“ für die Datenübertragung und Synchronisati- on zwischen den Schleifenbeschleunigern wurde danach aus den WSDF-Parametern hergeleitet. Daher ist die Kommunikations-Hardware nicht der Engpass des Besch- leunigerkette, das aus mehreren kommunizierenden Schleifenbeschleunigern und mehr- dimensionalen FIFOs besteht. Die Beschleuniger können auch als Co-Prozessoren in einem SoC verwendet werden. Um die Integration von Beschleunigern in einem SoC zu unterstützen, wurde eine Methodik zur automatisierten Generierung der Spei- cherabbildung, eine Treiber-Software und eine Hardwareschnittstelle vorgestellt. Die Software-Treiber sind die Prozessorprogramme, die für die Datenübertragung und Synchronisation mit Beschleunigern verantwortlich sind. Die Schnittstelle imple- mentiert die Protokollwandlung von Signalen für die Integration von Beschleuni- gern über einen System-Bus. Experimentelle Ergebnisse zeigen außerdem, dass der Performance-Gewinn für ein Hardware/Software Co-Design 10-fach geringer als bei einer reinen Hardware-Implementierung ausfällt. Dies liegt an dem großen Hard- ware/Software Kommunikationsaufwand. Trotzdem bieten die Beschleuniger einen Performanz-Vorteil von 2 bis 20-fach gegenüber einer reinen Software-Lösungen und darüber hinaus eine wesentlich geringere Leistungsaufnahme. Aufgrund der Vielzahl von Architektur und Compiler-Designentscheidungen, kann

196 die Auswahl einer optimalen Architektur sehr aufwändig sein: Die vollständige Er- kundung des Entwurfsraums erfordert sehr große Laufzeiten, weshalb wir moderne meta-Suchheuristiken, wie z.B. evolutionäre Algorithmen, und die effiziente Schät- zung von Zielfunktionen wie Flächenkosten und Leistungsaufnahme verwenden. Dies reduziert nicht nur die Laufzeit der Exploration für relevante Beschleunigerbench- marks, sondern liefert auch ein besseres Ergebnis als die zufällige Suche. Für ge- gebene Systemeinschränkungen muss ein passender Beschleuniger aus dem Pareto- optimalen Satz von Designs ausgewählt werden. Die vorgeschlagene analytische Me- thode zur Bestimmung des Best-Fit-Beschleunigers basiert auf dem Echtzeit-Kalkül. Zusammenfassend liefert diese Arbeit wichtige Beiträge zur Realisierung von Loop- Beschleunigern.

197 German Part

198 Bibliography

[1] Santosh G. Abraham and B. R. Rau. Efficient design space exploration in PICO. In CASES ’00: Proceedings of the 2000 international conference on Compilers, Architecture, and Synthesis for embedded systems, pages 71–79, New York, NY, USA, 2000. ACM.

[2] Giovanni Agosta, Gianluca Palermo, and Cristina Silvano. Efficient architec- ture/compiler co-exploration using analytical models. Design Automation for Embedded Systems, 11(1):1–23, 2007.

[3] Altera. White paper implementation of the Smith-Waterman algorithm on a reconfigurable supercomputing platform. Whitepaper, September 2007. www. altera.com/literature/wp/wp-01035.pdf.

[4] Hideharu Amano. A Survey on Dynamically Reconfigurable Processors. IE- ICE Transactions on Communications, E89-B:3179–3187, 2006.

[5] Abdelkader Amar, Pierre Boulet, and Philippe Dumont. Projection of the Array-OL Specification Language onto the Kahn Process Network Computa- tion Model. In International Symposium on Parallel Architectures, Algorithms, and Networks, pages 496–503, 2005.

[6] Corinne Ancourt and François Irigoin. Scanning polyhedra with do loops. In Proceedings of the third ACM SIGPLAN symposium on Principles and prac- tice of parallel programming, PPOPP ’91, pages 39–50, New York, NY, USA, 1991. ACM.

[7] Krste Asanovic, Rastislav Bodik, James Demmel, Tony Keaveny, Kurt Keutzer, John Kubiatowicz, Nelson Morgan, David Patterson, Koushik Sen, John Wawrzynek, David Wessel, and Katherine Yelick. A view of the parallel computing landscape. Communications ACM, 52:56–67, October 2009.

[8] D. Avis. lrs: A Revised Implementation of the Reverse Search Vertex Enu- meration Algorithm. In G. Kalai and G. Ziegler, editors, Polytopes – Combi- natorics and Computation, pages 177–198. Birkhauser-Verlag, DMV Seminar Band 29, 2000.

199 Bibliography

[9] P. Banerjee, V. Saxena, J. Uribe, M. Haldar, A. Nayak, V. Kim, D. Bagchi, S. Pal, N. Tripathi, and R. Anderson. Making area-performance tradeoffs at the high level using the AccelFPGA compiler for FPGAs. In FPGA ’03: Pro- ceedings of the 2003 ACM/SIGDA eleventh international symposium on Field programmable gate arrays, pages 237–243, New York, NY, USA, 2003. ACM.

[10] Utpal K. Banerjee. Dependence Analysis for Supercomputing. Kluwer Aca- demic Publishers, Norwell, MA, USA, 1988.

[11] Kevin J. Barker, Kei Davis, Adolfy Hoisie, Darren J. Kerbyson, Mike Lang, Scott Pakin, and Jose C. Sancho. Entering the petaflop era: the architecture and performance of Roadrunner. In SC ’08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing, pages 1–11, Piscataway, NJ, USA, 2008. IEEE Press.

[12] Muthu Manikandan Baskaran, Uday Bondhugula, Sriram Krishnamoorthy, J. Ramanujam, Atanas Rountev, and P. Sadayappan. A compiler framework for optimization of affine loop nests for GPGPUs. In ICS ’08: Proceedings of the 22nd annual international conference on Supercomputing, pages 225–234, New York, NY, USA, 2008. ACM.

[13] C. Bastoul. Code generation in the polyhedral model is easier than you think. In PACT’13: IEEE International Conference on Parallel Architecture and Compilation Techniques, pages 7–16, september 2004.

[14] V. Baumgarte, G. Ehlers, Frank May, A. Nückel, Martin Vorbach, and Markus Weinhardt. PACT XPP – A Self-Reconfigurable Data Processing Architecture. The Journal of Supercomputing, 26(2):167–184, 2003.

[15] M. Bednara and J. Teich. Interface Synthesis for FPGA Based VLSI Processor Arrays. In Proc. of the International Conference on Engineering of Reconfig- urable Systems and Algorithms (ERSA 02), pages 74–80, Las Vegas, Nevada, U.S.A., June 2002.

[16] Marcus Bednara. Design Automation for Massively Parallel Processor Ar- rays: Transforming Regular Algorithms to Reconfigurable Hardware. PhD thesis, University of Erlangen-Nuremberg, Department of Computer Science- 12, Erlangen, Germany, 2004.

[17] Luca Benini, Davide Bertozzi, Davide Bruni, Nicola Drago, Franco Fummi, and Massimo Poncino. SystemC Cosimulation and Emulation of Multiproces- sor SoC Designs. Computer, 36(4):53–59, 2003.

200 Bibliography

[18] Greet Bilsen, Marc Engels, Rudy Lauwereins, and Jean Peperstraete. Cyclo- static dataflow. IEEE Transactions on Signal Processing, 44(2):397–408, February 1996.

[19] A. P. W. Bohm, B. Draper, W. Najjar, J. Hammes, R. Rinker, M. Chawathe, and C. Ross. One-Step Compilation of Image Processing Applications to FP- GAs. In FCCM ’01: Proceedings of the the 9th Annual IEEE Symposium on Field-Programmable Custom Computing Machines, pages 209–218, Washing- ton, DC, USA, 2001. IEEE Computer Society.

[20] Egor Bondarev, Michel R. V. Chaudron, and Erwin A. de Kock. Exploring Per- formance Trade-offs of a JPEG Decoder using the DeepCompass Framework. In Proceedings of the International Workshop on Software and Performance (WOSP), pages 153–163, Buenes Aires, Argentina, 2007.

[21] Uday Bondhugula, Albert Hartono, J. Ramanujam, and P. Sadayappan. A prac- tical automatic polyhedral program optimization system. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 101–113, June 2008.

[22] P. Boulet, J.-L. Dekeyser, J.-L. Levaire, P. Marquet, J. Soula, and A. De- meure. Visual data-parallel programming for signal processing applications. In 9th Euromicro Workshop on Parallel and Distributed Processing, PDP 2001, pages 105 –112, 2001.

[23] Pierre Boulet and Paul Feautrier. Scanning polyhedra without do-loops. In Proceedings of the 1998 International Conference on Parallel Architectures and Compilation Techniques, PACT ’98, pages 4–11, Washington, DC, USA, 1998. IEEE Computer Society.

[24] Jichun Bu and Ed F. Depreterre. Processor clustering for the design of optimal fixed-size systolic arrays. In Valero, Kung, Lang, and Fortes, editors, ASAP: IEEE conference on Application Specific Array Processor, pages 402–413. IEEE Computer Society Press, 1991.

[25] Joseph Buck, Soonhoi Ha, Edward A. Lee, and David G. Messerschmitt. Ptolemy: a framework for simulating and prototyping heterogeneous systems. In Readings in hardware/software co-design, pages 527–543, Norwell, MA, USA, 2002. Kluwer Academic Publishers.

[26] Cadence Design Systems, Inc. C-to-silicon compiler, 2009. http://www. cadence.com/products/sd/silicon_compiler.

[27] F. Catthoor, K. Danckaert, C.Kulkarni, E. Brockmeyer, P.G. Kjeldsberg, T. Van Achteren, and T. Omnes. Data access and storage management for embedded

201 Bibliography

programmable processors. Number ISBN 0-7923-7689-7. Kluwer Acad. Pub- lishers, Boston, 2002. [28] Chaitali Chakrabarti. A dwt-based encoder architecture for symmetrically ex- tended images. In Proceedings of the International Symposium on Circuits and Systems, pages 123–126, 1999. [29] Deming Chen, Jason Cong, Yiping Fan, and Zhiru Zhang. High-Level Power Estimation and Low-Power Design Space Exploration for FPGAs. In ASP- DAC ’07: Proceedings of the 2007 Asia and South Pacific Design Automation Conference, pages 529–534, Washington, DC, USA, 2007. IEEE Computer Society. [30] Y. K. Chen, J. Chhugani, P. Dubey, C. J. Hughes, D. Kim, S. Kumar, V. W. Lee, A. D. Nguyen, M. Smelyanskiy, and M. Smelyanskiy. Convergence of recog- nition, mining, and synthesis workloads and its implications. In Proceedings of the IEEE, volume 96, pages 790–807, 2008. [31] Pai H. Chou, Ross B. Ortega, and Gaetano Borriello. The Chinook hardware/- software co-synthesis system. In ISSS ’95: Proceedings of the 8th interna- tional symposium on System synthesis, pages 22–27, New York, NY, USA, 1995. ACM. [32] Adam (Eds.) Coussy, Philippe; Morawiec, editor. High-Level Synthesis from Algorithm to Digital Circuit, volume XVI. Springer, 2008. [33] Alain Darte, Steven Derrien, and Tanguy Risset. Hardware/software interface for multi-dimensional processor arrays. In ASAP ’05: Proceedings of the 2005 IEEE International Conference on Application-Specific Systems, Architecture Processors, pages 28–35, Washington, DC, USA, 2005. IEEE Computer So- ciety. [34] Alain Darte and Fredric Vivien. Revisiting the decomposition of Karp, Miller and Winograd. In ASAP ’95: Proceedings of the IEEE International Confer- ence on Application Specific Array Processors, pages 13–25, Washington, DC, USA, 1995. IEEE Computer Society. [35] Abhishek Das, William J. Dally, and Peter Mattson. Compiling for stream processing. In PACT ’06: Proceedings of the 15th international conference on Parallel Architectures and Compilation Techniques, pages 33–42, New York, NY, USA, 2006. ACM. [36] Jean-Marc Daveau, Gilberto Fernandes Marchioro, Tarek Ben-Ismail, and Ahmed Amine Jerraya. Protocol selection and interface generation for hw- sw codesign. IEEE Transactions on Very Large Scale Integration Systems, 5(1):136–144, 1997.

202 Bibliography

[37] Kalyanmoy Deb, Samir Agrawal, Amrit Pratap, and T. Meyarivan. A fast elitist non-dominated sorting genetic algorithm for multi-objective optimisa- tion: Nsga-2. In PPSN VI: Proceedings of the 6th International Conference on Parallel Problem Solving from Nature, pages 849–858, London, UK, 2000. Springer-Verlag. [38] Ed F. Deprettere, Todor Stefanov, Shuvra S. Bhattacharyya, and Mainak Sen. Affine nested loop programs and their binary parameterized dataflow graph counterparts. In ASAP: IEEE International Conference on Application- Specific Systems, Architecture and Processors, pages 186–190, 2006. [39] Steven Derrien and Sanjay Rajopadhye. Energy/Power Estimation of Regular Processor Arrays. In ISSS ’02: Proceedings of the 15th international sympo- sium on System Synthesis, pages 50–55, New York, NY, USA, 2002. ACM. [40] Frederic Desprez, Jack Dongarra, Antoine Petitet, Cyril Randriamaro, and Yves Robert. Scheduling block-cyclic array redistribution. IEEE Transactions on Parallel and Distributed Systems, 9:192–205, 1998. [41] Pedro C. Diniz, Mary W. Hall, Joonseok Park, Byoungro So, and Heidi E. Ziegler. Automatic mapping of C to FPGAs with the defacto compilation and synthesis system. Microprocessors and Microsystems, 29(2-3):51–62, 2005. [42] Hritam Dutta, Frank Hannig, Alexey Kupriyanov, Dmitrij Kissler, Jürgen Te- ich, Rainer Schaffer, Sebastian Siegel, Renate Merker, and Bernard Pottier. Massively Parallel Processor Architectures: A Co-design Approach. In Pro- ceedings of the 3rd International Workshop on Reconfigurable Communication Centric System-on-Chips (ReCoSoC), pages 61–68, Montpellier, France, June 2007. Univ. Montpellier. [43] Hritam Dutta, Frank Hannig, Holger Ruckdeschel, and Jürgen Teich. Effi- cient Control Generation for Mapping Nested Loop Programs onto Processor Arrays. Journal of Systems Architecture, 53(5–6):300–309, May 2007. [44] Hritam Dutta, Frank Hannig, Moritz Schmid, and Joachim Keinert. Mod- eling and Synthesis of Communication Subsystems for Loop Accelerator Pipelines. In Proceedings of the 21st IEEE International Conference on Application-specific Systems, Architectures, and Processors (ASAP), pages 125–132, Rennes, France, July 2010. IEEE Computer Society. [45] Hritam Dutta, Frank Hannig, and Jürgen Teich. Controller Synthesis for Mapping Partitioned Programs on Array Architectures. Technical Re- port 03–2005, University of Erlangen-Nuremberg, Department of CS 12, Hardware-Software-Co-Design, Am Weichselgarten 3, 91058 Erlangen, Ger- many, November 2005.

203 Bibliography

[46] Hritam Dutta, Frank Hannig, and Jürgen Teich. A Formal Methodology for Hi- erarchical Partitioning of Piecewise Linear Algorithms. Technical Report 04– 2006, University of Erlangen-Nuremberg, Department of CS 12, Hardware- Software-Co-Design, Am Weichselgarten 3, 91058 Erlangen, Germany, April 2006.

[47] Hritam Dutta, Frank Hannig, and Jürgen Teich. Controller Synthesis for Map- ping Partitioned Programs on Array Architectures. In Werner Grass, Bernhard Sick, and Klaus Waldschmidt, editors, Proceedings of the 19th International Conference on Architecture of Computing Systems (ARCS), volume 3894 of Lecture Notes in Computer Science (LNCS), pages 176–191, Frankfurt am Main, Germany, March 2006. Springer.

[48] Hritam Dutta, Frank Hannig, and Jürgen Teich. Hierarchical Partitioning for Piecewise Linear Algorithms. In Proceedings of the 5th International Con- ference on Parallel Computing in Electrical Engineering (PARELEC), pages 153–160, Bialystok, Poland, September 2006. IEEE Computer Society.

[49] Hritam Dutta, Frank Hannig, and Jürgen Teich. Mapping of Nested Loop Pro- grams onto Massively Parallel Processor Arrays with Memory and I/O Con- straints. In Friedhelm Meyer auf der Heide and Burkhard Monien, editors, Proceedings of the 6th International Heinz Nixdorf Symposium, New Trends in Parallel & Distributed Computing, volume 181 of HNI-Verlagsschriftenreihe, pages 97–119, Paderborn, Germany, January 2006. Heinz Nixdorf Institut, Universität Paderborn.

[50] Hritam Dutta, Frank Hannig, and Jürgen Teich. PARO: A Design Tool for Au- tomatic Generation of Hardware Accelerators. In Proceedings of ACACES 2008 Poster Abstracts: Advanced Computer Architecture and Compilation for Embedded Systems, pages 317–320, L’Aquila, Italy, July 2008. Academia Press, Ghent.

[51] Hritam Dutta, Frank Hannig, and Jürgen Teich. The PARO Design Tool for Automatic Generation of Hardware Accelerators, March 2008. Interactive Pre- sentation at Friday Workshop, The New Wave of the High-Level Synthesis, Design, Automation and Test in Europe (DATE), Munich, Germany.

[52] Hritam Dutta, Frank Hannig, and Jürgen Teich. Performance Matching of Hardware Acceleration Engines for Heterogeneous MPSoC using Modular Performance Analysis. In Proceedings of the 22nd International Conference on Architecture of Computing Systems (ARCS), volume 5455 of Lecture Notes in Computer Science (LNCS), pages 233–245, Delft, The Netherlands, January 2009. Springer.

204 Bibliography

[53] Hritam Dutta, Frank Hannig, and Jürgen Teich. PARO – A Design Tool for Synthesis of Hardware Accelerators for SoCs, March 2010. Tool Presentation at the University Booth at Design, Automation and Test in Europe (DATE), Dresden, Germany.

[54] Hritam Dutta, Frank Hannig, Jürgen Teich, Benno Heigl, and Heinz Horneg- ger. A Design Methodology for Hardware Acceleration of Adaptive Filter Algorithms in Image Processing. In Proceedings of the 17th IEEE Interna- tional Conference on Application-specific Systems, Architectures, and Proces- sors (ASAP), pages 331–337, Steamboat Springs, CO, USA, September 2006. IEEE Computer Society.

[55] Hritam Dutta, Dmitrij Kissler, Frank Hannig, Alexey Kupriyanov, Jürgen Te- ich, and Bernard Pottier. A Holistic Approach for Tightly Coupled Reconfig- urable Parallel Processors. Microprocessors and Microsystems, 33(1):53–62, February 2009.

[56] Hritam Dutta, Jiali Zhai, Frank Hannig, and Jürgen Teich. Impact of Loop Tiling on the Controller Logic of Hardware Acceleration Engines. In Pro- ceedings of the 20th IEEE International Conference on Application-specific Systems, Architectures, and Processors (ASAP), pages 161–168, Boston, MA, USA, July 2009. IEEE Computer Society.

[57] U. Eckhardt and R. Merker. Hierarchical Algorithm Partitioning at System Level for an Improved Utilization of Memory Structures. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 18(1):14–24, 1999.

[58] Uwe Eckhardt. Algorithmus-Architektur-Codesign fuer den Entwurf digitaler Systeme mit eingebettetem Prozessorarray und Speicherhierarchie. Phd thesis, Technische Universitaet Dresden, June 2001.

[59] Uwe Eckhardt and Renate Merker. Optimization of the background memory utilization by partitioning. In Proceedings of the 10th international symposium on System synthesis, ISSS ’97, pages 82–89, Washington, DC, USA, 1997. IEEE Computer Society.

[60] Sven Eisenhardt, Thomas Schweizer, Julio A. de Oliveira Filho, Tobias Op- pold, Wolfgang Rosenstiel, Alexander Thomas, Jürgen Becker, Frank Han- nig, Dmitrij Kissler, Hritam Dutta, Jürgen Teich, Heiko Hinkelmann, Peter Zipf, and Manfred Glesner. SPP1148 Booth: Coarse-Grained Reconfigura- tion. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL), pages 349–350, Heidelberg, Germany, Septem- ber 2008.

205 Bibliography

[61] Michael Eisenring and Jürgen Teich. Domain-specific interface generation from dataflow specifications. In CODES/CASHE ’98: Proceedings of the 6th international workshop on Hardware/software codesign, pages 43–47, Wash- ington, DC, USA, 1998. IEEE Computer Society. [62] Petru Eles, Krzysztof Kuchcinski, and Zebo Peng. System Synthesis with VHDL: A Transformational Approach. Kluwer Academic Publishers, Nor- well, MA, USA, 1998. [63] Cagkan Erbas, Selin Cerav-Erbas, and Andy D. Pimentel. Multiobjective op- timization and evolutionary algorithms for the application mapping problem in multiprocessor system-on-chip design. IEEE Transactions on Evolutionary Computation, 10(3):358–374, 2006. [64] Paul Feautrier. Parametric integer programming. Technical report, 1988. [65] Paul Feautrier. Dataflow analysis of array and scalar references. International Journal of Parallel Programming, 20(1):23–53, February 1991. [66] Paul Feautrier. The Data Parallel Programming Model: Foundations, HPF Re- alization, and Scientific Applications, chapter Automatic Parallelization in the Polytope Model, pages 79–103. Lecture Notes in Computer Science. Springer- Verlag, 1996. [67] Paul Feautrier. Scalable and structured scheduling. International Journal of Parallel Programming, 34(5):459–487, 2006. [68] Wu-Chun Feng and Tom Scogland. The Green500 List: Year One. In 5th IEEE Workshop on High-Performance, Power-Aware Computing (in conjunc- tion with the 23rd International Parallel & Distributed Processing Sympo- sium), pages 1–7, Rome, Italy, May 2009. [69] Dirk Fimmel and Renate Merker. Design of processor arrays for reconfigurable architectures. The Journal of Supercomputing, 19(1):41–56, 2001. [70] Dirk Fischer, Jürgen Teich, Ralph Weper, Uwe Kastens, and Michael Thies. Design space characterization for architecture/compiler co-exploration. In CASES ’01: Proceedings of the 2001 international conference on Compilers, architecture, and synthesis for embedded systems, pages 108–115, New York, NY, USA, 2001. ACM.

[71] Forte Design Systems, Forte Cynthesizer. www.forteds.com. [72] Antoine Fraboulet and Tanguy Risset. Master interface for on-chip hardware accelerator burst communications. Journal of VLSI Signal Processing Systems, 49(1):73–85, 2007.

206 Bibliography

[73] Grigori Fursin, Cupertino Miranda, Olivier Temam, Mircea Namolaru, Elad Yom-Tov, Ayal Zaks, Bilha Mendelson, Phil Barnard, Elton Ashton, Eric Courtois, Francois Bodin, Edwin Bonilla, John Thomson, Hugh Leather, Chris Williams, and Michael O’Boyle. MILEPOST GCC: machine learning based research compiler. In Proceedings of the GCC Developers’ Summit, pages 1–13, Ottawa, Canada, Jun 2008.

[74] William J. Gilbert. Bricklaying and the Hermite Normal Form. The American Mathematical Monthly, 100(3):242–245, 1993.

[75] Ricardo Gonzalez and Mark Horowitz. Energy Dissipation in General Purpose Processors. IEEE Journal of Solid State Circuits, 31:1277–1284, 1996.

[76] Brian J. Gough and Richard M. Stallman. An Introduction to GCC. Network Theory Ltd., 2004.

[77] Georgios Goumas, Maria Athanasaki, and Nectarios Koziris. Automatic code generation for executing tiled nested loops onto parallel architectures. In SAC ’02: Proceedings of the 2002 ACM symposium on Applied computing, pages 876–881, New York, NY, USA, 2002. ACM.

[78] Martin Griebl, Peter Faber, and Christian Lengauer. Space-time mapping and tiling: a helpful combination. Concurrency and Computation: Practice and Experience, 16(3):221–246, March 2004.

[79] Matthias Gries. Methods for evaluating and covering the design space during early design development. Technical Report UCB/ERL M03/32, Electronics Research Lab, University of California at Berkeley, August 2003.

[80] Armin Größlinger, Martin Griebl, and Christian Lengauer. Introducing non- linear parameters to the polyhedron model. In Michael Gerndt and Edmond Kereku, editors, Proc. 11th Workshop on Compilers for Parallel Computers (CPC 2004), Research Report Series, pages 1–12. LRR-TUM, Technische Universität München, July 2004.

[81] Anne-Claire Guillou, Patrice Quinton, and Tanguy Risset. Hardware synthesis for systems of recurrence equations with multidimensional schedule. Interna- tional Journal of Embedded Systems, 3(4):271–284, 2008.

[82] Zhi Guo, Walid Najjar, and Betul Buyukkurt. Efficient hardware code gener- ation for FPGAs. ACM Transactions on Architecture and Code Optimization, 5(1):1–26, 2008.

207 Bibliography

[83] Sumit Gupta, Nikil D. Dutt, Rajesh K. Gupta, and Alexandru Nicolau. SPARK: A High-Level Synthesis Framework for applying Parallelizing Com- piler Transformations. In Proceedings of the International Conference on VLSI Design, pages 461–466, January 2003.

[84] Soonhoi Ha, Sungchan Kim, Choonseung Lee, Youngmin Yi, Seongnam Kwon, and Young-Pyo Joo. Peace: A hardware-software codesign environ- ment for multimedia embedded systems. ACM Transactions on Design and Automation of Electronic Systems (TODAES), 12(3):1–25, 2007.

[85] Wolfgang Haid and Lothar Thiele. Complex task activation schemes in sys- tem level performance analysis. In Proc. 5th Intl Conf. on Hardware/Software Codesign and System Synthesis (CODES+ISSS), pages 173–178, Salzburg, Austria, October 2007. ACM Press.

[86] Frank Hannig. Scheduling Techniques for High-Throughput Loop Accelera- tors. PhD thesis, University of Erlangen-Nuremberg, Germany, Munich, Au- gust 2009.

[87] Frank Hannig, Hritam Dutta, Alexey Kupriyanov, Jürgen Teich, Rainer Schaf- fer, Sebastian Siegel, Renate Merker, Ronan Keryell, Bernard Pottier, Daniel Chillet, Daniel Ménard, and Olivier Sentieys. Co-Design of Massively Parallel Embedded Processor Architectures. In Proceedings of the first International Workshop on Reconfigurable Communication Centric System-on-Chips (Re- CoSoC), pages 27–34, Montpellier, France, June 2005. Univ. Montpellier II.

[88] Frank Hannig, Hritam Dutta, and Jürgen Teich. Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology. In Proceedings of the 18th International Parallel and Dis- tributed Processing Symposium (IPDPS), Santa Fe, NM, USA, April 2004. IEEE Computer Society.

[89] Frank Hannig, Hritam Dutta, and Jürgen Teich. Regular Mapping for Coarse- grained Reconfigurable Architectures. In Proceedings of the IEEE Interna- tional Conference on Acoustics, Speech, and Signal Processing (ICASSP), volume V, pages 57–60, Montréal, Quebec, Canada, May 2004. IEEE Signal Processing Society.

[90] Frank Hannig, Hritam Dutta, and Jürgen Teich. Mapping a Class of De- pendence Algorithms to Coarse-grained Reconfigurable Arrays: Architectural Parameters and Methodology. International Journal of Embedded Systems, 2(1/2):114–127, January 2006.

208 Bibliography

[91] Frank Hannig, Hritam Dutta, and Jürgen Teich. Parallelization Approaches for Hardware Accelerators – Loop Unrolling versus Loop Partitioning. In Pro- ceedings of the 22nd International Conference on Architecture of Computing Systems (ARCS), volume 5455 of Lecture Notes in Computer Science (LNCS), pages 16–27, Delft, The Netherlands, March 2009. Springer.

[92] Frank Hannig, Hritam Dutta, and Jürgen Teich. PARO A Design Tool for the Automatic Generation of Hardware Accelerators, July 2009. Tool Presentation at the Demo Night of the 20th IEEE International Conference on Application- specific Systems, Architectures, and Processors (ASAP).

[93] Frank Hannig, Holger Ruckdeschel, Hritam Dutta, and Jürgen Teich. PARO: Synthesis of Hardware Accelerators for Multi-Dimensional Dataflow- Intensive Applications. In Proceedings of the Fourth International Work- shop on Applied Reconfigurable Computing (ARC), volume 4943 of Lecture Notes in Computer Science (LNCS), pages 287–293, London, United King- dom, March 2008. Springer.

[94] Frank Hannig, Holger Ruckdeschel, and Jürgen Teich. The PAULA Lan- guage for Designing Multi-Dimensional Dataflow-Intensive Applications. In Proceedings of the GI/ITG/GMM-Workshop – Methoden und Beschrei- bungssprachen zur Modellierung und Verifikation von Schaltungen und Sys- temen, pages 129–138, Freiburg, Germany, March 2008. Shaker.

[95] Frank Hannig and Jürgen Teich. Dynamic Piecewise Linear/Regular Algo- rithms. In Proceedings of the Fourth International Conference on Paral- lel Computing in Electrical Engineering (PARELEC), pages 79–84, Dresden, Germany, September 2004. IEEE Computer Society.

[96] C. A. R. Hoare. Algebraic specifications and proofs for communicating se- quential processes. In Proceedings of the NATO Advanced Study Institute on Logic of programming and calculi of discrete design, pages 277–300, New York, NY, USA, 1987. Springer-Verlag New York, Inc.

[97] Glenn H. Holloway and Michael D. Smith. The machine-suif control flow graph library. In D. Smith & A. K. Langslow (Eds.), The Idea of a University, London, pages 148–175, 1998.

[98] Amir Hormati, Manjunath Kudlur, Scott Mahlke, David Bacon, and Rodric Rabbah. Optimus: efficient realization of streaming applications on FPGAs. In CASES ’08: Proceedings of the 2008 international conference on Compilers, architectures and synthesis for embedded systems, pages 41–50, New York, NY, USA, 2008. ACM.

209 Bibliography

[99] Kai Huang, Sang-Il Han, Katalin Popovici, Lisane B. de Brisolara, Xavier Guerin, Lei Li, Xiaolang Yan, Soo-Ik Chae, Luigi Carro, and Ahmed Amine Jerraya. Simulink-Based MPSoC Design Flow: Case Study of Motion-JPEG and H.264. In Design and Automation Conference (DAC), pages 39–42, 2007.

[100] Ilog, Inc. Solver cplex, 2003. http://www.ilog.fr/products/cplex/.

[101] International Technology Roadmap for Semiconductors. International tech- nology roadmap for semiconductors 2007 edition, 2007. http://www.itrs. net/Links/2007ITRS/Home2007.htm.

[102] Hyunuk Jung and Soonhoi Ha. Hardware synthesis from coarse-grained dataflow specification for fast HW/SW cosynthesis. CODES+ISSS ’04: Pro- ceedings of the 2nd IEEE/ACM/IFIP international conference on Hardware/- software codesign and system synthesis, pages 24–29, 2004.

[103] James A. Kahle, Michael N. Day, Peter Hofstee, Charles R. Johns, Theodore R. Maeurer, and David J. Shippy. Introduction to the Cell multipro- cessor. IBM Journal of Research and Development, 49(4-5):589–604, 2005.

[104] Gilles Kahn. The semantics of a simple language for parallel programming. In Proceedings of IFIP Congress 74,Stockholm, Sweden, pages 471–475, 1974.

[105] Richard M. Karp, Raymond E. Miller, and Shmuel Winograd. The organiza- tion of computations for uniform recurrence equations. Journal of the Associ- ation for Computing Machinery, 14(3):563–590, 1967.

[106] Joachim Keinert. Data Flow Based System Level Modeling, Analysis, and Syn- thesis of High-Performance Streaming Image Processing Applications. PhD thesis, University of Erlangen-Nuremberg, 2009.

[107] Joachim Keinert, Hritam Dutta, Frank Hannig, Christian Haubelt, and Jürgen Teich. Model-Based Synthesis and Optimization of Static Multi-Rate Image Processing Algorithms. In Proceedings of the Conference on Design, Automa- tion and Test in Europe (DATE), pages 135–140, Nice, France, April 2009. IEEE Computer Society.

[108] Joachim Keinert, Christian Haubelt, and Jürgen Teich. Modeling and analysis of windowed synchronous algorithms. In International Conference on Acous- tics, Speech and Signal Processing (ICASSP), pages III–892– III–895, 2006.

[109] Joachim Keinert, Christian Haubelt, and Jürgen Teich. Synthesis of multi- dimensional high-speed FIFOs for out-of-order communication. In Architec- ture of Computing Systems (ARCS 2008), volume 4934/2008, pages 130 – 143. Springer, 2008.

210 Bibliography

[110] Bart Kienhuis, Ed F. Deprettere, Pieter van der Wolf, and Kees A. Vissers. A methodology to design programmable embedded systems - the Y-chart ap- proach. In Embedded Processor Design Challenges, pages 18–37, 2002.

[111] DaeGon Kim, Lakshminarayanan Renganarayanan, Dave Rostron, Sanjay Ra- jopadhye, and Michelle Mills Strout. Multi-level tiling: M for the price of one. In SC ’07: Proceedings of the 2007 ACM/IEEE conference on Supercomput- ing, pages 1–12, New York, NY, USA, 2007. ACM.

[112] Dmitrij Kissler, Hritam Dutta, Alexey Kupriyanov, Frank Hannig, and Jürgen Teich. A High-Speed Dynamic Reconfigurable Multilevel Parallel Architec- ture, March 2008. Hardware and Software Demo at the University Booth at Design, Automation and Test in Europe (DATE), Munich, Germany.

[113] P. M. W. Knijnenburg, T. Kisuki, and M. F. P. O’Boyle. Iterative compila- tion. In Embedded processor design challenges: systems, architectures, mod- eling, and simulation (SAMOS), pages 171–187, New York, NY, USA, 2002. Springer-Verlag New York, Inc.

[114] Vyas Krishnan and Srinivas Katkoori. A genetic algorithm for the design space exploration of datapaths during high-level synthesis. IEEE Transactions on Evolutionary Computation, 10(3):213–229, 2006.

[115] Manjunath Kudlur, Kevin Fan, and Scott Mahlke. Streamroller: automatic synthesis of prescribed throughput accelerator pipelines. In CODES+ISSS ’06: Proceedings of the 4th international conference on Hardware/software codesign and system synthesis, pages 270–275, New York, NY, USA, 2006. ACM.

[116] Simon Kuenzli. Efficient Design Space Exploration for Embedded Systems. PhD thesis, ETH Zurich, April 2006.

[117] Robert H. Kuhn. Transforming Algorithms for Single-Stage and VLSI Ar- chitectures. In Workshop on Interconnection Networks for Parallel and Dis- tributed Processing, pages 11–19, April 1980.

[118] Dhananjay Kulkarni, Walid A. Najjar, Robert Rinker, and Fadi J. Kurdahi. Compile-time area estimation for LUT-based FPGAs. ACM Transactions on Design Automation of Electronic Systems, 11(1):104–122, 2006.

[119] Chris Lattner and Vikram Adve. LLVM: A Compilation Framework for Life- long Program Analysis & Transformation. In Proceedings of the 2004 Inter- national Symposium on Code Generation and Optimization (CGO’04), pages 75–88, Palo Alto, California, Mar 2004.

211 Bibliography

[120] David Lau, Orion Pritchard, and Philippe Molson. Automated Generation of Hardware Accelerators with Direct Memory Access from ANSI/ISO Standard C Functions. In FCCM ’06: Proceedings of the 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines, pages 45–56, Wash- ington, DC, USA, 2006. IEEE Computer Society.

[121] Edward Ashford Lee and David Messerschmitt. Static scheduling of syn- chronous data flow programs for digital signal processing. IEEE Transactions on Computers, 36:24–35, 1987.

[122] Vincent Lefebvre and Paul Feautrier. Optimizing storage size for static control programs in automatic parallelizers. In Euro-Par ’97: Proceedings of the Third International Euro-Par Conference on Parallel Processing, pages 356–363, London, UK, 1997. Springer-Verlag.

[123] Christian Lengauer. Loop parallelization in the polytope model. In CONCUR ’93: Proceedings of the 4th International Conference on Concurrency Theory, pages 398–416, London, UK, 1993. Springer-Verlag.

[124] E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym. NVIDIA Tesla: A Unified Graphics and Computing Architecture. IEEE Micro, pages 39–55, 2008.

[125] M. Luthra, Sumit Gupta, Nikil Dutt, Rajesh Gupta, and A. Nicolau. Interface synthesis using memory mapping for an FPGA platform. In Proceedings. 21st International Conference on Computer Design, 2003., pages 140 – 145, oct. 2003.

[126] Manju Manjunathaiah, Graham M. Megson, Sanjay V. Rajopadhye, and Tan- guy Risset. Uniformization of affine dependance programs for parallel embed- ded system design. In Proceedings of the 2001 International Conference on Parallel Processing (ICPP), pages 205–213, 2001.

[127] Roel Meeuws, Yana Yankova, Koen Bertels, Georgi Gaydadjiev, and Stamatis Vassiliadis. A quantitative prediction model for hardware/software partition- ing. In Field Programmable Logic and Applications (FPL), pages 735–739, 2007.

[128] Bingfeng Mei, Serge Vernalde, Diederik Verkest, Hugo de Man, and Rudy Lauwereins. ADRES: An Architecture with Tightly Coupled VLIW Processor and Coarse-Grained Reconfigurable Matrix. In Proceedings of the 13th Inter- national Conference on Field Programmable Logic and Applications (FPL), pages 61–70, 2003.

212 Bibliography

[129] Richard Membarth, Hritam Dutta, Frank Hannig, and Jürgen Teich. Efficient Mapping of Streaming Applications for Image Processing on Graphics Cards. (To appear in) Transactions on High-Performance Embedded Architectures and Compilers (Transactions on HiPEAC), February 2011.

[130] Richard Membarth, Frank Hannig, Hritam Dutta, and Jürgen Teich. Efficient Mapping of Multiresolution Image Filtering Algorithms on Graphics Proces- sors. In Koen Bertels, Nikitas Dimopoulos, Christina Silvano, and Stephan Wong, editors, Proceedings of the 9th International Workshop on Systems, Architectures, Modeling, and Simulation (SAMOS), volume 5657 of Lecture Notes in Computer Science (LNCS), pages 277–288, Island of Samos, Greece, July 2009. Springer.

[131] Richard Membarth, Frank Hannig, Hritam Dutta, and Jürgen Teich. Opti- mization Flow for Algorithm Mapping on Graphics Cards. In Proceedings of ACACES 2009 Poster Abstracts: Advanced Computer Architecture and Com- pilation for Embedded Systems, pages 229–232, Terrassa, Spain, July 2009. Academia Press, Ghent.

[132] Richard Membarth, Philipp Kutzer, Hritam Dutta, Frank Hannig, and Jür- gen Teich. Acceleration of Multiresolution Imaging Algorithms: A Com- parative Study. In Proceedings of the 20th IEEE International Conference on Application-specific Systems, Architectures, and Processors (ASAP), pages 211–214, Boston, MA, USA, July 2009. IEEE Computer Society.

[133] Mentor Graphics. Catapult-c. www.mentor.com/products/esl.

[134] Zbigniew Michalewicz and David B. Fogel. How to Solve It: Modern Heuris- tics. Springer Verlag, 2000.

[135] Giovanni De Micheli. Synthesis and Optimization of Digital Circuits. McGraw-Hill Higher Education, 1994.

[136] Masato Motomura. A Dynamically Reconfigurable Processor Architecture. Microprocessor Forum, October 2002.

[137] Steven Muchnick. Advanced Compiler Design and Implementation. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1997.

[138] Praveen K. Murthy and Edward A. Lee. Multidimensional synchronous dataflow. IEEE Transactions on Signal Processing, 50:3306–3309, 2002.

[139] Anshuman Nayak, Malay Haldar, Alok N. Choudhary, and Prithviraj Baner- jee. Accurate Area and Delay Estimators for FPGAs. In DATE’02: Design, Automation and Test in Europe, pages 862–869, 2002.

213 Bibliography

[140] Ralf Niemann and Peter Marwedel. Hardware/software partitioning using in- teger programming. In EDTC ’96: Proceedings of the 1996 European confer- ence on Design and Test, pages 473–479, Washington, DC, USA, 1996. IEEE Computer Society.

[141] Rishiyur S. Nikhil. Bluespec System Verilog: efficient, correct RTL from high level specifications. In IEEE International Conference on Formal Methods and Models for Co-Design (MEMOCODE), pages 69–70, 2004.

[142] Hristo Nikolov, Todor Stefanov, and Ed Deprettere. Multi-processor system design with ESPAM. In CODES+ISSS ’06: Proceedings of the 4th interna- tional conference on Hardware/software codesign and system synthesis, pages 211–216, New York, NY, USA, 2006. ACM.

[143] Open64 Workshop. In conjunction with ieee/acm international symposium on code generation and optimization (cgo). pages xv–xvi, Los Alamitos, CA, USA, 2009. IEEE Computer Society.

[144] Maurizio Palesi and Tony Givargis. Multi-objective design space exploration using genetic algorithms. In Proceedings of the Tenth International Sympo- sium on Hardware/Software Codesign (CODES), pages 67–72, 2002.

[145] Christian Pilato, Antonino Tumeo, Gianluca Palermo, Fabrizio Ferrandi, Pier Luca Lanzi, and Donatella Sciuto. Improving evolutionary exploration to area-time optimization of FPGA designs. Journal of Systems Architecture - Embedded Systems Design, 54(11):1046–1057, 2008.

[146] Latha Pillai. Video compression using dct. Application Note, 2002. direct.xilinx.com/support/documentation/application_ notes/xapp610.pdf.

[147] Andy D. Pimentel, Cagkan Erbas, and Simon Polstra. A Systematic Approach to Exploring Embedded System Architectures at Multiple Abstraction Levels. IEEE Transactions on Computers, 55(2):99–112, 2006.

[148] Sébastian Pop, Albert Cohen, Cédric Bastoul, Sylvain Girbal, P. Jouvelot, and N. Vasilache. GRAPHITE: Loop optimizations based on the polyhedral model for GCC. In Proc. of the 4th GCC Developper’s Summit, pages 179–198, Ottawa, Canada, June 2006.

[149] Louis-Noël Pouchet, Cédric Bastoul, Albert Cohen, and John Cavazos. It- erative optimization in the polyhedral model: Part II, multidimensional time. In ACM SIGPLAN Conference on Programming Language Design and Im- plementation (PLDI’08), pages 90–100, Tucson, Arizona, June 2008. ACM Press.

214 Bibliography

[150] Sanjay V. Rajopadhye and Richard Fujimoto. Synthesizing systolic arrays from recurrence equations. Parallel Computing, 14(2):163–189, 1990.

[151] Sailesh K Rao. Regular interactive algorithms and their implementations on processor arrays. PhD thesis, Stanford, CA, USA, 1986.

[152] Recore Systems. www.recoresystems.com.

[153] Edwin Rijpkema, Ed F. Deprettere, and Bart Kienhuis. Deriving process net- works from nested loop algorithms. Parallel Processing Letters, 10(2/3):165– 176, 2000.

[154] A. W. Roscoe and C. A. R. Hoare. The laws of Occam programming. Technical monograph PRG-53, University of Oxford Computer Laboratory, 1986.

[155] Robert Schreiber, Shail Aditya, Scott Mahlke, Vinod Kathail, B. Ramakrishna Rau, Darren Cronquist, and Mukund Sivaraman. PICO-NPA: High-Level Syn- thesis of Nonprogrammable Hardware Accelerators. Journal of VLSI Signal Processing and Systems, 31(2):127–142, 2002.

[156] Alexander Schrijver. Theory of Linear and Integer Programming. Wiley – Interscience series in discrete mathematics. John Wiley & Sons, Chichester, New York, USA, 1986.

[157] Paul Schumacher and Pradip Jha. Fast and accurate resource estimation of RTL-based designs targeting FPGAS. In Field Programmable Logic and Ap- plications (FPL), pages 59–64, 2008.

[158] Carter Shanklin and Leda Ortega, editors. High Performance Compilers for Parallel Computing. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1995.

[159] Richard Sharp. Higher-Level Hardware Synthesis, volume 2963 of Lecture Notes in Computer Science. Springer, 2004.

[160] Hartej Singh, Ming-Hau Lee, Guangming Lu, Nader Bagherzadeh, Fadi J. Kurdahi, and Eliseu M. Chaves Filho. Morphosys: An Integrated Recon- figurable System for Data-Parallel and Computation-Intensive Applications. IEEE Transactions on Computers, 49(5):465–481, 2000.

[161] Byoungro So, Mary W. Hall, and Pedro C. Diniz. A compiler approach to fast hardware design space exploration in FPGA-based systems. In PLDI ’02: Proceedings of the ACM SIGPLAN 2002 Conference on Programming language design and implementation, pages 165–176, New York, NY, USA, 2002. ACM.

215 Bibliography

[162] Dinesh C. Suresh, Satya R. Mohanty, Walid A. Najjar, Laxmi N. Bhuyan, and Frank Vahid. Loop level analysis of security and network applications. In Workshop on Computer Architecture Evaluation using Commercial Workloads (CAECW-03), February 2003.

[163] Synopsys, Inc. Synphony c compiler, 2010. http://www.synopsys.com/ Tools/SLD/HLS/Pages/SynphonyC-Compiler.aspx.

[164] Jürgen Teich. A Compiler for Application-Specific Processor Arrays. PhD thesis, Institut für Mikroelektronik, Universität des Saarlandes, Saarbrücken, Germany, September 1993.

[165] Jürgen Teich. Invasive Algorithms and Architectures. it - Information Tech- nology, 50(5):300–310, 2008.

[166] Jürgen Teich, Frank Hannig, Holger Ruckdeschel, Hritam Dutta, Dmitrij Kissler, and Andrej Stravet. A Unified Retargetable Design Methodology for Dedicated and Re-Programmable Multiprocessor Arrays: Case Study and Quantitative Evaluation. In Proceedings of the International Conference on Engineering of Reconfigurable Systems and Algorithms (ERSA), Invited pa- per, pages 14–24, Las Vegas, NV, USA, June 2007. CSREA Press.

[167] Jürgen Teich and Christian Haubelt. Digitale Hardware/Software-Systeme: Synthese und Optimierung. Springer-Verlag, Berlin Heidelberg, 2nd edition, 2007.

[168] Jürgen Teich and Lothar Thiele. Exact partitioning of affine dependence al- gorithms. In SAMOS: Embedded Processor Design Challenges: Systems, Ar- chitectures, Modeling, and Simulation, pages 135–153, New York, NY, USA, 2002. Springer-Verlag New York, Inc.

[169] Jürgen Teich, Lothar Thiele, and Li Zhang. Scheduling of Partitioned Regular Algorithms on Processor Arrays with Constrained Resources. Journal of VLSI Signal Processing, 17(1):5–20, September 1997.

[170] The MathWorks. Simulink. www.mathworks.com/products/simulink/.

[171] L. Thiele. Scheduling of Uniform Algorithms with Resource Constraints. Journal of VLSI Signal Processing, 10:295–310, 1995.

[172] Lothar Thiele, Ernesto Wandeler, and Samarjit Chakraborty. A Stream- Oriented Component Model for Performance Analysis of Multiprocessor DSPs. IEEE Signal Processing Magazine, 22(3):38–46, May 2005.

216 Bibliography

[173] Lother Thiele, Iuliana Bacivarov, Wolfgang Haid, and Kai Huang. Map- ping Applications to Tiled Multiprocessor Embedded Systems. In Proc. 7th Int’l Conference on Application of Concurrency to System Design (ACSD’07), pages 29–40, Bratislava, Slovak Republic, July 2007.

[174] Bill Thies, Michal Karczmarek, and Saman Amarasinghe. Streamit: A lan- guage for streaming applications. In International Conference on Compiler Construction, pages 179–196, 2001.

[175] Alexandru Turjan, Bart Kienhuis, and Ed Deprettere. A Compile Time Based Approach for Solving Out-of-Order Communication in Kahn Process Net- works. ASAP ’02: Proceedings of the IEEE International Conference on Application-Specific Systems, Architectures, and Processors, pages 17–28, 2002.

[176] Sven van Haastregt and Bart Kienhuis. Automated synthesis of streaming C applications to process networks in hardware. In Proceeding of Design, Automation and Test in Europe Conference (DATE), pages 890–893, 2009.

[177] Sven Verdoolaege, Rachid Seghir, Kristof Beyls, Vincent Loechner, and Mau- rice Bruynooghe. Analytical computation of Ehrhart polynomials: enabling more compiler analyses and optimizations. In CASES ’04: Proceedings of the 2004 international conference on Compilers, architecture, and synthesis for embedded systems, pages 248–258, New York, NY, USA, 2004. ACM.

[178] Hervé Le Verge, Christophe Mauras, and Patrice Quinton. The ALPHA lan- guage and its use for the design of systolic arrays. VLSI Signal Processing, 3(3):173–182, 1991.

[179] Henry S. Warren. Hacker’s Delight. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 2002.

[180] R. Clint Whaley and Jack J. Dongarra. Automatically tuned linear algebra software. In Supercomputing ’98: Proceedings of the 1998 ACM/IEEE con- ference on Supercomputing (CDROM), pages 1–27, Washington, DC, USA, 1998. IEEE Computer Society.

[181] R. Clint Whaley, Antoine Petitet, and Jack J. Dongarra. Automated empirical optimizations of software and the ATLAS project. Parallel Computing, 27(1- 2):3 – 35, 2001.

[182] Samuel Williams, Leonid Oliker, Richard Vuduc, John Shalf, Katherine Yelick, and James Demmel. Optimization of sparse matrix-vector multipli- cation on emerging multicore platforms. In SC ’07: Proceedings of the 2007

217 Bibliography

ACM/IEEE conference on Supercomputing, pages 1–12, New York, NY, USA, 2007. ACM.

[183] Robert P. Wilson, Robert S. French, Christopher S. Wilson, Saman P. Ama- rasinghe, Jennifer-Ann M. Anderson, Steven W. K. Tjiang, Shih-Wei Liao, Chau-Wen Tseng, Mary W. Hall, Monica S. Lam, and John L. Hennessy. Suif: An infrastructure for research on parallelizing and optimizing compilers. SIG- PLAN Notices, 29(12):31–37, 1994.

[184] Wayne Wolf. Computers as components: principles of embedded computing system design. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2001.

[185] Wayne Wolf. The Future of Multiprocessor Systems-on-Chips. In Proceed- ings of the 41st annual Design Automation Conference, pages 681–685, Los Alamitos, CA, USA, 2004. IEEE Computer Society.

[186] Wayne Wolf, Burak Ozer, and Tiehan Lv. Smart cameras as embedded sys- tems. Computer, 35(9):48–53, 2002.

[187] Xilinx Inc. Xilinx platform studio, documentation, 2009. http://www. xilinx.com/ise/embedded/edk\_pstudio.htm.

[188] Xilinx Inc. Xpower analyzer, 2009. http://www.xilinx.com/products/ design\_tools/logic\_design/verification/xpower.htm.

[189] Jingling Xue. Loop tiling for parallelism. Kluwer Academic Publishers, Nor- well, MA, USA, 2000.

[190] Z. Zhang, Y. Fan, W. Jiang, G. Han, C. Yang, and J. Cong. AutoPilot: a platform-based ESL synthesis system, chapter 6, pages 99–112. High-Level Synthesis: From Algorithm to Digital Circuit. Springer Verlag, 2008.

[191] E. Zitzler, M. Laumanns, and L. Thiele. SPEA2: Improving the Strength Pareto Evolutionary Algorithm for Multiobjective Optimization. In K.C. Gi- annakoglou et al., editors, Evolutionary Methods for Design, Optimisation and Control with Application to Industrial Problems (EUROGEN 2001), pages 95– 100. International Center for Numerical Methods in Engineering (CIMNE), 2002.

[192] E. Zitzler and L. Thiele. Multiobjective evolutionary algorithms: A compara- tive case study and the strength pareto evolutionary algorithm. IEEE Transac- tions on Evolutionary Computation, 3(4):257–271, 1999.

218 List of Abbreviations

ASIC ...... Application-specific Integrated Circuit BRAM ...... Block Random Access Memory CFG ...... Control-Flow Graph DCT ...... Discrete Cosine Transform DMA ...... Direct Memory Access DPLA ...... Dynamic Piecewise Linear Algorithm DSE ...... Design Space Exploration EWDF ...... Elliptic Wave Digital Filter FF ...... Flip-Flop FIR ...... Finite Impulse Response Filter FPGA ...... Field Programmable Gate Array GA ...... Genetic Algorithm GPU ...... Graphics Processing Unit HLS ...... High-Level Synthesis LA ...... Loop Accelerator LoC ...... Lines of Code LUT ...... Look-up-Table MoC ...... Model of Computation MOEA ...... Multi-Objective Evolutionary Algorithm MPSoC ...... Multi-processor System-on-Chip MPA ...... Modular Performance Analysis NSGA ...... Non-dominating Sorting Genetic Algorithm PE ...... Processor Element PA ...... Processor Array PLA ...... Piecewise Linear Algorithm QoR ...... Quality of Results RTC ...... Real-time Calculus RTL ...... Register Transfer Level SoC ...... System-on-Chip SPEA ...... Strength Pareto Evolutionary Algorithm TCPA ...... Tightly-Coupled Processor Array WSDF ...... Windowed Synchronous Dataflow

219 List of Abbreviations

220 Curriculum Vitae

Hritam Dutta was born in Bokaro, India on April 5, 1979. He received his Bache- lor of Science (B.Sc.) and Master of Science (M.Sc.) degrees in Mathematics and Computing from Indian Institute of Technology, Kharagpur, India in 2000 and 2002, respectively. After obtaining his second Master of Science (M.Sc.) degree in Compu- tational Engineering from the University of Erlangen-Nuremberg in 2005, he joined the Chair of Hardware-Software-Co-Design (Prof. Dr-Ing. Jürgen Teich) at the Uni- versity of Erlangen-Nuremberg, Germany as a research assistant. Here, he was em- ployed in DFG P2R Project CoMap: Co-Design of Massively Parallel Embedded Processor Architectures from 2005-2010. Besides his studies, he gained further re- search exposure during internship in Indian Institute of Remote Sciences (2000), as an intern in Siemens Medical Solutions (2004), and as a visiting research employee in the University Bretagne-Occidentale, Brest, France (2010). His research interests in- clude models and methods for parallel/distributed computing in embedded systems. In 2000, he received student fellowship from the Indian Academy of Sciences. In 2003 he also received the "Siemens TOPAZ" scholarship from Siemens AG for pur- suing higher studies in the University of Erlangen-Nuremberg. Hritam Dutta has been a reviewer for several international conferences, and the Elsevier Journal for Microprocessors and Microsystems.

221