A Coarse Grained Reconfigurable Architecture Framework supporting Macro-Dataflow Execution

A Thesis Submitted for the Degree of Doctor of Philosophy in the Faculty of Engineering

by Keshavan Varadarajan

Supercomputer Education and Research Centre INDIAN INSTITUTE OF SCIENCE BANGALORE – 560 012, INDIA

DECEMBER 2012 cba Keshavan Varadarajan 2012 To my grandfather Late H. Keshavachar

a˜p vA rTm@yA\ hE-tnAÚ þºoEDnFm^ Acknowledgments

Before I start thanking people, I would like to state that this piece of paper can neither capture the extent of my gratitude nor the entire list of people to whom I am thankful. I name a few people who have directly helped me; There are many unnamed people who make it possible for us to work and attempt to make a meaningful contribution to the society. Dreams drive innovation and the dreamer-in-chief in this case was my guide: Prof. S K Nandy. I thank him for the dreams, for giving the initial impetus so that we could take it forward and for giving us the means to achieve it: monetary, intellectual, advisorial and equipments. I would like to thank Dr. Ranjani Narayan for many things: helping me make it into this institution, for the numerous discussions, patient paper reviews and steadfast belief in REDEFINE. I thank Dr. Balakrishnan Srinivasan for his timely and incisive comments that helped us perceive the architecture in newer light and identify the shortcomings of the architecture. I would like to thank Prof. Bharadwaj Amrutur for the numerous discus- sions on caches and sorry that I wasted your time without any results to show. His encouragement has worked wonders on me. To Profs. R Govindarajan and Matthew Jacob, I express my most sincere gratitude. I learnt my first lessons of from them. Subsequently, their belief in me has helped me sail the rough waters of PhD. I thank Prof. R Govindarajan for helping me secure my scholarship in some of the most testing times, I have ever faced. I would like to thank Prof. Y C Tay of the National University of Singapore for the opportunity he provided to work with him. I thank Prof. Georgi Gaydadjiev for the encouragement he gave me and the wonderful opportunity to work with him in Netherlands, which I could not take up. Mythri Alle, I thank you whole heartedly for several things. We started our PhD journeys together and now we end it nearly together. Without you, this PhD may have been very difficult or may not even have been possible. Next, to my dearest friend Rajdeep Mondal. Dude, without your belief that anything can be completed in time and your insightful comments, the bluespec implementation would not have been possible. Prasenjit Biswas and Saptarsi Das, I thank you for the many discussions and technical inputs. I rely on these people until this very day for technical inputs. Sanjay Kumar, without you life would have been very boring in CAD lab. You have been ii the most helpful innumerable times and the most reliable person when some task needs to be offloaded. Ganesh Garga my sincere thanks for all the work you put into Redefine and thanks for being a friend in the initial years of my stay at IISc. I would also like to thank Alexander Fell for all the work he put into the NoC. To my other friends: Aparna Mandke, Basavaraj Talwar, Vishal Sharda, Ritesh Rajore, Ramesh Reddy and Nimmy Joseph, it was a pleasure working with you. Bharath “Amba" Ravikumar, Sujay Mysore, Swaroop Krishnamurthy and Poornima Hatti have been my friends through thick and thin. Without them the journey of PhD would have been nigh impossible to complete. Last but not the least, I express my deepest gratitude to my family members: my parents, Anta, Akka, Jiju, Kutti, Sampath, U-ma, K-pa, chitti, chittappa and Bhartu. I will not belittle their contribution by attempting to state it in words. Finally, I would like to thank the guiding light who spoke in many voices, some known and some unknown. Abstract

A Coarse-Grained Reconfigurable Architecture (CGRA) is a processing plat- form which constitutes an interconnection of coarse-grained computation units (viz. Function Units (FUs), Arithmetic Logic Units (ALUs)). These units communicate directly, viz. send-receive like primitives, as opposed to the shared memory based communication used in multi-core processors. CGRAs are a well-researched topic and the design space of a CGRA is quite large. The design space can be represented as a 7-tuple (C,N,T,P,O,M,H) where each of the terms have the following meaning: C - choice of computation unit, N - choice of interconnection network, T - Choice of number of con- text frame (single or multiple), P - presence of partial reconfiguration, O - choice of orchestration mechanism, M - design of and H - host-CGRA coupling. In this thesis, we develop an architectural framework for a Macro-Dataflow based CGRA where we make the following choice for each of these parameters: C - ALU, N - Network-on-Chip (NoC), T - Mul- tiple contexts, P - support for partial reconfiguration, O - Macro Dataflow based orchestration, M - data memory banks placed at the periphery of the reconfigurable fabric (reconfigurable fabric is the name given to the intercon- nection of computation units), H - loose coupling between host and CGRA, enabling our CGRA to execute an application independent of the host-processor’s intervention. The motivations for developing such a CGRA are:

• To execute applications efficiently through reduction in reconfiguration time (i.e. the time needed to transfer instructions and data to the reconfigurable fabric) and reduction in execution time through better exploitation of all forms of parallelism: Instruction Level Parallelism (ILP), Data Level Parallelism (DLP) and /Task Level Parallelism (TLP). We choose a macro-dataflow based orchestration framework in combination with partial reconfiguration so as to ease exploitation of TLP and DLP. Macro-dataflow serves as a light weight synchronization mechanism. We experiment with two variants of the macro-dataflow orchestration units, namely: hardware controlled orchestration unit and the controlled orchestration unit. We employ a NoC as it helps reduce the reconfiguration overhead. iv

• To permit customization of the CGRA for a particular domain through the use of domain-specific custom-Intellectual Property (IP) blocks. This aids in improving both application performance and makes it energy efficient. • To develop a CGRA which is completely programmable and accepts any program written using the C89 standard. The compiler and the architecture were co-developed to ensure that every feature of the ar- chitecture could be automatically programmed through an application by a compiler. In this CGRA framework, the orchestration mechanism (O) and the host- CGRA coupling (H) are kept fixed and we permit design space exploration of the other terms in the 7-tuple design space. The mode of compilation and execution remains invariant of these changes, hence referred to as a framework. We now elucidate the compilation and execution flow for this CGRA framework. An application written in C language is compiled and is trans- formed into a set of temporal partitions, referred to as HyperOps in this thesis. The macro-dataflow orchestration unit selects a HyperOp for execu- tion when all its inputs are available. The instructions and operands for a ready HyperOp are transferred to the reconfigurable fabric for execution. Each ALU (in the computation unit) is capable of waiting for the availability of the input data, prior to issuing instructions. We permit the launch and execution of a temporal partition to progress in parallel, which reduces the reconfiguration overhead. We further cut launch delays by keeping loops persistent on fabric and thus eliminating the need to launch the instructions. The CGRA framework has been implemented using Bluespec System Veri- log. We evaluate the performance of two of these CGRA instances: one for cryptographic applications and another instance for linear algebra kernels. We also run other general purpose integer and floating point applications to demonstrate the generic nature of these optimizations. We explore various microarchitectural optimizations viz. pipeline optimizations (i.e. changing value of T ), different forms of macro dataflow orchestration such as hard- ware controlled orchestration unit and compiler-controlled orchestration unit, different execution modes including resident loops, pipeline parallelism, changes to the router etc. As a result of these optimizations we observe 2.5× improvement in performance as compared to the base version. The reconfig- uration overhead was hidden through overlapping launching of instructions with execution making. The perceived reconfiguration overhead is reduced drastically to about 9-11 cycles for each HyperOp, invariant of the size of the HyperOp. This can be mainly attributed to the data dependent instruction execution and use of the NoC. The overhead of the macro-dataflow was reduced to a minimum with the compiler controlled orchestration unit. To benchmark the performance of these CGRA instances, we compare the performance of these with an Core 2 Quad running at 2.66GHz. On the cryptographic CGRA instance, running at 700MHz, we observe one to two orders of improvement in performance for cryptographic applications and up to one order of magnitude performance degradation for linear algebra CGRA instance. This relatively poor performance of linear algebra kernels can be attributed to the inability in exploiting ILP across computation units inter- connected by the NoC, long latency in accessing data memory placed at the periphery of the reconfigurable fabric and unavailability of pipelined floating point units (which is critical to the performance of linear algebra kernels). The superior performance of the cryptographic kernels can be attributed to higher computation to load instruction ratio, careful choice of custom IP block, ability to construct large HyperOps which allows greater portion of the communication to be performed directly (as against communication through a register file in a general purpose processor) and the use of resident loops execution mode. The power consumption of a computation unit employed on the cryptography CGRA instance, along with its router is about 76mW, as estimated by Synopsys Design Vision using the Faraday 90nm technology library for an activity factor of 0.5. The power of other instances would be dependent on specific instantiation of the domain specific units. This implies that for a reconfigurable fabric of size 5 × 6 the total power consumption is about 2.3W. The area and power ( 84mW) dissipated by the macro dataflow orchestration unit, which is common to both instances, is comparable to a single computation unit, making it an effective and low overhead technique to exploit TLP. vi Contents

Acknowledgments i

Abstract iii

List of Figures xi

List of Tables xv

List of Acronyms xvii

1 Introduction 1 1.1 Evolution of Reconfigurable Architectures ...... 2 1.2 A different view point ...... 3 1.3 Motivation for developing a new CGRA ...... 4 1.4 Overall Setting of this Thesis ...... 6 1.5 Organization of the Thesis ...... 6 1.6 Note to the Reader/Reviewer ...... 7

2 Background and Related Work 9 2.1 Reconfigurable Architectures ...... 10 2.2 Classification of Reconfigurable Architectures ...... 11 2.3 Compilation for Reconfigurable Architectures ...... 16 2.4 Modern CGRAs ...... 18 2.4.1 RAW Architecture ...... 18 2.4.2 PACT eXtreme Processing Platform ...... 20 2.4.3 TRIPS ...... 22 2.4.4 IPFlex-DAPDNA ...... 23 2.4.5 ADRES ...... 24 2.4.6 Wavescalar ...... 25 2.4.7 NEC-Dynamically Reconfigurable Processor ...... 26 2.4.8 Ambric ...... 28 2.4.9 Polymorphic Pipeline Array ...... 29 2.5 Summary ...... 31 viii

3 Macro Dataflow Execution Model for a CGRA 35 3.1 Why Dataflow? ...... 35 3.2 Macro Dataflow ...... 39 3.3 Macro Operations ...... 39 3.3.1 Macro Dataflow Graph ...... 40 3.3.1.1 Macro Operations ...... 41 3.3.1.2 Input Merge ...... 41 3.3.1.3 Conditionals ...... 42 3.3.1.4 Loop Constants ...... 42 3.3.1.5 Function Call and Contexts ...... 43 3.3.1.6 Memory Access ...... 43 3.3.1.7 Conditions for a Well-behaved graph . . . . . 44 3.4 Execution Model ...... 46 3.4.1 Orchestration Unit ...... 47 3.4.1.1 Context Memory Allocation ...... 47 3.4.1.2 Loop Throttling ...... 48 3.4.1.3 Schedule, Terminate and Purge ...... 48 3.5 Granularity of Macro Operations ...... 49 3.6 Conclusion ...... 50

4 Design of the Reconfigurable Fabric 53 4.1 Programmability is of paramount importance ...... 54 4.2 Improving Fabric Utilization ...... 55 4.3 A Unified Interconnection ...... 55 4.4 Domain-specific customization to achieve better performance 56 4.5 High-Level Design choices ...... 57 4.6 Design of the Computation unit (C, T )...... 59 4.7 Design of the Router and the Reconfigurable Fabric (N)... 63 4.8 Design of the Load-Store Unit (M)...... 65 4.9 Relating to the Parametric Design Space ...... 67 4.10 Conclusion ...... 68

5 Design of the Macro-Dataflow Orchestration Subsystem 69 5.1 Overview of Operation ...... 69 5.2 Context Memory Update Logic and Orchestration Unit . . . 71 5.2.1 Determining the Consumer’s Instance Number . . . . 71 5.2.2 Determining the location within the Context Memory 75 5.2.3 Handling of Loop Constants ...... 76 5.2.4 Design of the Orchestration Unit ...... 76 5.3 Design of the Resource Allocator ...... 79 5.4 Design of the Instruction and Data Transfer Unit ...... 79 5.5 Conclusion ...... 81 Contents

6 Experimental Framework and Results 83 6.1 Experimental Framework ...... 83 6.2 Choice of Applications ...... 88 6.3 Compilation Overview ...... 91 6.4 Execution Overview ...... 92 6.4.1 Understanding Time Spent During Execution . . . . . 94 6.5 Results ...... 95 6.6 Conclusion ...... 98

7 Microarchitectural Optimizations 101 7.1 Reducing Fabric Execution Time (FET) ...... 101 7.1.1 Reducing Temporal Distance between Memory Oper- ations ...... 101 7.1.2 Eliminating the Priority Encoder ...... 103 7.1.3 Reducing Temporal Distance between Dependent In- structions ...... 104 7.2 Reducing Inter-HyperOp Launch Time (IHLT) ...... 106 7.3 Evaluating the impact on FET and IHLT ...... 110 7.3.1 Impact on FET ...... 110 7.3.2 Impact on IHLT ...... 113 7.3.3 Impact on overall execution time ...... 114 7.4 Reducing Perceived HyperOp Launch Time (PHLT) . . . . . 116 7.4.1 Interleaving Instruction and Data Load ...... 116 7.4.2 Resident Loops ...... 117 7.4.3 Note on different techniques to reduce PHLT . . . . . 121 7.5 Understanding the Cumulative Effect ...... 121 7.6 Synthesizing Hardware Modules ...... 122 7.7 Comparing Performance with other processors ...... 123 7.7.1 Comparison with a General Purpose Processor (GPP) 124 7.7.2 Comparison with Field Programmable Gate Arrays (FPGAs) ...... 124 7.7.3 Comparison with CGRAs employing FPGAs . . . . . 125 7.7.4 Addressing some shortcomings ...... 126 7.7.4.1 Alternative Techniques for Load-Store Over- head Reduction ...... 127 7.8 Exploiting Pipeline Parallelism and Task Level Parallelism . . 130 7.9 Conclusion ...... 133

8 Conclusions and Future Work 135 8.1 Future Work ...... 138 8.2 Directions for Future Research ...... 139

References 143 x

Publications 153 List of Figures

3.1 The schematic shows how two application substructures can be simultaneously realized on an array of computation units. 37 3.2 A tree of stacks is needed when exploiting parallelism that exists among application substructures...... 38 3.3 Schematic figure showing different application substructures, where a loop and several parallel tasks are placed between sequential substructures...... 40 3.4 All inputs edges on a macro operation may not be valid. How- ever, all such inputs are mutually exclusive thus allowing an implicit merge node...... 41 3.5 Schematic showing three substructures A, B and C. The de- cision on invocation of C is taken in substructure B and A is another predecessor of C which generates inputs for it without consideration of the decision taken in B...... 44 3.6 Schematic of the execution system...... 46

4.1 The plot shows the percentage nodes having a given outdegree. This has been plotted for different applications...... 58 4.2 Schematic Diagram showing the internals of a Computation unit connected to a Router...... 60 4.3 Instruction selection logic in a computation unit. V - valid bit; P - Predicate value; E - Predicate expected ...... 60 4.4 Schematic diagram of the write back stage...... 62 4.5 Block diagram of the router...... 63 4.6 Reconfigurable fabric in which computation units are inter- connected using the honeycomb topology. The blocks with wavy shading are the peripheral routers...... 64 4.7 Schematic diagram of the Load-Store Unit...... 66 4.8 The connection of the Load-Store Unit and Orchestrator with the reconfigurable fabric is shown...... 66

5.1 Schematic of the Macro-Dataflow Orchestration Subsytem . . 70 xii

5.2 Four possible cases of HyperOp to HyperOp communication through scalar variables (shown with solid edges)...... 74 5.3 Schematic of the Instruction and Data Transfer Unit...... 80

6.1 Block Diagram showing the interconnection between the re- configurable fabric and the macro-dataflow orchestration sub- system ...... 84 6.2 of resource allocation is shown ...... 93 6.3 The various steps in HyperOp execution and the time spent in various activities. Total HyperOp Launch Time (THLT); Perceived HyperOp Launch Time (PHLT); Fabric Execution Time (FET); Inter-HyperOp Launch Time (IHLT) ...... 94 6.4 Plot shows the portion of the overall execution time spent in various tasks...... 97

7.1 The path taken by the load request and the trigger edge to the subsequent memory operation that follows are shown. . . 102 7.2 The path taken by the load request and the local trigger edge to the subsequent memory operation that follows are shown. . 103 7.3 Modified Compute Element structure reduces the minimum temporal distance between two dependent instructions to two 105 7.4 The figure shows a Control Flow Graph (CFG) with the HyperOps overlaid on it indicating which basic blocks are a part of the same HyperOp. The boxes represent the basic blocks and the ovals the HyperOps...... 107 7.5 The plots show the normalized execution time of the CGRA for configurations I and II. The normalization is performed with respect to the base configuration...... 111 7.6 FET and Clocks Per Instruction (CPI) improvements due to optimizations within the computation unit ...... 112 7.7 Plot comparing the IHLT component in the overall execution time for various configurations...... 113 7.8 The plots show the normalized fraction PHLT w.r.t overall exection time for configurations I and II. The normalization is performed with respect to the base configuration...... 115 7.9 The plots show the normalized PHLT, FET and overall exe- cution time for configuration III, normalized with respect to configuration II. The last plot compares the execution time for configuration III to the execution time recorded for con- figuration I...... 117 7.10 The overall execution time with configuration IV normalized with respect to configuration III...... 119 List of Figures

7.11 A single plot showing the improvement in performance for each of the architectural optimizations with respect to the base configuration...... 122 7.12 Plot shows the comparison in performance between a Core 2 Quad and our CGRA...... 124 7.13 Normalized FET for HyperOps on Cryptographic Fabric . . . 128 7.14 Normalized FET for HyperOps on Floating Point Fabric. . . 129 7.15 Normalized FET for the floating point applications when pipelined floating point units are employed...... 130 xiv List of Tables

2.1 The table indicates the values for all the terms in the 7-tuple design space of a CGRA. The terms in the 7-tuple design space have the following meanings: C - choice of computa- tion unit, N - choice of interconnection network, T - Choice of number of context frame (single or multiple), P - presence of partial reconfiguration, O - choice of orchestration mech- anism, M - design of memory hierarchy and H - host-CGRA coupling. The meaning of other terms: C/F: Control flow; D/F: Dataflow...... 31

4.1 The values of various parameters have been decided based on the overarching goals that were set forth at the outset. . . . . 67

6.1 List of all FUs and the latency in clock cycles for the Crypto- graphy fabric and floating-point fabric...... 86 6.2 Execution time in cycles for the Cryptography applications, Integer applications and floating-point applications...... 96 6.3 CPI recorded for various applications while executing on the reconfigurable fabric...... 98

7.1 The various hardware configurations...... 110 7.2 The area, power and frequency estimates for various modules. 123 7.3 Comparison of the throughput achieved by FPGA and CGRA.125 7.4 Cycles recorded on Molen for various applications...... 126 7.5 The overall execution time recorded for configuration IV and configuration VI and the percentage difference between the two execution times...... 132

8.1 Table summarizing the various simulation configurations. . . 142 xvi List of Acronyms

ADRESArchitecture for Dynamically Reconfigurable Embedded Systems9

AES Advanced Encryption Standard...... 88

ALU ...... iii

ASAP As Soon As Possible ...... 104

ASIC Application Specific ...... 1

BDL Behavioral Design Language ...... 27

CDFG Control-Dataflow Graph...... 27

CFG Control Flow Graph ...... xii

CGRA Coarse-Grained Reconfigurable Architecture...... iii

CPI Clocks Per Instruction ...... xii

CPLD Complex Programmable Logic Device ...... 2

CPU ...... 15

CRC Cyclic Redundancy Check ...... 90

DAP Digital Application Processor ...... 23

DFG Dataflow Graph ...... 91

DLP Data Level Parallelism ...... iii

DMA Direct Memory Access...... 30 xviii

DNA Distributed Network Architecture ...... 23

DRESCDynamically Reconfigurable Compiler . . . . . 24

DRP Dynamically Reconfigurable Processor ...... 17

DSP ...... 3

ECC Elliptic Curve Cryptography ...... 89

ECDH Elliptic Curve Diffie–Hellman...... 89

ECDSAElliptic Curve Digital Signing Algorithm ...... 89

ECPA Elliptic Curve Point Addition ...... 88

ECPD Elliptic Curve Point Doubling ...... 88

EDGE Explicit Data Graph Execution...... 22

ETS Explicit Token Store ...... 109

FDLS Force Directed List Scheduling ...... 114

FIFO First In First Out ...... 28

FET Fabric Execution Time ...... ix

FFT Fast Fourier Transform ...... 90

FPGA Field Programmable ...... ix

FNC Function ...... 20

FSM Finite State Machine ...... 27

FU Function Unit ...... iii

GALS Globally Asynchronous Locally Synchronous ...... 29

GPP General Purpose Processor ...... ix

HDL Hardware Description Language...... 16 List of Tables

HLL High Level Language ...... 125

IDCT Inverse Discrete Cosine Transform...... 90

IHLT Inter-HyperOp Launch Time ...... ix

ILP Instruction Level Parallelism ...... iii

ISA Instruction Set Architecture ...... 25

I/O Input/Output ...... 15

IP Intellectual Property ...... iv

LLVM Low Level Virtual Machine ...... 91

LUT Look Up Table...... 2

Mbps Megabits Per Second...... 125

MFLOPSMega-FloatingPoint Operations Per Second ...... 125

NoC Network-on-Chip ...... iii

NURA Non-Uniform Register Access ...... 19

PAE Processing Array Element ...... 20

PAL ...... 2

PC Program ...... 36

PE Processing Element ...... 23

PHLT Perceived HyperOp Launch Time ...... ix

PLA Programmable Logic Array ...... 2

PPA Polymorphic Pipeline Array ...... 9

RC Reconfigurable Cell ...... 24

RISC Reduced Instruction Set Computer ...... 23 xx

RAM Random Access Memory ...... 28

ROM Read Only Memory...... 2

RTL Register Transfer Level ...... 84

SHA Secure Hash Algorithm...... 88

tCFG tree Control Flow Graph ...... 27

THLT Total HyperOp Launch Time...... xii

TTA Transport Triggered Architecture ...... 4

TLP Thread/Task Level Parallelism ...... iii

TTDA Tagged Token Dataflow Architecture ...... 106

SSA Static Single Assignment ...... 91

SoC System on Chip ...... 28

VLIW Very Long Instruction Word...... 3

XPP eXtreme Processing Platform ...... 9 Chapter 1

Introduction

It really does not matter how much you learn in a PhD, what matters is what the world learns from your PhD - Prof. Bharadwaj Amrutur1

Today, the processor industry is at a point where heterogeneity in the processing platform is the answer to providing performance scaling for the next decade and more. Heterogeneity in processing platforms is achieved by inter-mingling various forms of computing processors including GPPs, reconfigurable processors, Application Specific Integrated Circuits (ASICs) etc. (Borkar and Chien, 2011). On heterogeneous processing platforms, higher performance and energy efficiency can be achieved through better assignment of applications to processor types based on application characteristics. This implies there is a need to define how these varied processing cores are integrated in terms of hardware and in terms of programmability. Papers by Hill and Marty (2008) and Borkar and Chien (2011) have been predicting that a heterogeneous mix of small and large processor cores may be the future of processors. However, the design space is large and remains to be completely explored. Further, Hill and Marty (2008) discusses the possibility of designing dynamic cores where each of these small processors cooperate to execute a large task. This form of execution is greatly reminiscent of a Coarse-Grained Reconfigurable Architecture (CGRA) style of execution. This model is vastly different from the way parallelism was exploited on the Niagara processor (Kongetira et al., 2005), where threads are executed on different processors and communication is facilitated through shared memory

1This statement has always kept me aware that it is not my satisfaction of learning that matters, but my contribution that matters. While I am not sure if there is any substantial learning, I know I have tried. 2 Introduction

semantics. In this thesis, we present an architecture of a CGRA which was developed ground up along with its compiler, where cores are dynamically composed from small computation units to execute a large application. This thesis serves to document the design objectives, choices, decisions and the impact of these decisions encountered while architecting this CGRA. We begin this thesis with a brief description of the evolution of reconfigurable architectures. We then list the motivations which served as guiding principles during the design of our CGRA. We end this chapter with a summary of the chapters that follow and a guide to the reader/reviewer on various possible orders in which the chapters could be read.

1.1 Evolution of Reconfigurable Architectures

The term “reconfigurable architecture", due to its lack of clear defini- tion, is popularly used to refer to either very specific solutions such as FPGAs or is colloquially used to refer to multi-mode ASICs. The key to understanding what constitutes a reconfigurable architecture lies in its evolution from early hardware solutions such as the Programmable Logic Arrays (PLAs). PLAs were introduced in 1970 by Texas In- struments. PLAs were a 2-D grid of AND and OR gates which were appropriately interconnected so that any logic expression in sum of products form could be realized (as long as the number of inputs to the logic expression was equal to the number of logic pins on the PLA). PLAs were “mask-programmed" during manufacture. A variant of this, which includes fuses, that could be one time programmed was called Programmable Array Logic (PAL). The rigidity of these devices, i.e. its one time . programmability, led to the invention of programmable Read Only Memory (ROM) based devices. These devices were called Generic Array Logic. This was followed by Complex Programmable Logic Devices (CPLDs) which could be used in places where much larger number (thousands to hundreds of thousands) of gates were needed. All these technologies however, were still not programmable on the “field" i.e. could not be performed at the customer site. FPGAs which were designed as an interconnection of Look Up Tables (LUTs) were the first to achieve “field programmability". The devices were fabricated by the vendors and shipped to customer, who could pro- gram the device at their own site. The programs were specified in hardware description languages such as Verilog and VHDL. The FPGAs achieved phenomenal success in the 1990s and are now synonymous with Reconfigurable Architectures. The structure of FPGAs makes it well suited in modeling large state 1.2 A different view point 3

machines and other logic expressions that operate at the bit-level. In order to model an application that requires larger mathematical operations, viz. floating point addition, these mathematical operators are emulated on the FPGA in order to emulate the behavior of the ap- plication. However, this emulation of the mathematical operators are not as efficient as a custom function unit present within a GPP. A new form of reconfigurable architecture, which was an interconnection of coarse-grained mathematical operators (instead of fine-grained LUTs), emerged as the answer. Several of these CGRAs were designed by the academia and industry. A detailed review of these can be found in the survey papers by Hartenstein (2001), Compton and Hauck (2002) and Amano (2006). Unlike the FPGAs of that time, most CGRAs were capable of runtime reconfiguration, i.e. ability to change the hardware configuration quickly so that the configuration could be periodically changed, and included support for high-level languages in the com- pilation process instead of hardware description languages used by FPGAs vendors. While modern FPGAs continue to retain their primary characteristic of being bit-programmable, they have included support for several of these technologies through instantiation of Digital Sig- nal Processor (DSP) slices, support for partial reconfiguration and permitting synthesis from high-level languages on to FPGAs. The entire gamut of silicon solutions spanning from PLAs to CGRAs constitute the class of reconfigurable architectures. The most fun- damental characteristic which is common to all of them is the spa- tial parallelism that can be exploited in these devices (DeHon and Wawrzynek, 1999). We revisit this definition in chapter 2 to refine it further.

1.2 A different view point

Very Long Instruction Word (VLIW) processors emerged as a means to address the scalability problems associated with Superscalar pro- cessors. VLIW processors allowed the compiler to explicitly indicate which instructions were to be executed in parallel, as opposed to a which attempted to detect it at runtime. It was soon discovered that the VLIW too had scalability problems; Beyond a certain issue the number of read/write ports on the register file could not be increased as it leads to an energy-inefficient design. This led to development of Clustered-VLIW processors. In the clustered-VLIW processor the issue width was spread across multiple processors of a cluster. Each processor in the cluster was connected to a single register file. They could collectively issue much larger number of instructions 4 Introduction

every clock cycle. Data to be transferred from one processor to another within the cluster had to be sent through the use of explicit transfer instructions. The instructions within all the processors of a cluster proceeded in lock-step; A stall emanating from any of the processors would cause a stall of the entire cluster. The RAW processor was designed to address this severe limitation. The RAW processor is an interconnection of several simple processors (MIPS R2000) and all the processors could proceed stepping through its instructions without regard to the state of the other processors. When a data is sent to the neighbouring processor the send primitive is employed and when a data is to be received receive primitive is used. A receive instruction for which data has not been received stalls the pipeline until the arrival of data. Unlike the Niagara processor Kongetira et al. (2005), the small processors did not execute individual threads. An application written in a high-level program was partitioned into different instruc- tion streams, each of which was assigned to these simple processors. These simple processors collectively executed a large program, mak- ing it a CGRA-style of execution as opposed to a multi-core style of execution2. In the process of making the pipelines of all the processors independent, the designers of RAW permitted pipeline stalls when the data for the receive instruction was not available. Prior to this, in VLIW architectures pipeline stalls were only enforced when the data from a load had not yet arrived. The introduction of receive instructions in all processors at appropriate places made it possible to collectively work on the same program without the need for explicit synchronization constructs. Data dependent instruction execution, of the form demonstrated here, is the most light-weight implicit syn- chronization mechanism. This led to the renewed interest in Dataflow processing whose influence can be seen in TRIPS processor Burger et al. (2004). Unlike the static instruction scheduling employed in RAW, TRIPS employs dynamic instruction scheduling. A detailed de- scription of these architectures and their differences is presented in chapter 2.

1.3 Motivation for developing a new CGRA

Transport Triggered Architecture (TTA) (Corporaal, 1997) addressed the problem of need for many read/write ports in the register files in VLIW by performing transport scheduling along with instruction scheduling. Corporaal (1997) not just exposed the instruction width

2We cover this in greater detail in chapter 2. 1.3 Motivation for developing a new CGRA 5

and the type of FUs available to the compiler, also exposed the trans- ports of data from one FU to another FU or to a register file to the compiler. The number of operand buses interconnecting various FUs were known to the compiler and the transport of data from a FU was scheduled by the compiler. It was observed by Corporaal (1997) that close to 30% of the writebacks for the SPEC95 benchmark were never used by the processor. This architecture served as the precursor to our CGRA. We wanted to expand this architecture into a 2-D structure and schedule instructions in much the same manner as TTA. This, we be- lieved, would solve the problem of lack of complete programmability in CGRAs we had encountered until then. However, unlike the TTA, we decided to employ a 2-D network of ALUs instead of FUs as we identified it as one of the shortcoming of the DAP-DNA architecture (Sato et al., 2005; Sugawara et al., 2004) (details of which too are available in chapter 2). We wanted to extend the idea of TTA to allow for the creation of variable issue-width VLIW processors3 i.e. allow for creation of different sized VLIW cores based on application require- ment. Further, we wanted the ability to use the remaining resources on the fabric (after the creation of one VLIW core) to instantiate other VLIW cores, in line with the idea of dynamic cores expressed by Hill and Marty (2008). Instantiation of multiple VLIW cores at the same time is essential only in the presence of TLP. In order to exploit TLP (when available) we decided to use a dynamic-dataflow based task scheduling, so that the overhead of creation of new tasks and syn- chronization among tasks is very low. During our initial years, we identified that custom-IP blocks are key to obtaining performance and energy efficiency. Therefore, the CGRA and the compilation had to have support for integration of custom-IP blocks. The motivations for developing a new CGRA can be summarized as follows: • A CGRA that can be programmed through a high-level language specification (viz. C language) with little or no manual overrides. Further, the CGRA that accepts any program written using the C89 standard. • A CGRA that achieves good performance through reduction of reconfiguration time and execution time. The reduction in execu- tion time can be achieved through exploitation of available ILP, DLP and TLP. • A CGRA that can be customized for a specific domain through

3This ability translates into two requirements: (i) Ability to exploit ILP by scheduling different portions of the instruction stream on different nodes of the 2-D structure (ii) Ability to create such a group of processing elements at runtime 6 Introduction

instantiation of domain-specific FUs. This helps improve the performance and energy consumption of the CGRA. • A CGRA which is coupled loosely to the host processor such that the CGRA and host-processor can be efficiently utilized. This provides us the ability to offload large and independent tasks on to the CGRA, while the host processor continues to execute other tasks. An architecture that catered to all of these requirements, to the best of our knowledge, did not exist at the point of its inception. The primary contribution of this thesis is the architecture of a CGRA that addresses most of these requirements.

1.4 Overall Setting of this Thesis

This thesis is a part of a larger project in which the CGRA, its compiler, partitioning and mapping algorithms, NoC were developed. This thesis pertains to the architecture of the CGRA and its implementation. Other theses, cover different aspects of this work. These include 1. The thesis by Alle (2012) provides the details of the compiler. 2. The thesis by Krishnamoorthy (2010) provides the details of the partitioning and mapping algorithms that can be used along with this architecture. 3. NoC is detailed in the thesis by Fell (2012).

1.5 Organization of the Thesis

The thesis describes the architecture and design of our CGRA. This description is preceded by a description of the background and related work in chapter 2. Chapter 3 lays down the theoretical foundation for a macro-dataflow based execution model on a CGRA. This chapter is the keystone and covers all aspects of the architectural behavior and defines the interface between the compiler and the architecture. The following chapter, chapter 4, covers the microarchitectural details of the 2-D array of computation units. Chapter 5 covers the microarchi- tectural details of the macro-dataflow orchestration subsystem. The experimental framework and results of our initial evaluations are presented in chapter 6. We perform several microarchitectural im- provements and evaluate the effectiveness of each of these techniques. 1.6 Note to the Reader/Reviewer 7

This is presented in chapter 7. The conclusion and a proposal for future work is presented in chapter 8.

1.6 Note to the Reader/Reviewer

A reader/reviewer of this thesis may follow two possible orders (while reading the core chapters): 1. chapters 3 → 4 → 5 → 6 → 7 2. chapters 3 → 5 → 4 → 6 → 7 A reader/reviewer with a background in CGRAs would find our use of the macro-dataflow orchestration quite unique as traditionally CGRAs have been relying on the host-processor for sequencing through the set of configurations. Another characteristic which is certainly unique among CGRAs is the exclusive use of a NoC as the intercon- nection network between computation units. Over the course of the thesis, we evaluate the effects of these choices. A reader/reviewer who is familiar with literature from the architec- tural space pertaining to GPPs would find that we have evaluated one of the design points proposed by Hill and Marty (2008)4 (Symmet- ric Multicore with Sixteen 1-BCE cores). While both Hill and Marty (2008) and Borkar and Chien (2011) describe a heterogeneous solu- tion as a possible design point, the details such as the ability of each of these small cores, the type of interconnection network to be applied, the programming model are not specified. We evaluate a particular configuration of this design point and the results presented here apply to that subset only. However, the reader/reviewer would be able to draw inferences about other design points from the data presented in this thesis.

4Our design and architecture predates the publication by Hill and Marty (2008). As the model proposed by Hill and Marty (2008) is generic almost all CGRAs would fit it. 8 Introduction Chapter 2

Background and Related Work

What is a Reconfigurable Processor? Is there a (widely accepted) defini- tion? – Dr. Vinod Kathail1

“Using FPGAs for computing led the way to a general class of computer organizations which we now call reconfigurable computing architectures. The key characteristics distinguishing these machines are that they both:

• can be customized to solve any problem after device fabrication

• exploits a large degree of spatially customized computation in order to perform their computation."

This is an excerpt from a seminal paper on reconfigurable comput- ing by DeHon and Wawrzynek (1999). Unfortunately, in the context of today’s processors, this description could refer to a whole gamut of solutions; these include FPGAs, multi-core processors, architectures such as RAW (Agarwal et al., 1997; Lee et al., 1998; Waingold et al., 1997), Wavescalar (Swanson et al., 2003, 2007), TRIPS (Burger et al., 2004; Sankaralingam et al., 2003), DAP-DNA (Sato et al., 2005; Sug- awara et al., 2004), NEC-DRP (Amano et al., 2004; Suzuki et al., 2004; Toi et al., 2006), Polymorphic Pipeline Array (PPA) (Park et al., 2009, 2010), PACT-eXtreme Processing Platform (XPP) (Cardoso and Wein- hardt, 2002; Technologies, 2006a,b,c), Architecture for Dynamically Reconfigurable Embedded Systems (ADRES) (Mei et al., 2005, 2002, 2003) etc. In the subsequent sections, we identify some of the key

1This statement motivated some of our very initial search for the definition of Reconfig- urable Architecture. Today after a lot of searching, we know that it was an evolution from different origins, rather than a defined class. Thank you for the question. 10 Background and Related Work

characteristics that differentiate reconfigurable processors from other processing paradigms.

2.1 Reconfigurable Architectures

A reconfigurable architecture is a processing platform which comprises an interconnection of computation units. This interconnection of computation units is referred to as the reconfigurable fabric. The com- putation units (referred to as processing elements by Amano (2006)) can be anything from a in a PLA, a LUT employed in FPGA to much coarser-grained units viz. ALUs. A plurality of these computation units can be dynamically combined to execute a larger functionality. The interconnection employed could have different topologies (viz. -based, 2-D mesh, honeycomb) and switching functionality could be implemented in many ways (viz. programmable interconnection network, NoC, point-to-point interconnection network, bus). Execu- tion on a reconfigurable architecture is preceded by a programming of the computation units and the interconnection network. The “in- structions" to the compute unit determine the function to be executed and the “configuration" for the interconnection network determines the direction of switching upon the arrival of data from a certain dir- ection2. The “instruction" and “configuration" together are referred to as the hardware context. The key observation one can make from the execution paradigm is the presence of explicit communication between the computation units facilitated by the interconnection network. This is unlike a multi-core processor where communication is accomplished through the use of shared memory semantics. An application written for a reconfigurable architecture is compiled/synthesized into collec- tion of operations which communicate, where operations are mapped to computation units and the communication is facilitated through the interconnection network. This form is in contrast to the explicit creation of threads in applications written for multi-core processing platforms. Amano (2006), in his survey paper, indicates that architectures like RAW (Agarwal et al., 1997) would fall under the class of tile processors as each computation unit on the fabric occupies a much larger area than traditional CGRAs and thus is more powerful. The computation units of TRIPS (Sankaralingam et al., 2003), Wavescalar (Swanson et al., 2003) which have similar large computation units

2The configuration of the interconnection network depends on the type of the interconnec- tion network. This description applies to a programmable interconnection network. 2.2 Classification of Reconfigurable Architectures 11 would not belong to the class of Reconfigurable architectures and are instead called Tile Processors. However, this definition based on area is very qualitative (as there is no reference reconfigurable architecture against which it can be compared) and therefore does not help in easy classification. In this thesis we do not differentiate between CGRAs and Tile processors and refer to both these type of architectures as CGRA.

2.2 Classification of Reconfigurable Architectures

The micro-architectural design space of a CGRA is quite large and can be expressed as a 7-tuple (C,N,T,P,O,M,H). Each element of the 7-tuple can be assigned different values from a set of choices. The choices of values for each of the elements of this 7-tuple space has an impact on the programmability, performance and power/energy of the resulting CGRA. The use case determines the constraints placed on the design of the CGRA such as area, power and energy. These constraints limit the design space, as certain design choices may lead to a point outside the area, power and/or energy limits. The parameters listed above are a superset of the parameters identified by Amano (2006) and Compton and Hauck (2002). The explanation for each of the terms in the 7-tuple is given below.

Computation unit (C): The choice of computation unit is the first and the most critical choice to be made during the design of a recon- figurable architecture: the choice is between fine-grained units and coarse-grained units. A fine-grained computation unit has a typical input bit width of 1-4 bits (which can be implemented as LUTs). This type of unit is very efficient when realizing designs that involve bit level manipulations or designs with low-level state machines. On the other hand coarse-grained units typically comprise FUs or ALUs of input bit widths ranging from 8-32. A unit which implements a single functionality is referred to as FU and the ALU is used to refer to a unit which implements multiple functions. While FUs require little or no configuration (thus have low reconfiguration overhead), the design of a interconnection of FUs can be quite challenging. Determining the organization of the FUs (i.e. which FUs are connected to which FUs) is critical to the efficient utilization and performance of the reconfigur- able fabric. ALUs on the other hand require configuration information (and thus have higher reconfiguration overhead) but make the task of instruction placement easier due to the versatility of the ALU. In 12 Background and Related Work

the case of an ALU the architect must also decide whether the fabric would be a homogeneous collection of computation units or a het- erogeneous one. A homogeneous fabric makes the task of instruction placement very simple whereas a heterogeneous fabric may present a more pragmatic choice. Making all ALUs equally capable may lead to an infeasible design point in terms of the area, power etc. based on the use case. The coarse-grained units are efficient in executing programs with coarse-grained mathematical functions viz. add, subtract. A reconfigurable architecture with coarse-grained computation units is referred to as Coarse-Grained Reconfigurable Architecture.

Interconnection Network (N): The designer can choose one of the following types of interconnections: programmable, point-to-point, bus or a Network-on-Chip. The programmable interconnection net- work includes which are pre-programmed through writes to the memory cells connected to it. A point-to-point network allows data forwarding from a node to only its immediate neighbours. In order to perform a multi-hop communication one needs forwarding instructions at each of the intermediate nodes. These instructions are a part of the configuration. A NoC is an interconnection of routers where each router checks the destination field of the packet to be for- warded and determines the appropriate neighbouring router to which it needs to be forwarded. A programmable interconnection network incurs a configuration cost associated with determining the behavior of each . This configuration must be passed along with the hardware context. On the other hand a point-to-point network incurs no cost associated with programming the interconnection network, but appropriate forwarding instructions need to be inserted in interme- diate nodes. NoC based solutions also do not incur configuration cost, but need to implement a routing algorithm in hardware. Unlike the NoC and programmable interconnection networks, the point-to-point interconnection network directly connects two computation units. On a reconfigurable fabric employing NoC (or programmable intercon- nection network) the computation unit is connected to a router (or a switch) and the routers (or switches) are interconnected together. The point-to-point interconnection network is efficient for near neighbour communication since it connects neighbouring compute units directly. This form of network does not incur an additional routing latency between neighbouring computation units. Apart from these basic types, a designer may choose a hierarchical interconnection where different network types are used at different levels of the fabric. For example, a close group of compute elements can be interconnected 2.2 Classification of Reconfigurable Architectures 13

through point to point links and the groups of compute elements can be interconnected with a NoC. In the case of hierarchical intercon- nections, bus based communication links can also be provided (in addition to the set of choices mentioned above) at the first level. The designer may also choose to employ multiple networks of different kinds, each with a specific purpose. We refer to this as the hybrid network. Other than the type, the designer also needs to choose the topology of the interconnection network. 2-D mesh is the most pop- ular among them. Other topologies include bus-based, honeycomb, hexagonal etc. The interconnection network between the computation units have the same width as the bit-width of the computation unit or is largely determined by it.

Support for Single or Multiple Contexts (T ): Another question which needs to be answered early on during the design of a reconfigurable architecture is whether it is expected to run kernels/applications whose sizes are much larger than what can be accommodated within one hardware context. If it is, then the application needs to be divided into multiple temporal partitions. In such a scenario support for multiple contexts help in reducing the time needed to switch from one temporal partition to the other through prefetch of the next temporal partition. Almost all FPGAs support storage of a single context.

Support for Partial Reconfiguration (P ): Partial reconfiguration refers to writing the configuration to a specific region of the reconfigurable fabric without affecting the configuration loaded in other regions of the reconfigurable fabric. This feature allows (i) overwriting a portion of the configuration for runtime modifications (viz. FPGAs) (ii) having temporal partitions which partially occupy the fabric. This permits loading of temporal partitions while other temporal partitions continue to execute undisturbed on the fabric. Partial reconfiguration can be supported along with single-context or multi-context reconfiguration models. Parital reconfiguration is a common feature found on modern FPGAs.

Choice of Type of Orchestration (O): In a reconfigurable architecture where the entire application cannot be completely mapped, the applic- ation is transformed into multiple temporal partitions. These temporal partitions are time multiplexed on to the reconfigurable fabric. The order in which these temporal partitions are executed is determined based on program order. The orchestration of these is typically man- 14 Background and Related Work

aged by an external entity (external to the reconfigurable fabric). Choosing a sequential execution model, for such an orchestration unit, leads to a simple hardware implementation. However, in this model the exploitation of TLP needs to be done through software mechanism for maintaining state (viz. through creation of a new call stack for each thread which is instantiated). An alternate mechanism would be to use a dataflow based orchestration model. In this model, temporal partitions are launched as and when it is ready for execution. The third form of orchestration is a reduction-oriented orchestration of temporal partitions. A reduction-oriented orchestration would require support for context switches on the reconfigurable fabric. Implement- ing context save on a reconfigurable fabric incurs a high overhead. Another form of orchestration which has been employed in NEC-DRP (Amano et al., 2004; Suzuki et al., 2004; Toi et al., 2006) is based on a state machine. The state machine determines the next state based on the current state and results of certain operations. The state space is computed based on the application specification. This model is not a different paradigm and can be used to implement either sequential or dataflow execution paradigms.

Organization of the memory hierarchy (M): Early reconfigurable ar- chitectures were designed for streaming inputs only and did not have a notion of a memory hierarchy. The data for the computation were available from the input ports. Lack of a memory hierarchy limits the type of programs that can be executed on these architectures. Any program which performs memory accesses that cannot be determined at compile time or the memory requirements of which exceed the available memory cannot be executed on it (viz. NEC-DRP (Amano et al., 2004; Suzuki et al., 2004; Toi et al., 2006)). To address this, a memory hierarchy in the form of scratch-pad, block-Random Access Memories RAM or L1 data caches, is connected to the periphery of the fabric. The presence of such a general purpose memory hierarchy allows any program to be executed on the CGRA. A third alternative would be to have an L1- or scratch pad memory local to each computation unit on the CGRA. In this design, each computation unit occupies a larger area (viz. RAW (Agarwal et al., 1997; Lee et al., 1998; Waingold et al., 1997)). This design requires the compiler to de- termine the data-sharing mechanism between the computation units. Another possible design would be to have some memory units in place of computation units on certain positions on the fabric. This limits the total distance that needs to be traversed in order to access the memory. 2.2 Classification of Reconfigurable Architectures 15

Coupling with host processor (H): Many reconfigurable processors are typically coupled with host processors. This coupling can take one of four forms reported by Compton and Hauck (2002). These include 1. as a FU within the Central Processing Unit (CPU) 2. as a co-processor tightly coupled with CPU 3. as a processing unit connected to the cache or 4. connected to the CPU through an Input/Output (I/O) interface The extent of coupling determines the extent to which the reconfig- urable processor can operate independently of the host processor. If the reconfigurable processor is embedded within the CPU as a FU it needs to interact with the host processor’s pipeline on a per-instruction basis. When connected to the CPU as a co-processor it can operate independent of the host processor for 10-100s of cycles and allows the host processor and the co-processor to execute in parallel. In the third mode, when connected with the cache it can operate as a data cache or as an additional processing unit. In the last mode, when connected over the I/O interface it allows the reconfigurable processor to work for long durations independent of the processor. The former two choices cause a tight coupling between the CGRA and the host-processor and the latter are an example of loose coupling. In tightly coupled architectures task of orchestration is off-loaded to the host-processor and it serves as a short-term scheduler. In other cases, the host-processor only serves as a long-term scheduler. This choice, of tightly coupling the CGRA and the host processor, curtails the possible choices of other subsystems viz. the computation unit is most likely to be a FU, the preferable network would be a program- mable interconnection network or a point-to-point interconnection network etc. This form of coupling is used mostly in processors that support instruction set extensions. In the loose coupling configuration, the reconfigurable architecture executes for 100s of cycles prior to interacting with the host-processor. A loose coupling opens up the entire design space for other subsystems. This model is used mostly to execute entire applications as opposed to accelerating instruction sequences. This is one of the earliest decisions that is to be taken, as this choice has an impact on the possible choices for other parameters.

Implementation of the Orchestrator: We have mentioned in the pre- vious paragraphs about various forms of orchestration and the various degrees of coupling between the host-processor and the CGRA. In 16 Background and Related Work

many CGRAs the orchestrator is implemented in the form of control program on the host-processor. Some others use a dedicated hard- ware entity to orchestrate between the various temporal partitions. A third form of implementation of the orchestration is referred to as self-reconfiguration. In self-reconfiguration, each computation unit or one of the computation units on the fabric decides to reconfigure itself or the entire fabric and requests for the instructions and data to be loaded for the subsequent temporal partition. We do not express this as a separate term of the design space, since it is related to the implementation of the orchestrator and interaction between the CGRA and host-processor.

2.3 Compilation for Reconfigurable Architectures

Apart from those listed above, another important design parameter is the design entry point i.e. the type of language in which the ap- plication is specified. Most fine-grained reconfigurable architectures have traditionally supported Hardware Description Languages (HDLs) such as Verilog and VHDL. Coarse-grained reconfigurable architec- tures typically support high-level languages viz. a subset of C, full C language etc. HDL is considered unsuitable for CGRAs since the nature of operations supported in a CGRA are of coarser granularity. This decision is an important one and determines the type of support needed in hardware. For example, if full C language is to be supported then support for loads/stores is needed. However, some reconfigurable architectures assume that data is streamed into the computation unit. Such an assumption cannot be made when pointers are present in the C code or in the presence of non-affine array accesses (Hennessy and Patterson, 2003). When an application is specified in HDL, then compilation involves transforming it into its corresponding netlist, transforming the netlist into structures which are implementable on the specific reconfigurable processor (viz. aggregating one or more gates so that they can be mapped onto 4-input LUT or in the presence of an add function modify it suitably to use the carry-chain etc.), performing a placement of these mappable structures onto computation units on the reconfigurable processor, followed by routing. The routing step involves determining the exact paths to be taken by the two communicating computation units and appropriately programming the path to facilitate this com- munication. This process is very similar to synthesis design flow used in the context of ASICs. When a high-level language is used as the design entry point two 2.3 Compilation for Reconfigurable Architectures 17

different compilation flows are possible for generation of context information, namely high-level synthesis and compilation into an ex- ecutable. In high-level synthesis the high-level language description is transformed into an intermediate language and then this is translated into a HDL description. This HDL description can then be targeted to a reconfigurable processor, in a manner similar to compilation for FPGAs. An instance of this form of compilation is discussed in the context of NEC-Dynamically Reconfigurable Processor (DRP) in section 2.4.7. A different approach to compiling a high-level language descrip- tion for a reconfigurable architecture is similar to compilation for a general-purpose processor; It produces an executable for the recon- figurable fabric. The high-level language description is translated into an intermediate form. From this intermediate representation it is partitioned into temporal partitions such that each temporal par- tition can be executed on the reconfigurable fabric. Each temporal partition is converted into a hardware context. Each temporal par- tition is further partitioned into spatial partitions, each of which is accommodated within the same computation unit. On reconfigurable architectures with homogeneous compute elements and single context, this spatial partitioning is trivial; in reconfigurable architectures with multiple-contexts, spatial partitioning refers to creation of groups of operations that map onto the same computation units. This is followed by placement, where spatial partitions are mapped to computation units and is followed by a routing step which determines and allocates the path for communication between two computation units. This is finally translated into binary hardware contexts for execution. In reconfigurable architectures which have software-based orchestration of hardware contexts, i.e. where the host processor serves as the orchestrator, a control program needs to be generated to be executed on the host processor. It is to be noted that in all these approaches, the application is partitioned into temporal partitions and the execution of this temporal partition is accomplished through compile time determined compos- ition of computation units. The composition may indicate the exact location within the reconfigurable fabric where it is to be placed or the determination of the exact location within the reconfigurable fabric may be postponed to the runtime. Compilation of applications for a tile-processors (viz. RAW, TRIPS; as referred to by Amano (2006)) is akin to the compilation flow for a CGRA (with high-level language as design input). The application is partitioned into temporal partitions and spatial partitions. Instructions within a spatial partition are sequenced through the use of a local 18 Background and Related Work

controller. The instructions exchange data with other instructions executing on different computation units. This exchange of data is facilitated over the interconnection network. An instruction within a spatial partition may be stalled until the arrival of data from another spatial partition. The execution of temporal partitions happen through compile-time determined compositions, in a manner similar to a CGRA. Due to similarity between tile processors and CGRAs, we will henceforth use the term CGRA to encompass both CGRAs and tile processors. On the other hand, in a multi-core use of multiple processing cores by the same application requires the programmer to create software threads or processes. Each thread can be executed independently on these cores. These threads are mostly independent tasks which communicate infrequently. Data communication is facilitated through the use of techniques such as shared memory. We differentiate a multi-core from a reconfigurable architecture primarily based on their programming methodology and communication semantics (shared memory vs explicit communication). It should be understood that fine-grained reconfigurable architectures, CGRAs and multi-core pro- cessors form a continuum of processing solutions. The difference in granularity of the computation units, make one communication model more amenable to implement over others on these different platforms.

2.4 Modern CGRAs

CGRAs have been in existence for nearly two decades now. In fact the timing of this thesis coincides with the completion of a decade since the paper titled “A Decade of Reconfigurable Computing" by Hartenstein (2001) was published. This paper summarizes several CGRAs which were active at that time. Several of these architectures are not covered in this section. We restrict the related work survey to those architectures which appeared starting from RAW (Agarwal et al., 1997).

2.4.1 RAW Architecture The primary philosophy of RAW (Agarwal et al., 1997; Lee et al., 1998; Waingold et al., 1997) is to use a simple and scalable hardware; the complexity is transferred to the compiler and runtime system. The authors carried forward the design principles of the VLIW processors. There are certain important differences between VLIW and RAW archi- tectures. VLIW architectures suffer from scalability issues pertaining to register file ports and wire delays. RAW tries to address this by 2.4 Modern CGRAs 19

making two important changes. The RAW architecture is a matrix of tiles each housing a simple processor pipeline which has its own register file, instruction memory, data memory and a custom logic which can be used to synthesize application specific accelerators. The switch is integrated into the processor pipeline, allowing processors to access the data from a neighbouring processor’s register file. They refer to this as Non-Uniform Register Access (NURA). They employ a hybrid network: a point-to-point interconnection network to link all computation units on which the compiler schedules the transports (i.e. all conflicts are known statically and hence resolved by the compiler through appropriate scheduling of data transports); a dynamic net- work comprising a NoC is used in cases where the compiler is unable to schedule the transports in advance. The authors indicate that ILP is exploited by scheduling independent instructions on to different tiles. However, it should be noted that not all parallel instructions can be scheduled in neighbouring processors (especially so if these parallel instructions produce data for another instruction which immediately follows these instructions in the ASAP levels).

Applications are specified in C or FORTRAN. The RAWCC compiler is implemented using the SUIF (Wilson et al., 1994) infrastructure. The initial phase performs traditional compiler optimizations. This is followed by a phase which performs basic block orchestration and control orchestration. During basic block orchestration, a certain number of initial code transformations are performed which enables construction of a data dependence graph. This is followed by an instruction partitioning stage. Instruction partitioning is achieved by first clustering those instructions which are sequential or have ILP that cannot be exploited due to the delays involved on the interconnection network. Subsequently the clusters are merged such that the count of instruction partitions matches the number of tiles. This stage does not assign instruction partitions to tiles. This is followed by global data partitioning in which each scalar value is assigned a home tile on which it will be resident. This stage groups together those data elements which are frequently accessed by instructions of a tile into a single tile. This is followed by the instruction and data placement phase where the groups of data and instruction partitions are assigned specific tiles, on which they remain resident. Appropriate communication code is generated, in the form of send and receive primitives. The receive primitive is blocking and awaits the arrival of the message initiated by send.A send may arrive prior to the execution of the receive primitive. The compiler determines and schedules various computation and communication events. The compiler should ensure that the schedule 20 Background and Related Work

of communication generated does not cause a deadlock on the network. The use of blocking receive primitive imparts the ability to withstand long unpredictable latencies associated with memory access. This property is a key requirement for real world systems. Further, the statically computed schedule is not affected by the unknown latencies, as the ordering of the events remain the same. During execution, the processor pipeline continues to forward results on the static network to enable communication between non- neighbouring tiles, even if the pipeline is stalled awaiting the arrival of a data value. This proceeds only if the communication itself is not control dependent on the value of data to be received. The dynamic network, which implements wormhole routing with dynamic decisions, can be used to transport dynamic events. Dynamic events occur when the compiler cannot statically determine data dependencies or latency for an operation. One of the important features of this architecture is its ability to conveying branch decisions in a distributed architec- ture (instead of computing at various places). The RAW processor implements asynchronous global branching. In this scheme, one tile which houses the decision variable determines the branch decision and broadcasts the results to the other tiles. The other tiles execute the appropriate branch without need for explicit synchronization. Fur- ther, to minimize this communication sometimes control localization optimization is performed i.e. instructions belonging to the same basic block are localized to the same tile and thus avoiding the broadcast of the branch decision. The RAW architecture uses two separate instruction streams i.e. one for the processor and one for the switch. The end result is config- uration information that is very similar to an FPGA, at the granularity of word size instructions, as opposed bit sized manipulation.

2.4.2 PACT eXtreme Processing Platform

PACT-XPP (Cardoso and Weinhardt, 2002; Technologies, 2006a,b,c) is a CGRA that has no explicit host processor i.e. the functionality of the host processor is subsumed by special-purpose units on the reconfigurable fabric. It contains a set of VLIW-like processors called Function (FNC) Processing Array Elements (PAEs). These can execute up to eight 16-bit instructions in the same clock cycle. The eight processing units can be used to execute four back-to-back condition dependent instructions in a single clock cycle. Other than the FNC-PAE, the processing array contains two other types of PAEs, namely: ALU- PAE and RAM/IO-PAE. Each ALU-PAEs contains three units and each 2.4 Modern CGRAs 21

unit can execute a single functionality. The central unit of the ALU- PAE is a basic 28-bit ALU unit which can execute arithmetic and logic operations. Two other units called FREG and BREG, on either side of the 28-bit ALU, support data forwarding between ALUs and other specific functionality such as , additional add/subtract units etc. The RAM/IO-PAE is similar to ALU-PAE in structure; the ALU is replaced by a memory unit or an I/O unit. A point-to-point interconnection network is employed, with one cycle latency between the PAEs. The FREG and BREG units in the ALU-PAE and RAM/IO-PAE help forward data. Horizontal and vertical links are available for interconnecting the RAM/IO-PAEs and ALU-PAEs. Only horizontal links are available on the FNC-PAEs. PACT-XPP compiler (Cardoso and Weinhardt, 2002; Technologies, 2006a) accepts applications specified in C/C++. Applications are compiled for the FNC-PAE, by the FNC C compiler, and the execution is profiled to identify hotspots within the code. These hotspots, which comprise function calls and innermost loops, are good candidates for acceleration on the ALU-PAEs and RAM/IO-PAEs. Compute-intensive segments of the code are compiled into a dataflow graph and each node of the dataflow graph is assigned to a single unit within the ALU- PAE or FNC-PAE. This kind of arrangement performs well if the portion of the code mapped to the dataflow fabric is executed repeatedly, so as to amortize the cost of configuring the dataflow fabric. The recon- figurable fabric implements the static dataflow execution paradigm (with acknowledgements for every data transfer). This enables exploit- ation of pipeline parallelism on the reconfigurable fabric, especially in the contexts of loops. Control dominated codes are run on the FNC-PAE. The compiler for the reconfigurable fabric, referred to as XPP Vectorizing Compiler, does not handle C++ constructs, pointers and floating-point operations. The reconfigurable fabric allows the execution of codes larger than what can be accommodated within one hardware context. The reconfigurable fabric or the FNC-PAE can trigger a reconfiguration. PACT-XPP supports multiple hardware context buffers and supports loading of the context buffers when execution is in progress. It also supports partial reconfiguration of the reconfigurable fabric. The PACT- XPP architecture and the ADRES architecture (explained in section 2.4.5) have tight coupling between the host-processor and the recon- figurable fabric. Static dataflow implemented on the reconfigurable fabric is similar to what has been implemented in DAP-DNA (section 2.4.4). However, unlike DAP-DNA, PACT-XPP supports more than one type of operation within each ALU-PAE, which makes it easier to map 22 Background and Related Work

applications to the reconfigurable fabric.

2.4.3 TRIPS

TRIPS (Burger et al., 2004; Sankaralingam et al., 2003) is the first processor implemented based on the Explicit Data Graph Execution (EDGE) architecture. The instructions in TRIPS name the destinations to which the result of the operation is to be forwarded. This is unlike traditional von-Neumann processors which specify their source oper- ands (typically in a register file). In this manner, the consumers of an instruction from the dataflow graph are explicitly specified as opposed to the implicit mechanism used in other von-Neumann based archi- tectures. It should be noted at this point that TRIPS is meant to be a general-purpose processor (while CGRAs have been mostly employed in the context of embedded applications). CGRAs have been using this model of explicit reference to destinations and direct delivery of data to the consumer since its inception. TRIPS is a statically placed and dynamically issued processor i.e. the placement of instructions is done statically by the processor but the issue of instructions proceeds in a dataflow manner. TRIPS architecture contains a matrix of 4 × 4 execution units. Each execution unit has its own instruction buffer where various hardware contexts (i.e. instructions) are stored and operand buffers to store the data operands. The execution unit can issue at most one instruction every clock cycle and supports both instruction and floating point instructions (other than floating point division). The execution units are connected by a lightweight point-to-point interconnection. Each execution unit can hold up to 128 contexts; it can choose an instruction for execution independent of other execution units. There are four banks of register files connected along the columns to enable exchange of data across temporal partitions. There are also four banks of instruction and data caches which are connected along the row, which supply data and instructions to the execution units. There is one global tile which determines the loading of newer contexts, performs branch prediction to load the next context (details of which can be found in paper by Burger et al. (2004)). Applications to be executed in TRIPS can be specified in either C or FORTRAN. Applications are compiled into TRIPS blocks which are similar to hyperblocks (Mahlke et al., 1992). These TRIPS blocks are fixed size (128 instructions) and have additional modifications specific to TRIPS. The TRIPS blocks increase the amount of straight-line code through predication. The compiler then transforms the hyperblocks 2.4 Modern CGRAs 23

into a dataflow graph for code generation. The instructions are suit- ably partitioned into groups which reside within the same execution unit. These partitions are then suitably mapped to reduce the com- munication distance. A single TRIPS block can span several hardware contexts and this collection of hardware contexts which are a part of the same TRIPS block are referred to as A-frame. The architecture allows up to 16 hardware contexts per A-frame and up to 32 A-frames can be loaded at the same time. The architecture supports three different execution mechanisms for exploiting ILP, TLP and DLP. In a configuration termed as D-morph it allows extensive speculation to improve the exploitation of ILP. In this configuration, the G-tile allows loading multiple speculated A-frames. The architecture supports of A-frames and the results are committed to the register files and memory upon detect- ing a correct prediction or it is rolled back. To allow exploitation of TLP, TRIPS supports an execution mechanism called T-morph. In this configuration, it allows 2 frames per thread. A hyperblock belonging to a thread is atomically executed and then the hyperblock belong- ing to another thread is chosen. The register file and data caches are appropriately partitioned for use by the various threads. In the S-morph configuration, used for exploiting DLP, a super A-frame is constructed by replicating the A-frames belonging to a loop. A more recent publication (Putnam et al., 2011) details the second version of their architecture called E2 shows vectored execution mode can be supported in the modified architecture.

2.4.4 IPFlex-DAPDNA

IPFlex’s DAPDNA (Sato et al., 2005; Sugawara et al., 2004) was one of the first commercially available general-purpose dynamically reconfigurable processor. The hardware included a Reduced Instruc- tion Set Computer (RISC) core referred to as Digital Application Processor (DAP). This serves as the host processor. A reconfigurable fabric of 376 Processing Elements (PEs) referred to as the Distributed Network Architecture (DNA) is also present. There are two types of PEs, namely: data Processing PE and data I/O PE. Each PE imple- ments only one function (i.e. it is not a general purpose ALU). The PEs are interconnected by a programmable interconnection network. The reconfigurable fabric is subdivided into 6 segments, where each segment contained 8×8 PEs and thus forming a total of 376 PEs in the DNA. The DNA has the ability to store up to three additional hardware configurations, while execution of one configuration is in progress. 24 Background and Related Work

The reconfiguration of the DNA can be triggered either by the DAP or can be autonomously triggered by DNA. The application design flow proceeds as follows: An application written in C language is first compiled and run on the DAP. The hotspots within the code are identified and those hotspots are rewritten in a different language called Dataflow C (a variant of C). This is compiled onto the DNA (some manual intervention may be needed). During execution, the DAP transmits control to the DNA for execution and blocks until the data is available from the DNA. While the idea of a coarse-grained replacement for a FPGA was interesting, there was one flaw in their design as we perceive it. DAP-DNA fabric is an interconnection of single-function PEs. The organization of PEs, i.e. which PE is connected to which other PE, is predetermined. The choice and cardinality of specific types of PEs and their placement is critical. Absence of required number of certain PEs implies that an appropriate level of PE utilization cannot be achieved within the section of the fabric. Even if sufficient numbers of all PEs were available, their placement has an impact on the latency and hence the performance. Later CGRAs avoided the use of simple FUs and employed fully functional ALUs.

2.4.5 ADRES

ADRES (Mei et al., 2005, 2003) has a tightly coupled VLIW processor and a coarse-grained reconfigurable fabric. The VLIW processor is formed by a set of FUs which share a register file and have an instruc- tion delivery and decode pipeline. These set of FUs, which form the VLIW processor, in fact forms the first row of the reconfigurable fabric. The remaining elements in the reconfigurable fabric are Reconfigurable Cells (RCs). Each RC contains a FU and a small private register file. The ADRES is a template architecture which can be domain special- ized. Therefore the choice of functionality in each FU is left to the specific customization. Each FU can handle word-sized operations (i.e. 32 bit operations). They are interconnected using a point-to-point mesh topology. However, the topology too can be varied using the configuration file (Mei et al., 2005). The RCs have access to the re- gister file associated with VLIW processor. This innovative design of closely coupling the host processor and reconfigurable fabric allows very close interactions between the two. The application to be compiled for this platform can be specified in C language. The compilation framework, called Dynamically Re- configurable Embedded System Compiler (DRESC) (Mei et al., 2002) 2.4 Modern CGRAs 25

compiles code for both the VLIW processors and the CGRA. The com- piler identifies inner most loops with feed forward dependence and transforms them into a hardware context for the CGRA. The interface between the portions of the application that execute on the CGRA and VLIW is identified by the compiler and is either left inside the shared register file or stored in memory. The innermost loops are modulo scheduled for improving the efficiency of execution. Multiple hardware contexts can be stored within each RC and context switching can be performed within a single cycle. The RC has access to the data memory hierarchy; the data memory hierarchy closely resembles the hierarchy in modern general-purpose processors.

2.4.6 Wavescalar

Wavescalar (Swanson et al., 2003, 2007) is an Instruction Set Architecture (ISA) based on the dataflow paradigm. The processor that implements this ISA is called WaveCache. WaveCache loads instructions from memory and they are assigned to PEs for execution. The instructions remain within the WaveCache, until it is evicted from the cache. It is evicted from the cache when the working set of the application changes and these instructions are no longer needed. The processor behaves like a cache, fetching instructions and data when there is a reference to it and is not available in the cache. It evicts instruction- s/data when that location is needed by another instruction/data and the currently residing data is no longer needed. The processor contains an array of interconnected PEs. Each PE contains an ALU and implements a dynamic dataflow based firing logic. There is an instruction store and three operand stores. Apart from this there is a tracker board which tracks ready instructions. In case an operand within the PE is resident for too long, then it can be evicted to the operand store in memory. When all operands of an instruction are available, the instruction executes. The PE implements a five-stage pipeline. There are bypass stages available to enable back-to-back firing of instructions. The results of computation are directly delivered to the consuming instruction. Two PEs are connected together to form a pod. The PEs within a pod have access to each other’s bypass bus, enabling back-to-back firing of instructions in two different PEs of a pod. Four pods are interconnected through a pipelined bus to form a domain and four domains are interconnected through a network to form a cluster. This hierarchical interconnection enables the delivery of data to the appropriate instructions. An application written in a high-level language, viz. C, can be 26 Background and Related Work

compiled automatically for the Wavescalar processor. The compilation partitions the applications into blocks, called waves. A wave is akin to a code block (Iannucci, 1988). Since the architecture implements the dynamic dataflow execution model (Arvind and Culler, 1986), each data must be identified by a tag. All instructions within a wave share the same tag. There is a wave number which is associated with execution of every wave. One of the highlights of the Wavescalar ISA is its ability to implement in-order memory semantics, demanded by programs written in imperative languages, on a dataflow machine. In order to implement it, the compiler assigns identifiers to all loads and stores within the wave based on program order. It also records the predecessors and successor identifiers (to work around loads and stores which may not be executed due to conditional statements). The hardware store buffer needs to ensure that it picks up loads and stores as per the numbering, predecessor and/or successor information provided by the compiler. There is one store buffer per domain. There is also one data cache per domain. The main difference between this architecture and TRIPS (refer section 2.4.3) and Wavescalar is in the execution paradigm. The sequencing of hyperblocks in TRIPS happens as per the von-Neumann semantics and execution of instructions within a hyperblock happens as per static dataflow semantics. In Wavescalar the execution happens as per dynamic dataflow execution semantics. It is unclear to us as to why WaveCache was implemented as a cache. Instruction window management can be explicitly performed by a compiler and it has been done so in several cases. By implementing a cache, the management of the instruction window has been moved to the hardware especially when the information about instruction liveness is available at compile time. Operands which are produced much before its consumption are stored along with its instruction in the wave cache and are evicted by the hardware if space is needed. In practice, it is our observation that such operands are not found in large numbers and just storing them in memory would be sufficient in most cases.

2.4.7 NEC-Dynamically Reconfigurable Processor

NEC-DRP (Amano et al., 2004; Suzuki et al., 2004; Toi et al., 2006) has the most unique compilation methodology and hardware structure of all the CGRAs we have thus far discussed. It uses a high-level synthesis based transformation from high level language to hardware contexts. DRP comprises several tiles, where each tile contains a matrix of PEs. Along with the matrix of PEs, it also includes a state- 2.4 Modern CGRAs 27

transition controller, several memories along the periphery (referred to as vertical and horizontal memories) and memory controllers. Each PE contains an 8-bit ALU and an 8-bit data manipulation unit. The data manipulation unit supports shifting and masking of bits. It also includes sixteen 8-bit registers in a register file and an additional 8-bit flip flop. Each PE can store sixteen configurations at any point in time. The switching between configurations is controlled by the global state transition controller. Each PE can receive two 8-bit inputs and produce one 8-bit output. Further, unlike previously discussed CGRAs, it allows multiple PEs to be connected sequentially and execute within the same clock cycle. This implies that the frequency at which an application is synthesized is determined by the configuration. This style of operation is akin to an FPGA. The compilation methodology is geared towards this style of hardware and is similar to the compilation flow followed on an FPGA. Applications are specified in a high-level language which is a deriv- ative of C language, referred to as Behavioral Design Language (BDL). This language natively supports hardware specific constructs viz. input and output ports, bit-level extraction and bit concatenation. The syn- tax is similar to C language and some constructs of C language are not supported such as recursion and dynamic memory allocation. Point- ers may be used if the compiler can statically determine the address location referred to by the pointer. An application specified in BDL is transformed into tree Control Flow Graph (tCFG). Optimizations, viz. constant propagation, common subexpression elimination, are per- formed on the tCFG and it is then transformed into Control-Dataflow Graph (CDFG). This is followed by resource allocation, where re- sources include PE, routes on the interconnection network, switches along the selected route, registers, flip flops etc. Apart from this, the overall control flow is determined and captured as state transitions. It is broken up into multiple contexts through a multi-step process. Each context is then transformed into a special RTL notation called multi-context Verilog. Once it has been converted to Verilog, it is taken through a synthesis process akin to the design flow followed for an ASIC. However, the synthesis tool is appropriately modified to gener- ate an output for DRP. For example, the synthesis process must be repeated several times once for each context. The state transitions are typically modeled as a hard wired Finite State Machine (FSM) in an ASIC, whereas in the context of DRP it needs to be mapped to the state transition controller. For complete details on the difference between synthesis process for an ASIC and DRP refer Toi et al. (2006). The hardware contexts, which are compiled, are loaded on to the 28 Background and Related Work

context memories associated with each PE. The switching of contexts is governed by the state transition controller, which can compute the next state within one clock cycle. The state transition controller determines the next context to run based on the evaluation of condi- tionals. Several tiles can run several different processes independently and concurrently. DRP can be integrated with System on Chips (SoCs) or ASICs.

2.4.8 Ambric

Ambric (Butts, 2007; Halfhill, 2006) is an interconnection of brics, where each bric contains two computation units and two Random Access Memory (RAM) units. Each computation unit contains two Streaming RISC processor cores. Each of these cores contains a single ALU and is primarily meant to handle input and output related tasks (viz. preparing the data for processing by the main cores). The main processing cores are two Streaming RISC DSP processor cores. This contains three ALUs two of which are in series (i.e. input of one connected to the output of the other) and a third ALU which is in parallel to both of them. The third ALU is capable of rapidly executing operations such as Multiply-Accumulate, Sum of Absolute differences etc. in a pipelined manner. The four processors, i.e. two Streaming RISC cores and two Streaming RISC DSP cores, are interconnected through crossbars. These crossbars are also connected to the external communication interface making them sending and receiving data to/from other Computation units and RAM units. The crossbars are software controlled. Each of these four cores has local storage, which can be used to store both instructions and data. The Streaming RISC DSP cores are also connected to the RAM-unit. The RAM unit contains four banks of memory, each of which is 1-2KB in size. These memories can be configured to store instructions or data. They can also be configured to stream addresses and data or operate as a First In First Out (FIFO). A pair of computation unit and its RAM unit are connected to a mirrored copy of compute and RAM unit, such that each computation unit is connected directly to two RAM units, which forms of a bric. Each of these brics is interconnected through Ambric registers. At the input end, each of these registers, expose a signal called accept and at the output end a signal called valid. When both these signals are high data transfer takes place. If the accepting register is not free to accept an input, the accept signal is not asserted and the backpressure causes the data generating processor to stall until space is available on the output channel. Each Ambric register 2.4 Modern CGRAs 29

can store two data operands. This mechanism allows different brics to be operating at different clock frequencies, where the clock frequency of each bric can be changed at runtime by the software. Several brics may be combined at runtime to form dynamic cores. Such a scheme may be used if the space available in a RAM unit is insufficient for holding all the data. Several of these brics can be grouped together to get access to a larger storage. The first Ambric processor contained 5×9 array of brics. The chip also includes a DDR2 and I/O controllers. The design entry point for this processor is the Java language. The programmer writes a Java program, followed by an aStruct spe- cification. The aStruct specification indicates how many objects are instantiated for each Java class, the type and number of commu- nication channels needed and their interconnections. This is then synthesized onto the processor array by the Ambric compiler. Even though the compiler accepts Java based programs the executable is not in the form of a byte code; the program is transformed into an Ambric-specific binary by the compiler using the Java and aStruct specifications. Each Java class of the application is expected to be single threaded. Ambric does not support the Java Thread model. Ambric has a well-structured hardware implemented with an eye towards energy efficiency. This is evident in the support for Globally Asynchronous Locally Synchronous (GALS) like structure which is employed. Each computation unit too has sufficient compute power to process several instructions in parallel. The presence of large RAM- units makes it especially suitable for executing video encoding and decoding application which have a high memory requirement. The latency for accessing the RAM-units and the latency for sending data from Streaming RISC DSP core to another is unclear. The idea of using Java like language for programming and mapping an object to a core (a core is formed by the grouping of one or more brics) is quite interesting and intuitive. While the aStruct specification is easy to write, the programmer may create artificial deadlocks if sufficient buffers are not allocated between objects. Additional software checks may be needed to detect such errors.

2.4.9 Polymorphic Pipeline Array PPA (Park et al., 2009, 2010) is a CGRA-influenced multi-core solution. It is designed primarily to address two short-comings of current CGRAs (as they perceive it); these include • CGRAs have typically been employed to accelerate inner loops 30 Background and Related Work

and functions. Outer loops are typically executed on a host processor.

• The reconfiguration models described in section 2.2 only talks of the interconnection between a CGRA and host processor and this interconnection typically takes the form of a bus. Park et al. (2009) address this through the creation of flexible multi-cores which are tightly interconnected and can be dynamically com- posed to form a virtual core.

The PPA architecture comprises many FUs interconnected through a mesh network. A set of four FUs forms a core. All FUs support execution of 32-bit basic integer arithmetic instructions and only one among them supports multiplication. Each FU contains a register file which is local to it. The interconnection within the core is a register file to register file interconnection. Each core contains an I-cache which supplies instructions to all FUs within it. It also contains a loop buffer for storing instructions which are repeatedly executed in a loop. The FUs within a core share a common predicate register file. The interconnection between the cores also allows direct access to the register files of other cores. Other interconnects are used for virtualization. Many cores can be dynamically composed to form a virtual core. The larger core can be used to spread instructions across many cores and perform execution. The application design entry point is C language. The compiler statically determines the time taken for each instruction execution and statically schedules communication after route selection, to ensure that there are no conflicts. In the case of virtualization, there is sufficient hardware support (in the virtual controller) to reschedule instructions which are reassigned at runtime to a different core. DLP which exists within a loop is exploited through the use of modulo scheduled loops. The architecture also has support to launch multiple coarse-grained tasks in parallel (on different cores). PPA can be coupled with a host processor. Since large tasks (viz. nested loops) are assigned to PPA the interaction with the host processor is very limited and data can be transferred through Direct Memory Access (DMA). We strongly feel (as also reflected from the ideas presented in Borkar and Chien (2011); Hill and Marty (2008)) that in order to be energy efficient while obtaining better performance, it is essential to define FUs specific to a certain domain and support large custom IP blocks. Further, PPA in the current form is not capable of withstand- ing long unpredictable latencies. This support would be crucial for becoming a mainstream processor. 2.5 Summary 31

2.5 Summary

Table 2.1: The table indicates the values for all the terms in the 7-tuple design space of a CGRA. The terms in the 7-tuple design space have the following meanings: C - choice of computation unit, N - choice of interconnection network, T - Choice of number of context frame (single or multiple), P - presence of partial reconfiguration, O - choice of orchestration mechanism, M - design of memory hierarchy and H - host-CGRA coupling. The meaning of other terms: C/F: Control flow; D/F: Dataflow.

Architecture C N T P O M H RAW RISC Hybrid yes – self Local None Core C/F PACT XPP ALU point-point no yes self periphery tight & ex- coup- ternal ling C/F TRIPS ALU NoC yes no C/F periphery None DAPDNA FU programmable yes no C/F periphery tight coup- ling ADRES FU point-point yes no C/F periphery tight coup- ling Wavescalar ALU hierarchical yes – self local* None D/F NEC-DRP ALU programmable yes no state periphery – ma- chine Ambric ALU point-point yes – – local – PPA ALU point-point yes yes – periphery –

In this chapter, we looked at various CGRA architectures. Some of them viz. RAW and PPA, keep the hardware extremely simple while transferring all the complexities to the compiler. On the other end of the continuum, are architectures such as TRIPS and Wavescalar, which use dataflow semantics in execution. Architectures such as NEC-DRP rely on high-level synthesis like compilation flows to target byte-width CGRAs. Architectures such as PACT-XPP and ADRES have very tight integration of the host processor and the reconfigurable fabric, which helps in fast switching of execution between the recon- figurable fabric and the host processor. The common technique used in all of these examples (without exception) is the use of explicit com- munication. Data-driven execution has been a common underlying technique used in almost all of these architectures to withstand unpre- dictable latencies. We summarize, in table 2.1, the features of each 32 Background and Related Work

of these processors based on the 7-tuple space we proposed in the beginning of this chapter. Each of these processors has been designed with different motivations in mind. As mentioned in chapter 1, we intend to build a CGRA that has a loose-coupling with host processor to maximize the utilization of both these entities. We employ a NoC as it helps in reducing the configuration overhead. Further, we intend to use dataflow based orchestration mechanism to better exploit task level parallelism. In chapter 3, we derive the conditions for correct execution. These conditions for correct execution need to be implemented partially in the hardware and partially by the compiler. In chapter 4, we motivate the design decisions for computation unit and intercon- nection network. Chapter 5 describes the design decisions with regard to the macro-dataflow orchestration subsystem. The Road goes ever on and on Down from the door where it began. Now far ahead the Road has gone, And I must follow, if I can, Pursuing it with eager feet, Until it joins some larger way Where many paths and errands meet. And whither then? I cannot say. - J R R Tolkien in The Lord of the Rings 34 Background and Related Work Chapter 3

Macro Dataflow Execution Model for a CGRA

Why Dataflow? –Prof. Masahiro Fujita It seems Arvind has affected your mind... – Prof. Yale Patt1

We intend to design a CGRA with dataflow based orchestration unit, as motivated in chapter 1. Dataflow based semantics used at the orchestration unit operates at the level of temporal partitions (unlike traditional dataflow systems which operate at the level of instructions). In this chapter we motivate the need for a dataflow based orchestration unit, we derive the conditions for correct execution followed by a description of the model of execution to be followed on our CGRA.

3.1 Why Dataflow?

An application to be executed on a CGRA is compiled and converted into one or more hardware contexts. Each hardware context indic- ates the way the computation units and the interconnection network should behave. If the computation units comprise ALUs then the hardware context can be an instruction or a set of instructions. If the interconnection network is programmable then the configuration determines the direction of switching (at each switch) or if the inter- connection network is a NoC then the network configuration arrives along with the data to be routed. A hardware context may span some

1These quotations are from some of my interactions during my PhD life. All these statements have profoundly impacted the development of this project and is quoted here as a tribute to each of these people who have so affected it. 36 Macro Dataflow Execution Model for a CGRA

or all computation units of the fabric. If a hardware context does not span all computation units, the reconfigurable fabric may have the capability to permit another hardware context to be loaded on the remaining computation units. We refer to this ability to allow different configurations to independently load and execute on the fabric as partial-reconfiguration2; In the absence of this ability it is referred to as total-reconfiguration. CGRAs that only support total reconfiguration typically have multiple configuration planes where the next configura- tion can be preloaded, while the execution of another configuration is in progress. This helps in fast switching between configurations. Both schemes allow piecewise execution of the application. Piecewise execution of an application requires the application to be partitioned into temporal partitions such that each temporal partition can inde- pendently execute on the interconnection of computation units. Since the configuration associated with an temporal partition is distributed across several computation units saving context for the purpose of con- text switching would be impractical; this is due to the amount of data to be saved and the coordination needed in saving the data. In this chapter, we will refer to the temporal partition as an application sub- structure. While a temporal partition is a substructure of the application, application substructures need not always be time multiplexed on the reconfigurable fabric, as implied by temporal partitions; two application substructures can be run in parallel if sufficient resources are available, as shown in figure 3.1. Therefore, any application substructure once loaded on the fabric must run to completion i.e. execution of an application substructure is atomic and is uninterruptible. While partitioning an application into substructures for a CGRA, it is desirable to have application substructures whose schedule is known at compile time, so that the subsequent hardware context may be pre- loaded while the current hardware context executes. However, this would not be feasible in the presence of data-dependent conditionals. Therefore, scheduling in this context has to be carried out at runtime, based on the evaluation of conditional statements. An obvious schedul- ing mechanism would be to employ a simple counter based scheduling (viz. such as the (PC) based scheduling mechanism in a von-Neumann machine). In this scheme, the various applica- tion substructures are executed in sequence and upon encountering a branch, the counter is set to the new application substructure along the direction in which the execution continues. A similar scheme is

2In the context of FPGAs partial reconfiguration refers to the ability of changing a portion of the configuration. In our context, it refers to the ability of loading another configuration if the fabric is not completely utilized. 3.1 Why Dataflow? 37

Application Application Substructure Substructure #1 #2

Figure 3.1: The schematic shows how two application substructures can be simultaneously realized on an array of computation units.

employed in PACT-XPP (Cardoso and Weinhardt, 2002; Technologies, 2006c) and is referred to as self-reconfiguration. In this scheme, each application substructure indicates the next application substructure to be executed. Such application substructures cannot be eagerly scheduled at least until the evaluation of the conditional statement, which governs its ex- ecution3. Further, an application substructure does not start execution either until all its inputs have arrived or until all its predecessors (ap- plication substructures that produce inputs for it) have been launched for execution. The reason is illustrated with an example. Let us as- sume two application substructures S1 and S2 are governed by the same predicate. The predicate for S1 arrives first and we launch the same. Further, let us assume that S1 and S2 are quite large and cannot be launched at the same time. If S1 is data dependent on S2, it leads to a deadlock. Since S1 cannot be preempted (no ability to context switch), S2 cannot be launched and hence one or some of the data operands for S1 are never produced. For these reasons, the scheduling of application substructures either needs to adhere to strict execution semantics or they need to be scheduled in a topologically sorted or- der. For the sake of design simplicity, we assume that strict execution semantics are used4.

3In the presence of a mechanism that squashes an incorrectly scheduled application substructure an eager schedule is possible. This mechanism is used in TRIPS (Sankaralingam et al., 2003). We do not consider this optimization at this point in time and will be discussed in brief in chapter 7. 4If the application substructure is launched prior to the arrival of all its inputs then a mechanism to forward the inputs to the fabric would be needed 38 Macro Dataflow Execution Model for a CGRA

Figure 3.2: A tree of stacks is needed when exploiting parallelism that exists among application substructures.

In CGRAs that permit partial reconfiguration, it is possible to ex- ecute two or more application substructures at the same time (de- pending on resource availability). Parallelism among application substructures exists when there is data level parallelism or task level parallelism. In such a scenario, multiple parallel execution streams need to be simultaneously handled. When multiple execution streams are forked from a single source then a tree of stacks (Culler et al., 1993) (figure 3.2) is created. The stack from which the fork was initiated must await the completion of all these tasks. The scheduling mechanism described above can still be used to fork execution of mul- tiple application substructures. However, the function that forked the multiple streams cannot resume until all the streams complete. Thus, there is a need for explicit synchronization (barrier synchronization) at this point. The scheduling mechanism needs to be enhanced to support this synchronization. Barrier Synchronization across several tasks is a commonly employed synchronization method in distributed systems. However, unlike traditional distributed systems the substruc- tures that are scheduled on the fabric run for 10-100s of cycles. This implies that the cost of synchronization should be low in order to make exploitation of parallelism feasible. Hence, we employ the dataflow paradigm for scheduling these application substructures. In dataflow execution semantics, the synchronization is implicit; an application substructure firing as per dataflow schedule may not start execution until all its input operands arrive. 3.2 Macro Dataflow 39

Several other forms of parallelism can be exploited. However, in all these cases there is a need for explicit synchronization. Dataflow paradigm provides lightest synchronization mechanism between par- allel entities. Several processors were developed by Arvind and Culler (1986); Arvind and Nikhil (1990); Papadopoulos and Culler (1998). The disadvantages of these architectures are well known and have been documented in several studies Lee and Hurson (1993). The disad- vantages were due to the implementations of implicit synchronization where an instruction does not execute until its operands arrive.

3.2 Macro Dataflow

Unlike traditional dataflow execution semantics that is at the level of monadic or dyadic operations, the operations in our context are application substructures, which can be viewed as macro dataflow operations. Since we employ the dataflow paradigm, the application substructures is referred to as macro dataflow operations. We refer to the unit that schedules these macro dataflow operations as the orchestration unit. The orchestration unit may be implemented in software or hardware and the details of its actual implementation is discussed in the chapters 5 and 7. In this chapter, we describe the macro dataflow based execution model. This specifies a) the expected properties for these macro operations (i.e. applic- ation substructures) to be schedulable and well-behaved. Well- behavedness is needed for limiting the resource requirement of the run time system. The compiler that generates these application substructures needs to ensure that they are well-behaved. The properties of such macro operators is discussed in section 3.3. b) the expected behavior of the runtime orchestration unit for correct execution. The details of the working of the orchestration unit is elucidated in section 3.4. While we describe the conditions needed for correct execution, choice of granularity of the application substructure is implementa- tion dependent i.e. dependent of the fabric implementation and the language which is used as the design entry point.

3.3 Macro Operations

Consider the example scenarios given in figure 3.3. Each of these schematics shows different substructures within the application. In 40 Macro Dataflow Execution Model for a CGRA

A A

CDB

B

E C

(a) Sequential Code - Loop - Sequen- (b) Sequential Code - Tasks - Sequen- tial Code tial Code

Figure 3.3: Schematic figure showing different application substructures, where a loop and several parallel tasks are placed between sequential substructures.

figure 3.3a the substructure containing the loop may have DLP and one may wish to run multiple instances of the same, while a single instance of the remaining substructures are run. As mentioned previously, each application substructure is executed in a strict manner i.e an applications substructure cannot be executed until all its inputs are available. These application substructures can- not be context switched. The inputs to each application substructure is stored in the context memory associated with it, until it is selected for execution. In the example shown in figure 3.3a, A is instanti- ated first followed by B. The execution of A and B may proceed in parallel. The data dependence establishes a set of partial orders for execution. To ensure schedulability there must exist at least one total order which satisfies the partial orders needed for execution i.e. the application substructures must respect the convexity condition (Sarkar and Hennessy, 1986).

3.3.1 Macro Dataflow Graph

The partial orders between various application substructures can be represented in the form a graph. In this graph, the nodes are the application substructures and the edges indicate their data interactions. Such a graph is akin to a dataflow graph. The properties of such a graph are listed below. 3.3 Macro Operations 41

A

A

i

i

i B i B

(a) Not all input edges have valid (b) Input edges with implicit merge data node

Figure 3.4: All inputs edges on a macro operation may not be valid. However, all such inputs are mutually exclusive thus allowing an implicit merge node.

3.3.1.1 Macro Operations The nodes of this graph, unlike a dataflow graph, do not represent (micro) operations but macro operation. Each macro operation has sev- eral input and output edges, unlike nodes in dataflow graph which are either monadic or dyadic (Arvind and Nikhil, 1990). Micro-operations of a dataflow graph consume one set of inputs from all input edges and produce one set of output edges on all its output edges (switch and merge are exceptions). However, since the macro operations, represented by application substructures, are a composition of several these micro-operations (including switch and merge), these do not necessarily consume inputs on all input edges and produce outputs along all output edges. Such a macro operation is best represented by an actor (Janneck, 2003)5. The macro operations also contain edges such that data is produced once and consumed many times. These kinds of input edges are not present in a micro-operation.

3.3.1.2 Input Merge As can be observed from figure 3.4a data may not be available on all input edges at all times. For the application structure shown in figure 3.4a, while some inputs are generated by A for the first iteration of B, in the subsequent iterations this data needs to generated by the macro operations belonging to the previous iteration (which may include

5The actor model presented by Janneck (2003) differs slightly from the model presented by Hewitt (1977) and Agha (1986). 42 Macro Dataflow Execution Model for a CGRA

the previous iteration of B or any such macro-operation within the loop). Examples include loop constants6, loop index etc. However, it is evident that only one of these inputs is generated at runtime. Since they are mutually exclusive, it is equivalent to having an implicit merge to be present at some input edges (indicated in figure 3.4b). Whenever data for an input edge is not produced in a macro operation, it has to have an implicit merge with another input edge, where these two edges carry data on mutually exclusive conditions.

3.3.1.3 Conditionals The macro operations may not produce outputs on all output edges. For example, in figure 3.3a the data on the back edge from B to B is only produced as long as the loop is active and the data on the edge (B,C) is only produced when the loop becomes inactive. The inputs to the macro operations along with its internal state determine whether data is produced for a certain output edge. In this case, the macro operation does not satisfy the condition of well-behavedness as it does not produce outputs along all output edges. However, in the converse case when data is produced on all output edges and the data operands that are produced are intended for macro operations that are on the not-taken path need to be purged. A simple way to circumvent this problem is to prevent delivery of data operands to macro operations until the decision of the branch, which determines whether it is executed, is taken. We refer to the macro operation where the branch condition is evaluated as the branch macro operation. In this case, data produced by other macro operations, which precede the branch macro-operation, are forwarded to the branch macro operation. This can lead to an increase in the number of inputs to branch macro operation. Another way to handle this is to permit the delivery of data before the branch decision is taken and then issuing a special request to purge all the data delivered before the decision was taken. We address this in section 3.3.1.7.

3.3.1.4 Loop Constants Any substructure containing a loop may have several input arguments, which remain invariant throughout the execution of the loop. Such loop invariants or constants need not be repeatedly delivered to the partition. These loop constants can be stored and repeatedly used. However, upon exit from the loop the storage space used by the loop

6These can be stored as opposed to regenerated for efficiency. 3.3 Macro Operations 43

needs to be released. The release of these resources is discussed in detail in a later section.

3.3.1.5 Function Call and Contexts

A function call is allowed within a substructure if the processor has the ability to perform function calls (Example: if a function call stack is implemented as in a control flow based processor). However, in most CGRAs supporting function calls is difficult since saving context which is distributed across several processors on the reconfigurable fabric is quite difficult. In the absence of this support, the caller macro operation copies the inputs for the function in a separate context memory. When all the inputs for the function are available, the function executes. This mode of execution is akin to the execution model of parallel tasks, even though the execution proceeds in a serial manner. To handle multiple simultaneous invocations of the same function there needs to be multiple context memories one for each function being invoked. Every function invocation is associated with an instance identifier and a separate context memory is allocated at runtime. For example, in figure 3.3b substructures B, C and D are parallel invocations of functions.

3.3.1.6 Memory Access

Thus far, we have assumed that all input data to a substructure is made available in the context memory. However, it may not be practically possible to copy all data from a global store to the specified context memory. In such a scenario, a lightweight mechanism is needed to ensure that the processor executing the successor substructure has the specified data available at the memory store. In order to achieve the same, we introduce an explicit memory precedence edge between these macro operations in the macro dataflow graph. This input, associated with the memory precedence edge, is written to by the predecessor substructure that last updates the data. The value of this sequencing input is never used during execution of the successor substructure, but just serves as a means to delay the scheduling of the macro operation until the data is updated. It should be noted that the writing of the sequencing input to a substructure may need to be preceded by appropriate memory synchronization actions as dictated by the memory consistency model implemented on the execution system. 44 Macro Dataflow Execution Model for a CGRA

A B

C

Figure 3.5: Schematic showing three substructures A, B and C. The decision on invocation of C is taken in substructure B and A is another predecessor of C which generates inputs for it without consideration of the decision taken in B.

3.3.1.7 Conditions for a Well-behaved graph A function call invocation, as stated above, requires dynamic alloca- tion of the context memory. Since machine resources are finite, these resources need to be appropriately reclaimed when the function ter- minates. Function termination could mean the function returns to the caller or that function invocation did not occur due to the presence of a conditional statement. However, some of the inputs for the function may have already been produced and written to the context memory7. These inputs written to the context memory need to be purged if the function terminates. Further, if some of the inputs to this function were in transit at the time of reclamation of resources, it can lead to erroneous execution. As indicated by Arvind and Nikhil (1990), this places a requirement on the graph to be well-behaved. The conditions for being well-behaved macro dataflow graph are a superset of the conditions imposed on a dataflow graph. • The macro operation is well-behaved: Since the macro operation is a composite actor containing several micro-operations, this composition should be well behaved. An actor is said to be well behaved when it leaves the hardware in the same state as it was prior to execution. The well-behavedness of a macro-operation needs to be maintained both in the context memory and in the context of execution of the micro-operations. Every micro- operation within the substructure, given one token on every input edge, produces one token on every output edge. Most arithmetic, logical operators easily satisfy this rule. The operators which are not well-behaved include switch and merge8 (Arvind and Culler, 1986). These can be made well-behaved as described by Arvind

7One may argue that if the input data is not delivered up to the point of the decision such a situation may not arise. However, for the sake of generality we handle the case when a portion of the inputs is delivered before the decision on invocation is made. 8We do not employ merge nodes in our application. 3.3 Macro Operations 45

and Culler (1986); Arvind and Nikhil (1990). Thus, it can be shown that each macro operation is well-behaved. • The interaction between macro operations is well-behaved: Every macro-operation must consume all inputs and produces all out- puts. However, as observed previously a macro-operation does not guarantee generation of all outputs. In order to guarantee correctness: – Non-Loop Macro Operations: For every input that is gen- erated for a macro operation N by a macro operation M, the data should be consumed by macro operation N or a terminate request for macro operation N has to be issued to clear the data. Further, all the inputs for execution of macro operation N need not be produced in the case that it is on the not-taken path. Therefore, along with the termina- tion request, the number of inputs N is expected to receive (from other predecessors) prior to release of the resources, associated with the macro operation, is supplied. – Loop Macro Operations: A macro operation M which iterates several times may not generate all outputs during the course of its execution. Such a macro operation is well-behaved: If outputs along the back-edge are only generated if the loop is re-entered; if outputs associated with the exit path of the loop are only generated by the last iteration of the loop. • In the case of loops, the loop constants stored in the context memory of the substructure need to be purged and the storage released after the termination of the loop. A non-loop macro operation is one which is not re-instantiated several times. If a macro operation contains a loop and the macro operation itself is not re-instantiated every time, then it is a non-loop actor and the rules governing loop macro operations are not applicable in that context. In order to make the macro-dataflow graphs well-behaved, we need a mechanism where unused data can be purged. Further, these purge requests must indicate the number of operands expected. The compiler determines the number of outputs that would have been computed and delivered to any macro operation M, prior to every conditional statement that influences the execution of M. This evaluation is done at every conditional statement (i.e. for each basic block in a control flow graph). The runtime orchestration unit must ensure that it honors the count value when a termination signal arrives and the resource 46 Macro Dataflow Execution Model for a CGRA

allocation is not revoked until such time that the specified number of input operands arrive. Unlike Synchronous Dataflow, the determinism only extends to the number of input tokens for a macro-operation and does not apply to the aspect of schedulability.

3.4 Execution Model

In the previous section, we described properties of macro operations. An application can be represented as a macro dataflow graph, where each macro operation can be executed when all its inputs are available. We now describe the execution model of a system that schedules macro operations. An abstract model of the execution system is shown in figure 3.6. The figure shows an orchestration unit connected to a context alloc- ation area. The context memory for various macro operations are allocated in this space and tagged-token dataflow execution model (Arvind and Nikhil, 1990) is employed. In this model, a macro oper- ation may be instantiated several times during the execution of the application. Every instance of the macro operation is uniquely iden- tified, among all other active instances, by a tag9. The tagged-token approach exploits highest amount of parallelism (Arvind and Culler, 1986). Each data sent to the orchestration unit must identify the destination macro-operation and its instance number through a tag field. The tag field needs to be computed on the processor, along with the operand. A detailed description of the orchestration unit is provided in the following section.

9The tags are not unique for every instance and the number of uniquely identified instances is determined by the number of bits associated with the tag.

Launch of Macro Operation

Context Allocation Array of Processors Area Orchestration Unit

Write to Context Memory

Figure 3.6: Schematic of the execution system. 3.4 Execution Model 47

In this model, an application is partitioned into several macro oper- ations that satisfy the convexity condition. The first macro operation of a program can be executed when all the inputs for the program are available. This macro operation is selected by an Orchestration Unit for scheduling and the configuration associated with the macro operation is transferred to the reconfigurable fabric for execution. The macro operation executes and produces data operands for other macro operations. The data for the other macro operations are stored within the context memory (refer figure 3.6). When the expected number of inputs for a macro operation is available, it is selected for scheduling. It is to be noted that the scheduling of the next macro operation may proceed even if the first macro operation is still being executed.

3.4.1 Orchestration Unit The primary tasks of the orchestration unit are a) allocation of context memory to every instance of a macro operation. b) scheduling of ready macro operations c) termination of macro operations or purging of a context memory d) managing function invocations.

3.4.1.1 Context Memory Allocation Each instance of a macro operation is associated with its own context memory. This allocation can be performed whenever a new instance of a macro-operation is instantiated10. In this case, since the allocation is done at runtime, any predecessor macro operation does not know the address for the context memory, within the context allocation area, of the macro operation. Thus, a name service, which translates the macro operation identifier and its dynamic instance number (tag) to a memory address, is needed. Such a model would be akin to the Tagged-Token Dataflow Architecture (Arvind and Nikhil, 1990). A less hardware intensive alternative exists; in this scheme, the compiler determines the number of context memory allocations needed for the function (based on the number of macro operations contained within it and the maximum number of active macro operations at any point in time). At the start of function, the required numbers of (contiguous) context memories are allocated. The start address of the context memory chunk is passed to the function instance as input. Subsequently all macro operations within the function add the appropriate offset in order to deliver data to a specific macro operation.

10Instantiation can be automatic based on the arrival of the first input to that instance of the macro operation. 48 Macro Dataflow Execution Model for a CGRA

This obviates the need for a name service. This scheme is akin to the Explicit Token Store Architecture (Papadopoulos and Culler, 1998). It is to be noted that a static allocation of context memory for every macro operation cannot be performed. While such a solution may work in some applications, support for function recursion and ability to exploit task level parallelism would not be available.

3.4.1.2 Loop Throttling Since the allocation of context memory is performed at run time, there is a need to throttle the uncontrolled instantiation of loop iterations, in the absence of which the context allocation area may be overrun leading to a deadlock (Culler, 1985). In order to perform loop throt- tling in the scheme based on Tagged Token Dataflow Architectures, the name service must not allow the creation of more than k instance entries for the same macro operation. Similarly, in the Explicit Token Store based implementation explicit sequencing arcs are needed to prevent reuse of the same context memory prior to the previous use of that location has completed.

3.4.1.3 Schedule, Terminate and Purge The orchestration unit needs to maintain metadata on the number of input operands available within various context memories. An important condition needed for ensuring correct execution is that updates to the available inputs counter guarantees availability of all those inputs at the context memory. Further, if multiple inputs for a context memory arrive simultaneously appropriate synchronization constructs must be employed while updating the counter. A hardware orchestration unit may be implemented by allowing one input write to a given context memory in a single cycle. When a loop constant is written to the context memory, then apart from the input counter and another counter indicating the number of loop constants also need to be maintained. In processors, where loop macro operations are realized as multiple instantiations of the macro operation (when the processor does not support execution of loops) the input counter is reset to the value of loop constant count until such time that the loop terminates. Updates to the loop constant counter too must be protected by appropriate synchronization constructs to guard against multiple simultaneous updates to it. When the number of input operands matches the expected count for the specific macro operation, the macro operation is chosen for scheduling. Yet another useful feature would be the ability to launch 3.5 Granularity of Macro Operations 49

multiple instance of the same macro operation. Such a scheme would be useful when DLP is available. If the orchestration unit receives a termination request (which hap- pens when the condition governing the execution of a macro operation evaluates to false) then the macro operation has to be terminated. However, the termination cannot be done until the expected number of inputs have arrived. The expected number is determined based on how many inputs have been transferred prior to the decision making (without awaiting the decision). If the macro operation is terminated before the arrival of these inputs, (i.e. while they are in transit) these inputs may arrive after this context memory has been reallocated to a different macro operation and will lead to erroneous execution. When a loop terminates, the loop constants stored in the context memory (or in a separate store) need to be purged. However, this purging too should not be carried out until all the expected loop constants have arrived to prevent erroneous execution.

3.5 Granularity of Macro Operations

Thus far, we have described the various properties of macro opera- tions but have not described how macro operations are created. As mentioned previously, when instructions are spread across multiple processors of a reconfigurable fabric saving context prior to a function call is very difficult to implement. Therefore, a function call is not usually a part of the macro operation. The decision on the inclu- sion/exclusion of loops within a macro-operation is determined by the capabilities of the processor11. In order to support loops, the processor needs the ability to detect an end of an iteration. This is needed to determine when to restart the instructions corresponding to the next iteration of the loop. Iannucci (1988) and Sarkar and Hennessy (1986) generate macro actors (referred to as scheduling quanta by Iannucci (1988)) by group- ing together nodes of a dataflow graph. In both cases a functional language was used to express the application. When an imperative language is used to write an application, macro operations can be constructed by grouping basic blocks of an application. It has been shown by Alle et al. (2009) that basic blocks need to be processed in the topological order to ensure that the convexity condition is met. Architectures such as TRIPS (Sankaralingam et al., 2003), PPA (Park

11The tree of stacks mentioned previously is with respect to the execution of several macro operations in parallel and this pertains to the inclusion of function calls within a macro operation 50 Macro Dataflow Execution Model for a CGRA

et al., 2009), ADRES (Mei et al., 2002) employ a variant of hyper- blocks. The number of instructions that can be included within the same macro operation depends on several factors. These include: • The choice of computation unit. If FUs are employed as computa- tion units, then the number of instructions that can be included in the macro operation is dependent on the type of the individual FUs and the cardinality of these. When multi-function ALUs as are employed as computation units then the number of instruc- tions that can be included in a macro operation is equal to the number of computation units multiplied by the number of instruc- tions that can be accommodated within each computation unit. In contrast, in the former case the number of instructions that can be accommodated might be less than available instruction storage capacity in the computation units. • The number of context frames. This refers to the number of instructions that can be stored within each computation unit. If each computation unit can hold more than one operation then the capacity of the macro operation increases. It must be noted that macro-dataflow based orchestration unit will work for any choice of computation unit (FU to a RISC core) or choice of interconnection network. The orchestration unit only determines the next macro operation to be executed and does not concern itself with the mode of execution of the instructions of a macro operation. The design of the reconfigurable fabric is elucidated in chapter 4.

3.6 Conclusion

In this chapter, we described the macro dataflow execution model for efficient execution on a CGRA. An application is partitioned into application substructures. A macro dataflow graph composed of macro dataflow operations (i.e. application substructures) can be construc- ted. These macro operations can receive many inputs and produce many outputs. Correct execution can only be ensured when a macro operation is scheduled after all its inputs arrive, due to the absence of context switching mechanism. A well-behaved macro operation may produce lesser than maximum outputs, however it needs to ensure that all macro operations which are data-dependent on it eventually are executed or terminate. In order to ensure this, we developed a model that can exactly determine the number of expected inputs prior to termination of a macro operation. The orchestration unit with finite 3.6 Conclusion 51

resources can schedule well-behaved macro operations. The expected behavior of the runtime orchestration unit was described in detail in section 3.4. The model presented here is agnostic to the granularity of macro operations, choice of computation unit and interconnection. Inclusion of a loop within a macro operation is to be determined by the capabilities of the computation units on which it is to be executed. The model of dataflow computation presented in this chapter ex- tends the dataflow model specified by Arvind and Nikhil (1990). Spe- cifically the following extensions are made to this model of dataflow: 1. The notion of macro-operation has been extended from that of a collection of serial instructions to a collection of instructions which can include even loops. 2. Define the conditions for well-behavedness for such a macro dataflow operation. 3. Due the enforcement of program order through memory pre- cedence edges, we have eliminated the need for an I-Structure Arvind et al. (1989) which has been used in several Dataflow architectures. Since the memory ordering is enforced by the hardware, the memory hierarchy is greatly simplified. This model serves as the interface specification between the com- piler and our CGRA implementation. The compiler generates well- behaved application substructures which can be scheduled and ex- ecuted with finite resources and the runtime orchestration unit within the CGRA must adhere to the rules laid down for correct execution. In the next two chapters, we present the design of the reconfigurable fabric and the macro dataflow Orchestration unit. 52 Macro Dataflow Execution Model for a CGRA Chapter 4

Design of the Reconfigurable Fabric

Custom IP blocks are key to performance in the future - Prof. Arvind (told to Prof. Nandy) There is a need for having ‘refrigerators’ in future processors - Prof. Yale Patt1

A singular distinguishing feature of a CGRA is its ability to spread the computation across an interconnected set of FUs or ALUs and these units communicate and cooperate to produce the result. It is thus evident that the reconfigurable fabric (which comprises an interconnection of computation units) is the most critical component of a CGRA. Some of the overarching objectives that guided the design of the reconfigurable fabric on our CGRA include:

• Ensuring programmability is not sacrificed for performance

• Improve utilization of the reconfigurable fabric through appro- priate choice of computation units

• Try and employ a single interconnection network, for achieving power efficiency

• Creating a framework which can support different forms of domain-specific customizations.

1These quotations are from some of my interactions during my PhD life. All these statements have profoundly impacted the development of this project and is quoted here as a tribute to each of these people who have so affected it. 54 Design of the Reconfigurable Fabric

The meaning and motivations for each of these objectives are explained in the subsequent sections. In this chapter, we determine the values to be assigned to the following parameters in the parametric design space (refer chapter 2) that help meet the design objectives which has been setforth:

• C - Choice of computation unit

• N - Choice of interconnection network

• T - Choice between single context and multiple contexts

• P - Decision on whether partial reconfiguration will be suppor- ted.

• M - Choice of memory hierarchy

This is followed by description of the design of the reconfigurable fabric.

4.1 Programmability is of paramount importance

Most CGRAs use a high-level language viz. C language, as the design entry point for programming a CGRA. However, many of them support only a subset of the language. One of our primary goals while develop- ing this CGRA was to support any program written in C language (C89 standard). While, some constructs may be efficiently realized others may not, the CGRA and its compiler should be able to find a mapping for any valid C program. Most CGRAs support a subset of the C lan- guage do not permit pointer-based accesses, support for recursion etc. The reason for this restriction is that CGRAs are usually used to accel- erate loops. Well-structured loops can be written without use of these structures. However sound the reasoning may seem, it usually leads to re-engineering of the source code. Our design philosophy is that, for a platform to be usable and commercially viable it must support the entire gamut of constructs available in the language. Some constructs such as pointers may not lead to efficient realization, however any C code can be directly compiled on to this platform and the developer may optimize only those portions that are critical for performance. An intended side effect of this feature is that the platform is capable of executing any code and not just loops. This breaks the tight coupling needed between the host-processor and the reconfigurable fabric. 4.2 Improving Fabric Utilization 55

4.2 Improving Fabric Utilization

The choice of computation unit on a CGRA spans the entire range from a single FU, a small set of FUs up to a full-fledged ALU seen on modern general purpose processors. This choice is probably the most critical one with regard to the design of the reconfigurable fabric. Employing a single FU reduces the area of each computation unit and permits having a large number of them on fabric. On the other hand, for a given interconnection network, it limits the number (and hence the type) of FUs with which each FU can interact with. To illustrate let us assume a reconfigurable fabric which comprises FUs interconnected through a mesh topology. Any FU can at most have four neighbours. The type of each of these four neighbouring FUs can be chosen in multiple ways viz. it may be chosen based on statistical observation of instruction sequences. If the choice is such that for an instruction sequence in a certain application, the producer and consumer are not directly connected then it results in a multi-hop communication. This results in under-utilization of the fabric (as the intermediate FU is not being used) and the number of instructions that can be packed within a configuration is lesser than the number of FUs available. Other possible solutions include having different types of FUs in the neighbourhood of a given type of FU in different locations on the fabric. Such solutions make the task of instruction placement extremely difficult. One may also increase the degree of each node in the interconnection network so that all types of FUs are reachable in one hop from any other FUs. A cheaper alternative which provides same reachability is an ALU. However, the disadvantage of employing an ALU is that in order to increase utilization, the area of each compu- tation unit increases manifold when compared to a computation unit which employs a FU. Therefore, the number of instructions that can be accommodated on the fabric at any point in time comes down assum- ing one instruction is assigned to a computation unit. Having smaller configurations imply that configurations need to be frequently loaded and thus obviating any advantage of spatial computation. In order to ameliorate this, the number of instructions that can be accommodated per computation unit is increased i.e. multi-context reconfigurable fabric is used.

4.3 A Unified Interconnection

Many architectures (viz. Raw, TRIPS) employ multiple networks for the transport of instructions, data and control messages. If point-to- 56 Design of the Reconfigurable Fabric

point interconnection network is used to transport data operands, it is imperative that a different network is used to transport instructions. The network for transporting instructions is only active at the time of configuration and subsequently not used. In an ideal reconfigurable architecture, the ratio of time spent in computation to the time spent in loading the configuration is expected to be maximal. Therefore, it is possible to merge the instruction and data delivery networks with no impact on the performance on single-configuration CGRA. However, on multi-configuration CGRA if execution of another configuration is in progress then loading of instructions would affect the data transport due to the use of shared resources. Some of these can be amelior- ated through appropriate choice of routing paths and placement of configuration (in the presence of partial reconfiguration) within the reconfigurable fabric. In our CGRA, we employ a NoC and evaluate the viability of this choice for a reconfigurable fabric. The NoC serves as a single unified network for transporting instructions and data. The NoC expects the destination to be specified as a part of the packet that is then routed based on a hard-wired routing algorithm. This choice also reduces the amount of time needed to transfer the configuration asso- ciated with an application substructure, as the configuration for the interconnection is now encoded as a destination address (employing lesser number of bits) along with the instruction.

4.4 Domain-specific customization to achieve better performance

Most CGRAs are deployed as a part of an embedded system. This class of systems needs to achieve performance as specified by the throughput requirements of the application and at the same time be as energy-efficient as possible. Embedded systems achieve this through the use of application/domain-specific units which accelerate one or more kernels which critically impact power and/or performance. Thus, the CGRA must support mechanisms to instantiate application/domain- specific accelerators without much change to the other parts of the CGRA. It may not be possible to have application/domain-specific accelerators in all computation units. In such a case, the choice of other FUs to be included along with the accelerator in the same computation unit must be based on the typical frequency of interactions between the domain-specific unit and the general-purpose units. Such a choice will render the reconfigurable fabric heterogeneous. 4.5 High-Level Design choices 57

4.5 High-Level Design choices

Based on the discussion presented in the previous subsections, our reconfigurable fabric is designed as an interconnection of ALUs. We employ a single NoC as the interconnection network. The computation units would be most efficient when they have a simple pipeline. The computation units are designed to issue a single instruction every clock cycle. Ideally, we should be able to exploit instruction level parallelism by scheduling parallel instructions on different computation units. Such an interconnection of computation units can directly transfer data from the producer to the consumer instruction through the use of the NoC. Our statistics (shown in plots of figure 4.1) indicate that a large percentage of the instructions have a very small number (typically 1-2) of consuming instructions. Zero destination instructions refer to store operations. As is evident from the cumulative line plots nodes with outdegree of 0, 1, 2 and 3 accounts for almost all of the nodes in the dataflow graph. Higher outdegree is typically observed on nodes which distribute the predicate after evaluation of a condition or load instructions which access data which is used extensively in the rest of the computation. The consumer instructions may be present in different computation units and the result of an operation may need to be delivered to multiple computation units. We deliver the results to all consumer instructions instead of using a common storage location.

Akin to other CGRAs, we place the instruction and data memory at the periphery of the fabric. Further, to allow future extendibility we assume that all data memory accesses, through load/store instructions, are of variable latency. This assumption permits us to replace the data memory by a data cache. The choice of data memory versus data cache is based on the amount of data expected to be stored over the lifetime of the application. In the case of streaming applications, the data arrives over the input interface and once it is processed, it is no longer needed. Therefore, a small amount of data memory would suffice. However, in applications where data is processed from a storage device the amount of data memory needed may be quite large making it impossible to have such a large on chip memory. In such cases, a data cache is preferred. In order to withstand variable latencies associated with loads the computation unit needs to determine if the data is available prior to execution. It is to be observed that the NoC too adds a certain amount of non-determinism to the delay between the source and the destination. In the presence of a fair arbitration policy, the probability of selecting a packet among several packets vying for the 58 Design of the Reconfigurable Fabric

100 100 ith ith 80 80

60 60

40 40

the given the given outdegree 20 the given outdegree 20 CRC AES-D SHA1 Percentage of w Percentage nodes AES-E of w Percentage nodes IDCT

0 0 2 4 6 8 10 0 0 2 4 6 8 10

Outdegree of a Node Outdegree of a Node (a) AES Decrypt and Encrypt (b) CRC, SHA-1 and IDCT

110 100 ith 90 80 70 60 50 40 30

the given the given outdegree 20 Matrix Multiplication LU Percentage of w Percentage nodes 10 Givens

0 0 2 4 6 8 10

Outdegree of a Node (c) Elliptic Curve Point Addition and (d) Matrix kernels: Matrix Multiply, Doubling LU decomposition and Givens Rota- tion

110 100 ith 90 80 70 60 50 40 30

the given the given outdegree 20 FFT Percentage of w Percentage nodes 10 MRI-Q

0 0 2 4 6 8 10

Outdegree of a Node (e) FFT and MRI-Q

Figure 4.1: The plot shows the percentage nodes having a given outdegree. This has been plotted for different applications. 4.6 Design of the Computation unit (C, T ) 59

same output port in a given clock cycle is equal. Therefore, the exact arrival time of a packet cannot be determined. To address this requirement of determining the availability of the operands prior to execution of the instructions, we introduce a reser- vation station at each computation unit. The is common to all FUs present within the computation unit. The presence bit associated with each operand is used to ascertain if the data oper- and is available. The result of a computation is delivered directly to the destination computation unit and to the appropriate location within it. Unlike the Superscalar processor, where a bus based interconnection network is employed, the NoC transports the result from the source to the destination. The disadvantage of using the NoC is that only one packet (to a single destination) may be sent every clock cycle. When multiple destinations (instructions) are present then it takes as many clock cycles to transfer the result. As is evident from figure 4.1 many instructions have a single destination thus the overhead is not expected to be very high. The design of the reconfigurable fabric, which adheres to the high- level design decisions described above, is elucidated in the subsequent sections.

4.6 Design of the Computation unit (C, T )

As shown in figure 4.2, the computation unit comprises: 1. Instruction, Operand and Predicate stores 2. Instruction selection logic 3. Packet Creation stage 4. Writeback stage Each computation unit can store multiple instructions along with its operands and predicate. We implement multiple contexts in the form of multiple instruction buffers per computation unit. It supports a maximum of three operands per instruction. The storage for the instructions and various operands are separate and can be written to independently. Each instruction includes in it the opcode for the operation to be executed and a set of at most three destinations for each instruction. Predicates indicate whether the instructions will be executed. If the predicate is false (and if the instruction expects a predicate), the instruction is squashed. Each of these entries (i.e. instructions, operands and predicate) has a valid bit associated with 60 Design of the Reconfigurable Fabric

Instruction Selection

ALU Operand Unit Compute Packet Creation

Writeback

To West Router To East

To South Figure 4.2: Schematic Diagram showing the internals of a Computation unit connected to a Router.

Instruction Operand 1 Operand 2 Operand 3 Predicate Store Store Store Store Store

Instr V Data V Data V Data V EVP

Instruction Ready

Figure 4.3: Instruction selection logic in a computation unit. V - valid bit; P - Predicate value; E - Predicate expected 4.6 Design of the Computation unit (C, T ) 61

them that indicates their availability. These valid bits are AND’d together to determine whether an instruction is ready for execution. In the case of the predicate, the valid bit and the value of the predicate should be considered only if a predicate is expected. In the case of operations, which require lesser than three inputs, the operand valid bits for the unused operands are preset at the time of instruction loading. The instruction selection logic is shown in figure 4.3. If multiple instructions become ready for execution in the same cycle, then one among them must be chosen for execution. A priority encoder consumes all Instruction Ready signal and picks one among them for execution. This mechanism ensures that any instruction which is ready for execution is forwarded to the ALU. The use of valid bits for each operand makes the computation unit resilient to the variable latencies. The selected instruction and its operands are copied into the pipeline register between the instruction selection stage and the ALU stage. Once an instruction is selected, its valid bit is reset so that it is not scheduled again. The ALU, as mentioned previously, includes several FUs. The number and type of FUs supported can be configured at the time of design instantiation. The ALU can support one instruction every clock cycle. Multiple instructions in flight is currently not supported. Most operations are single cycle. The exceptions to this include integer multiplication, floating-point operations and some domain-specific units. If a multi-cycle operation is in progress then the ALU does not accept another instruction until such time that the current instruction completes execution. The result of the computation and the list of des- tinations (extracted from the instruction) are copied into the pipeline register between the ALU stage and the packet creation stage. If the destination of a computation is an instruction in the same computation unit then it is forwarded to the write-back stage, with appropriate bits indicating the intended destination. If the destination is an instruction in a different computation unit, a packet is constructed and forwarded to the write-back stage from where it is written to the router’s input port the following clock cycle. Every clock cycle the packet creation stage can process at most one destination to the same computation unit and one destination in a different computation unit (since the write back stage has only one port for each of these destinations). In case not all the destinations of an instruction can be processed within the same cycle then the pipeline is stalled until all destinations are processed. The write back stage receives the packets from the packet creation stage and incoming packets from the router. Packets intended for 62 Design of the Reconfigurable Fabric

From Packet Creation Unit

To Operand 1 Store Decoder Arbiter Packet Type Operand1

To Operand 2 Store Arbiter Operand2

To Operand 3 Store WriteStage Arbiter Operand3

To Predicate Store Merge Predicate

To Instruction Decoder Store Packet Type

From Router

Figure 4.4: Schematic diagram of the write back stage.

the same computation unit are written in to the appropriate storage locations and packets meant for other computation units are enqueued into the router. If the incoming packets from the packet creation stage and routers are meant to be written to register banks then they can be supported in parallel. It may be recalled that the instruction store, operand stores and predicate stores are all stored in different register banks (as seen in figure 4.3). However, if both packets are meant for the same register bank then only one of them is allowed to write to register bank. There is a fixed priority ordering (local writebacks > router writes) which is implemented by the arbiters, shown in figure 4.4. The write back stage also updates the valid bits when writing the instructions, operands and predicates. The predicate bits are all stored in a single register (instead of a register file), where each bit refers to an instruction in the instruction store. This structure allows us to update multiple predicate bits at the same time and thus the name “Predicate Merge", in figure 4.4. When a computation unit completes execution or squashing of all instructions, it marks its status as free. Prior to declaring its status as free, it determines whether all the instructions have been received. The number of instructions to be received is sent as a part of a special packet, prior to the transfer of instructions. 4.7 Design of the Router and the Reconfigurable Fabric (N) 63

To North From North

From West To West Round Robin Arbiter From East To East

From South To South Router

Multi-write FIFOs

Figure 4.5: Block diagram of the router.

4.7 Design of the Router and the Reconfigurable Fabric (N)

Another critical component of the reconfigurable fabric is the router. As shown in figure 4.2, each computation unit is connected to a router. An interconnection of these routers is referred to as the NoC. Each router has a structure as shown in figure 4.5. Each router has a fixed number of input ports and output ports. The number of input ports and output ports is determined by the topology of the interconnection. In case the honeycomb topology is used then it needs three input ports (and output ports) to interact with three neighbouring routers and one input port (output port) to interact with the computation unit to which it is attached. In case of the mesh topology is used then a total of five input/output ports are needed, four ports for the four neighbours and one for the computation unit. In figure 4.5 we show the structure of a router for the honeycomb topology. The router for the mesh topology is similar in structure. The router shown in figure 4.5 implements an output-buffered mechanism. This helps in eliminating back-to-back registers between the packet creation stage and input port of the router. This does not cause much change to the performance of the router. At the output, the router implements a Multi-Write FIFO. A multi-write FIFO, allows multiple packets to be enqueued every cycle. Only one packet can be dequeued from this FIFO every clock cycle. When multiple packets are enqueued the round robin arbiter determines the order in which the 64 Design of the Reconfigurable Fabric

Figure 4.6: Reconfigurable fabric in which computation units are interconnected using the honeycomb topology. The blocks with wavy shading are the peripheral routers.

packets are enqueued. If the space on the output FIFO is not sufficient to enqueue all packets to the intended output port, the packets to be enqueued are determined by the round robin arbiter. If the packet is accepted the acknowledgement line is asserted high, and the packet is removed from the pipeline register. The honeycomb router implements the routing algorithm reported by Fell et al. (2009a). The router for mesh implements a deadlock-free routing algorithm. The router accepts relative addresses. The relative address of the destination is updated prior to enqueuing in the output FIFO. When a router receives a packet with destination address (0, 0) it forwards the packet to the computation unit. The reconfigurable fabric is an interconnection of computation units. Each computation unit is connected to a router and the routers either are interconnected in the honeycomb or mesh topology. As indicated by Satrawala et al. (2007), there are only three possible planar tessellations namely: honeycomb, mesh and hexagon. Of the three possible topologies honeycomb has the least hardware overhead since it has least degree per node. Almost all of our experiments use the honeycomb topology. As reported by Asanovic et al. (2006), com- munication characteristics for different applications are different and hence the topology of the fabric needs to be chosen in an application aware manner. While the choice of interconnection network topology is a very important design parameter, in this thesis we do not delve further into it. In figure 4.6, each of the white boxes represent a tile.A tile is a 4.8 Design of the Load-Store Unit (M) 65

computation unit and its associated router (as shown in figure 4.2). The boxes at the periphery of the fabric (boxes with wavy shading) are referred to as peripheral routers. The peripheral routers do not have a computation unit associated with them. They are instead connected to other modules such as the Load-Store unit, instruction memory etc. The fabric has a horizontal toroidal link. This link helps reduce the number of hops between computation units and the nearest peripheral router. Since any computation unit that wants to load/store a data must forward the request to the data memory at the periphery, it is important to reduce the time lag in reaching the periphery of the fabric. For ensuring deadlock-free routing with honeycomb topology it is essential to maintain the size of the fabric to be a multiple of 3 (Fell et al., 2009b). Since 3 × 3 is very small, we choose a fabric of size 6 × 6.

4.8 Design of the Load-Store Unit (M)

The load-store units (shown in figure 4.7) are connected to the peri- pheral routers and provide access to the data memory. A load-store unit processes load/store request. A load request specifies the memory address, the expected data width (based on data type) and a destina- tion. The destination is an instruction that consumes the result of a load. A store request comprises data, address and the type of data to be written. In case of a load request, the destination address is stored in a FIFO and retrieved when the memory responds. A data packet is constructed and sent over the NoC to the destination. The load-store units are connected to each peripheral router (shown in figure 4.6). The address range of the data memory is partitioned across these data memory banks. The higher order bits of the address are used to determine the bank to which the address belongs. Each computation unit issuing a load/store request inspects these higher order bits to determine the load-store unit to which the said request needs to be forwarded. A critical requirement while issuing load-stores in a distributed manner (i.e. from different computation units) is ensuring memory consistency. The order in which these requests should reach the load- store unit must be as specified by the program order. In order to relax this model, we further add a restriction that only loads/stores referring to the same address must appear at the load-store unit in program order. Loads/Stores referring to different addresses may be received at the load-store unit in any order. However, this assumes the existence of perfect alias information. In the absence of perfect alias 66 Design of the Reconfigurable Fabric

Request processing

Data

Unit Memory Load-Store

Response processing

Response Address FIFO Figure 4.7: Schematic diagram of the Load-Store Unit. Orchestration Subsyste Orchestration Macro Dataflow Macro m

Switch Peripheral Router

Load-Store Unit Compute Unit+Router

Figure 4.8: The connection of the Load-Store Unit and Orchestrator with the reconfigurable fabric is shown. 4.9 Relating to the Parametric Design Space 67

information, all loads/stores in the same alias set must be received at the load-store unit as specified by the program order. In order to achieve this, the compiler adds memory precedence edges between loads/stores in the same alias set. When a memory precedence edge is present, the load/store request additionally includes the location of the subsequent load/store instruction in program. The load-store unit sends a trigger to this instruction indicating that it may now be issued. The trigger flag is copied into the predicate store and is treated as a predicate by the computation unit. The mechanism of sending a trigger to the subsequent instruction is necessary because this instruction may be placed in a computation unit closer to the load-store unit than the preceding load/store instruction in program order. The trigger mechanism sent from the load-store unit completely eliminates any window of these requests appearing at the load-store unit out-of-order. The mechanism described does not talk of the ordering between two unrelated loads/stores. In order to implement a complete memory consistency model there is a need for memory synchronization prim- itives viz. test and set, swap. To ensure memory ordering between the memory synchronization instructions and loads/stores the current mechanism of adding memory precedence edges can be extended.

4.9 Relating to the Parametric Design Space

With regard to the parametric design space defined in chapter 2 the value of some of the parameters have been identified based on the high-level motivations set forth in chapter 1. These parameters and their values are listed in table 4.1. The use of the NoC with relative addressing, automatically makes the fabric capable of supporting partial reconfiguration i.e. multiple macro dataflow operations can be executed simultaneously on the reconfigurable fabric. This is made possible because each router employs relative addressing mechanism while transferring data. This implies that any data transfers that

Table 4.1: The values of various parameters have been decided based on the overarching goals that were set forth at the outset.

C ALU N NoC . T Multiple contexts with ability to dynamically choose any context P Partial reconfiguration is supported due to use of NoC M Load-store memory banks placed at the periphery of the fabric 68 Design of the Reconfigurable Fabric

occur only need to know the destination address with respect to the source computation unit. Thus a macro dataflow operation can be placed anywhere on the reconfigurable fabric as long as the relative placement of source and destination instructions remains the same.

4.10 Conclusion

In summary, the reconfigurable fabric of our CGRA ALUs are employed within each computation unit which can store and execute several instructions in each computation unit. The interconnection network is a light-weight NoC. The load-store units are placed at the periphery of the fabric. The latency of load/store operations are assumed to be variable (i.e. not fixed latency) permitting the replacement of the data memories by data caches for future extendibility. The use of NoC and load/store unit necessitates the check for availability of input operands prior to execution of instructions. In order to implement this we employ a reservation station-like structure for instruction selection. The result of a computation is directly delivered to the consuming instruction and up to three destinations are stored as a part of the instruction. As mentioned, the choice of NoC is a design trade-off to have single interconnection and reduce reconfiguration overhead. Further details of the impact of the NoC is discussed in chapters 6 and 7. Chapter 5

Design of the Macro-Dataflow Orchestration Subsystem

It is slow, expensive and horribly complex - Dr. S. Balakrishnan1

The reconfigurable fabric executes the instructions of a macro operation. However, the other tasks such as 1. the selection of a macro operation for execution 2. determining the computation units for the macro operation on the reconfigurable fabric 3. transferring the relevant instructions and data of the macro operation to the reconfigurable fabric 4. facilitating communication between two macro operations has to be managed externally. This external entity referred to as the Macro-Dataflow Orchestration Subsystem. In this chapter, we explain the design of the macro dataflow orchestration subsystem that handles all of these tasks based on the theoretical framework set forth in chapter 3.

5.1 Overview of Operation

The macro-dataflow orchestration subsystem which implements the aforementioned functionalities is shown in figure 5.1.

1These quotations are from some of my interactions during my PhD life. All these statements have profoundly impacted the development of this project and is quoted here as a 70 Design of the Macro-Dataflow Orchestration Subsystem

Instruction & To the Fabric Data Transfer

stration Resource Status from Fabric Allocator

Orchestration Unit subsystem Update Logic Context Data for another Memory Macro Operation Macro-DataflowOrche

Figure 5.1: Schematic of the Macro-Dataflow Orchestration Subsytem

1. Orchestrator unit selects an appropriate macro operation for exe- cution. In the context of our CGRA, we refer to a macro operation as a HyperOp. 2. A HyperOp that is selected for execution needs to have its associ- ated computation units on the reconfigurable fabric where it can be executed. Each HyperOp specifies the number of computation units, type of computation units (in case of a heterogeneous reconfigurable fabric) and exact interconnection between them. The required interconnection pattern is specified as a computa- tion unit requirement matrix of fixed size (in our case 5 × 5). The hardware unit referred to as Resource Allocator (refer figure 5.1) attempts to allocate a group of computation units on the fabric, such that their placement with respect to each other is as given by the requirement matrix. This is explained in section 5.3. 3. Once the group of computation units are identified, the instruc- tion and data transfer unit transfer instructions, constants and input data to these computation units. This unit receives the (x, y) coordinates from the resource allocator and the HyperOp’s input operands from the orchestrator. Once the instructions and data are transferred on to these computation units, the execution of the HyperOp ensues. This is explained in section 5.4. 4. During the course of execution, a HyperOp may produce data for another HyperOp. As mentioned in chapter 3, a HyperOp may

tribute to each of these people who have so affected it. 5.2 Context Memory Update Logic and Orchestration Unit 71

not start execution until all its input operands are available in the context memory. Thus, the consumer HyperOp will not be available on the reconfigurable fabric and the data for it needs to be written in the context memory. This data is forwarded to the update unit (refer figure 5.1) which is responsible for determining the appropriate location within the context memory where the operand needs to be stored. When all operands for another HyperOp are available in the context memory, it is selected for execution and the process repeats. This is explained in section 5.2.

In this chapter, we will present the design details of the units we have alluded to thus far.

5.2 Context Memory Update Logic and Orchestration Unit

5.2.1 Determining the Consumer’s Instance Number When a HyperOp produces a data that is to be consumed by another HyperOp there are two mechanisms in which this can be accomplished. It can be written to the data memory through the load-store unit or it can be written to the context memory. In our implementation, a change to a vector is routed through the data memory and a change to a scalar is routed through the context memory. Other possible implementations include (i) routing both scalars and vectors through the data memory (ii) writing both scalars and vectors to the context memory (iii) vectors to the context memory and scalars to the data memory. In the case of the context memory, the input data is available to the HyperOp at the start of its execution. In the case of data memory, the effective address may need to be computed followed by a load request to the data memory. The load request incurs a round-trip delay. Thus communication through the context memory is faster than communication through the data memory. This rules out options (i) and (iii). Option (ii) is not a viable solution and is explained subsequently. Let us first consider the case of transferring a scalar operand from one HyperOp to another. A data operand for a HyperOp identifies the static identifier of the destination HyperOp (as assigned by the compiler). However, this information is insufficient to identify the location within the context memory. During the execution of the application, several instances of the HyperOp may be instantiated. 72 Design of the Macro-Dataflow Orchestration Subsystem

Further, several instances of the same HyperOp may be active at the same time if the HyperOp participates in a loop or if the function to which it belongs is invoked multiple times. In order to identify the instance of the HyperOp, a separate HyperOp instance number is associated with every HyperOp. We divide the instance number field into 2 components.

1. One component identifies the function instance number and the other component identifies the loop hierarchy within which it is present. The function instance number is 5 bits wide permitting at most 32 function instances to be active at the same time.

2. The loop instance number is 8 bits wide, with 2 bits being used for each nesting depth (permitting a nesting depth of 4). The compiler has checks in place to ensure that the code does not have a nesting depth greater than 4.2 The loop instance number is arranged such that the least significant bits always represent the instance number of the deepest loop thus far encountered. The two bits to the left of the least significant bits indicate the instance number of loop that contains the inner most loop and so on.

In the case of a new function instance a function instance number is assigned to the function from the global pool. Communication to and from a function occurs only at call and return sites. In the case of a call, a HyperOp instance of the destination HyperOp is maintained same as the parent upto instant the HyperOp is scheduled for execution. When the HyperOp is scheduled, in case it is the first HyperOp of a function, a new instance number (obtained from the global pool) replaces the function instance number. The loop instance number of a new function call is cleared and reset to zero. In the case of a return, the calling function is passed the instance number of the destination. When the return is encountered, this instance number (passed as parameter) is used directly to address the destination HyperOp instance. The return also triggers the release of the the function’s instance number, so that it may be reused. The Orchestration unit maintains list of available and in-use function instance numbers. It assigns a function instance number upon a function call (and moves it to the in-use list) and upon a return the instance number is moved back to the available list.

2This is not a limitation since codes of nesting depth greater than 4 can implemented such that each set of four nested loops are present in a different function. However, in a subsequent scheme we address both the limitations of the maximum possible function instances and loop depth. 5.2 Context Memory Update Logic and Orchestration Unit 73

The case of HyperOps participating in loops is differently handled. In order to determine the instance number to which the data is inten- ded, the compiler supplies the runtime system with instance number of the source HyperOp and the relationship between the source and destination instance numbers. Based on the compiler specified hints, the instance number of the source HyperOp, the destination HyperOp instance number can be computed. There are four possible cases of HyperOp-to-HyperOp communication, which leads to different in- stance number relationships. These are shown in figure 5.2. Figure 5.2a shows two HyperOps that are present at the same nesting depth. The last two bits of the iteration index of each of loops in the hier- archy constitutes the loop instance number3. It is evident from figure 5.2a that the instance number of producer HyperOp 1 and consumer HyperOp 2 are always equal. Figure 5.2b shows a different case where the scalar is passed from HyperOp 2 to HyperOp 1 and the iteration index of HyperOp 1 is always one greater than HyperOp 2. Therefore, in this case the producer’s instance number needs to be incremented by 1 to obtain the instance number of the consumer. Figure 5.2c indicates a case where a new nesting depth is encountered. In this case, the loop instance number needs to be shifted left by 2 bits to obtain the destination HyperOp’s instance number. Figure 5.2d shows the inverse case, where the destination HyperOp is at a lower level of nesting that the producer HyperOp. Therefore, the instance number of the producer HyperOp needs to be shifted right by 2 bits to obtain the instance number of the destination HyperOp. Apart from these four cases, a few more cases are possible, due to the nesting of loops. The innermost loop may skip a level of loop nesting while exiting the loop. With a nesting depth of four it creates three other possibilities. These seven cases sum up all possible communication between two HyperOps when a scalar variable is being modified. There is also a possibility of data being delivered from a source HyperOp that is at a nesting depth higher than the destination HyperOp and is not containing loop of the destination HyperOp. In most of these cases, these scalar variables are loop invariants. This is not supported and all data has to be routed through the immediate parent loop. The reason for this is explained in section 5.2.3. The presence of just two bits in each loop instance implies that at most four instances of the same loop can be active at a time. If

3We are assuming that iteration index is always incremented. In either case the instance number is always incremented without consideration of the iteration index. The iteration index was only used for the purposes of illustration. 74 Design of the Macro-Dataflow Orchestration Subsystem

instance instance number number x y z HyperOp1 z HyperOp1 w w x y +1

+0 +1

w x y z HyperOp2 w x y z HyperOp2

(b) From one iteration to the next (a) Within the same iteration iteration

instance number HyperOp1 0 w x y HyperOp 0

instance number w x y z HyperOp1 w x y z HyperOp2

HyperOp2 HyperOp3 0 w x y

(c) From outside the loop to the first (d) From within the loop to a iteration HyperOp outside

Figure 5.2: Four possible cases of HyperOp to HyperOp communication through scalar variables (shown with solid edges). 5.2 Context Memory Update Logic and Orchestration Unit 75

an update to a vector by an instance of a HyperOp were consumed by another instance of the same HyperOp but with instance number difference of greater than 4, then the lack of sufficient bits of instance number would render it impossible to communicate. For this reason, when a vector data is exchanged between two HyperOps it is always routed through the data memory. However, in cases where the distance between vector updates is within the limit permitted by the hardware, the compiler can route this through the context memory.

5.2.2 Determining the location within the Context Memory We have thus far explained the rationale and technique for generation of the instance number of the consumer based on the producer’s instance number. Once the consumer HyperOp’s static identifier and instance number are known, it should be possible to determine the exact location within the context memory. The context memory is organized into several fixed size contexts, where each context can accommodate all the inputs for a HyperOp’s instance. In our implementation the maximum number of operands a HyperOp can consume is 164. By placing all the operands for the same HyperOp instance at the same location, the task of detecting HyperOp’s readiness to execute is made simpler. A separate store indicates the number of expected inputs for each HyperOp. When the number of input operands for a HyperOp is equal to the expected number of inputs then the HyperOp is ready for execution. The context for a HyperOp instance is allocated when the first operand for that instance arrives. Subsequently, all operands for the same instance are copied into the same context. In order to track this, we maintain a lookup table in hardware. This lookup table is indexed by the HyperOp’s static identifier. The lookup table is four way associative i.e. it can store upto four instances each of them tagged by a HyperOp’s instance number. When the consuming HyperOp’s static identifier and instance number are known, the hardware determines if a context has been allocated from the lookup table. If no allocation exists, then an appropriate context location is now assigned to this HyperOp identifier and instance number. An appropriate entry is made into the lookup table, so that subsequent operands for this HyperOp identifier and instance are forwarded to the same location. When the HyperOp instance is scheduled or if it is terminated (due to the arrival of a false predicate), the context is released and the entry from the lookup table is erased.

4This is a configurable parameter in the hardware implementation and compiler. 76 Design of the Macro-Dataflow Orchestration Subsystem

5.2.3 Handling of Loop Constants Some of the input data for HyperOps, which are executed repeatedly, are produced once and have to be used several times. Since producer HyperOp exits after producing the data once, there is a need for mechanism that delivers this data for each instance of the HyperOp. These loop invariant data are stored in separate data store within the update unit. Whenever a new instance of a HyperOp is created (by the arrival of some input operand), the update unit searches the loop invariant data store for the presence of loop invariant operands. If they are present, these are transferred to the context memory of the HyperOp instance. The loop invariant data for the HyperOp is cleared upon the receipt of a special command. The Orchestration unit upon successful evaluation of the exit condition generates this command. This is explained in greater detail in the subsequent section. The loop invariant data store is indexed by the HyperOp identifier of the destination HyperOp and is tagged with the instance number of the source HyperOp, akin to the look up table mentioned previously. However, this store is direct mapped. Unlike the lookup table, which just stores the address within the context memory, the entire context needs to be stored in the loop invariant data store. This is because the number and positions of the loop invariant data is not fixed. When a new HyperOp instance is created, the loop invariant data store is searched using the destination HyperOp’s static identifier. The source HyperOp instance number is determined by shifting the destination HyperOp’s instance number right by two bits. This is because the compiler appropriately modifies the dataflow graph such that loop invariants can only be generated by immediate outer loop. In the absence of this scheme, the search for loop invariants would need multiple passes, each time with the instance number of the destination HyperOp being shifted right by two bits. Neither the compiler nor the execution guarantees that the loop invariant data would arrive at the orchestration subsystem prior to the normal data for the HyperOp. Thus, for correct execution, the context memory is searched for an entry for the destination HyperOp when a new loop invariant is written.

5.2.4 Design of the Orchestration Unit The Orchestration Unit is responsible for

1. Identifying if a HyperOp is ready for execution (based on the availability of all its input operands) 5.2 Context Memory Update Logic and Orchestration Unit 77

2. Selecting a HyperOp for execution or squashing from a list of several possible ready HyperOps

3. Allocating a function instance number, if the HyperOp to be scheduled is first HyperOp of a function (i.e. a function call has been issued).

4. Reclaiming the function instance number when the return in- struction is executed

5. Issuing instructions to the loop-invariant data store to clear the data for a specific HyperOp.

A HyperOp is ready for execution when all its inputs (including predicate input) are available. The number of input operands ex- pected by the HyperOp is known at compile time and this metadata about the HyperOp is loaded into a lookup table prior to the start of execution. Every time an operand is copied into the context location of a HyperOp instance, a counter is updated. When the counter value equals the expected number of inputs, the HyperOp instance is ready for execution. When several HyperOp instances are ready for execu- tion there is a need for a mechanism that appropriately chooses the next HyperOp instance for execution. This choice is crucial in ensuring optimal utilization and forward progress. Details of this can be found in papers by Culler (1985) and Satrawala and Nandy (2009). However, our experience indicates that in most programs written in an imperative language, that not many HyperOp instances become ready for execution at the same time. In fact, in most cases, just one ready HyperOp instance is available at any time. In cases, where TLP is available the HyperOp instances either are different iterations of the same loop or are function calls that can execute in parallel. The number of parallel macro operations active at any time is quite small. Therefore, we use a very simple mechanism to select the next HyperOp to be executed. We employ a priority encoder for the selection of the next HyperOp instance. The parallelism profile is not expected to be as high as those mentioned by Culler (1985) due to the difference in the granularity of operations (Culler (1985) talks of dataflow machines which operate at the granularity of an instruction). The static identifier of the selected HyperOp is forwarded to the resource allocator for finding a mapping on the fabric and the data associated with the HyperOp’s instance is forwarded to the instruction and data transfer unit. The context memory associated with the HyperOp’s instance is deallocated. When a HyperOp instance is ready 78 Design of the Macro-Dataflow Orchestration Subsystem

for execution, but the predicate associated with it is false, then the HyperOp instance is squashed and the context memory is deallocated.

As mentioned in chapter 3, a function call is very similar to the invocation of another HyperOp. When data to the first HyperOp of a function is made available, a function call is said to have been made. However, a key difference (w.r.t. a normal HyperOp) is that a new function instance number needs to be allotted to the HyperOp instance. At compile time, if a HyperOp is the first HyperOp within a function it is appropriately marked and this information is available in the metadata alongside the expected number of inputs. When a HyperOp instance is selected for execution, if the metadata indicates that it is the first HyperOp of a function, it is allotted a new instance number.

When a return instruction is encountered within the function, a message is sent to the orchestration unit (from the reconfigurable fabric) indicating the HyperOp instance of the function which is ex- ecuting the return. The function instance number of the HyperOp is reclaimed and moved to the available instance numbers pool.

The loop invariant data associated with a HyperOp can only be purged when all iterations of the HyperOp have completed. The completion of all iterations can be determined by the evaluation of the exit condition of the HyperOp. The compiler introduces special types of HyperOps called Purge HyperOps. These HyperOps contain as data operands the list of HyperOps whose loop invariant data needs to be cleared. The purge HyperOps also consume a predicate input. The value of the predicate is directly dependent on the evaluation of the loop exit condition. If the condition evaluates to true, the purge HyperOp is executed. When the HyperOp type is detected as a purge HyperOp, the orchestrator iterates through the list of HyperOps to be purged and sends out a loop invariant data purge signal to the update unit for each of HyperOps in the purge list. For efficiency in implementation, if the predicate of a purge HyperOp is received as false the context area associated with it is not reclaimed and the predicate is not recorded.

Note: Thusfar, we have documented the rationale and workings of the initial design of the Orchestrator. This component, along with the necessary compilation steps, is where the imperative language to dataflow execution gap is bridged. We present a simplified design of the orchestrater with different trade-offs in chapter 7. 5.3 Design of the Resource Allocator 79

5.3 Design of the Resource Allocator

During the process of compilation, a HyperOp is partitioned into a number of partitioned-HyperOps or p-HyperOps, such that each p- HyperOp can be mapped to a single computation unit on the fabric. This mapping from p-HyperOps to computation units needs to be inject- ive and need not satisfy the surjective property. This implies that every p-HyperOp needs to have a mapping to a computation unit on the fab- ric, while inverse need not be true. The compiler does the partitioning and mapping and it provides the hardware a binary computation unit requirement matrix. This matrix indicates the desirable interconnec- tion pattern between the computation units needed for execution of the HyperOp. The resource allocator attempts to find a sub-matrix matching between the binary computation unit requirement matrix and the binary computation unit availability matrix. The computation unit availability matrix is a binary matrix indicating which computa- tion units are busy and which are available for allocation for another HyperOp. A computation unit is busy when it is currently servicing a HyperOp. When all instructions in a computation unit have completed execution, the computation unit declares itself free.

5.4 Design of the Instruction and Data Transfer Unit

Once a HyperOp instance is selected for execution, its instructions, constants and data operands need to be transferred to the computation units identified for it on the reconfigurable fabric. A block diagram of the instruction and data transfer unit is shown in figure 5.3. The instruction and data transfer unit is connected to five instruc- tion memory banks. These contain the instructions, constants and data transfer instructions for all the HyperOps. When a HyperOp is to be scheduled, the HyperOp’s static identifier is used to determine the address within the instruction memories. The address along with the count of instructions to be transferred is sent to each instruction memory bank. There is a small unit alongside the instruction memory, which transfers the instructions to the packet creation blocks. The packet creation blocks transform the instruction into a packet. It ap- propriately sets (x, y) address of the destination based on the location identified by the resource allocator and compile time configuration information identifying the specific computation unit. The packet is forwarded to the identified computation unit through an appro- priate peripheral router. The peripheral router on the same row as the destination is chosen for transfer. To transfer the packet to the 80 Design of the Macro-Dataflow Orchestration Subsystem

Instruction Instruction Instruction Instruction Instruction Address memory memory memory memory memory Look Up Table

HyperOp ID Stream Instructions Stream Instructions Stream Instructions Stream Instructions Stream Instructions

x y Packet Packet Packet Packet Packet Creation Creation Creation Creation Creation

Input

Transfer Unit Operands Crossbar Storage Instruction& Data

Figure 5.3: Schematic of the Instruction and Data Transfer Unit.

appropriate peripheral router the crossbar is employed. The compiler guarantees that no two instructions belonging to different memory banks are required to be forwarded to the same peripheral router at run time. In order to transport data to the computation units, data transfer packets are present. The data transfer instructions indicate which of the HyperOp’s input operands need to be transferred to which consuming instruction in the computation unit. The transfer proceeds by first transferring all the instructions followed by constants and data. This ordering enables us to employ a speculative HyperOp prefetch mechanism described by Krishnamoorthy et al. (2010). Due to the presence of five instruction memory banks and an equal number of packet processing units upto five transfers can be done simulatenously on to the fabric.

The transfer proceeds by identifying the right row along which the desired computation unit resides and transferring the packet to the appropriate peripheral router (refer figure 4.6 in chapter 4) in that row. The crossbar aids in transferring the packets created by any of the packet creation units to the appropriate peripheral routers. The instructions in the memory are so organized that all instructions to computation units that map to the same row are present in the same instruction memory bank. This eliminates the possibility of a conflict at the crossbar. 5.5 Conclusion 81

5.5 Conclusion

In this chapter, we presented the details of the macro dataflow orches- tration subsystem. This unit comprises the context memory update and orchestration unit, the resource allocator and instruction and data transfer unit. The context memory update unit implements a hardware controlled allocation of context locations. In order to implement this, it employs a look up table that helps in determining the location of a macro operation instance within the context memory. It also includes a loop invariant data store that helps in delivering loop invariant data to every instance of a macro operation. The orchestration unit selects a ready macro operation instance for execution and in the process de- allocates the context memory associated with it. The selected macro operation is forwarded to the resource allocator for determining an appropriate set of computation units where it may be executed. The instruction and data transfer unit accesses the instruction memory and transfers instructions, constants and data to the computation units identified by the resource allocator. 82 Design of the Macro-Dataflow Orchestration Subsystem Chapter 6

Experimental Framework and Results

You are using a light-weight NoC - Prof. Nader Bagherzadeh (to. Dr. Ratna Krishnamoorthy) I am still not convinced that one would need an NoC - Dr. Ramanathan Sethuraman (to Dr. S. Balakrishnan)1

In chapter 3, we presented the macro-dataflow execution model for a CGRA. In the subsequent chapters, chapter 4 and chapter 5 we presented design details of a reconfigurable fabric and the macro dataflow orchestration unit, which are components of the CGRA. In this chapter, we will first present the experimental framework and applications to be used for performance evaluation of this CGRA. We also provide a brief description about the compilation process. Sub- sequently, we present results of the performance evaluation followed by an analysis of the time spent in various activities during execution of the application.

6.1 Experimental Framework

The block diagram in figure 6.1 shows the reconfigurable fabric con- nected to the macro-dataflow orchestration subsystem through the peripheral routers. The reconfigurable fabric, as mentioned previously, is an interconnection of computation units. The computation unit is

1These quotations are from some of my interactions during my PhD life. All these statements have profoundly influenced the development of this project and is quoted here as a tribute to each of these people who have so affected it. 84 Experimental Framework and Results Orchestration Subsyste Orchestration Macro Dataflow Macro m

Switch Peripheral Router

Load-Store Unit Compute Unit+Router

Figure 6.1: Block Diagram showing the interconnection between the reconfigur- able fabric and the macro-dataflow orchestration subsystem

capable of storing 16 instructions in the reservation station. Along with each instruction, there is storage available for its 3 data operands and predicate. When all operands of an instruction are available, it is ready for execution. The instructions are selected for execution as and when they are ready. If more than one instruction is ready in a single clock cycle, then the priority encoder is used to arbitrate among them. The ALU, within each computation unit, comprises several FUs. The type of FUs employed in a computation unit depends on the domain for which the reconfigurable fabric has been customized. In this thesis, we experiment with two different domain-customizations. One of them is for cryptography. The other domain customization is done for supporting floating-point applications. The list of FUs supported in these domain-specific fabric are listed in table 6.1. In table 6.1, alongside each FU type, the latency in cycles is also presented. The latency numbers presented are obtained from the Register Transfer Level (RTL) implementations of these units. The floating-point fabric employs two types of computation units; one computation unit for integer operations and another computation unit for floating point operations. The implementation details of the floating point unit is available in Choudhary (2011). In the floating-point fabric, of the 5 × 6 computation units, 5 are type II (floating-point) units and the remaining units are type I units. The details of each of the FUs is listed below: 6.1 Experimental Framework 85

Integer Add and Compare Unit: This unit supports unsigned and signed addition and subtraction. It also supports all comparison operators >, <, =, <=, >=. The operation also specifies the width of the operation i.e. 8-bit, 16-bit and 32-bit. All operations are completed within one clock cycle.

Integer Multiply Unit: This unit multiplies two signed or unsigned numbers and produces a single 32-bit output (akin to the behavior in C language semantics). We do not support retrieval of the 64-bit result, as there is no programmatic technique available in C to do so (unless the data type is changed). This operation too supports 8-bit, 16-bit and 32-bit operations. The operation completes in 6 clock cycles. The unit is not pipelined and only one operation may in progress at any point in time.

Bitwise and Logic Operations: The Bitwise operations AND, OR, XOR, NOT are supported. These operations are supported on 8-bit, 16-bit and 32-bit widths. Boolean operands are treated as single bit operands and are accepted for logical operations. All of these operations complete execution in a single cycle.

Data Transport Operations: The data transport instructions include MOV and MOVToHyperOp opcodes. The MOV opcode moves data any- where within the reconfigurable fabric. The destination field specifies the relative (x, y) coordinates of the destination computation unit, the type of data transferred (viz.operand 1, operand 2, operand 3, predicate) and the particular index within this register file where this data is to be copied. It supports all possible data widths. The destintation of a MOV operation is encoded by the compiler. The other opcode, MOVToHyperOp is used when sending data to the or- chestrator subsystem. As discussed in chapter 5, this instruction specifies the HyperOp’s static identifier, the instance number of the producer HyperOp, a compiler hint to compute the instance number of the consumer HyperOp and data to be transported. The consumer HyperOp’s instance number is computed using these. All data widths are supported in this instruction. There are additional bits to indicate if this marks the return of a function or if this data is a loop invari- ant data. This information is used by the orchestrator subsystem to update the data suitably in the context memory. The explanation for the same can be found in chapter 5. The result of MOVToHyperOp is sent to the peripheral router present along the same row (as the computation unit executing the instruction). 86 Experimental Framework and Results

Table 6.1: List of all FUs and the latency in clock cycles for the Cryptography fabric and floating-point fabric.

Type FU Latency Type FU Latency of in of in CE cycles CE cycles Cryptography Fabric Floating Point Fabric Add & Compare unit 1 Add & Compare unit 1 Integer Multiply 6 Integer Multiply 6 Bitwise and Logical 1 Bitwise and Logical 1 Operations Operations I I Data Transport Op- 1 Data Transport Op- 1 erations erations Load-Store Opera- 1 Load-Store Opera- 1 tions tions Shift Operations 3 Shift Operations 1 Field Multiplication 3 & Barrett Reduction Field Squarer 1 Floating Point Add 9 & Subtract II Floating Point Mul- 6 tiply Floating Point Di- 21 vide Floating Point Sin & 29 Cos Floating Point 36 Square Root Integer to/from 1 Floating Point type conversion

Load-Store Operations: The load/store instructions are mechanisms to access the data memory. The load/store request can be of 8-bit, 16-bit or 32-bit widths. The memory address must be appropriately aligned (based on the data-width). The load request contains the memory address, the triggered destination and the data destination. The triggered destination is the location of the next load/store which must execute after the load/store request under consideration. As discussed earlier this is required to maintain correct order among memory operations. The data destination indicates the location of the consumer instruction on fabric. The store destination includes the address, data and the trigger destination. The FU takes a single cycle to determine the appropriate load-store unit (based on the higher order address bits) and create a load/store packet to be sent to the load-store unit. 6.1 Experimental Framework 87

Shift Operations: The shift operations, shift-left and shift-right, are supported for all bit widths. The shift-left and shift-right in the cryp- tography fabric are supported using the field multiplier. The scheme for supporting these operations on a field multiplier is elucidated by Das et al. (2011). The floating-point fabric uses a dedicated barrel shifter to support these operations and is thus accomplished in a single cycle.

Field Multiplication & Barrett Reduction: A single unit is used to implement Galois Field Multiplication over binary fields and barrett reduction. The design of this unit has been presented by (Das et al., 2011). The unit supports 8-bit, 16-bit and 32-bit width operands. In the case of 8-bit and 16-bit operations, the unit supports vectored execution mode. Each operation takes 3 clock cycles to execute. These operations are extensively used in the context of Elliptic Curve Cryptography.

Field Squarer: This too is an operation defined over Galois Fields. This operation performs squaring on a number in Galois field. The result is in unreduced form and takes a single clock cycle to complete.

Floating Point Add and Subtract: This unit supports floating-point addition and subtraction on 64-bit floating-point numbers. The unit takes 9 cycles to compute the result and is unpipelined.

Floating Point Multiplication, Floating Point Division, Square root, Sine and Cosine: These operations are all supported in hard- ware. The algorithm implemented is a variant of the algorithm reported by Goto and Wong (1995). The details of this can be found in Choudhary (2011). The unit is unpipelined and supports only 64-bit floating-point numbers.

Each of the Load/Store Units is connected to a 128KB data memory bank. The maximum permitted address space per bank is 229 bytes. The bank size was chosen to be 128KB for purposes of simulation2. An equal amount of memory per bank is allocated to the instruc- tion memory. These instruction memory banks are connected to the instruction and data transfer unit. A context memory (within the macro-dataflow orchestration subsystem) of size 4KB is used. This can accommodate 64 contexts each with 16 data operands of size 4 bytes each. As mentioned in chapter 5 there is a lookup table which is used

2We did not use the filesystem to simulate the memory. It was not possible to implement the file seek function in the version of Bluespec which we used to implement it. So we simulated it using a large register file, which is held in memory. 88 Experimental Framework and Results

to translate the HyperOp’s static identifier and instance number into a location in the context memory. The lookup table in the update unit is 80Kbits. This is sufficient to implement have a 4 way lookup table with a 10-bit HyperOp identifier3. The loop-invariant store iss 64KB4. The CGRA was implemented in Bluespec System Verilog. Designs written in Bluespec can be translated automatically into C++ or Veri- log. We employ the C++ flow for all our simulations. The synthesis is performed using the Verilog generated from the Bluespec specification.

6.2 Choice of Applications

In order to evaluate performance, we choose a set of applications which includes a mix of cryptography applications, integer kernels and floating point kernels. Cryptographic protocols are used for a wide range tasks but the most important of these are message encryption/decryption (to provide confidentiality) and message authentication (to ensure authenticity of the message and non-repudiation). The settings in which these algorithms work can be broadly classified into two types: symmetric key setting and public key setting. In symmetric key cryptography, the communicating parties share a common secret – the key. This key is used in encryption/decryption and in the computation of the message authentication codes. In public key cryptography, each user has two keys one private key which is kept secret and a public key which is shared with all other users in the network. The public key and private key are chosen such that the private key may not be derived from the public key. For secure communication, the sender uses the public key of the receiver to encrypt the message while the private key is used to decrypt the message. For achieving message authentication (and non-repudiation) the sender uses his/her own private key to sign the message and the receiver can use the public key to determine the veracity of the message. For the purposes of performance evaluation, we have chosen Advanced Encryption Standard (AES) symmetric key algorithm and the Secure Hash Algorithm (SHA)-1 symmetric key message authen- tication algorithm. We also evaluate the performance of Elliptic Curve Point Addition (ECPA) and Elliptic Curve Point Doubling (ECPD).

3Maximum number of HyperOps we have encountered in the course of our experiments even on real life programs is 370. 10 bits allocated for the HyperOp identifier should be sufficient in most cases. 4Worst case design. This could have been made smaller by using a portion of the HyperOp identifier in the tag portion. 6.2 Choice of Applications 89

AES: AES is used extensively in many communication protocols. The algorithm operates at the granularity of a byte. In the first step a byte level substitution is performed. This operation can be implemented as a simple lookup table. However, the value to be substituted can be computed as an inversion over Galois field 28 followed by an affine transform. The subsequent step referred to as shift rows, rotates each row (consisting of four bytes) of the resulting matrix left by 0, 1, 2 and 3 bytes respectively. This is followed by a step called mix columns. In this step, each column is multiplied with a constant, where multiplication is defined over a Galois field. Finally, a step by name add round key is performed. This step derives the sub-key for the next computation by combining it with the 4 × 4 input data. These four steps are repeated nine times for a key size of 128 bit. The details are available in (Daemen and Rijmen, 2002). In our experiments we run AES-128 encryption and decryption algorithms.

SHA-1: SHA-1 is used to compute a cryptographic hash function for message authentication. The algorithm is used extensively in several protocols. In SHA-1, the input message is converted into a 160-bit message digest (National Institute of Standards and Technology, 2002).

ECPA and ECPD: ECPA and ECPD are fundamental building blocks for Elliptic Curve Cryptography (ECC) algorithms viz. Elliptic Curve Digital Signing Algorithm (ECDSA) and Elliptic Curve Diffie–Hellman (ECDH). ECDH is a key-agreement protocol used for establishing a shared key between two parties that intend to employ a symmetric encryption/decryption algorithm during communication. The key agreement protocol itself uses a public key-private key mechanism to ensure secure key agreement. All ECC algorithms are public key-private key based algorithms. ECDSA is an example of a public key-private key based message authentication and non-repudiation technique. One of conditions essential for the viability of public key-private key cryptographic algorithms is the inability to derive the private key given the public key and vice versa. This is made pos- sible in ECC due to the computational infeasibility of Elliptic Curve Discrete Logarithm Problem i.e. given a point P on the curve and another point Q on the same curve such that Q = nP , where n is a scalar, the problem of finding n given P and Q is computationally infeasible for sufficiently large values of n. This property is used in generating public key-private key pairs. Future encryption standards are expected to extensively use ECC algorithms. The result of the scalar multiplication, nP , is computed by the repeated application 90 Experimental Framework and Results

of the ECPA and ECPD. Thus, the computational efficiency of the ECPA and ECPD kernels are of paramount importance. ECPA determ- ines the point of intersection between the line connecting the two points under consideration with the elliptic curve whose equation is known. The elliptic curve point doubling attempts to find the point of intersection between the tangent to the point under consideration with the elliptic curve. Both these kernels involve field multiplication and doubling over large binary fields viz. 2163 or upwards. The field multiplication over large fields is performed using the Karatsuba- Ofman algorithm (Karatsuba and Ofman, 1963). More details of these algorithms and techniques can be found in Menezes (1994).

Apart from these cryptographic kernels, we also run other integer kernels which include Cyclic Redundancy Check (CRC), Inverse Dis- crete Cosine Transform (IDCT) and Sobel Edge detection. The running of these kernels on the cryptography fabric shows the versatility of the platform in executing non-cryptographic kernels. We specifically pick these kernels based on the amount of computation versus load/store operations used. CRC is a small kernel which is load/store centric with very little computation. CRC is run for a 256-byte data block in our experiments. Sobel Edge detection has a large number of loads/stores and an equal number compute instructions. The edge detection al- gorithm is run for the image size 382 × 204. IDCT is a balanced kernel with both memory accesses and large number of computations (unlike Cryptographic kernels which have far more computation than memory access). This is computed for a block of size 8 × 8. The list of floating-point kernels includes matrix kernels such as matrix multiplication, LU factorization and QR factorization based on Given’s rotation. We also simulate general floating-point kernels namely Fast Fourier Transform (FFT) and MRI-Q. The matrix multiply kernel has a simple nested loop structure with more load instructions than compute instructions. Matrix multiplication operates over the entire matrix unlike LU factorization and QR factorization, which op- erate over the upper or lower triangular matrices only. The iterations of the innermost loops of LU factorization and QR factorization are independent of each other and can potentially execute in parallel. QR factorization includes an if-else statement, unlike the strictly predict- able loops in LU factorization and matrix multiplication. All matrix kernels are executed on a matrix of size 30 × 30. FFT has completely different loop structure when compared to matrix kernels. In FFT the array access patterns follow different arithmetic progressions in different iterations. The common difference of these arithmetic progressions follows a geometric progression. The 6.3 Compilation Overview 91

twiddle factors for FFT are computed instead of loading from data memory (Prashank et al., 2010). In all our experiments, we run a 2048-point FFT. MRI-Q computes the Q matrix needed during 3-D reconstruction of MRI. The Q matrix is convolution kernel. The Q matrix is dependent on the scan trajectory of the MRI and the size of the scan image produced by it. The details of computation are presented by Stone et al. (2008). The computation involves use of trigonometric functions in addition to loads, stores and floating-point operations. This algorithm is implemented for an image size of 3 × 10. All of the floating-point kernels used here have a large percentage of load and store instructions.

6.3 Compilation Overview

An application written in C is transformed into an executable for the CGRA by the compiler. In the first step the application is compiled with Low Level Virtual Machine (LLVM) (Lattner and Adve, 2004), which transforms the code into an intermediate representation – Static Single Assignment (SSA) form. The intermediate representation is in terms of the virtual ISA defined by LLVM. From this intermediate representation, we derive the CFG and Dataflow Graph (DFG) for each function in the application. The compiler creates HyperOps by merging several basic blocks, until the instruction count is exceeded or the input limit is exceeded. While selecting basic blocks for inclusion within a HyperOp, the compiler traverses the CFG in a depth-biased topological order. This order of processing ensures that HyperOps satisfy the convexity condition and all instructions within a HyperOp are governed by the same predicate. Having several instructions with complementary predicates leads to squashing of instructions after they have been loaded on the fabric, which leads to inefficient execution. In the case that a single basic block has more instructions than what is permitted within a HyperOp, then the basic block is split and mapped into different HyperOps. The DFG for the HyperOp is constructed from the function’s optimized DFG. The function’s DFG includes memory precedence edges as described in chapter 3. The HyperOp’s DFG is appropriately modified such that each instruction has no more than 3 destinations5. In cases where more than 3 destinations are present, appropriate MOV instructions are inserted to propagate the result of an instruction. The HyperOp’s DFG is partitioned such that each partition of the HyperOp can be accommodated in a single computation unit.

5Maximum allowed destinations per instruction. 92 Experimental Framework and Results

These partitions need not satisfy the convexity condition. Partitioning algorithms with different objective functions may be employed such as reduction of communication between the partitions (Krishnamoorthy et al., 2011a), placing parallel instructions in different partitions such that ILP-exploitation is maximized etc. In the current implementation we determine individual instruction “threads" (i.e. sequence of de- pendent instructions with no parallelism) and two of these threads are merged to generate a partition. The partitions thus created need to be mapped to different computation units on the fabric such that communication distance between them is optimized to reduce the overall execution. Several possible optimizations may be employed and one such scheme is reported in Krishnamoorthy et al. (2011b). We currently employ a very naive scheme, where the highly interacting computation units are placed close together all along the same column. Placing instructions along the same column maximizes the number of instruction transfers that can be effected in parallel. The mapping algorithm only specifies the relative placement of the various parti- tions of a HyperOp, with respect to each other. This virtual-mapping ensures that a HyperOp can be placed at any location within the recon- figurable fabric (refer 4.9). Once the partitioning and mapping phases are completed, the HyperOps are transformed into an executable for the CGRA. This work is not a part of this thesis and is presented here only for the sake of completeness. More details can be found in the following theses: Alle (2012) and Krishnamoorthy (2010).

6.4 Execution Overview

The HyperOps of an application compiled for this architecture, are scheduled one at a time as and when they become ready for execu- tion. A HyperOp is ready for execution when all its input operands are available. The first HyperOp of an application (pertaining to the main() function in C code) is always for ready execution. The trigger for execution is supplied by the host-processor along with command line inputs which the program may consume. A ready HyperOp is selected by the orchestration unit in the macro-dataflow orchestration subsystem (refer figure 6.1). The identifier of the selected HyperOp is forwarded to the resource allocator, which determines the location on the reconfigurable fabric where this HyperOp can be launched. The resource allocator makes use of the binary computation-unit requirement matrix (output of the mapping phase) and the binary computation-unit availability matrix (which contains information on 6.4 Execution Overview 93

1 1 0 0 0

Orchestration Subsyste Orchestration 1 1 0 0 0 Computation 1 0 0 0 0 Requirement

Macro Dataflow Macro 1 0 0 0 0 Matrix 1 0 0 0 0

0 1 1 1 1 0 1 1 1 1 Computation Unit 0 1 1 1 1 Availability 1 1 1 1 1 Matrix

m 1 1 1 1 1

X 1 1 X X X 1 1 X X Computation Unit X 1 X X X Allocation Switch Peripheral Router X 1 X X X X 1 X X X Load-Store Unit Compute Unit+Router

Busy Compute Units Compute Units Chosen for Allocation

Figure 6.2: Process of resource allocation is shown which computation units are unallocated on the reconfigurable fab- ric). The resource allocator determines the (x, y) coordinates where the HyperOp can be placed within the fabric. Figure 6.2 shows the process by which resource allocation is performed. This information is presented to the Instruction and Data Transfer Unit along with the input operands for the HyperOp. The Instruction and Data transfer unit retrieves the instructions from the instruction memory banks and the instructions, constants and input operands are streamed into the computation units identified by the resource allocator. Instruction exe- cution commences as soon as inputs are available for any instruction. The transfer of instruction and data operands proceeds in parallel with instruction execution. During the course of execution, instructions produce data operands which may be consumed by other instructions within the same computation unit, or by instructions belonging to the same HyperOp but residing in other computation units or data operands meant for other HyperOps. Data operands meant for other HyperOps are forwarded to the Update Logic in the macro-dataflow orchestration subsystem, which is connected to the periphery of the fabric (refer figure 6.1). The data operands sent to the update logic are appended with appropriate meta data which helps the update logic determine the instance number of the destination HyperOp. The update logic consults the context memory lookup table to determine the location within the context memory where other operands of the destination HyperOp instance are available. The data is forwarded to the context memory along with the location within the context 94 Experimental Framework and Results

THLT PHLT Overlap IHLT FET time , s s n n rtion HyperOp fabricbegins next HyperOp space onfabric space onfabric datacompleted fabriccompletes Instructionand Data Orchestrator schedules Orchestrator schedules Transferinstructions of InstructionExecution o InstructionExecution o Resource Allocatorfind Resource Allocatorfind transfer unitbegins ope

Figure 6.3: The various steps in HyperOp execution and the time spent in various activities. Total HyperOp Launch Time (THLT); Perceived HyperOp Launch Time (PHLT); Fabric Execution Time (FET); Inter-HyperOp Launch Time (IHLT) .

memory where this data needs to be placed. The orchestration unit constantly monitors the context memory for HyperOps which may become ready for execution. Once a HyperOp is ready for execution, the process repeats. Figure 6.3 shows various steps in the execution of a HyperOp.

6.4.1 Understanding Time Spent During Execution The overall execution time is divided into several components, each of which is explained below.

Total HyperOp Launch Time (THLT): The time taken to transfer the instructions, data and constants for a single HyperOp is referred to as Total HyperOp Launch Time (THLT).

Perceived HyperOp Launch Time (PHLT): The time between the start of the instruction transfer and the execution of the first instruction on the fabric is referred to as the Perceived HyperOp Launch Time (PHLT).

Fabric Execution Time (FET): The time between the execution of the first instruction and the execution time of the last instruction of that HyperOp is referred to as the HyperOp’s Fabric Execution Time (FET). There is usually an overlap between the time it takes to launch the HyperOp and the time taken to execute instructions, which is shown in figure 6.3. 6.5 Results 95

Inter-HyperOp Launch Time (IHLT): The time between the execution of the last instruction on the fabric and selection of resources on the HyperOp for the next HyperOp is referred to as Inter-HyperOp Launch Time (IHLT). Figure 6.3 shows a scenario where the next HyperOp to be executed is determined after the completion of the previous HyperOp’s execution. This may not be the case between all HyperOps. The next HyperOp to be executed may be determined during the execution of the current HyperOp. In such a case the IHLT is negative, indicating overlap between HyperOp selection and the execution of the previous HyperOp. The definitions of FET, PHLT and IHLT are provided in the context of individual HyperOps. These numbers are accumulated across all HyperOps in an application and presented in the rest of this thesis. In the rest of this thesis, any reference to FET, PHLT and IHLT are accumulated values across all HyperOps. IHLT, FET and PHLT are three components which constitute the overall execution time. The relationship between these terms and the overall execution time can be defined as follows:

T otalT ime = P HLT + FET ∗ c + IHLT ∗ −1x ∗ c (6.1)

1 where c = p where p is the average parallelism across HyperOps and x is defined as follows:  0 if p = 1, x = (6.2) 1 if p > 0.

As is evident from the equation above, IHLT will be negative if the number of HyperOps running in parallel is greater than one. The FET will be greater than one if the parallelism is greater than one. This is because, the FET is comptued as the sum total of the time spent in fabric execution across all HyperOps.

6.5 Results

In this section, we present the results of performance evaluation ob- tained by running the aforementioned applications/kernels on our CGRA. The execution time in cycles for all applications is shown in table 6.2. In order to gain an understanding on the performance of various subsystems of the CGRA, we need to subdivide the total execu- tion time into the time spent in various activities (during execution). In figure 6.4, we show the FET, PHLT and IHLT for all applications 96 Experimental Framework and Results

Table 6.2: Execution time in cycles for the Cryptography applications, Integer applications and floating-point applications.

Cyptography and Integer Applications CRC 33411 AES-Decryption 6800 AES-Encryption 7965 ECPA 11243 ECPD 6137 SHA-1 42523 IDCT 9761 Sobel Edge Detection 20978652 Floating Point Applications Matrix Multiplication 4021993 LU Factorization 1104892 QR Factorization-Givens 2519652 FFT 3550832 MRI-Q 12191

as a portion of the overall execution time. The time spent in each of these activities across all HyperOps are aggregated and presented as a fraction of the overall execution time in figure 6.4. For efficient execution, the PHLT and IHLT must be made as small as possible. The absolute value of the FET too must be minimized to reduce the overall execution time. In figure 6.4, the IHLT for AES-encrypt, LU factorization and QR factorization is negative. This implies that the subsequent HyperOp is decided while the execution of another HyperOp is in progress. In the remaining cases the time spent in IHLT is 16-33%. In the case of LU factorization and QR factorization, FET expressed as a fraction of the overall execution time is greater than 1, while IHLT component of the overall execution time is much lesser than 0. FET is computed as a summation of the time taken to execute each of these HyperOps. When two or more HyperOps execute in parallel, the FET is expected to be much larger than the time elapsed in executing these HyperOps. In fact, upto 3 iterations of the innermost loop run in parallel. The IHLT component is negative since it overlaps with the execution of another HyperOp. PHLT constitutes 14-35% of the overall execution (and always remains positive). For efficient execution, PHLT and IHLT need to be reduced as much as possible. In order to reduce PHLT, we either need to reduce THLT or increase the overlap between the launching of a HyperOp 6.5 Results 97

2 Fabric Execution Time component HyperOp Launch Time component 1.5 Inter-HyperOp Launch Time component

1

various activities 0.5

0

-0.5

Portion of time spent in -1

-1.5 CRC AES-D AES-E ECPA ECPD SHA1 Sobel IDCT MatrixMultiplicationLU QR FFT MRIQ

Applications

Figure 6.4: Plot shows the portion of the overall execution time spent in various tasks.

and execution of instructions on the fabric. Since it is not possible to reduce THLT, we try and explore ways to reduce PHLT through the increase in overlap between instruction execution and instruction launch. As mentioned previously, if the next HyperOp to be launched is selected while execution of a HyperOp is in progress the IHLT is negative, which implies that it does not add to the overall execution time. In applications where IHLT is positive we need to reduce the time spent in processing each operand in order to reduce IHLT and hence overall execution. We explore techniques to reduce PHLT and IHLT in chapter 7. To evaluate the effectiveness of the reconfigurable fabric, we meas- ured the CPI for all applications. The results are presented in table 6.3. The table shows that all of the applications (irrespective of the type of the application) have a CPI of greater than one. This implies that sufficient number of instructions is not being executed each clock cycle. Our investigations indicate that the delay between issue of loads (due to the need to deliver triggers from the load-store unit) slows down the issue of loads. In most cases, the loads are present at the beginning of the instruction stream (especially in loops), thus delaying 98 Experimental Framework and Results

Table 6.3: CPI recorded for various applications while executing on the reconfig- urable fabric.

Application CPI CRC 3.35 AES-Decrypt 2.42 AES-Encrypt 1.43 ECPA 1.62 ECPD 1.68 SHA-1 1.67 sobel 2.08 IDCT 1.47 Matrix Multiplication 2.53 LU factorization 6.09 QR factorization 5.24 FFT 3.26 MRI-Q 3.97

all dependent instructions; this elongates the FET of the HyperOp. Another important factor is that the minimum time between launch of dependent instructions is 4 cycles, due to the 4 cycle deep pipeline. This implies that to ensure that atleast one instruction is executed every clock cycle, an ILP of 4 is needed at each computation unit. Communication over the network incurs an additional two cycles over and above the pipeline delay. This causes further degradation in performance. These shortcomings are addressed in the subsequent chapter.

6.6 Conclusion

In this chapter we presented the results of executing cryptographic, integer and floating point applications on our CGRA. The time spent during execuction can be divided into three components namely: time for launching of a HyperOp, time for executing instructions of this HyperOp on the reconfigurable fabric and the time needed to select the next HyperOp for execution. We observed that all of these three time components need to be reduced in order make the execution efficient. These are addressed through microarchitectural optimizations, which are described in the next chapter. When you first start off trying to solve a problem, the first solutions you come up with are very complex, and most people stop there. But if you keep going, and live with the problem and peel more layers of the onion off, you can often times arrive at some very elegant and simple solutions. - Steve Jobs in a 2006 interview with Newsweek and MSNBC 100 Experimental Framework and Results Chapter 7

Microarchitectural Optimizations

Why didn’t you try ETS? - Prof. Derek Chiou

In the previous chapter, we observed that the execution time can be divided in Fabric Execution Time (FET), Inter-HyperOp Launch Time (IHLT) and Perceived HyperOp Launch Time (PHLT). We further observed from the results that each of these components of execution time need to be reduced in order to achieve efficient execution. In this chapter, we explore techniques to reduce FET, followed by a microarchitectural change which reduces the IHLT and optimizations to reduce the PHLT.

7.1 Reducing FET

7.1.1 Reducing Temporal Distance between Memory Operations In chapter 6, we noted that the poor performance results are on account of the long delay between issue of two loads which may alias. Trigger edges are needed between stores and loads and stores and stores. However, several trigger edges may be essential from a single store and a synchronization tree is needed to ensure all the loads prior to that has completed before the load is issued. In such cases, it may be more efficient (i.e. involves fewer number of nodes in the dataflow graph) to directly chain the loads related to the same store instead of the aforementioned structure. The compiler makes an appropriate 102 Microarchitectural Optimizations

ld

st

Load-Store Unit Computation Unit + Router Load-Store Request edge Peripheral Router Trigger Edge

Figure 7.1: The path taken by the load request and the trigger edge to the subsequent memory operation that follows are shown.

choice based on the efficiency of either scheme. When loads are linked together, trigger edges traverse the entire distance from the first load’s computation unit to the load-store unit and then from the load-store unit to the second load’s computation unit. This round trip delay can be quite long and affects the time it takes in issuing loads at the beginning of an iteration. This round trip is shown in figure 7.1. In order to reduce this overhead, we place all loads and stores from the same alias set in the same computation unit. The trigger edges are now delivered from one instruction in the computation unit to another instruction within the same computation unit (as shown in figure 7.2). In order to ensure correctness, we need to ensure that load/store request packets sent on the NoC, originating from the same source, are received at the destination in the same order. This requirement is easy to meet since the NoC always chooses the same path between a given source and destination. This requirement translates into the router giving preference to the oldest packet from the same input link while selecting a packet for forwarding to the next hop. The structure of the router described in chapter 4 implicitly meets this requirement due to the use of FIFOs at the output. The scheme described above does not address the case when the number of loads and stores in the same alias set exceed the number of instructions that can be accommodated 7.1 Reducing FET 103

ld st

Load-Store Unit Computation Unit + Router Load-Store Request edge Peripheral Router Trigger Edge

Figure 7.2: The path taken by the load request and the local trigger edge to the subsequent memory operation that follows are shown. within the same computation unit. The technique of sending a trigger from the source to the destination instruction directly does not work if the instructions are in different computation units. The ordering of packets originating from two different sources to the same destination cannot be maintained. Therefore, in order to ensure correctness the trigger needs to be routed through the load-store unit, as per the original scheme.

7.1.2 Eliminating the Priority Encoder

Another important observation from the experiments presented in sec- tion 6.5, is that in most cases just one instruction is ready for execution at any point in time. This implies that the priority encoder present within the computation unit that is used to select ready instructions for execution has a very sparse usage. This is on account of the be- havior of the partitioning algorithm, which divides the instructions within a HyperOp among various computation units. The partitioning algorithm assigns instructions which have long-range ILP (i.e. interac- tion between the common successor of these parallel instructions are many (ASAP) levels from these instructions) to different computation units. The instructions which have short-range ILP (i.e. interaction 104 Microarchitectural Optimizations

between the common successor of these parallel instructions is very few As Soon As Possible (ASAP) levels away from these instructions) are assigned to the same computation unit. The compiler can schedule these instructions statically, even in the presence of variable delay experienced in load/store access and variable delays on account of the NoC. In the presence of large ILP with long-range interactions, the compiler is confronted with the right order among parallel instructions, since the data of any of these could arrive before or after the other (in the presence of variable latencies of the reconfigurable fabric). When only short-range ILP is present the number of choices for the compiler is limited and difference in performance due to change in order of instructions is expected to be very small. For this reason, we eliminate the priority encoder and the compiler performs efficient instruction ordering which takes into account the critical path in order to minimize the overall execution time. The elimination of the priority encoder reduces the overall area of the computation unit. It also opens up newer design points viz. higher number of instructions per computation unit. Previously, the number of instructions within a computation unit was not scalable due to the presence of a priority encoder, whose complexity increases quadratically with increase in the number of instructions. The removal of the priority encoder, further simplifies the load and store instruction sequentialization. The loads and stores are arranged in program order and they are issued one after another. There is no need for explicit triggers between two instructions within the same computation unit. When instructions are present in two different computation units, the trigger must arrive from the load-store unit to prevent any possible reversal in arrival order at the load-store unit. With the introduction of this sequential execution (which is sched- uled by the compiler), we have effectively combined *-T Papadopoulos et al. (1993) like execution semantics on the reconfigurable fabric with a macro-dataflow execution.

7.1.3 Reducing Temporal Distance between Dependent Instructions

As mentioned in chapter 6, since the pipeline within the computation unit is four cycles deep, execution of two dependent instructions are separated by four cycles. This can be reduced by appropriately creating a bypass network. In order to reduce the temporal distance, the data from the ALU stage is bypassed to the instruction selection stage. When instructions have three destinations, it causes pipeline 7.1 Reducing FET 105

stalls. Since only one packet may be sent out on each interface from the writeback stage. When more than one destination is either within the same computation unit or is in different computation unit then the delivery of the results to these destinations get serialized. This serialization causes the subsequent instruction to stall as the writeback stage is held for an additional cycle. This causes a . For certain combinations of destinations (i.e. all destinations within the same computation unit or all destinations in different computation units) it is possible to incur a two cycle pipeline stall. In order to address this, we allow one of the destinations of an instruction to be processed within the ALU stage. This makes it possible to process all destinations with at most one stall cycle. The revised block diagram of the pipeline stages within the computation unit is shown in figure 7.3. Since one of the destinations is processed in the ALU stage, the dependent instruction can be selected for execution in the following cycle and executed in the cycle following that. This helps us schedule the dependent instruction within 2-cycle of the source instruction.

Instruction Selection

ALU Packet Creation Operand Register File Unit Compute

Packet Creation

Writeback

To West Router To East

To South

Figure 7.3: Modified Compute Element structure reduces the minimum temporal distance between two dependent instructions to two . 106 Microarchitectural Optimizations

7.2 Reducing IHLT

IHLT is the time between the completion of a HyperOp’s execution on the fabric and selection of the next HyperOp for execution. A bulk of this time is spent in processing the operands to be copied into the context memory by the update unit. The update unit described in chapter 5, allocates and deallocates context area for the HyperOps from the context memory. This scheme is similar to the Tagged Token Dataflow Architecture (TTDA) proposed by Arvind and Nikhil (1990). It computes the HyperOp’s instance number and uses it along with the HyperOp’s static identifier to determine the location within context memory. We also observed that storage requirement of the instance lookup table and loop-invariant data table far exceed the amount of storage needed for the context memory. Our rationale in implementing a hardware managed context memory was that the number of active HyperOp instances was unknown at com- pile time. Such a generalized assumption led to an over-investment in hardware. A compiler based context area allocation scheme would suffice. Such a scheme, as we will see over the course of the descrip- tion, leads to a lower area design. The hardware managed context area allocation scheme can exploit parallelism even when the compiler cannot detect it. However, in the case of an imperative language the amount of parallelism at the level of HyperOps is detected at compile time and needs to be explicitly enabled by the compiler. The interactions (i.e. exchange of data) between HyperOps are known at compile time. These interactions can be represented by a directed graph termed as HyperOp Interaction Graph, whose ver- tices represent the HyperOps and the edges indicate transfer of data between them. If a HyperOp generates data for another HyperOp, it follows from the condition of strict execution that the source HyperOp is either present on the reconfigurable fabric or has completed ex- ecution and the destination HyperOp has an existence only within the context memory (since atleast one input operand for it has not yet arrived). This implies that these interacting HyperOps can use the same context area within the context memory. For HyperOps on complementary paths (i.e. execution of the HyperOp is governed by complementary predicates) the same context memory location can be assigned. However, no such conclusion can be drawn about other HyperOps which do not interact with each other (other than the two conditions stated above) and therefore are said to conflict with each other. From the HyperOp Interaction Graph, it is therefore possible to assign context areas to each HyperOp such that no two HyperOps 7.2 Reducing IHLT 107

A

B F

C D G H

E I

J

Figure 7.4: The figure shows a CFG with the HyperOps overlaid on it indicating which basic blocks are a part of the same HyperOp. The boxes represent the basic blocks and the ovals the HyperOps. which have been allocated the same context location conflict. This can be solved by transforming it into a graph-coloring problem. Since the compiler statically allocates context areas for each HyperOp, it is no longer necessary to maintain an instance lookup table in hardware. In order to reduce the overhead of the loop invariant data store, we store the loop invariant data in the context area of the HyperOp and repeatedly use it. In order to prevent overwrite of the loop-invariant data, the context area of a HyperOp that contains loop invariant data is said to conflict with all other HyperOps within the loop. This allows us to completely eliminate the loop invariant store. The purging of loop invariant data is done by a special request at the end of the loop. The aforesaid conditions for conflict are not sufficient to ensure correct execution. To ensure correct execution there is a need for certain synchronization points i.e. the execution needs to wait for the occurrence of certain events. This is illustrated with an example. Figure 7.4 shows a CFG, where each basic block is represented by a box and the HyperOps to which these basic blocks belong are represented by ovals. During the course of execution the HyperOp containing basic block A produces data for the HyperOp containing H. Let us assume that the conditional statement within the HyperOp containing A evaluates to a value such that it takes the left hand side path i.e. A to B path. Thus the HyperOp containing H needs to be terminated. HyperOp containing A sends a termination request to H and proceeds with the execution. Let us assume that this request to terminate H takes a long time to reach the orchestration subsystem (due to 108 Microarchitectural Optimizations

possible network congestion). By the time the packet containing the termination request reaches the orchestration subsystem, HyperOp containing A restarts and produces another data for H. This can lead to incorrect behavior as the HyperOp containing H now contains more data than expected by the termination request. Conversely, if the terminate request reached the HyperOp containing H prior to the arrival of the first data produced, the termination is not carried out as the expected number of inputs have not arrived. Let us further assume that the second data arrives just before the first data arrives, in which the termination request will consume the second data and terminate the HyperOp instance. A new HyperOp instance is created with the first instance of the data. This too can lead to incorrect execution. This occurs because the first instance and second instance of the HyperOp containing H conflict with each other and execution of the producer for the second instance started prior to termination of the first instance. This problem can be avoided if the loop is not restarted until the terminate request is completed by the orchestration subsystem. In order to address this problem in a generalized manner, we permit each HyperOp to wait for the occurrence of an event. An event includes scheduling of a HyperOp instance, termination of a HyperOp instance or purging of a HyperOp’s loop invariant data. Upon occurrence of any of these events, a mechanism is needed to inform another HyperOp within the context memory. In the above example the HyperOp containing J waits for the termination of H i.e. the scheduling of HyperOp J cannot proceed until an explicit event indicating termination of H reaches J. The events which require notification and the HyperOps which need to wait for such notifications to occur are computed by the compiler. We refer to these HyperOp which need to be notified as synchronization HyperOps. Any HyperOp can be designated a synchronization HyperOp. In the case of function calls, as mentioned in chapter 3 runtime allocation of context memory is essential. In order to address this, the compiler performs static allocation for a function. The context area locations allotted within a function are treated as an offset with respect to the the first HyperOp of the function. At the time of function invocation, the orchestration subsystem reserves a fixed number of contiguous context memory locations as requested by the function call. The starting address of this contiguous location in the context memory is assigned to the first HyperOp of the function. The destination context memory location specified in the MOVToHyperOp instructions are encoded as an offset with respect to the context memory location 7.2 Reducing IHLT 109

of the source HyperOp. . During function invocation if the appropriate number of context memory locations could not be reserved, then the application is terminated (similar to the case of a stack overflow in von-Neumann architectures). In the compiler allocated context memory scheme, just described, the operation of the update unit is highly simplified. The ALU com- putes the exact location within the context memory. This location is passed along with data operand to the update unit. A counter associated with each context memory is incremented upon arrival of each operand. Along with data operand 0, a separate metadata containing the HyperOp’s static identifier, the expected number of input operands, the number of context locations expected by a func- tion call1 (if it is call) are passed to the update unit. The update unit uses the input count information from the metadata and when the input count matches the expected count the HyperOp instance is selected for execution. The selection no longer relies on the priority encoder. Instead all ready HyperOps are pushed into FIFOs. The ready HyperOps are picked up for processing from the FIFO. The use of a FIFO as opposed to a priority encoder can lead to blocking of packets on the router interfaces between the reconfigurable fabric and macro- dataflow orchestration subsystem. If either the resource allocator or the instruction and data transfer unit stalls and the number of Hyper- Ops ready to be executed exceeds the total available capacity of the FIFO then such a blocking of packets could occur. With appropriate FIFO sizing and software based techniques (viz. k-loop bounding) such eventualities can be completely avoided. A termination request (i.e. false predicate) indicates that the num- ber of inputs expected to arrive before the instance can be terminated. When the required number of inputs is available, the HyperOp instance is terminated and an event notification is forwarded to the intended recipient as specified by the termination request. A purge request for loop invariant data operates in a manner similar to the operation of the termination request. The compiler controlled orchestration scheme drastically reduces the area of the orchestration subsystem and simplifies the logic considerably. This also leads lower overhead in cycles, for processing each operand arriving at the update unit. This implementation is akin to the Explicit Token Store (ETS) model as described by Papadopoulos and Culler (1998). It is to be noted that the performance of the compiler-controlled is completely independent of the number of context memory locations. The number of context memory locations is a resource requirement, the absence of which

1A function call is indicated by a bit in the metadata. 110 Microarchitectural Optimizations

leads to execution termination. The improvement in performance is expected to be obtained from the use of simplified hardware.

7.3 Evaluating the impact on FET and IHLT

The microarchitectural optimizations presented thus far were designed to improve the FET and IHLT. We present the results of its impact on overall execution time, IHLT, FET and PHLT. The normalized execution times are shown for two different microarchitectural config- urations: (i) 16 instructions per computation unit (ii) 32 instructions per computation unit. The list of configurations are summarized in table 7.1. The execution time is normalized with respect to the base configuration. The compiler configuration across all three microarchi- tectural configurations are identical, other than the compiler changes needed to enable (i) Compiler controlled orchestration (ii) assigning load-store instructions from the same alias set to a single computa- tion unit (iii) ordering of instruction in the computation unit, so as to satisfy data dependences, due to the elimination of the priority encoder. The results, shown in figure 7.5, indicate that the overall execution time for configuration I is lower than the execution time for the base configuration. In the case of configuration II, we observe that it performs worse than the configuration I in half of the cases and performs well in the other half of the cases.

7.3.1 Impact on FET The FET for configuration I and II computation units normalized with respect to the FET on the base configuration is shown in figure 7.6a. The comparison between the CPI is shown in figure 7.6b. We observe that in all cases the FET and CPI are lower for configuration II

Table 7.1: The various hardware configurations.

Base Configuration Computation unit with 16 instruc- tions; 4 cycle pipeline; Hardware con- trolled context memory I Computation unit with 16 instruc- tions; 2 cycle pipeline; Compiler con- trolled context memory II Computation unit with 32 instruc- tions; 2 cycle pipeline; Compiler con- trolled context memory 7.3 Evaluating the impact on FET and IHLT 111

1.3 Configuration I 1.2 Configuration II ration 1.1

1

0.9

0.8

Time w.r.t Base configu 0.7

0.6

0.5

0.4

0.3 Normalized Execution Execution Normalized

0.2 CRC AES-D AES-E ECPA ECPD SHA-1 IDCT Sobel MatrixMultiplyLU QR FFT MRI-Q

Applications

Figure 7.5: The plots show the normalized execution time of the CGRA for configurations I and II. The normalization is performed with respect to the base configuration. when compared to configuration I. This is on account of the reduction in the number of computation units across which instructions are spread and the resultant reduction in the overall communication area on the reconfigurable fabric. A lot more instructions communicate locally (i.e. within the same computation unit). In the case of most applications, the percentage reduction in FET is also reflected in the percentage reduction in CPI. However, in some of these cases (AES- Decrypt, AES-Encrypt, ECPA, ECPD and IDCT) the reduction in CPI is not commensurate to the reduction in FET. HyperOps in configuration II can pack more instructions. HyperOps are limited by the number of instructions that can be accommodated on the reconfigurable fabric. With increase in the number of instructions in each computation unit, each HyperOp can now allow a higher number of instructions. In HyperOps with lower instruction limits, we observe that some large basic blocks (especially in AES-Decrypt, AES-Encrypt, ECPA, ECPD, SHA-1 and IDCT) are split into multiple HyperOps. When a basic block is split, additional data movement instructions need to be added to the HyperOp to move data from one part to another part of the basic block. With increased instruction limit, that is available in configuration II, many of these large basic blocks fit within one HyperOp thus reducing 112 Microarchitectural Optimizations

1 Configuration I 0.9 Configuration II

0.8

cution Time 0.7

0.6

0.5

0.4

0.3 Normalized Fabric Exe Normalized 0.2 CRC AES-D AES-E ECPA ECPD SHA1 Sobel IDCT MatrixMultiplyLU QR FFT MRI-Q

Applications (a) FET improvement

1 Configuration I 0.9 Configuration II

0.8

Instruction 0.7

0.6

0.5

0.4

0.3 Normalized Cycles Per Normalized 0.2 CRC AES-D AES-E ECPA ECPD SHA-1 Sobel IDCT MatrixMultiplyLU QR FFT MRI-Q

Applications (b) CPI improvement

Figure 7.6: FET and CPI improvements due to optimizations within the compu- tation unit 7.3 Evaluating the impact on FET and IHLT 113

0.4

0.2

0

-0.2

Execution time -0.4

-0.6

-0.8

-1 IHTL Portion of Overall

-1.2 Base Configuration Configuration I Configuration II -1.4 CRC AES-D AES-E ECPA ECPD SHA-1 Sobel IDCT MatrixMultiplyLU QR FFT MRI-Q

Applications

Figure 7.7: Plot comparing the IHLT component in the overall execution time for various configurations.

the number of communication instructions. This brings down the overall instruction count. The dual effect of decrease in FET and drop in the number of instructions causes a lesser drop in CPI when compared to the drop in FET. In the case of Matrix Multiplication, LU factorization, QR factoriza- tion and SHA-1 we observe that the reduction in FET is not matched by CPI in configuration I (when compared to the original scheme reported in chapter 6). In all these applications the number of instructions have reduced on account of the way in which the context memory is man- aged. In the case of hardware managed context memory additional overhead is incurred in transferring the list of HyperOps to be purged on a loop exit.

7.3.2 Impact on IHLT

The plot in figure 7.7 shows the IHLT component of the overall execu- tion time. The plot shows that all cases where IHLT was positive in the base configuration (refer figure 6.4 in chapter 6), has lower IHLT with the compiler controlled context memory (configurations I and II). In applications where the IHLT was negative due to overlap between 114 Microarchitectural Optimizations

execution of a HyperOp and selection of the next HyperOp, we see a drop in the amount of overlap time. As mentioned in section 7.3.1, the FET has dropped for each of these applications due to optimiza- tions in the computation unit. Thus, the amount of time available for overlap is lower. It is also observed that in most applications the IHLT recorded for configuration II is more than the configuration I. The instructions ordering within the computation unit is done using the Force Directed List Scheduling (FDLS) algorithm (Paulin and Knight, 1989). Since an instruction to move data to the orchestration unit is a leaf node (with no successors), the algorithm tends to delay it giving higher importance to other nodes. Since the number of instruc- tion slots in configuration II is higher than the configuration I (32 instructions as opposed to 16 instructions), it causes higher extent of sequentialization and thus increases the delay. In the case of AES-encrypt, AES-decrypt, ECPA, ECPD and SHA-1 it was mentioned that the number of HyperOps reduced, since the fabric can accomodate a larger number of instructions. In AES-encrypt and AES-decrypt kernels the reduction in the number of HyperOps reduces the number of interactions between the HyperOps and therefore re- duces the IHLT component of the execution time. In the case ECPA and ECPD the number of HyperOps reduce, however this leads to an elimination of overlap between execution of HyperOps2. The loss in HyperOp level parallelism causes the IHLT component to become positive.

7.3.3 Impact on overall execution time The optimizations described above helped reduce the FET in all applic- ations. Applications where the number of HyperOps reduces owing to increased instruction storage see a reduction in the overall execution. In other applications, where the number of HyperOps is not reduced an increase in IHLT and overall execution time is observed. In these cases it is observed that the PHLT has increased, as seen in figure 7.8.

2A single very large basic block was split into several HyperOps due to the instruction count limit previously. These HyperOps belonging to the same basic block exhibited HyperOp level parallelism. With the increase in the number of instructions, the entire basic block now is included within the same HyperOp. Thus eliminating HyperOp level parallelism. 7.3 Evaluating the impact on FET and IHLT 115 n

1.8 Configuration I Configuration II 1.6

1.4 w.r.t Base configuratio 1.2

1

0.8 HyperOp Launch Time HyperOp Launch 0.6

0.4

0.2 CRC AES-D AES-E ECPA ECPD SHA-1 Sobel IDCT MatrixMultiplyLU QR FFT MRI-Q Normalized Perceived Perceived Normalized

Applications

Figure 7.8: The plots show the normalized fraction PHLT w.r.t overall exection time for configurations I and II. The normalization is performed with respect to the base configuration. 116 Microarchitectural Optimizations

7.4 Reducing PHLT

One of the reasons for the poor performance of configuration II with respect to configuration I is the increased PHLT (see figure 7.8). The increase in PHLT is due to increase in the number of instructions within each computation unit. As indicated in chapter 5, the instruction and data transfer unit first transfers all the instructions followed by all the data and constants. The first instruction may start execution only after all the instructions have been loaded and its operands are available. An increase in the number of instructions within each computation unit, evidently increases the PHLT. The PHLT does not increase for four applications AES-Decrypt, AES-Encrypt, ECPA, ECPD and IDCT due to a drop in the number of instructions as explained previously. In this section, we attempt to reduce the PHLT through two different techniques, namely: (i) interleaved instruction and data loading (ii) resident loops .

7.4.1 Interleaving Instruction and Data Load

In order to reduce PHLT, the order of instruction and data loading is changed such that an instruction and its data operands are loaded together. This would help cut PHLT drastically at the cost of increase in FET. If the increase in FET is not very high then a benefit may be seen in the overall execution time. We will refer to this inter- leaving scheme as configuration III. Configuration III retains all the remaining characteristics of configuration II. The comparison between the PHLT and FET for the interleaved scheme and non-interleaved scheme are shown in figure 7.9. The plots show a clear trend – the PHLT is down to 25% or lower (refer figure 7.9a). The FET has in- creased by 2-30% when compared to configuration II (refer figure 7.9b). The overall execution time has reduced in the range of 9-30% when compared to configuration II (refer figure 7.9c). The overall execution time recorded for configuration III performs as good as or better than configuration I, despite recording an increase in the FET (refer figure 7.9d). The most important effect of this optimization is the reduction of the PHLT to a very small value in the range of 9-11 cycles. These values are invariant of the HyperOp size. This implies that in applications where the number of instructions are large and the FET is large the overhead of PHLT is reduced considerably (viz. AES-Encrypt and Decrypt, ECPA and ECPD). In applications such as QR factorization and LU factorization, which have moderately sized HyperOps, decrease in PHLT does not decrease the overall execution 7.4 Reducing PHLT 117

0.26 1.3 guration II 0.24 1.25 w.r.t configuration II 0.22 1.2 0.2 0.18 1.15

0.16 1.1 cution Time w.r.t. confi 0.14 1.05 HyperOp Launch Time HyperOp Launch 0.12 1 0.1 0.08 0.95 CRC AES-D AES-E ECPA ECPD SHA-1 Sobel IDCT MatrixMultiplyLU QR FFT MRI-Q CRC AES-D AES-E ECPA ECPD SHA-1 Sobel IDCT MatrixMultiplyLU QR FFT MRI-Q Normalized Fabric Exe Normalized Normalized Perceived Perceived Normalized

Applications Applications (a) Normalized PHLT for configura- (b) Normalized FET for configuration tion III III n I n II 0.95 1.1

0.9 1

0.9 0.85 0.8 0.8 Time w.r.t. configuratio Time w.r.t. configuratio 0.7 0.75 0.6

0.7 0.5

0.65 0.4 CRC AES-D AES-E ECPA ECPD SHA-1 Sobel IDCT MatrixMultiplyLU QR FFT MRI-Q CRC AES-D AES-E ECPA ECPD SHA-1 Sobel IDCT MatrixMultiplyLU QR FFT MRI-Q Normalized Execution Execution Normalized Normalized Execution Execution Normalized

Applications Applications (c) Normalized overall execution for (d) Normalized overall execution for configuration III configuration III w.r.t to configuration II

Figure 7.9: The plots show the normalized PHLT, FET and overall execution time for configuration III, normalized with respect to configuration II. The last plot compares the execution time for configuration III to the execution time recorded for configuration I.

time due to an increase in FET. However, in other applications (viz. SHA-1, Matrix Multiplication, IDCT etc.) the reduction in the PHLT is large enough to compensate for the increase in FET.

7.4.2 Resident Loops

The technique of interleaving instructions and data helps reduce PHLT, but does not eliminate the HyperOp launch time. If HyperOps that are executed repeatedly (i.e. HyperOps that are a part of the loop) are kept resident on the fabric, then the overhead of launching the HyperOp several times is completely eliminated. However, such a scheme would not work since the reconfigurable fabric does not have 118 Microarchitectural Optimizations

support for executing loops as in a static dataflow machine (Arvind and Culler, 1986). It neither has support for tagging of data belonging multiple iterations. To execute loops on fabric, instructions belonging to a computa- tion unit should not be invalidated after completing one iteration. However, for ensuring correct execution no instruction should restart until all instructions within the loop have completed the iteration. In order to ensure this, once a computation unit completes execution of all instructions it informs an on-fabric leader. One computation unit on the fabric is designated the leader. The leader computation unit awaits similar messages from all the computation units taking part in the execution of the loop. When all complete messages have been received, the computation unit sends a restart message to all the mem- ber computation units. However, if all complete messages have been received and the loop termination condition is met then the leader computation unit sends terminate messages to all the computation units. This model can be very easily implemented in software through introduction of additional instructions. We use a hardware-based technique to avoid the overhead of executing these instructions. A hardware event counter is introduced in each computation unit. This event counter counts the number of complete messages that arrive. When the value of the counter meets the pre-specified counter value then the event counter informs the instructions propagating restart or terminate messages. The scheme described above works for a single level of nesting. However, when multiple levels of nesting are present it would be advantageous to accommodate multiple levels of a loop on fabric. In many cases, the outer levels of the loop have very few compute instruc- tions. Loading a separate HyperOp for executing these low overhead instructions is not advantageous. We permit multiple levels of loop to be resident on the fabric. When multiple levels of loop nesting are present, one leader computation unit is assigned to each loop. When the loop termination condition evaluates to true, it informs the outer level loop while simultaneously informing all member computation units to reset all loop-invariant inputs (these will be regenerated by the next iteration of the outer loop). The outer loop completes execution of the code following the inner loop and then the iteration restarts. When the iteration restarts, all loop-invariants for the inner loop are produced once again. The execution of the inner loop restarts. When the outer most loop completes execution of all iterations, it causes termination signals to be sent to all computation units taking part in all loops. The HyperOps corresponding to the inner and outer loops 7.4 Reducing PHLT 119

1.05

III 1

0.95

0.9 Time w.r.t configuration 0.85

0.8

0.75 Normalized Execution Execution Normalized

0.7 CRC AES-D AES-E SHA-1 Sobel IDCT MatrixMultiplyLU QR FFT MRI-Q

Applications

Figure 7.10: The overall execution time with configuration IV normalized with respect to configuration III.

are all merged into a larger HyperOp, along with control instructions for the leader computation unit, and are referred to as Loop-HyperOps. We refer to this as configuration IV. Other than the use of Loop- HyperOps and event counters in each computation unit the hardware configuration remains identical to configuration III. The results of these optimizations are presented in figure 7.10. This technique helps in reducing PHLT (as there are no HyperOp launches) and IHLT (as this is replaced by on fabric communication). When loops are kept resident on fabric, at the end of each iteration the completion status is informed to the leader computation unit which in turn responds with a restart signal once all completion signals have been received. Thus there is a round trip delay time which is incurred in this model of execution. If the number of computation units spanned by a Loop- HyperOp is less, then the overhead of communicating the completion and restart signals is quite low. However, if many computation units are used then this overhead is higher. If the increase in FET on account of sequentialization of iterations and/or overhead of completion and restart signals is much smaller than the drop in IHLT and PHLT then an improvement in overall execution time is observed. This behaviour is observed in applications such as CRC, FFT, Matrix Multiplication, Sobel Edge Detection and IDCT. In all these applications, the size of the Loop HyperOp is quite small and thus leading to a lower overhead.

This technique also facilitates overlap of execution between the 120 Microarchitectural Optimizations

outer and inner loops which helps improve FET. Another factor to be observed is that when loops are resident on fabric, execution of all iterations following the first iteration does not await the arrival of instructions. The execution starts as soon as the data is available. This too helps reduce the overall FET across several iterations. This factor is observed in all nested loop kernels such as Matrix Multiplication, LU factorization and QR factorization. However, QR factorization does not gain much due to the sequentialization of iterations. When loops are not kept resident on the fabric, there is a large overlap between execution of successive iterations of QR and LU factorizations on the fabric.

In MRI-Q the FET is the most dominant factor. The FET in MRI-Q does not benefit from keeping the loop resident on fabric as each iteration includes several long latency floating-point operations. The reduction in PHLT and IHLT does not offset the increase FET due to the overhead of the completion and restart signals. This leads to very minor increase in overall execution. In effect, Loop-HyperOps have little or no effect on the performance of MRI-Q.

AES-Decrypt and AES-Encrypt exhibit very different trends when the loops are kept resident on fabric. In the case of AES-Decrypt we observe 20% reduction in overall execution time with configuration IV when compared to configuration III. This reduction can be attributed to the lower FET per iteration of the loop. AES-Decrypt kernel there is a lot of parallelism in the beginning of the iteration and is very sequential towards the end of the iteration. In configuration III the FET gets stretched since the parallel instructions at the beginning of the iteration await the arrival of instructions and/or input operands. This wait is eliminated in configuration IV resulting in reduction of overall execution time. AES-Encrypt being the inverse of AES- Decrypt has the exact opposite parallelism profile. Less parallelism in the beginning of the iteration implies that during execution of the kernel in configuration III, the execution is not stalled by the arrival of instructions. Therefore, no substantial gain in performance is obtained by keeping the loops resident. The execution time of AES- encrypt is slightly worse ( 3%) than the execution time recorded with configuration III, due to the runtime overhead of sending complete messages and receiving a restart message. This overhead cannot be hidden. 7.5 Understanding the Cumulative Effect 121

7.4.3 Note on different techniques to reduce PHLT In this chapter, we described two possible techniques for reducing PHLT. As mentioned in chapter 5, speculative prefetch is another technique which can employed to reduce the PHLT. This technique is described in detailed by Krishnamoorthy et al. (2010). Speculative Prefetch and interleaved instruction execution help reduce the PHLT, while the technique of keeping loops resident on the fabric reduce the number of HyperOp launches (due to elimination of repeated launches of HyperOps participating in loops). Each of these techniques has their own distinct advantages and disadvantages. Speculative prefetch helps reduce the PHLT without any impact on FET3. However, it comes with a hardware overhead (a lookup table). On the other hand interleaved loading of instructions and data does not have any overhead in hard- ware, however it stretches the FET to get stretched by about 2-30% as shown in figure 7.9. Keeping loops resident on fabric is possible only if the loop is small enough to be accommodated on the fabric. If the loop can launch one or more iterations in parallel, it may not be advantageous to keep the loops resident on the fabric. An appropriate choice must be made for the specific application based on application characteristics. Applications such as QR and LU factorization would benefit when speculative prefetch is used as the FET remains the same.

7.5 Understanding the Cumulative Effect

Thusfar, we have seen the impact of four sets of architectural optimiza- tions. The impact on performance was presented in all places relative to the previous architectural configuration, to understand the relative change in performance. In order to gain an understanding of the per- formance improvement with respect to all of these changes, we present a single plot (figure 7.11) where all the performance improvements are normalized with respect to the base configuration (refer table 7.1). On an average the application performance has been 2.44× due to all of these optimizations. The best improvement is observed in ECPA and ECPD where the improvement is greater than 4x and the least improvement is seen in the case of QR-factorization. QR-factorization exhibits an interesting trend. The best performance is recorded for configuration I i.e. use of 16 instructions per computation unit leads to

3The impact on the FET is dependent on the relative placement of the speculatively loaded HyperOp and the currently executing HyperOp. If they were placed such that the instruction packets for the speculatively loaded HyperOp traverse paths, which are used for data, transfer within the currently executing HyperOp, then an impact on FET would be observed. 122 Microarchitectural Optimizations

1.3 Configuration IV Configuration III 1.2 Configuration II Configuration I 1.1 ation’

1

0.9

0.8

ized wrt Base Configur 0.7

0.6

0.5

0.4 Execution time Normal

0.3

0.2 CRC AES-D AES-E ECPA ECPD IDCT SHA1 Sobel MatMult LU QR FFT MRI-Q

Applications

Figure 7.11: A single plot showing the improvement in performance for each of the architectural optimizations with respect to the base configuration.

the best performance which can be accounted due to drop in overlap between parallel execution of multiple iterations when moving from configuration I to configuration II (as indicated previously).

7.6 Synthesizing Hardware Modules

In order to determine the silicon area, operating frequency and power dissipation of the various modules, we synthesize these modules us- ing 90nm Faraday High Speed Low VT technology. Synthesis was performed using Synopsys Design Vision. We only synthesize the hardware for configuration IV (presented in section 7.4.2). This con- figuration includes compiler controlled context memory, computation unit with 32 instructions with sequential instruction selection. Each computation unit also includes an event counter for determining the number of complete messages. We synthesize a computation unit which is customized for the Cryptographic fabric (described in chapter 6; table 6.1). The integer and multiplier modules are instan- tiations of appropriate modules from the Synopsys DesignWare IP Library, while the remaining FUs have been specifically designed for our CGRA. The area, frequency and power estimates obtained from 7.7 Comparing Performance with other processors 123

Synopsys Design Vision is shown in table 7.2. The area, power and frequency for the Update and Orchestration Unit provided do not include the context memory4. All power estimates have been obtained at the highest possible activity factor of 0.5.

Table 7.2: The area, power and frequency estimates for various modules.

Module Area in Frequency Dynamic mm2 in MHz Power in mW Computation unit 0.25 454 52.49 Router 0.098 454 24 Update & Orchestration Unit 0.3 700 84

One of the concerns with a CGRA, is the presence of a large number of computation units and hence larger possibility of leakage power. The very structure of a CGRA makes it amenable to leakage power optimization through coarse-grained Vdd gating. Coarse-grained Vdd gating can be implemented at the level of each computation unit. As in an FPGA where such a scheme is implemented, the leakage power saved depends on the utilization of the CGRA (Rahman et al., 2006).

7.7 Comparing Performance with other processors

In this section, we present a comparison of the performance of various applications with GPP and other reconfigurable processors. The com- parison with GPP is complete due to easy accessibility of processor and compiler. In order to compare the performance with GPP, the same C code executed on the CGRA is compiled for the GPP and the results are gathered after execution. This approach gives us a mechanism to evaluate the effectiveness of the proposed CGRA in comparison with a GPP. Comparisons are also presented with FPGA based implementations and reconfigurable architectures employing FPGAs. The comparison with FPGA based implementations are based on results available in existing literature. Implementing all algorithms in Verilog/VHDL in a performance-efficient manner is difficult and time-consuming and yet does not guarantee optimality. The comparison with other reconfigurable architectures are also based on published results in academic literature.

4Memory modules are typically custom generated using a memory compiler. 124 Microarchitectural Optimizations

7.7.1 Comparison with a GPP In order to benchmark the performance of the applications on the CGRA, we compare its performance with the execution time on a Core 2 Quad operating at 2.66GHz (Q8400). The performance comparison is shown in figure 7.12. The x-axis shows the various applications and the y-axis shows the ratio of time taken on our CGRA versus the GPP. The values above the x-axis indicate applications in which the CGRA does not perform well and the values below the x-axis indicate the applications in which the CGRA performs very well. AES-encrypt, AES- decrypt, ECPA and ECPD perform exceptionally well with one order to close to two orders of magnitude better than GPP. The reason for the exceptional performance of AES-Decrypt, AES-Encrypt, ECPA and ECPD can be attributed to the higher computations to load instruction ratio and the use of specialized FUs which aid these computations. The simpler pipeline and use of specialized FUs helps achieve lower power dissipation while not compromising on performance. On the other hand, all floating point applications, IDCT, SHA-1 and CRC do not perform well when compared to the GPP.

Figure 7.12: Plot shows the comparison in performance between a Core 2 Quad and our CGRA.

7.7.2 Comparison with FPGAs The comparison with FPGAs are presented for four applications, namely: AES-Encrypt, SHA-1, FFT and IDCT. We could not obtain relevant implementation for other applications (some applications we could 7.7 Comparing Performance with other processors 125

Table 7.3: Comparison of the throughput achieved by FPGA and CGRA.

Application FPGA Throughput CGRA Throughput AES-Encrypt 1410 34.47 SHA-1 5900 6.50 FFT 3800 41.58 IDCT 4.27 0.17

not find similar application configurations and in others no implement- ation is available). The results are presented in table 7.3. We present the throughput achieved for each of these applications. In the case of AES-Encrypt and SHA-1 the throughput is expressed in Megabits Per Second (Mbps), Mega-Floating Point Operations Per Second (MFLOPS) for FFT and Mega Samples per second for IDCT. The performance information for each of these applications are obtained from the fol- lowing references: AES-Encrypt: Bulens et al. (2008), SHA-1: Lee et al. (2009), FFT: Hemmert and Underwood (2005), IDCT: Sima et al. (2001). The difference in performance is of the order of 24× (recorded for IDCT) to 907× (recorded for SHA-1). Excluding SHA-1, which is a poorly performing application, the performance difference is in the order of 10-100×. The difference in performance is directly attributable to the nature of processing. All of these implementations work on streaming data (and sometimes pipelined execution). The overheads of the CGRA are mainly attributable to nature of instruction execution (involving fetch, decode, execute and writeback) as opposed to data triggered LUT lookup which is a single cycle operation.

7.7.3 Comparison with CGRAs employing FPGAs

Molen Chaves et al. (2006) Chaves et al. (2008) uses FPGA for imple- menting instruction set extensions. Molen uses a PowerPC architecture with a FPGA used as a hardware assist to improve processor perform- ance. The main kernel is offloaded on to the FPGA. The computation offloaded on to the FPGA is either manually coded in HDL or is auto- matically translated from a High Level Language (HLL) to HDL. A software based control code is run on the PowerPC processor and it invokes a hardware function, which is emulated on the FPGA. The cycle comparison between our CGRA and Molen for a few applications is presented in table 7.4. As observed previously, SHA-1 performs poorly and the AES kernels perform much better than Molen. When compared to a HDL approach, our CGRA does not perform very well. However, when compared with a software approach or with a HLL 126 Microarchitectural Optimizations

Table 7.4: Cycles recorded on Molen for various applications.

Application Time on Molen Time on our CGRA AES-Encrypt 5.6µs 3.712µs AES-Decrypt 5.6µs 3.015µs SHA-1 3.96µs 38.28µs

based approach to mapping on a FPGA we perform well. An analysis of the instruction execution trace reveals that the CGRA does not perform well due to the following reasons: • The load latency even from the closest computation unit takes 10 cycles. Therefore, all load-dependent operations incur long delays. • We are unable to exploit ILP, unless the interaction between these parallel instructions are quite far apart in the dataflow graph. • All our computation units are non-pipelined multi-cycle opera- tions. This implies even if independent instructions are present, the execution of an instruction needs to await the completion of the previous instruction which has been issued. • Dependent instructions are not executed in back-to-back clock cycles. These factors do not affect AES kernels, ECPA and ECPD since we were able to exploit long-range ILP. As mentioned previously, these applications have a higher compute to load instruction ratio.

7.7.4 Addressing some shortcomings We try to address some of these shortcomings and evaluate its ef- fectiveness on a software simulator. The time between request for loading a data and the arrival of data typically is 10 cycles or higher. This is due to the round trip delay time for the request originating from a computation unit to reach the load-store unit, the memory access time and the time taken for the response data to be received at another computation unit. In order to cut this delay, the data memory is directly connected to the computation units issuing loads/stores so that one half of the round trip delay time is cut. The memory access latency is set to two clock cycles. The time taken for the response from the load to a computation unit on the fabric remains the same as before. Before we describe the next optimization implemented, we 7.7 Comparing Performance with other processors 127

take a digression to describe why this design choice is the best possible one in the context of our CGRA.

7.7.4.1 Alternative Techniques for Load-Store Overhead Reduction

Another possible technique is to use local memory units connected to a computation unit. This technique can reduce the access time for data from each computation unit to just two clock cycles (unlike the previ- ous scheme where the data needs to travel from the load-store unit to the destination which can incur higher than 2-cycles). However, this scheme leads to a non- latency. If the distribu- tion of data (across the memory units connected to each computation unit) is such that most data accesses are non-local, then the scheme would perform no better than the previously described scheme. In order to make most data accesses local, memory partitioning needs to be performed. Memory partitioning involves: 1. Scalar variables are already partitioned and the data is placed in the registers of each local computation unit. In case more than one consmer exists, the data is replicated. The use of SSA semantics makes it easy to determine the final value written to the scalar. 2. Vector variables need to be partitioned in the case local memories are employed. However, our current compilation methodology aims at exploiting ILP and not DLP. Vector data partitioning is useful only in the case of exploitation of DLP. Biswas et al. (2010) employ local memories when customizing the platform to emulate systolic execution of matrix algorithms. In this work, Biswas et al. (2010) exploits DLP by executing several iterations in parallel. The details of this are presented in greater detail in section 7.8. The use of local memories is not suitable in the case of streaming applications, where the data is read once, processed and an output generated i.e. there is no reuse of the data. In such a case, the use of local memory does not provide any added performance benefit. For the above stated reasons, the choice of locally connected memories to certain computation units is better suited for the applications and compilation methodology being employed. Yet another optimization we model in this simulator is execution of dependent instructions in subsequent clock cycles. The reservation station examines the destinations of the instruction to be scheduled and if the destination happens to the subsequent instruction within the reservation station then appropriate bits are set in the outgoing instruction to indicate that the data should be retained within the ALU. Appropriate bits are also set within the reservation station to indicate that the next instruction to be executed will not receive data 128 Microarchitectural Optimizations

1

0.9

uration III 0.8

0.7

0.6

cution Time w.r.t config 0.5

0.4

0.3

0.2 Normalized Fabric Exe Normalized

0.1 CRC AES-D-1 AES-D-2 AES-E-1 AES-E-2 ECPA-0 ECPA-2 ECPD-0 ECPD-1 IDCT-1 SHA1-4 SHA1-5 SHA1-7

HyperOps of Applications

Figure 7.13: Normalized FET for HyperOps on Cryptographic Fabric

from the instruction just issued to the ALU. In the next clock cycle, if all data operands for the instruction have been received (other than the one which was to be generated from the previous instruction) then it forwarded to the ALU for scheduling. The ALU uses the result of the previous computation for performing the computation. These two optimizations were implemented because of their minor impact on the frequency and power consumption. Instead of executing the whole application, we only execute critical HyperOps for each of these applications in the software simulator5. The plots in figure 7.13 and figure 7.14 show the FET recorded with these optimizations normal- ized with respect to configuration III, for corresponding HyperOps. The HyperOps chosen are all those which are a part of a loop and thus determine the performance of the application. Configuration IV is not used for comparison as we do not support Loop-HyperOps on the software simulator. Figure 7.13 shows 10-20% improvement in performance for most applications, with exceptions being CRC, ECPA-2, and SHA-1 Hyper- Ops. These applications have a higher percentage of loads (when compared to the others) and thus see the benefit of both optimizations. CRC is a very small kernel with two loads and thus a huge reduction in performance is observed. ECPA-2 gains primarily on account of reduction in load latency. The load instructions and its consumers are present within the same computation unit. This reduces the time for each load instruction by half. However, the additional performance improvement is observed due to poor instruction ordering in config-

5Since these optimizations only impact FET the software simulator was not designed to execute the whole application. 7.7 Comparing Performance with other processors 129

1

uration III 0.9

0.8

0.7

cution Time w.r.t config 0.6

0.5

0.4

0.3 Normalized Fabric Exe Normalized MatrixMultiplication-1MatrixMultiplication-2MatrixMultiplication-3MatrixMultiplication-4MatrixMultiplication-5LU-1 LU-2 LU-3 LU-4 LU-5 Givens-1Givens-2Givens-3Givens-4Givens-5Givens-6Givens-7FFT-1FFT-2FFT-3FFT-4FFT-5MRI-Q-1MRI-Q-2MRI-Q-3

HyperOps of Applications

Figure 7.14: Normalized FET for HyperOps on Floating Point Fabric.

uration III. This HyperOp contains 12 loads followed by 12 output instructions which deliver these load outputs to other HyperOps. Each load has two outgoing edges, one to the output instruction and another to the next load in program order. The instruction schedule decided by the compiler prioritized the output instruction over the load instruc- tion. Since execution happens in-order the output instruction blocks the execution of the subsequent load instruction and thus causing a very large execution time for each HyperOp. With reduction in the time for data availability from a load, the execution time and wait time reduces drastically contributing to a 90% reduction in execution time. Floating-point kernels on the other hand exhibit larger improve- ment in performance owing to the higher number loads in these applications and due to back-to-back execution of dependent instruc- tions. However, the core loops of floating point applications such as Matrix Multiplication-3, LU-2, FFT-2 and MRI-Q-2 do not see a large reduction in FET due to long latency floating point operations. These operations are the most dominant factor that determines the execution time of the HyperOp. In FFT-2 and MRI-Q-2 the sin and cos operations dominate the execution time and the benefits from faster data availability from load-store unit and reduced wait time between dependent instructions do not significantly change the execution time. We estimated the overall performance of these applications using the HyperOp’s FET (as it is the most dominant factor) and compared it with the performance of the GPP. We were able to bring down the worst-case performance from 13× worse performance (for CRC) to about 7× worse performance (for LU factorizations). In order to elim- 130 Microarchitectural Optimizations

1

0.95

uration IV 0.9

0.85

0.8

0.75 cution Time w.r.t config

0.7

0.65

0.6 Normalized Fabric Exe Normalized

0.55 MatrixMultiplication-1MatrixMultiplication-2MatrixMultiplication-3MatrixMultiplication-4MatrixMultiplication-5LU-1 LU-2 LU-3 LU-4 LU-5 Givens-1Givens-2Givens-3Givens-4Givens-5Givens-6Givens-7FFT-1FFT-2FFT-3FFT-4FFT-5MRI-Q-1MRI-Q-2MRI-Q-3

HyperOps of Applications

Figure 7.15: Normalized FET for the floating point applications when pipelined floating point units are employed.

inate the performance overhead due to non-pipelined floating-point units, we modified the software simulator to support this feature. We refer to this configuration with pipelined floating point units as Con- figuration V. The FET of the HyperOps for floating point applications is shown in figure 7.15. 12% improvement in FET is observed on an average. No change in FET is observed in those HyperOps which do not perform floating point operations (viz. MatrixMultiplication-1). In HyperOps that contain floating point operations 22% gain in perform- ance, on an average, is observed. However, as indicated previously one other factor contributes to the bad performance when compared with the GPP i.e. our inability to exploit short-range ILP. We list possible solutions to this problem in our subsequent chapter on Future Work (chapter 8).

7.8 Exploiting Pipeline Parallelism and Task Level Parallelism

Pipeline Parallelism in loops can be very easily exploited on CGRAs due to the large number of computation units available on the fab- ric. An example of this form of parallelism exploited on our CGRA has been demonstrated and is documented in our publications: Fell et al. (2009a) and Prashank et al. (2010). These were achieved through semi-automatic compilation and 1024-point FFT performed 5.8× when compared to a GPP. Exploiting pipeline parallelism requires 7.8 Exploiting Pipeline Parallelism and Task Level Parallelism 131

no changes to the architecture. The technique relies on precalculated latencies for each stage of the pipeline placed in hardware. In order to be able to support these kinds of pipelines, a technique is needed for estimating the latencies accurately for any application kernel. As- suming fixed latencies will work only under conditions of streaming data. This technique which is employed in papers (Fell et al., 2009a; Prashank et al., 2010) are not extendable to all applications which do not have streaming data. In the absence of streaming data there is a need for supporting data prefetch as indicated in Biswas et al. (2010). In Biswas et al. (2010), we demonstrate how a systolic structure can be emulated on our CGRA. We place a scratch pad memory within the computation unit to allow storing certain values (sine and cosine values) needed for matrix computations (Givens rotation). In order to support data prefetch the load-store units are enhanced to include a set of load in- structions in auto-increment mode. The design allowed not just simple increments on the load address but also permitted evaluation of a more complex expression involving (say) a multiplication and addi- tion. The appropriate expressions for these were manually determined. There are techniques known in literature for identifying such chain of recurrences (Bachmann et al., 1994). The scalar evolution analysis framework of GCC (Pop et al., 2005) performs this kind of analysis. These load instructions supply data to the point of consumption. The scheme used for resident loops can be employed to provide a flow control mechanism. However, it has been our experience that the overhead of the loop restart mechanism (explained in section 7.4.2) can be quite large and nullifies the advantage obtained by placing load instructions in auto-increment mode in the load-store unit. In order to circumvent this problem, we can restart the loop after a stat- ically computed latency for each iteration. Static computation of these delays may not be possible in all cases, due to variable load latencies and the unpredictability of packet arrival in a NoC. When two packets arrive at a router and request to go along the same output direction at the same time, the arbitration mechanism employed by the router may choose either of these two packets based on prior choices. In the absence of such collisions, it is possible to statically predict the delays, assuming that the data operands are available in the data memory. If operands are not available in the data memory and need to be fetched from a lower level component of the memory hierarchy or needs to be received from the host processor, this mechanism of computing static delays may not work. A modified router design that will help us achieve the twin object- 132 Microarchitectural Optimizations

Table 7.5: The overall execution time recorded for configuration IV and configur- ation VI and the percentage difference between the two execution times.

Application Execution time Execution time Percentage Dif- for Configura- for Configura- ference tion IV tion VI CRC 13926 14196 1.94 AES-Decrypt 2111 2106 -0.24 AES-Encrypt 2599 2643 1.69 ECPA 2596 2591 -0.19 ECPD 1445 1442 -0.21 SHA-1 17227 17292 0.38 IDCT 4088 4088 0 Sobel 13233854 13468569 1.77 Matrix Mul- 1973315 1973318 0.00 tiply LU 692717 692722 0.001 QR 2355696 2355694 -0.00 FFT 1896295 1905288 0.47 MRI-Q 8915 8915 0

ives of lowering the reconfiguration overhead obtained by the use of routers and achieve predictability in data movement is presented. The round-robin arbiter of figure 4.5 in chapter 4 is replaced with a fixed priority logic. The elimination of round-robin priority logic reduces the area of the router and makes the execution predictable. We refer to this as configuration VI. This change does not affect the perform- ance of application (worst case drop of < 2%) as can be seen in table 7.5. The original execution number are reported for configuration IV. With this modified router, the schemes proposed by Biswas et al. (2010); Fell et al. (2009a); Prashank et al. (2010) become feasible and pipeline parallelism can be exploited. There are several factors which influence the effectiveness of pipeline parallelism. These include the number of instructions within each pipeline stage, time taken to execute each pipeline stage and the number of pipeline stages. If the number of instructions within a pipeline stage is larger than the number of instructions that can be accommodated within each computation unit then instructions of a pipeline stage spill over to many computation units. If the number of computation units assigned per pipeline stage matches the maximum ILP that can be exploited, then the execution of the pipeline stage is efficient. If the number of instructions become very large such that they spill over several computation units, then a lot of time is spent 7.9 Conclusion 133

communicating between these computation units (without any gain in parallel instruction execution). All instructions within a pipeline stage need to be completed prior to restarting the pipeline stage, whereas instructions in different pipeline stages can execute in parallel (on different data sets). This implies that pipeline parallelism performs best when the “right" number of computation units are employed. This number is computed analytically using deterministic latencies for the QR factorization kernel in Biswas et al. (2010). The analyt- ical modeling computes the schedule of various instructions based on deterministic latencies and deterministic memory access latencies. However, in some cases the pipeline stage may be very large lowering the effectiveness of the scheme. Support for Task-level parallelism is implicit within the system. Whenever a HyperOp is ready for execution (i.e. has received all its inputs) the macro-dataflow orchestration subsystem determines an appropriate location on the execution fabric and then transfers the instructions and data operands to the identified locations on the fabric. This was observed between the iterations of the innermost loops of QR factorization and LU factorization. An overlap in execution was also observed between HyperOps 2 and 3 in AES-decrypt for configuration I. It is therefore evident that the platform does not require any changes to support task level parallelism.

7.9 Conclusion

In this chapter we explored various microarchitectural optimizations to reduce FET, IHLT and PHLT. The highlight of all these optimizations is that we were able to achieve nearly fixed PHLT for all HyperOps by hiding a substantial portion. Various optimizations of the computation unit including increased bypasses, back-to-back instruction execution, optimizations to improve load-store performance etc. helped improve the FET. The IHLT was improved by a complete redesign of the Update and Orchestration unit rendering the hardware simpler while improving the performance. A performance comparison with GPP indicates that while we perform very well on kernels which have higher compute to load instruction ratio, we do not perform well on applications that are load-dominated. The architecture is unable to exploit the available ILP due to long computation unit–computation unit communication latency. This factor, along with placement of load-store units at the periphery of the reconfigurable fabric, critically affects the performance. Another important factor that contributed to the performance is the appropriate selection of specialized FUs. 134 Microarchitectural Optimizations

The choice of specialized FUs in combination with a simple pipeline helped us achieve a lower power dissipation. In the next chapter on Conclusions and Future Work, we summarize the good and ill effects of our choices and re-imagine the architecture to address the deficiencies while retaining its strengths. Chapter 8

Conclusions and Future Work

You are ready to graduate when you know that given another chance you would implement it differently. – Prof. S K Nandy

In this thesis, we presented the architecture of a CGRA which was built to achieve better programmability, higher performance through exploitation of ILP, DLP and TLP, enable instantiation of domain- specific FUs and a CGRA capable of executing a full program if need be. To exploit TLP and DLP, we implemented a macro-dataflow based orchestration unit. The orchestration unit is responsible for determin- ing the time and order of execution of various application substruc- tures. These application substructures or macro operations are derived from the application’s dataflow and control flow graphs. The macro operations need to be well-behaved. In order to be well-behaved, any macro operation receiving data operands either must execute or must be terminated so that the memory location associated with the macro operation is released. Further, loop invariants too must be purged when the last iteration of the loop has executed. A macro operation is selected for execution when all its input operands are available. This is preferred since implementing context switch cap- ability is very difficult on CGRAs. The reconfigurable fabric on the CGRA comprises a set ALUs interconnected by a NoC. Each ALU was designed with storage for multiple instructions to improve the instruc- tion density for the given silicon area. Instructions are executed in a data-dependent manner i.e. an instruction executes as soon as its data operands are available. This helps in withstanding the variable laten- cies associated with load operations and communication on the NoC. The reconfigurable fabric supports heterogeneous computation units 136 Conclusions and Future Work

i.e. all computation units need not support the same functionality. The computation unit can be designed in a domain specific manner and the compiler and architecture seamlessly support this domain specialization. During process of domain specialization, various mi- croarchitectural designs of the reconfigurable fabric can be explored. The orchestration unit is oblivious to the nature of reconfigurable fabric and paradigm of execution employed on it. We achieved almost all of the objectives, we outlined previously, with exception to one. We were not able to completely exploit the available ILP due to our choice of computation unit and interconnection network. We also observed that the performance of linear algebra kernels suffered on account of long latency associated with memory access. On a positive note, the compiler controlled orchestration unit is of low overhead and helps in exploiting TLP with little software intervention. The key lessons from our experience are listed below. • Compiler Controlled Orchestrator has better performance and lower area: The compiler controlled orchestrator performs better owing to lesser involvement of hardware. The compiler takes many decisions statically. The runtime overhead involves an addition, which is performed on the reconfigurable fabric. This renders the context memory update unit simpler and thus occupying lower area. The orchestrator takes a fixed 4 cycles to schedule an application substructure once all data operands for it have arrived. Therefore, macro-dataflow serves as a low-overhead mechanism to exploit TLP with little or no software changes (such as creation of threads etc.). • Data-dependent Execution and Stateless Interconnection Network helped lower reconfiguration overhead: Data dependent instruc- tion execution was primarily intended to withstand variable latency load operations (which is expected in the presence of caches). In such a case all dependent downstream instructions too must await the arrival of their data operands prior to start of execution. However, a positive effect of this design choice is a reduction in reconfiguration overhead. An instruction executes as soon as its data operands are available and does not await the arrival of all instructions. The presence of a stateless network (i.e. requiring no programming) makes the network always ready to transport packets. The combination of these factors helps us reduce the reconfiguration overhead considerably. • Larger computation units are preferred when communication cost is higher: Network latency determines which instruction parti- 137

tion strategy would be effective. As the cost of network com- munication increases, better performance is achieved through minimizing communication. In order to minimize communic- ation, large parts of closely interacting instructions are placed within the same computation unit. The choice of computation unit determines the efficiency of execution. If a large number of instructions are assigned to the same computation unit, a wider issue computation unit will be better able to exploit the available parallelism. Therefore, the design of the computation unit is highly dependent on the network latency. It follows from this that the design of the computation unit must be complete to meet the specific requirements for that particular intercon- nection network, replete with bypass network and other aids to achieve good performance. In other words, as mentioned in Emer et al. (2007), single computation unit (core) performance is very important.

• Short memory latency is critical to performance: We find that memory latency is critical to performance. Placing data memory banks at the periphery of the reconfigurable fabric does not yield good performance. Memory access in most cases must be completed in 1-2 clock cycles.

• Domain Specialization is the key to power and performance effi- ciency: In this thesis, we did not delve too much into the im- portance of domain specialization of the reconfigurable fabric. However, there are a few lessons from some of our experiments. Custom function units (a.k.a refrigerators referred to by Prof. Patt in (Emer et al., 2007)) are key to achieving domain special- ization. The design space for custom FUs is very large. In the case of AES encryption or decryption the most obvious choice of custom function unit is an ASIC for the algorithm. An alternative would be to create a custom function unit which implements one iteration of the algorithm. Such a choice would make it possible to shift between 128, 192 and 256 bit modes. Another choice is to use an accelerator for specific steps in the AES algorithm such as S-box or Mix-Round Keys. All of these choices are application- specific and are not domain-specific. A non-obvious choice would be to employ a field multiplier and express the S-box and Mix Round key operations in terms of a field multiplication. This choice allows us to accelerate other applications such as ECPA and ECPD. We employ a field multiplier due to its wider ap- plicability. In embedded systems, the choice of custom function 138 Conclusions and Future Work

units is sometimes driven by the throughput requirements for the specific application which leads to the pruning of the design space. Making the right choice among various design points can be based on the energy efficiency of the specialized unit. Employ- ing highly specialized units will enable us to shut off these units when the application is not active and when the application is active it helps reduce the execution time. On the other hand a more generalized unit may have wider applicability and hence reduce the running time of various applications (and to a lesser degree for the specific application) and the time for which it is shut off is also lower. The choice that leads to lower energy is preferable.

8.1 Future Work

Some major features and experiments which can be taken up in future to further enhance the CGRA are:

• Macro-Dataflow Orchestration Unit: The compiler controlled or- chestration unit can be extended such that the current limitation of 128 context memory locations can be extended to much lar- ger numbers by spilling older entries of the stack on to global memory (in a manner similar to the way Register Stack Engine is implemented on the processor (Sharangpani and Arora, 2000)).

• Integrating Context Memory with Data memory: The context memory and data memory may be merged into the same location, with the introduction of two types of stores: A normal store which writes data to a memory location; A context update store which increments a count prior to writing data to a memory location. This allows potentially large amount of context information to be stored per HyperOp.

• A framework for energy determination: In order to determine the dynamic energy consumption it is necessary to determine the activity factor due to the execution of various HyperOps. In order to analyze, we need to reconstruct the active and inactive periods for various modules from the execution trace. Using the static power estimates generated by Synopsys Design Vision it is then possible to determine energy consumption, by appropriately using the activity factor, duration of execution and the estimated 8.2 Directions for Future Research 139

frequency of operation. Such a framework will help in evaluating the energy efficiency of various domain specializations. • Experiment to determine best HyperOp size, fabric size and com- putation unit configuration: The size of the reconfigurable fabric is 6 × 6 in all experiments. This choice was primarily motivated due to its applicability to honeycomb topologies (the size was neither too small nor too big; only multiples of three are permit- ted). The size of the HyperOp (i.e. the number of instructions within it) is currently set at the maximum number of instructions that can be accommodated on the fabric (in configuration III; refer chapter 7). Having large HyperOps reduces the extent of off-fabric communication. However, due to the limited number of instructions that can be mapped to a single computation unit the communication between instructions of a HyperOp may be spread over a large area leading to inefficient execution. The aim of this experiment was to determine the best configuration from a performance and energy efficiency perspective.

8.2 Directions for Future Research

From the conclusions listed above it is evident that the future CGRAs must either employ: • Smaller cores with lower issue width, which employ richer and faster interconnections. • Larger cores with higher issue width, where the latency of com- munication can be high. The use of richer, faster interconnection, in the former solution, comes at the cost of higher power dissipation. A possible solution to this problem is the use of hierarchical interconnections, where faster richer communication is used between a small groups of processor cores and the groups of processors themselves are interconnected by a (relat- ively) higher latency interconnection network. Another interesting question that needs to be answered is what defines as small core and what defines a large core. From our experience we believe that the smallest possible core must be larger than the computation unit employed in our CGRA. Examples of such a small core include MIPS R2000 and OpenRISC 1000 processor. The largest possible core must be smaller than a modern day GPP, so that it remains possible to ex- ploit DLP and pipeline parallelism. We need to evaluate these choices for better application performance and better energy efficiency. 140 Conclusions and Future Work

It was mentioned in the conclusion that the data memory banks must be placed closer to the computation units in order to minimize the data access latency. Two possible solutions have been explored in literature, namely: 1. A fabric with alternating compute and memory units, as employed in Ambric (Butts, 2007). 2. A data cache exclusively connected to each computation unit, as employed in Raw (Waingold et al., 1997). The first solution seems quite attractive, as many pro- cessors will be able to access the memory directly. This memory can be used to pass large data which may need to be exchanged between two processors during execution or as a store for shared data. However, the downside of it is that there can be contention for access to the memory. The second solution on the other hand does not have any contention for access, however accessing shared data may lead to non-uniform latencies for different processors. In some cases, the data may need to be duplicated. The Road goes ever on and on Out from the door where it began. Now far ahead the Road has gone, Let others follow it who can! Let them a journey new begin, But I at last with weary feet Will turn towards the lighted inn, My evening-rest and sleep to meet. - J R R Tolkien in The Lord of the Rings 142 Conclusions and Future Work

Appendix

In this section we tabulate the details of all configurations so as to serve as an easy reference for the reader/reviewer.

Table 8.1: Table summarizing the various simulation configurations.

Feature Base Con- Config I Config II Config III Config IV fig No. of instruc- 16 16 32 32 32 tions per com- pute unit Instruction Dataflow In order In order In order In order Sequencing Pipeline Depth 4 2 2 2 2 of computation unit Load-Store any chained chained chained chained Instruction units per units per units per units per placement alias set alias set alias set alias set Fabric Size 5 × 6 5 × 6 5 × 6 5 × 6 5 × 6 Resident Loop No No No No Yes Support Instruction Sequential Sequential Sequential Interleaved Interleaved and Data load sequence Orchestrator Hardware Compiler Compiler Compiler Compiler Control Control Control Control Control References

Agarwal, A., Amarasinghe, S., Barua, R., Frank, M., Lee, W., Sarkar, V., Srikrishna, D., and Taylor, M. (1997). “The RAW compiler project.” In “Proceedings of the Second SUIF Compiler Workshop,” pages 21–23. Citeseer.

Agha, G. (1986). Actors: a model of concurrent computation in distrib- uted systems. MIT Press, Cambridge, MA, USA.

Alle, M. (2012). Compiling for Coarse-Grained Reconfigurable Architec- tures based on dataflow execution paradigm. Ph.d., Indian Institute of Science.

Alle, M., Varadarajan, K., Fell, A., Nandy, S. K., and Narayan, R. (2009). “Compiling Techniques for Coarse Grained Runtime Recon- figurable Architectures.” In “Proceedings of the 5th International Workshop on Reconfigurable Computing: Architectures, Tools and Applications,” pages 204–215. Springer-Verlag, Berlin, Heidelberg.

Amano, H. (2006). “A survey on dynamically reconfigurable processors.” IEICE transactions on communications, E89-B(12): 3179–3187. URL http://search.ieice.org/bin/pdf_link.php? category=B&lang=E&year=2006&fname=e89-b_12_ 3179&abst=.

Amano, H., Inuo, T., Kami, H., Fujii, T., and Suzuki, M. (2004). “Techniques for Virtual Hardware on a Dynamically Reconfigurable Processor âA¸SAn˘ Approach to Tough Cases âA¸S.”In˘ “Field Program- mable Logic,” pages 464–473. URL http://www.springerlink. com/index/hp384j4qfc7867je.pdf.

Arvind, Nikhil, R. S., and Pingali, K. K. (1989). “I-structures: data structures for .” ACM Trans. Program. Lang. Syst., 11(4): 598–632. URL http://doi.acm.org/10.1145/69558. 69562. 144 References

Arvind, K. and Culler, D. E. (1986). Dataflow architectures, pages 225–253. Annual Reviews Inc., Palo Alto, CA, USA. URL http: //portal.acm.org/citation.cfm?id=17814.17824. Arvind, K. and Nikhil, R. S. (1990). “Executing a Program on the MIT Tagged-Token Dataflow Architecture.” IEEE Trans. Comput., 39: 300–318. URL http://dx.doi.org/10.1109/12.48862. Asanovic, K., Bodik, R., Catanzaro, B., Gebis, J., Husbands, P., Keutzer, K., Patterson, D., Plishker, W., Shalf, J., Williams, S., and Yelick, K. (2006). “The Landscape of Parallel Computing Re- search: A View from Berkeley.” Technical report, University of Cali- fornia, Berkeley, Berkeley, CA, USA. URL http://www.citeulike. org/user/hawkestein/article/1099472. Bachmann, O., Wang, P., and Zima, E. V. (1994). “Chains of Recur- rences - a method to expedite evaluation of closed-form functions.” In “International Symposium on Symbolic and algebraic computa- tion,” pages 242–249. ACM. URL http://dl.acm.org/citation. cfm?doid=190347.190423. Biswas, P., Varadarajan, K., Alle, M., Nandy, S. K., and Narayan, R. (2010). “Design space exploration of systolic realization of QR factorization on a runtime reconfigurable platform.” In “ICSAMOS,” pages 265–272. Borkar, S. and Chien, A. A. (2011). “The future of .” Communications of the ACM, 54(5): 67–77. Bulens, P., Standaert, F., Quisquater, J.-J., Pellegrin, P., and Rouvroy, G. (2008). “Implementation of the AES-128 on Virtex- 5 FPGAs.” In “Proceedings of the Cryptology in Africa 1st international conference on Progress in cryptology (AFRICAC- RYPT’08),” pages 16–26. URL http://www.springerlink.com/ index/6842r1g274l47575.pdf. Burger, D., Keckler, S., McKinley, K., Dahlin, M., John, L., Lin, C., Moore, C., Burrill, J., McDonald, R., and Yoder, W. (2004). “Scaling to the end of silicon with EDGE architectures.” Com- puter, 37(7): 44–55. URL http://ieeexplore.ieee.org/lpdocs/ epic03/wrapper.htm?arnumber=1310240. Butts, M. (2007). “Synchronization through communication in a massively parallel processor array.” Micro, IEEE, 27(5): 32–40. URL http://ieeexplore.ieee.org/xpls/abs_all.jsp? arnumber=4378781. References 145

Cardoso, J. and Weinhardt, M. (2002). “XPP-VC: AC compiler with temporal partitioning for the PACT-XPP architecture.” In “Field-Programmable Logic and Applications:,” URL http://www. springerlink.com/index/2rwnvdfwv79wev9u.pdf.

Chaves, R., Kuzmanov, G., Sousa, L., and Vassiliadis, S. (2006). “Rescheduling for optimized SHA-1 calculation.” In “Proceed- ings of the 6th international conference on Embedded Com- puter Systems: architectures, Modeling, and Simulation (SAMOS ’06),” pages 425–434. URL http://www.springerlink.com/index/ 44604332P3308521.pdf.

Chaves, R., Sousa, L., Kuzmanov, G., and Vassiliadis, S. (2008). “Polymorphic AES Encryption Implementation.” URL http://ce.et. tudelft.nl/publicationfiles/1095_657_AES_molen.pdf.

Choudhary, N. (2011). Resource Efficient Enhanced Floating Point Functional Unit For REDEFINE. Master of technology, Indian Institute of Science.

Compton, K. and Hauck, S. (2002). “Reconfigurable computing: a survey of systems and software.” ACM Computing Surveys, 34(2): 171–210. URL http://portal.acm.org/citation.cfm? doid=508352.508353.

Corporaal, H. (1997). Architectures: From VLIW to Tta. John Wiley & Sons, Inc., New York, NY, USA.

Culler, D., Goldstein, S., Schauser, K., and Voneicken, T. (1993). “TAM - A Compiler Controlled Threaded .” Journal of Parallel and , 18(3): 347–370. URL http://citeseerx.ist.psu.edu/viewdoc/download?doi=10. 1.1.71.7242&rep=rep1&type=pdf.

Culler, D. E. (1985). “Resource Management for the Tagged Token Dataflow Architecture.” Technical report, Massachusetts Institute of Technology, Cambridge, MA, USA.

Daemen, J. and Rijmen, V. (2002). The Design of Rijndael: AES - The Advanced Encryption Standard. Springer-Verlag, Heidelberg, Germany.

Das, S., Varadarajan, K., Garga, G., Mondal, R., Narayan, R., and Nandy, S. K. (2011). “A Method for Flexible Reduction over Binary Fields using a Field Multiplier.” In “SECRYPT,” pages 50–58. 146 References

DeHon, A. and Wawrzynek, J. (1999). “Reconfigurable com- puting: what, why, and implications for design automation.” In “Design Automation Conference,” Section 7, pages 610– 615. Ieee. URL http://ieeexplore.ieee.org/lpdocs/epic03/ wrapper.htm?arnumber=782016. Emer, J., Hill, M., Patt, Y., Yi, J., Chiou, D., and Sendag, R. (2007). “Single-threaded vs. multithreaded: Where should we focus?” Mi- cro, IEEE, 27(6): 14–24. URL http://ieeexplore.ieee.org/xpls/ abs_all.jsp?arnumber=4437716. Fell, A. (2012). RECONNECT: A flexible Router Architecture for Network-on-Chips. Ph.d., Indian Institute of Science. Fell, A., Alle, M., Varadarajan, K., Biswas, P., Das, S., Chetia, J., Nandy, S. K., and Narayan, R. (2009a). “Streaming FFT on REDEFINE-v2: an application-architecture design space explora- tion.” In “Proceedings of the 2009 international conference on , architecture, and synthesis for embedded systems,” CASES ’09, pages 127–136. ACM, New York, NY, USA. URL http://doi.acm.org/10.1145/1629395.1629414. Fell, A., Biswas, P., Chetia, J., Nandy, S., and Narayan, R. (2009b). “Generic routing rules and a scalable access en- hancement for the Network-on-Chip RECONNECT.” In “2009 IEEE International SOC Conference (SOCC),” Vc, pages 251– 254. IEEE. URL http://ieeexplore.ieee.org/lpdocs/epic03/ wrapper.htm?arnumber=5398048. Goto, E. and Wong, W. f. (1995). “Fast Evaluation of the Elementary Functions in Single Precision.” IEEE Trans. Comput., 44: 453–457. URL http://dx.doi.org/10.1109/12.372037. Halfhill, T. R. (2006). “Ambric’s New Parallel Processor.” Micropro- cessor Report, pages 1–9. Hartenstein, R. (2001). “A decade of reconfigurable computing: a visionary retrospective.” In “Proceedings Design, Automation and Test in Europe. Conference and Exhibition 2001,” pages 642–649. IEEE Comput. Soc. Hemmert, K. and Underwood, K. (2005). “An Analysis of the Double-Precision Floating-Point FFT on FPGAs.” In “13th Annual IEEE Symposium on Field-Programmable Custom Computing Ma- chines (FCCM’05),” pages 171–180. Ieee. URL http://ieeexplore. ieee.org/lpdocs/epic03/wrapper.htm?arnumber=1508537. References 147

Hennessy, J. L. and Patterson, D. A. (2003). Computer Architecture: A Quantitative Approach. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 3 edition. Hewitt, C. (1977). “Viewing control structures as patterns of passing messages.” Artificial Intelligence, 8(3): 323– 364. URL http://www.sciencedirect.com/science/article/ B6TYF-4810S9X-W/2/517cb5846d1a1160d2cac1dfca757f5b. Hill, M. D. and Marty, M. R. (2008). “Amdahl’s Law in the Multicore Era.” Computer, 41(7): 33–38. URL http://ieeexplore.ieee.org/ lpdocs/epic03/wrapper.htm?arnumber=4563876. Iannucci, R. A. (1988). “Toward a dataflow/von Neumann hybrid architecture.” In “Proceedings of the 15th Annual International Symposium on Computer architecture,” ISCA ’88, pages 131–140. IEEE Computer Society Press, Los Alamitos, CA, USA. URL http: //portal.acm.org/citation.cfm?id=52400.52416. Janneck, J. W. (2003). “Actors and their Composition.” Formal Aspects of Computing, 15(4): 349–369. URL http: //www.springerlink.com/openurl.asp?genre=article&id=doi: 10.1007/s00165-003-0016-3. Karatsuba, A. and Ofman, Y. (1963). “Multiplication of Many- Digital Numbers by Automatic Computers.” Proceedings of the USSR Academy of Sciences, 145: 293–294. Kongetira, P., Aingaran, K., and Olukotun, K. (2005). “Niagara : A 32-Way Multithreaded SPARC Processor.” IEEE Micro, pages 21–29. Krishnamoorthy, R. (2010). Compiler Optimizations for Coarse Grained Reconfigurable Architectures. Ph.d., The University of Tokyo, Japan. Krishnamoorthy, R., Das, S., Varadarajan, K., Alle, M., Fujita, M., Nandy, S., and Narayan, R. (2011a). “Data Flow Graph Partition- ing Algorithms and Their Evaluations for Optimal Spatio-temporal Computation on a Coarse Grain Reconfigurable Architecture.” IPSJ Transactions on System LSI Design Methodology, 4(0): 193–209. Krishnamoorthy, R., Varadarajan, K., Fujita, M., and Nandy, S. K. (2011b). “Interconnect-Topology Independent Mapping algorithm for a Coarse Grained Reconfigurable Architecture.” In “Proceedings of the International Conference on Field Programmable Technology,” FPT ’11. IEEE Computer Society, Washington, DC, USA. 148 References

Krishnamoorthy, R., Varadarajan, K., Garga, G., Alle, M., Nandy, S. K., Narayan, R., and Fujita, M. (2010). “Towards minimizing execution delays on dynamically reconfigurable processors: a case study on REDEFINE.” In “Proceedings of the 2010 international conference on Compilers, architectures and synthesis for embedded systems,” CASES ’10, pages 77–86. ACM, New York, NY, USA. URL http://doi.acm.org/10.1145/1878921.1878935. Lattner, C. and Adve, V. (2004). “LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation.” In “Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization,” CGO ’04, pages 75–. IEEE Computer Society, Washington, DC, USA. URL http://dl.acm. org/citation.cfm?id=977395.977673. Lee, B. and Hurson, A. (1993). “Issues in dataflow com- puting.” Advances in computers, 37: 285–333. URL http://citeseerx.ist.psu.edu/viewdoc/download?doi=10. 1.1.77.3908&rep=rep1&type=pdf. Lee, E.-H., Lee, J.-H., Park, I.-H., and Cho, K.-R. (2009). “Imple- mentation of high-speed SHA-1 architecture.” IEICE Electronics Express, 6(16): 1174–1179. URL http://joi.jlc.jst.go.jp/JST. JSTAGE/elex/6.1174?from=CrossRef. Lee, W., Barua, R., Frank, M., Srikrishna, D., Babb, J., Sarkar, V., and Amarasinghe, S. (1998). “Space-time scheduling of instruction-level parallelism on a raw machine.” SIGOPS Oper. Syst. Rev., 32: 46–57. URL http://doi.acm.org/10.1145/384265. 291018. Mahlke, S., Lin, D., Chen, W., Hank, R., and Bringmann, R. (1992). “Effective Compiler Support For Predicated Execution Using The Hyperblock.” In “[1992] Proceedings the 25th Annual International Symposium on MICRO 25,” pages 45–54. IEEE Comput. Soc. Press. URL http://ieeexplore.ieee.org/lpdocs/ epic03/wrapper.htm?arnumber=696999. Mei, B., Lambrechts, A., Verkest, D., Mignolet, J.-Y., and Lauwereins, R. (2005). “Architecture Exploration for a Re- configurable Architecture Template.” IEEE Design and Test of Computers, 22(2): 90–101. URL http://ieeexplore.ieee.org/ lpdocs/epic03/wrapper.htm?arnumber=1413142. Mei, B., Vernalde, S., Verkest, D., De Man, H., and Lauwereins, R. (2002). “DRESC : A Retargetable Compiler for Coarse-Grained References 149

Reconfigurable Architectures.” In “IEEE International Conference on Field-Programmable Technology, (FPT).”, pages 166–173. IEEE Com- puter Society. URL http://ieeexplore.ieee.org/xpl/freeabs_ all.jsp?arnumber=1188678. Mei, B., Vernalde, S., Verkest, D., De Man, H., and Lauwereins, R. (2003). “ADRES: An architecture with tightly coupled VLIW processor and coarse-grained reconfigurable matrix.” In “Field- Programmable Logic and Applications,” pages 61–70. Springer. URL http://www.springerlink.com/index/03YT3XEH60R8971K.pdf. Menezes, A. J. (1994). Elliptic Curve Public Key Cryptosystems. Kluwer Academic Publishers, Norwell, MA, USA. National Institute of Standards and Technology (2002). “Se- cure Hash Standard.” Technical report, US Department of Commerce. URL http://csrc.nist.gov/publications/fips/ fips180-2/fips180-2withchangenotice.pdf. Papadopoulos, G. M., Boughton, G. A., Greiner, R., and Beckerle, M. J. (1993). “*T: integrated building blocks for parallel computing.” In “Proceedings of the 1993 ACM/IEEE conference on Supercom- puting,” Supercomputing ’93, pages 624–635. ACM, New York, NY, USA. URL http://doi.acm.org/10.1145/169627.169811. Papadopoulos, G. M. and Culler, D. E. (1998). “Monsoon: an ex- plicit token-store architecture.” In “25 years of the international symposia on Computer architecture (selected papers),” ISCA ’98, pages 398–407. ACM, New York, NY, USA. URL http://doi.acm. org/10.1145/285930.285999. Park, H., Park, Y., and Mahlke, S. (2009). “Polymorphic pipeline array: a flexible multicore accelerator with virtualized execution for mobile multimedia applications.” In “Proceedings of the 42nd Annual IEEE/ACM Symposium on Microarchitecture (MICRO’09),” URL http://portal.acm.org/citation.cfm?id=1669160. Park, Y., Park, H., and Mahlke, S. (2010). “Resource recycling: putting idle resources to work on a composable accelerator.” In “Proceedings of the 2010 international conference on Compilers, architectures and synthesis for embedded systems,” pages 21–30. URL http://portal.acm.org/citation.cfm?id=1878925. Paulin, P. G. and Knight, J. P. (1989). “Scheduling and Binding Algorithms for High-Level Synthesis.” In “26th ACM/IEEE Design Automation Conference,” pages 1–6. ACM Press. 150 References

Pop, S., Cohen, A., and Silber, G.-A. (2005). “Induction variable analysis with delayed abstractions.” In “High Performance Embed- ded Architectures,” pages 1–22. URL http://www.springerlink. com/index/bk24451u71q85616.pdf. Prashank, N. T., Prasadarao, M., Dutta, A., Varadarajan, K., Alle, M., Nandy, S. K., and Narayan, R. (2010). “Enhancements for variable N-point streaming FFT/IFFT on REDEFINE, a runtime re- configurable architecture.” In “ICSAMOS,” pages 178–184. Putnam, A., Smith, A., and Burger, D. (2011). “Dynamic vectoriz- ation in the E2 dynamic multicore architecture.” SIGARCH Com- puter Architecture, 38(4): 27–32. URL http://portal.acm.org/ citation.cfm?id=1926373. Rahman, A., Das, S., and Tuan, T. (2006). “Determination of power gating granularity for FPGA fabric.” In “Custom Integrated Circuits Conference,” Cicc, pages 9–12. IEEE Computer Society. URL http: //ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4114899. Sankaralingam, K., Nagarajan, R., Liu, H., Kim, C., Huh, J., Bur- ger, D., Keckler, S. W., and Moore, C. R. (2003). “Exploiting ILP, TLP, and DLP with the polymorphous TRIPS architecture.” In “ACM SIGARCH Computer Architecture News,” volume 31, page 422. URL http://portal.acm.org/citation.cfm?id=871656.859667. Sarkar, V. and Hennessy, J. (1986). “Partitioning parallel programs for macro-dataflow.” In “Proceedings of the 1986 ACM conference on LISP and functional programming,” LFP ’86, pages 202–211. ACM, New York, NY, USA. URL http://doi.acm.org/10.1145/ 319838.319863. Sato, T., Watanabe, H., and Shiba, K. (2005). “Implementation of dynamically reconfigurable processor DAPDNA-2.” In “2005 IEEE VLSI-TSA International Symposium on VLSI Design, Automation and Test, 2005.(VLSI-TSA-DAT).”, pages 323–324. IEEE. URL http: //ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1500086. Satrawala, A., Varadarajan, K., Lie, M., Nandy, S., and Narayan, R. (2007). “Redefine: Architecture of a soc fabric for runtime composition of computation structures.” In “Field Program- mable Logic and Applications, 2007. FPL 2007. International Conference on,” pages 558–561. IEEE. URL http://ieeexplore. ieee.org/lpdocs/epic03/wrapper.htm?arnumber=4380716http: //ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4380716. References 151

Satrawala, a. N. and Nandy, S. K. (2009). “RETHROTTLE: Exe- cution throttling in the REDEFINE SoC architecture.” In “2009 International Symposium on Systems, Architectures, Modeling, and Simulation,” June, pages 82–91. Ieee. URL http://ieeexplore. ieee.org/lpdocs/epic03/wrapper.htm?arnumber=5289245. Sharangpani, H. and Arora, H. (2000). “Itanium Processor Microar- chitecture.” Micro, IEEE, 20(5): 24–43. URL http://ieeexplore. ieee.org/xpls/abs_all.jsp?arnumber=877948. Sima, M., Cotofana, S., van Eijndhoven, J. T. J., Vassiliadis, S., and Vissers, K. (2001). “An 8x8 IDCT Implementation on an FPGA-augmented TriMedia.” In “Field-Programmable Custom Com- puting Machines (FCCM’01),” pages 160–169. Rohnert Park, Cali- fornia. URL http://ieeexplore.ieee.org/xpls/abs_all.jsp? arnumber=1420912. Stone, S., Haldar, J., Tsao, S., and Hwu, W.-m. (2008). “Acceler- ating advanced MRI reconstructions on GPUs.” Journal of Parallel and Distributed Computing, 68(10): 1307–1318. URL http://www. sciencedirect.com/science/article/pii/S0743731508000919. Sugawara, T., Ide, K., and Sato, T. (2004). “Dynamically Recon- figuable Processor Implemented with IPFlex’s DAPDNA Technology.” IEICE Transaction on Informations and Systems, E87-D(8): 1997– 2003. Suzuki, N., Kurotaki, S., Suzuki, M., Kaneko, N., Yamada, Y., Deguchi, K., Hasegawa, Y., Amano, H., Anjo, K., Motomura, M., Wakabayashi, K., Toi, T., and Awashima, T. (2004). “Im- plementing and Evaluating Stream Applications on the Dynamic- ally Reconfigurable Processor.” In “12th Annual IEEE Symposium on Field-Programmable Custom Computing Machines,” pages 328– 329. Ieee. URL http://ieeexplore.ieee.org/lpdocs/epic03/ wrapper.htm?arnumber=1364662. Swanson, S., Michelson, K., Schwerin, A., and Oskin, M. (2003). “WaveScalar.” In “Proceedings of the 36th annual IEEE/ACM Interna- tional Symposium on Microarchitecture,” page 291. IEEE Computer Society. URL http://portal.acm.org/citation.cfm?id=956417. 956546. Swanson, S., Schwerin, A., Mercaldi, M., Petersen, A., Putnam, A., Michelson, K., Oskin, M., and Eggers, S. (2007). “The Wavescalar Architecture.” ACM Transactions on Computer Systems (TOCS), 25(2): 152 References

1–54. URL http://portal.acm.org/citation.cfm?id=1233307. 1233308. Technologies, X. (2006a). “XPP Technologies Programming XPP-III Processors White Paper.” URL http://www.pactxpp.com/download/ XPP-III_programming_WP.pdf. Technologies, X. (2006b). “XPP Technologies Reconfiguration on XPP-III Processors White Paper.” URL http://www.pactxpp.com/ download/XPP-III_reconfiguration_WP.pdf. Technologies, X. (2006c). “XPP Technologies XPP-III Processor Overview White Paper.” URL http://www.pactxpp.com/download/ XPP-III_overview_WP.pdf. Toi, T., Nakamura, N., Kato, Y., Awashima, T., Wakabayashi, K., and Jing, L. (2006). “High-Level Synthesis Challenges and Solu- tions for a Dynamically Reconfigurable Processor.” In “IEEE/ACM International Conference on Computer Aided Design,” pages 702– 708. Ieee. URL http://ieeexplore.ieee.org/lpdocs/epic03/ wrapper.htm?arnumber=4110255. Waingold, E., Taylor, M., Srikrishna, D., Sarkar, V., Lee, W., Lee, V., Kim, J., Frank, M., Finch, P., Barua, R., Babb, J., Amarasinghe, S., and Agarwal, a. (1997). “Baring it all to software: Raw ma- chines.” Computer, 30(9): 86–93. URL http://ieeexplore.ieee. org/lpdocs/epic03/wrapper.htm?arnumber=612254. Wilson, R., French, R., Wilson, C., Amarasinghe, S., Anderson, J., Tjiang, S., Liao, S., Tseng, C., Hall, M., Lam, M., and Hennessy, J. (1994). “The SUIF Compiler System: a Parallelizing and Optimizing Research Compiler.” Technical report, Stanford University, Stanford, CA, USA. Publications

Alle, M., Varadarajan, K., , Fell, A., Nandy, S. K., and Narayan, R. (2008a). “Compiling Techniques for Coarse Grained Runtime Reconfigurable Architectures.” In “ARC’09: Proceedings of the 5th IEEE International Workshop on Applied Reconfigurable Comput- ing,” Contribution:Section 2.3, 2.4 that describes tag-generation for dynamic dataflow and support for pipelining HyperOps resepectively. Also Section 3.3 that presents corresponding results.

Alle, M., Varadarajan, K., Fell, A., C., R. R., Joseph, N., Das, S., Biswas, P., Chetia, J., Rao, A., Nandy, S. K., and Narayan, R. (2009). “REDEFINE: Runtime reconfigurable polymorphic ASIC.” ACM Trans. Embed. Comput. Syst., 9: 11:1–11:48. URL http://doi. acm.org/10.1145/1596543.1596545. Contribution:Section 1 that presents the philosophy of the work, Section 2 that describes the microarchitecture, Section 4.3 that describes implementation of few modules, Section 5.2 that presents synthesis results.

Alle, M., Varadarajan, K., Joseph, N., Reddy, C. R., Fell, A., Nandy, S. K., and Narayan, R. (2008b). “Synthesis of Application Ac- celerators on Runtime Reconfigurable Hardware.” In “ASAP ’08: Proceedings of the 19th IEEE International Conference on Applic- ation specific Systems, Architectures and Processors,” Contribu- tion:Section 2 that describes the architecture of the CGRA and part of section 4 that presents results.

Biswas, P., Udupa, P. P., Mondal, R., Varadarajan, K., Alle, M., Nandy, S. K., and Narayan, R. (2010a). “Accelerating Numerical Linear Algebra Kernels on a Scalable Run Time Reconfigurable Plat- form.” In “Proceedings of the 2010 IEEE Annual Symposium on VLSI,” ISVLSI ’10, pages 161–166. IEEE Computer Society, Washing- ton, DC, USA. URL http://dx.doi.org/10.1109/ISVLSI.2010.65.

Biswas, P., Varadarajan, K., Alle, M., Nandy, S., and Narayan, R. (2010b). “Design space exploration of systolic realization of QR 154 Publications

factorization on a runtime reconfigurable platform.” In “Embedded Computer Systems (SAMOS), 2010 International Conference on,” pages 265 –272.

Das, S., Varadarajan, K., Garga, G., Mondal, R., Narayan, R., and Nandy, S. K. (2011). “A Method for Flexible Reduction over Binary Fields using a Field Multiplier.” In “SECRYPT,” pages 50–58. Con- tribution:The design of the multiplier was jointly proposed the first two authors.

Fell, A., Alle, M., Varadarajan, K., Biswas, P., Das, S., Chetia, J., Nandy, S. K., and Narayan, R. (2009). “Streaming FFT on REDEFINE-v2: an application-architecture design space explora- tion.” In “Proceedings of the 2009 international conference on Compilers, architecture, and synthesis for embedded systems,” CASES ’09, pages 127–136. ACM, New York, NY, USA. URL http: //doi.acm.org/10.1145/1629395.1629414. Contribution:Design of components described in section 2 and Section 3 that describes design for support of custom functional units and persistent Hyper- Ops.

Joseph, N., Reddy, C. R., Varadarajan, K., Alle, M., Fell, A., Nandy, S. K., and Narayan, R. (2008). “RECONNECT: A NoC for poly- morphic ASICs using a Low Overhead Single Cycle Router.” In “ASAP ’08: Proceedings of the 19th IEEE International Conference on Application specific Systems, Architectures and Processors,” Contri- bution:I was involved in design of the NoC to suit the requirements of the architecture. The implementation of the NoC was done by Nimmy Josesph and C. Ramesh Reddy.

Krishnamoorthy, R., Das, S., Varadarajan, K., Alle, M., Fujita, M., Nandy, S. K., and Narayan, R. (2011a). “Data Flow Graph Partitioning Algorithms and their Evaluations for Optimal Spatio- Temporal Computation on a Coarse Grain Reconfigurable Architec- ture.” IPSJ Transactions on System LSI Design Methodology (TSLDM7). Contribution:Necessary architectural support and involved in dis- cussions to evaluate the suitability of various choices in the context of our framework.

Krishnamoorthy, R., Varadarajan, K., Fujita, M., Alle, M., Nandy, S. K., and Narayan, R. (2011b). “Dataflow graph partitioning for optimal spatio-temporal computation on a coarse grain re- configurable architecture.” In “Proceedings of the 7th interna- tional conference on Reconfigurable computing: architectures, tools Publications 155

and applications,” ARC’11, pages 125–132. Springer-Verlag, Berlin, Heidelberg. URL http://dl.acm.org/citation.cfm?id=1987535. 1987554. Contribution:Necessary architectural support and in- volved in discussions to evaluate the suitability of various choices in the context of our framework. Krishnamoorthy, R., Varadarajan, K., Fujita, M., and Nandy, S. K. (2011c). “Interconnect-Topology Independent Mapping algorithm for a Coarse Grained Reconfigurable Architecture.” In “Proceedings of the International Conference on Field Programmable Technology,” FPT ’11. IEEE Computer Society, Washington, DC, USA. Contribu- tion:Integrated the proposed mapping algorithm into the current flow and helped with performing simulations. Krishnamoorthy, R., Varadarajan, K., Garga, G., Alle, M., Nandy, S. K., Narayan, R., and Fujita, M. (2010). “Towards minimizing execution delays on dynamically reconfigurable processors: a case study on REDEFINE.” In “Proceedings of the 2010 international conference on Compilers, architectures and synthesis for embed- ded systems,” CASES ’10, pages 77–86. ACM, New York, NY, USA. URL http://doi.acm.org/10.1145/1878921.1878935. Contribu- tion:Section 3.1 that describes Modified architecture to supports prefetch of HyperOps and allows speculative launch of HyperOps. Satrawala, A. N., Varadarajan, K., Alle, M., Nandy, S. K., and Narayan, R. (2007). “REDEFINE: Architecture of a SOC Fabric for Runtime Composition of Computation Structures.” In “FPL ’07: Proceedings of the International Conference on Field Programmable Logic and Applications,” Contribution:This is a paper that describes the philosophy of our work. This was the foundation for our work. All authors share equal contribution in this paper. Varadarajan, K., Alle, M., Narayan, R., and Nandy, S. K. (2013). “Dynamic Dataflow Scheduling in a Coarse-Grained Reconfigurable Architecture.” Contribution:Manuscript under preparation, to be submitted to an appropriate journal. The architectural philosophy and the sections pertaining to the architectural support are the first author’s contribution.