A Coarse Grained Reconfigurable Architecture Framework Supporting
Total Page:16
File Type:pdf, Size:1020Kb
A Coarse Grained Reconfigurable Architecture Framework supporting Macro-Dataflow Execution A Thesis Submitted for the Degree of Doctor of Philosophy in the Faculty of Engineering by Keshavan Varadarajan Supercomputer Education and Research Centre INDIAN INSTITUTE OF SCIENCE BANGALORE – 560 012, INDIA DECEMBER 2012 cba Keshavan Varadarajan 2012 To my grandfather Late H. Keshavachar ap vA rTm@yA\ hE-tnAd þºoEDnFm^ Acknowledgments Before I start thanking people, I would like to state that this piece of paper can neither capture the extent of my gratitude nor the entire list of people to whom I am thankful. I name a few people who have directly helped me; There are many unnamed people who make it possible for us to work and attempt to make a meaningful contribution to the society. Dreams drive innovation and the dreamer-in-chief in this case was my guide: Prof. S K Nandy. I thank him for the dreams, for giving the initial impetus so that we could take it forward and for giving us the means to achieve it: monetary, intellectual, advisorial and equipments. I would like to thank Dr. Ranjani Narayan for many things: helping me make it into this institution, for the numerous discussions, patient paper reviews and steadfast belief in REDEFINE. I thank Dr. Balakrishnan Srinivasan for his timely and incisive comments that helped us perceive the architecture in newer light and identify the shortcomings of the architecture. I would like to thank Prof. Bharadwaj Amrutur for the numerous discus- sions on caches and sorry that I wasted your time without any results to show. His encouragement has worked wonders on me. To Profs. R Govindarajan and Matthew Jacob, I express my most sincere gratitude. I learnt my first lessons of computer architecture from them. Subsequently, their belief in me has helped me sail the rough waters of PhD. I thank Prof. R Govindarajan for helping me secure my scholarship in some of the most testing times, I have ever faced. I would like to thank Prof. Y C Tay of the National University of Singapore for the opportunity he provided to work with him. I thank Prof. Georgi Gaydadjiev for the encouragement he gave me and the wonderful opportunity to work with him in Netherlands, which I could not take up. Mythri Alle, I thank you whole heartedly for several things. We started our PhD journeys together and now we end it nearly together. Without you, this PhD may have been very difficult or may not even have been possible. Next, to my dearest friend Rajdeep Mondal. Dude, without your belief that anything can be completed in time and your insightful comments, the bluespec implementation would not have been possible. Prasenjit Biswas and Saptarsi Das, I thank you for the many discussions and technical inputs. I rely on these people until this very day for technical inputs. Sanjay Kumar, without you life would have been very boring in CAD lab. You have been ii the most helpful innumerable times and the most reliable person when some task needs to be offloaded. Ganesh Garga my sincere thanks for all the work you put into Redefine and thanks for being a friend in the initial years of my stay at IISc. I would also like to thank Alexander Fell for all the work he put into the NoC. To my other friends: Aparna Mandke, Basavaraj Talwar, Vishal Sharda, Ritesh Rajore, Ramesh Reddy and Nimmy Joseph, it was a pleasure working with you. Bharath “Amba" Ravikumar, Sujay Mysore, Swaroop Krishnamurthy and Poornima Hatti have been my friends through thick and thin. Without them the journey of PhD would have been nigh impossible to complete. Last but not the least, I express my deepest gratitude to my family members: my parents, Anta, Akka, Jiju, Kutti, Sampath, U-ma, K-pa, chitti, chittappa and Bhartu. I will not belittle their contribution by attempting to state it in words. Finally, I would like to thank the guiding light who spoke in many voices, some known and some unknown. Abstract A Coarse-Grained Reconfigurable Architecture (CGRA) is a processing plat- form which constitutes an interconnection of coarse-grained computation units (viz. Function Units (FUs), Arithmetic Logic Units (ALUs)). These units communicate directly, viz. send-receive like primitives, as opposed to the shared memory based communication used in multi-core processors. CGRAs are a well-researched topic and the design space of a CGRA is quite large. The design space can be represented as a 7-tuple (C;N;T;P;O;M;H) where each of the terms have the following meaning: C - choice of computation unit, N - choice of interconnection network, T - Choice of number of con- text frame (single or multiple), P - presence of partial reconfiguration, O - choice of orchestration mechanism, M - design of memory hierarchy and H - host-CGRA coupling. In this thesis, we develop an architectural framework for a Macro-Dataflow based CGRA where we make the following choice for each of these parameters: C - ALU, N - Network-on-Chip (NoC), T - Mul- tiple contexts, P - support for partial reconfiguration, O - Macro Dataflow based orchestration, M - data memory banks placed at the periphery of the reconfigurable fabric (reconfigurable fabric is the name given to the intercon- nection of computation units), H - loose coupling between host processor and CGRA, enabling our CGRA to execute an application independent of the host-processor’s intervention. The motivations for developing such a CGRA are: • To execute applications efficiently through reduction in reconfiguration time (i.e. the time needed to transfer instructions and data to the reconfigurable fabric) and reduction in execution time through better exploitation of all forms of parallelism: Instruction Level Parallelism (ILP), Data Level Parallelism (DLP) and Thread/Task Level Parallelism (TLP). We choose a macro-dataflow based orchestration framework in combination with partial reconfiguration so as to ease exploitation of TLP and DLP. Macro-dataflow serves as a light weight synchronization mechanism. We experiment with two variants of the macro-dataflow orchestration units, namely: hardware controlled orchestration unit and the compiler controlled orchestration unit. We employ a NoC as it helps reduce the reconfiguration overhead. iv • To permit customization of the CGRA for a particular domain through the use of domain-specific custom-Intellectual Property (IP) blocks. This aids in improving both application performance and makes it energy efficient. • To develop a CGRA which is completely programmable and accepts any program written using the C89 standard. The compiler and the architecture were co-developed to ensure that every feature of the ar- chitecture could be automatically programmed through an application by a compiler. In this CGRA framework, the orchestration mechanism (O) and the host- CGRA coupling (H) are kept fixed and we permit design space exploration of the other terms in the 7-tuple design space. The mode of compilation and execution remains invariant of these changes, hence referred to as a framework. We now elucidate the compilation and execution flow for this CGRA framework. An application written in C language is compiled and is trans- formed into a set of temporal partitions, referred to as HyperOps in this thesis. The macro-dataflow orchestration unit selects a HyperOp for execu- tion when all its inputs are available. The instructions and operands for a ready HyperOp are transferred to the reconfigurable fabric for execution. Each ALU (in the computation unit) is capable of waiting for the availability of the input data, prior to issuing instructions. We permit the launch and execution of a temporal partition to progress in parallel, which reduces the reconfiguration overhead. We further cut launch delays by keeping loops persistent on fabric and thus eliminating the need to launch the instructions. The CGRA framework has been implemented using Bluespec System Veri- log. We evaluate the performance of two of these CGRA instances: one for cryptographic applications and another instance for linear algebra kernels. We also run other general purpose integer and floating point applications to demonstrate the generic nature of these optimizations. We explore various microarchitectural optimizations viz. pipeline optimizations (i.e. changing value of T ), different forms of macro dataflow orchestration such as hard- ware controlled orchestration unit and compiler-controlled orchestration unit, different execution modes including resident loops, pipeline parallelism, changes to the router etc. As a result of these optimizations we observe 2:5× improvement in performance as compared to the base version. The reconfig- uration overhead was hidden through overlapping launching of instructions with execution making. The perceived reconfiguration overhead is reduced drastically to about 9-11 cycles for each HyperOp, invariant of the size of the HyperOp. This can be mainly attributed to the data dependent instruction execution and use of the NoC. The overhead of the macro-dataflow execution unit was reduced to a minimum with the compiler controlled orchestration unit. To benchmark the performance of these CGRA instances, we compare the performance of these with an Intel Core 2 Quad running at 2.66GHz. On the cryptographic CGRA instance, running at 700MHz, we observe one to two orders of improvement in performance for cryptographic applications and up to one order of magnitude performance degradation for linear algebra CGRA instance. This relatively poor performance of linear algebra kernels can be attributed to the inability in exploiting ILP across computation units inter- connected by the NoC, long latency in accessing data memory placed at the periphery of the reconfigurable fabric and unavailability of pipelined floating point units (which is critical to the performance of linear algebra kernels). The superior performance of the cryptographic kernels can be attributed to higher computation to load instruction ratio, careful choice of custom IP block, ability to construct large HyperOps which allows greater portion of the communication to be performed directly (as against communication through a register file in a general purpose processor) and the use of resident loops execution mode.