Synthesis and Exploration of Loop Accelerators for Systems-On-A-Chip

Synthesis and Exploration of Loop Accelerators for Systems-on-a-Chip Der Technischen Fakultät der Universität Erlangen-Nürnberg zur Erlangung des Grades DOKTOR-INGENIEUR vorgelegt von Hritam Dutta Erlangen 2011 Als Dissertation genehmigt von der Technischen Fakultät der Universität Erlangen-Nürnberg Tag der Einreichung: . 10. Januar 2011 Tag der Promotion: . 03. März 2011 Dekan: . Prof. Dr.-Ing. Reinhard German Berichterstatter: . Prof. Dr.-Ing. Jürgen Teich . .Prof. Christian Lengauer, Ph.D. Acknowledgements I owe my deepest gratitude to my adviser, Professor Jürgen Teich, for always being enthusiastic to propose and discuss new ideas. He also provided me a great amount of freedom, and valuable scientific and editorial feedback. I would also like to thank Professor Christian Lengauer for agreeing to serve on my dissertation committee and the suggestions to improve the dissertation. My sincere gratitude also goes to Professor Bernard Pottier and Professor Ulrich Rüde for introducing me to new ideas and fields of research. My special thanks goes to all colleagues, especially, Frank Hannig, Dmitrij Kissler, Joachim Keinert, Richard Membarth, Moritz Schmid, Jens Gladigau, Dirk Koch for brainstorming sessions and intensive co-operation, which led to key scientific progress and enrichment of my knowledge. I appreciate Frank’s patience in reading the whole dissertation and making valuable suggestions. I was fortunate to have won- derful office mates in Mateusz Majer and Tobias Ziermann, and thank them both for all the technical and non-technical discussions. My sincere acknowledgements also goes to external colleagues Sebastian Siegel, Rainer Schaffer (TU Dresden), Wolf- gang Haid (ETH Zürich), Samar Yazdani (UBO, Brest) for co-operation on important research problems. I also appreciate the efforts of undergraduate students, especially, Teddy Zhai and Holger Ruckdeschel, in the software development of PARO method- ology. I am also deeply indebted to Sonja Heidner and Ina Derr for helping me sort out several administrative issues. I greatly value the friendship of all the people who made my stay in Erlangen a real pleasure. My family has been a constant source of love, concern, support, and strength all these years. I would like to express my heart-felt gratitude to my family and dedicate this dissertation to them. Hritam Dutta iii iv Contents 1. Introduction 1 1.1. Next Generation Applications . .3 1.2. Accelerator based SoC Architectures . .5 1.3. Programming Models for SoC . .8 1.4. Problem Definition . 11 1.5. Contributions and Bibliographic notes . 12 1.6. A Guided Tour through the Thesis . 14 2. Fundamentals and Related Work 15 2.1. Algorithm Specification in the Polytope Model . 15 2.1.1. Fundamentals: Algorithm Specification . 15 2.1.2. Specification of Communicating Loop Nests . 20 2.1.3. Related Work . 24 2.2. A Generic Accelerator Scheme . 26 2.2.1. Characterization and Classification of Loop Accelerators . 26 2.2.2. Accelerator Subsystem for Streaming Application . 28 2.3. High-level Synthesis of Hardware Accelerators . 30 2.3.1. Front End: Loop Transformations . 30 2.3.1.1. Program Transformations . 31 2.3.1.2. Tiling . 33 2.3.2. Front End: Scheduling . 34 2.3.2.1. Global Scheduling and Binding . 35 2.3.2.2. Local Scheduling and Resource Binding . 37 2.3.3. Back End: Synthesis . 38 2.3.3.1. Synthesis of Processor Element . 39 2.3.3.2. Synthesis of Array Interconnection Structure . 40 2.3.3.3. Synthesis of Control Hardware . 40 2.4. Accelerator Design Space Exploration . 42 2.5. High-level Synthesis Tools . 44 2.6. Conclusion . 47 3. Accelerator Generation: Loop Transformations and Back End 49 3.1. Loop Optimizations for Accelerator Tuning . 50 3.1.1. Loop Transformations . 51 v Contents 3.1.1.1. Loop Permutation . 53 3.1.1.2. Loop Tiling . 55 3.1.2. Hierarchical Tiling . 56 3.1.2.1. Tiling: Decomposition of the Iteration Space . 62 3.1.2.2. Embedding: Splitting of Data Dependencies . 67 3.1.2.3. Iteration dependent Conditions . 70 3.1.2.4. Parallelization of Tiled Piecewise Linear Algorithms 71 3.1.3. Results: Scalability and Overhead of Hierarchical Tiling . 74 3.2. Controller Generation . 75 3.2.1. Accelerator Control Engine: Architecture and Synthesis Method- ology . 77 3.2.1.1. Counter Generation . 78 3.2.1.2. Determination of Processor Element Type . 88 3.2.1.3. Global and Local Controller Unit . 89 3.2.1.4. Propagation of Global Control and Counter Signals 91 3.2.2. I/O Communication Controller . 92 3.2.2.1. Buffer Modeling and Synthesis . 92 3.2.2.2. I/O Controller Synthesis . 94 3.3. Results . 97 3.3.1. Embedded Computation Motifs . 97 3.3.2. Impact of Compiler Transformations on Controller Overhead 100 3.4. Conclusion . 104 4. Accelerator Subsystem for Streaming Applications: Synthesis and Sys- tem Integration 105 4.1. Communicating Loop Model . 107 4.1.1. Loop Graph . 108 4.1.2. Accelerator Model . 110 4.1.3. Mapping: Putting it all together . 111 4.2. Automated Generation of a Communicating Accelerator Subsystem 116 4.2.1. Modeling of Communication Channels . 118 4.2.1.1. Simplified Windowed Synchronous Data Flow Model118 4.2.1.2. Conversion from the Polyhedral Model to the Data Flow Representation . 119 4.2.2. Multi-dimensional FIFO: Architecture and Synthesis . 127 4.3. Synthesis of Accelerators for MPSoCs . 131 4.3.1. Interface Synthesis . 133 4.3.1.1. Accelerator Memory Map Generation . 133 4.3.1.2. Hardware Wrapper . 135 4.3.1.3. Software Driver . 137 4.3.2. Accelerator Integration in SoC . 139 4.4. Results . 141 vi Contents 4.4.1. Overhead of Communication Primitives . 141 4.4.2. Accelerators as Components in SoC . 143 4.5. Conclusion . 144 5. Design Space Exploration: Accelerator Tuning 145 5.1. Single Accelerator Exploration . 146 5.1.1. Model Representation and Problem Definition . 148 5.1.2. Multiple Objectives . 150 5.1.3. Objective Functions . 151 5.1.3.1. Rapid Estimation Models . 152 5.1.4. Optimization Engine . 154 5.1.4.1. Baseline: Random or Exhaustive Search . 154 5.1.4.2. Evolutionary Algorithms . 155 5.2. Performance Analysis of Accelerators in an SoC System . 161 5.2.1. Modular Performance Analysis (MPA) . 163 5.2.2. Objective Parameter Estimation for Accelerators . 164 5.2.2.1. Accelerator Performance: Service Curve Estima- tion . 166 5.2.3. Optimal Configuration Selection in System Context . 169 5.2.4. Case Study . 170 5.2.4.1. Motion JPEG Decoder . 170 5.3. Conclusion and Summary . 172 6. Conclusions and Outlook 175 6.1. Conclusion . 175 6.2. Future Work . 177 A. Glossary 179 B. Hermite Normal Form 181 C. Loop Benchmarks 183 German Part 195 Bibliography 199 List of Abbreviations 219 Curriculum Vitae 221 vii Contents viii 1. Introduction The ephemeral craving for more and more performance and the famous Moore’s law have been the driving forces behind the evolution of computer architectures. On the other hand, green computing, which stands for reducing the environmental impact of computing devices, is urging increased energy efficiency. Therefore, next generation embedded, desktop, and supercomputing applications are placing extreme performance demands at low energy cost. According to the ITRS Road map 2007 [101], heterogeneous massively parallel computing is the paradigm that needs to be embraced in order to meet the demands of the next generation applications. A heterogeneous massively parallel computing platform consists of numerous specialized cores around multiple general purpose processors, which are not identical from the programming or the implementation standpoint. These architectures are also called system-on-a-chip (SoC) in the area of embedded computing. This is the next step in embedded architecture evolution, where homogeneous many-core architectures are incremented with task-specific specialized cores called accelerators. The block diagram in Figure 1.1 illustrates such an SoC architecture. The control intensive general purpose software is executed se- quentially on the processors. Whereas, the data intensive parts of the applications such as loop programs are offloaded to the corresponding tailored accelerators. According to Gries [79], the following approaches are needed in order to realize the full potential of SoC architectures: • Task specific processors: The need for performance at low power cost lead to the inclusion of task-specific processors also known as acceleration engines. They can be dedicated or offer domain specific programmability. • Correct-by-construction design: The verification and integration time accounts for a large chunk of the development time of SoCs; therefore, the automatic generation of a dedicated accelerator architecture and program code for processors by a compiler from a high-level specification should replace the error- prone manual design process. Embedded computing is embracing new software tools for correct-by-construction design. • Hardware/software co-design: The term refers to early system development on higher levels of abstraction and subsequent refinement on an accelerator/processor system. 1 1. Introduction CPU CPU DSP RAM USB Wireless Video DDR-RAM Accelerator FFT Accelerator RAM Controller Engine Engine Encryption Accelerator Matrix RGB2YUV DMA I/O Engine Multiplication Figure 1.1.: SoC architecture that has multiple processors, special purpose accelerators, connectivity IPs (e.g. USB, ...), memories, and controllers, which can be accommodated on a single chip. • Design space exploration: The architect must be aided in pruning of the large design space and in the identification of optimal designs through systematic exploration. However, it is usually the task of the designer to handcraft data intensive parts (i.e., loop programs) in domain-specific languages like VHDL to obtain the best-fit accelerator in terms of performance, cost, and power. The optimizations of such loop specifica- tions exploit partitioning, efficient data reuse, transfer, storage, and other transformations in the programmer’s bag of tricks.

Synthesis and Exploration of Loop Accelerators for Systems-On-A-Chip

Expression Rematerialization for VLIW DSP Processors with Distributed Register Files ?

User-Directed Loop-Transformations in Clang

Elimination of Memory-Based Dependences For

A General Compilation Algorithm to Parallelize and Optimize Counted Loops with Dynamic Data-Dependent Bounds Jie Zhao, Albert Cohen

Foundations of Scientific Research

Polyhedral-Model Guided Loop-Nest Auto-Vectorization Konrad Trifunović, Dorit Nuzman, Albert Cohen, Ayal Zaks, Ira Rosen

Compiler Construction

Portable Section-Level Tuning of Compiler Parallelized Applications

Mipsprotm Fortran 77 Programmer's Guide

Unified Polyhedral Modeling of Temporal and Spatial Locality

Power and Energy Impact by Loop Transformations

Loop Transformations: Convexity, Pruning and Optimization