Architectural Frameworks for Automated Design and Optimization of Hardware Accelerators

ARCHITECTURAL FRAMEWORKS FOR AUTOMATED DESIGN AND OPTIMIZATION OF HARDWARE ACCELERATORS A Dissertation Presented to the Faculty of the Graduate School of Cornell University in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy by Tao Chen May 2018 c 2018 Tao Chen ALL RIGHTS RESERVED ARCHITECTURAL FRAMEWORKS FOR AUTOMATED DESIGN AND OPTIMIZATION OF HARDWARE ACCELERATORS Tao Chen, Ph.D. Cornell University 2018 As technology scaling slows down and only provides diminishing improve- ments in general-purpose processor performance, computing systems are in- creasingly relying on customized accelerators to meet the performance and energy efficiency requirements of emerging applications. For example, today’s mobile SoCs rely on accelerators to perform compute-intensive tasks, and dat- acenters are starting to deploy accelerators for applications such as web search and machine learning. This trend is expected to continue and future systems will contain more specialized accelerators. However, the traditional hardware- oriented accelerator design methodology is costly and inefficient because it re- quires significant manual effort in the design process. This development model is unsustainable in the future where a wide variety of accelerators are expected to be designed for a large number of applications. To solve this problem, the development cost of accelerators must be drastically reduced, which calls for more productive design methodologies that can create high-quality accelerators with low manual effort. This thesis addresses the above challenge with architectural frameworks that combine novel accelerator architectures with automated design and optimization frameworks to enable designing high-performance and energy-efficient accelerators with minimal manual effort. Specifically, the first part of the thesis proposes a framework for automatically generating accelerators that can effectively toler- ate long, variable memory latencies, which improves performance and reduces design effort by removing the need to manually create data preloading logic. The framework leverages architecture mechanisms such as memory prefetching and access/execute decoupling, as well as automated compiler analysis to generate accelerators that can intelligently preload data needed in the future from the main memory. The second part of the thesis proposes a framework for building parallel accelerators that leverage concepts from task-based parallel programming, which enables software programmers to quickly create high-performance accelerators using familiar parallel programming paradigms, without needing to know low- level hardware design knowledge. The framework uses a computation model that supports dynamic parallelism in addition to static parallelism, and includes a flexible architecture that supports dynamic scheduling to enable mapping a wide range of parallel applications to hardware accelerators and achieve good performance. In addition, we designed a unified language that can be mapped to both software and hardware, enabling programmers to create parallel software and parallel accelerators in a unified framework. The third part of the thesis proposes a framework that enables accelerators to perform intelligent dynamic voltage and frequency scaling (DVFS) to achieve good energy-efficiency for interactive and real-time applications. The framework combines program analysis and machine learning to train predictors that can accurately predict the computation time needed for each job, and adjust the DVFS levels to reduce the energy consumption. BIOGRAPHICAL SKETCH Tao Chen attended Fudan University from the year 2008 to 2012, where he re- ceived his Bachelor of Science degree (with distinction) in Microelectronics. Af- ter graduation from Fudan University, he began pursuing his Ph.D. degree in the School of Electrical and Computer Engineering at Cornell University, where he worked with his advisor, Professor G. Edward Suh, on topics in the field of computer architecture, with a focus on hardware accelerators. iii This dissertation is dedicated to my parents. iv ACKNOWLEDGEMENTS Six years ago, I arrived at Cornell to pursue my Ph.D. degree. At that time, I was a young student who was nervous about the challenges ahead, and was uncer- tain if I could make it to the end. Six years later, I have successfully completed this dissertation and become a doctor. I am extremely grateful to have so many people help me along this exciting and rewarding journey. First and foremost, I would like to express my sincerest gratitude to my advisor, Professor G. Edward Suh. Throughout my Ph.D. study, Ed has supported me without reservations and provided valuable guidance, advice, encouragement, and help whenever I needed them. Ed gave me the freedom to pursue research directions that I am passionate about, and at the same time providing necessary guidance so that I can stay on the right path. Ed is always ready to of- fer his generous help, whether it is about brainstorming ideas, revising a paper, or perfecting a conference talk. Ed is also always encouraging when I face difficulties, which helped me stay optimistic and motivated through the challenging journey of working towards a Ph.D. degree. I am deeply grateful to him. I would like to thank my committee members, Professor David H. Albonesi and Professor Zhiru Zhang. Dave is a role model to me as a great computer ar- chitect who is passionate about research and teaching. Dave’s course on memory systems is one of the most exciting classes that I took, and inspired me to pursue the research on memory optimizations for accelerators. Zhiru’s vision and his pioneering work on high-level synthesis is a major source of inspiration for my research. He also provided many helpful suggestions and comments that greatly improved my work. I would like to thank Professor Christopher Batten for his guidance and support, and for generously sharing the research infrastructure that his group de- v veloped. Chris also mentored me on the parallel accelerator project and provided many insightful suggestions and advice. I am sincerely grateful to him. Special thanks to my friends and colleagues at CSL who helped me tremen- dously both with my research and with navigating graduate school. I would like to thank members of the Suh Research Group. I want to thank Daniel Lo for providing many helpful comments and insights that greatly helped my research. I would like to thank Ruirui (Raymond) Huang and Wing-kei (KK) Yu for sharing their experiences as senior Ph.D. students. Special thanks to Yao Wang for providing great suggestions and directions throughout my Ph.D. journey. I would also like to thank Andrew Ferraiuolo, Mohamed Ismail, Benjamin Wu, Weizhe (Will) Hua, and Mulong Luo for their support and friendship, which made my life as a Ph.D. student a lot more enjoyable. Special thanks to Shreesha Srinath for being both a mentor and a good friend. I enjoyed discussing and debating research ideas with him, and also benefited from his suggestions and guidance as a senior student. I would also like to thank Xiaodong Wang, Gai Liu, Steve Dai, and all other CSL students, whom I am fortunate to be friends with. I am proud to be a member of this brilliant community. I would like to thank my girlfriend Lin, for being caring and supportive for my life and research. Her encouragement helped me push forward in times of difficulties, and her warmth made me feel delighted every day. Finally, I would like to express my deepest gratitude to my parents, Xin Chen and Meihua Liu, for their unconditional love and support. They taught me to be persistent and optimistic, and that hard work pays off, which got me this far in my academic endeavor. They encouraged me to think independently, and supported me no matter what decisions I have made in my life. I am proud to have them as my parents, and I hope I have made them proud of me too. vi TABLE OF CONTENTS Biographical Sketch . iii Dedication . iv Acknowledgements . .v Table of Contents . vii List of Tables . .x List of Figures . xi 1 Introduction 1 1.1 Background . .1 1.2 Design Complexity of Accelerators . .3 1.3 Thesis Contributions and Organization . .6 2 Memory Optimization Framework for Efficient Data Supply 9 2.1 Introduction . .9 2.2 Overview . 12 2.2.1 System Architecture . 12 2.2.2 High-Level Synthesis . 13 2.2.3 Impact of Memory Accesses on Accelerator Performance . 14 2.2.4 Data Preloading Framework . 16 2.3 Prefetching . 18 2.4 Decoupled Access/Execute . 19 2.4.1 Access Unit . 23 2.4.2 Memory Units . 24 2.4.3 Execute Unit . 26 2.4.4 Deadlock Avoidance . 26 2.4.5 Customization of Memory Units . 27 2.4.6 Automated DAE Accelerator Generation . 28 2.5 Evaluation . 29 2.5.1 Methodology . 30 2.5.2 Experimental Setup . 31 2.5.3 Baseline Validation . 33 2.5.4 Performance Results . 34 2.5.5 Area, Power, and Energy Results . 36 2.5.6 Design Space Exploration: Queue Size . 42 3 Parallel Accelerator Framework 44 3.1 Introduction . 44 3.2 Computation Model for Dynamic Parallelism . 48 3.2.1 Primitives . 48 3.2.2 Continuation Passing . 50 3.2.3 Scheduling the Computation . 54 3.2.4 Function Calls . 56 vii 3.3 Accelerator Architecture . 56 3.3.1 FlexArch Tile and PE Architecture . 58 3.3.2 LiteArch Tile and PE Architecture . 64 3.3.3 Networks . 64 3.3.4 Memory Hierarchy . 65 3.3.5 CPU-Accelerator Interface . 66 3.4 Design Methodology and Framework . 67 3.4.1 Architectural Template . 67 3.4.2 Algorithm Description Format . 68 3.4.3 Accelerator RTL Generation . 69 3.5 Unified Framework for Parallel Accelerators and Software . 71 3.5.1 CPPWD-TBB Library . 72 3.5.2 Programmability . 74 3.6 Evaluation . 75 3.6.1 Benchmarks . 76 3.6.2 Design Effort Comparison .

Architectural Frameworks for Automated Design and Optimization of Hardware Accelerators

Development of Systemc Modules from HDL for System-On-Chip Applications

A Fedora Electronic Lab Presentation

Simulator for the RV32-Versat Architecture

Chapter 1. Origins of Mac OS X

High Speed Data Link

FPGA-Accelerated Evaluation and Verification of RTL Designs

Static Analysis to Improve RTL Verification

THE FUTURE of HOME NETWORKING the Impact of Wi-Fi, Remote UI and Open Source Stacks on Service Provider Network Architecture

Performed the Most Often. in FPGA Design Flow, Functional and Gate

RTL Design and Implementation of a Framebuffer for a RISC-V Processor

Pymtl: a Unified Framework for Vertically Integrated Computer

Capability Directory 3