Bespoke Behavioral Processors

BESPOKE BEHAVIORAL PROCESSORS by Rohit Sreekumar APPROVED BY SUPERVISORY COMMITTEE: Dr. Benjamin Carrion Schaefer, Chair Dr. Lakshman Tamil Dr. Yang Hu Copyright c 2020 Rohit Sreekumar All rights reserved This thesis is dedicated to my parents, R Sreekumar and Girija Sreekumar & my dear friends. BESPOKE BEHAVIORAL PROCESSORS by ROHIT SREEKUMAR, B.Tech THESIS Presented to the Faculty of The University of Texas at Dallas in Partial Fulfillment of the Requirements for the Degree of MASTER OF SCIENCE IN COMPUTER ENGINEERING THE UNIVERSITY OF TEXAS AT DALLAS May 2020 ACKNOWLEDGMENTS I would like to thank god almighty for instilling in me the confidence and drive to successfully complete this study. I express my deepest gratitude to my graduate thesis advisor, Dr. Benjamin Carrion Schaefer for the opportunity to work with him on a research-oriented project. His continued support, guidance and most of all his belief in me enabled me to successfully complete my thesis. I was able to learn immensely from his in-depth knowledge and experience in my work. Thank you sir. I would like to express my gratitude to Dr. Lakshman Tamil and Dr. Yang Hu for being on my evaluation committee. I would like to give a special thanks to my fiancee for her love, support and immense en- couragement throughout. I would like to thank my family for their constant motivation, support and for providing me with words of wisdom during my difficult times. I am thankful to the Department of Electrical and Computer Engineering at The University of Texas at Dallas for their help and providing me with ample research facilities all through my work. March 2020 v BESPOKE BEHAVIORAL PROCESSORS Rohit Sreekumar, MSCE The University of Texas at Dallas, 2020 Supervising Professor: Dr. Benjamin Carrion Schaefer Many emerging applications require simple controllers that run the exact same application continuously. These include medical devices and IoTs of different nature. Because of the nature of these applications, they have to be ultra-low power and small. Most of the applications are mapped onto low-power processors that are computationally inexpensive, thus, amenable to be executed on a simple microprocessor. One of the problems of using a general purpose processor, is that not all of the resources are required for a specific application, thus, there is a large potential for simplifying the processor to achieve lower area and power. In addition, these processors can be specified at the behavioral level using High-Level Synthesis (HLS) to generate the RTL automatically. This opens a window for additional optimizations as the processor can be pruned and re-synthesized at different VLSI design levels in order to obtain a smaller and more power-efficient processor. This work presents a methodology to customize a behavioral RISC processor automatically for a given workload such that its area and power are significantly reduced as compared to the original processor. Compared to previous work that customizes a given processor at the gate netlist only, this proposed method helps reduce the area and power significantly by raising the level of abstraction. vi TABLE OF CONTENTS ACKNOWLEDGMENTS . v ABSTRACT . vi LIST OF FIGURES . ix LIST OF TABLES . x CHAPTER 1 INTRODUCTION . 1 1.1 Thesis Motivation . .1 1.2 Thesis Contribution and Organization . .2 CHAPTER 2 APPLICATION SPECIFIC INSTRUCTION SET PROCESSORS . 3 2.1 Introduction . .3 2.2 Application Specific Processors . .3 2.2.1 Definition . .3 2.2.2 ASIP vs General CPU . .4 2.3 ASIP Design Flow . .5 2.4 Synopsys Processor Designer . .5 2.5 Cadence Xtensa . .7 CHAPTER 3 HIGH LEVEL SYNTHESIS . 9 3.1 Introduction to VLSI Design . .9 3.2 VLSI Design Flow and its Applications . .9 3.3 Introduction To High Level Synthesis . 11 3.4 High Level Synthesis Design Flow . 12 3.4.1 Resource Allocation . 15 3.4.2 Scheduling . 17 3.4.3 Binding . 19 3.4.4 RTL Code Generation . 20 3.5 Advantages of HLS . 22 3.6 Disadvantages of HLS . 23 3.7 Commercial HLS Tools . 24 3.7.1 Vivado HLS . 24 vii 3.7.2 Catapult C . 24 3.7.3 C to Silicon . 25 3.7.4 CyberWorkBench . 25 CHAPTER 4 BESPOKE PROCESSORS . 26 4.1 Introduction . 26 4.2 Motivational Example . 27 4.3 Bespoke Processor Proposed Method . 31 4.3.1 Behavioral Pruning . 32 4.3.2 RTL Pruning . 34 4.3.3 Gate Netlist Pruning . 34 4.4 Experimental Results . 35 4.4.1 Experimental Setup . 35 4.4.2 Experimental Results . 37 CHAPTER 5 CONCLUSION AND FUTURE WORK . 43 5.1 Conclusion . 43 5.2 Future Work . 43 REFERENCES . 44 BIOGRAPHICAL SKETCH . 47 CURRICULUM VITAE viii LIST OF FIGURES 2.1 ASIP design flow. [1] . .5 2.2 Synopsys Processor Design overview [2] . .6 2.3 Cadence Xtensa Design Flow. [3] . .8 3.1 VLSI design flow . 10 3.2 HLS Gajski-Kuhn Y-chart [18] . 13 3.3 HLS Design Flow . 14 3.4 Control and Data Flow graph . 15 3.5 Resource allocation example . 16 3.6 Scheduling example . 18 3.7 Binding example . 20 4.1 Motivational example. (a) Synthesizable behavioral description snippet of scalar MIPS processor.(b)RT-Level block diagram of processor, (c) Gate netlist view. (d) Application to be run on the processor (average of 8 numbers). 28 4.2 Area, power and timing reduction after each stage . 29 4.3 Overview of complete bespoke processor proposed flow . 31 4.4 Proposed method vs Previous method area savings . 37 4.5 Area savings per level of abstraction . 38 4.6 Power savings per level of abstraction . 39 4.7 Delay savings per level of abstraction . 40 4.8 Synthetic Benchmarks Area Savings . 41 ix LIST OF TABLES 4.1 Supported MIPS Instruction Set . 30 4.2 Benchmark details . 36 4.3 Run Time for Iterative Reduction Approach . 42 4.4 Run Time for Direct Reduction Approach . 42 x LIST OF ABBREVIATIONS ALAP As Late As Possible ALU Arithmetic Logic Unit ASAP As Soon As Possible ASIP Application Specific Instruction Set Architecture CISC Complex Instruction Set Computer CDFG Control and Data Flow Graph CPU Central Processing Unit DSP Digital Signal Processing HDL Hardware Description Language HLS High Level Synthesis I/O Input Output IoT Internet of Things LISA Language for Instruction Set Architectures MOS Metal Oxide Semiconductor RAM Random Access Memory RISC Reduced Instruction Set Computer ROM Read Only Memory RTL Register Transfer Level xi VHDL Very High Speed Integrated Circuit Hardware Description Language VLIW Very Long Instruction Word VLSI Very Large Scale Integration xii CHAPTER 1 INTRODUCTION The Internet of Things (IoT) is probably one of the most exciting fastest growing new technologies happening right now. There are currently over 6 billion connected devices and it is predicted that by 2022 this number will increase to over 30 billion and it is estimated that the global data volume will grow exponentially from 4.4 zettabytes to 44 zettabytes in 2022. These devices range from connected homes to industrial applications. Examples of IoT applications go from smart homes and cities to healthcare and transportation. IoT is a term that has been coined to describe this network of interconnected devices. IoT systems typically require an embedded processor, some communication circuit (e.g. Wi-Fi, Bluetooth) and multiple sensors. The main problem is that these type of systems are often battery operated and rely on renewable energies to re-charge the batteries, thus, they need to be ultra-low power, at the same time they often execute specific, static applications. This opens the question whether these IoT systems can be tailored to further reduce their power consumption. 1.1 Thesis Motivation A large number of applications require ultra-low power hardware and are cost sensitive. At the same time, these applications are not very computationally demanding and thus, can be executed on a general purpose processor. These applications include wearable [10; 23] and IoT applications [11; 24]. One of the problems with general purpose processors is that they are not as power efficient as dedicated solutions like ASICs. To address this, state of the art processors make use of different level of adaptive power management techniques such as power gating [15; 12] and event-driven programming through interrupts [14]. These techniques help reducing the 1 power consumption of the unused parts of the processor, but are often restricted to a coarser granularity while at the same time lead to area overheads due to the need to include different clock domains, gating logic etc... 1.2 Thesis Contribution and Organization A bespoke processor is created from an existing general purpose microprocessor tailored to a particular target application. This thesis raises the abstraction level of the tailoring procedure from the gate netlist level to begin the tailoring starting from a higher level of abstraction. This methodology follows an iterative tailoring procedure where the bespoke processor tailoring begins at the behavioral description of the processor where all the unused lines of code is removed followed by an RTL level reduction and finally a gate level netlist reduction. The following chapters of the thesis are organized as follows. Chapter 2 introduces ASIPs a component used in System-on-Chips whose instruction set is tailored to a specific application, its design flow and the tools supporting the same. Chapter 3 introduces the concept of High Level Synthesis, its design flow, the pros and cons of HLS and the various tools for HLS. Chapter 4 introduces the concept of Bespoke processors, the proposed methodology for the generation of behavioral bespoke processors and the experimental results obtained. Chapter 5 finally presents the conclusion. 2 CHAPTER 2 APPLICATION SPECIFIC INSTRUCTION SET PROCESSORS 2.1 Introduction Advancements in the semiconductor fabrication technologies has enabled solutions with short product cycles to cope with the constantly varying application functionality.

Bespoke Behavioral Processors

Synthesis and Verification of Digital Circuits Using Functional Simulation and Boolean Satisfiability

A Logic Synthesis Toolbox for Reducing the Multiplicative Complexity in Logic Networks

Logic Optimization and Synthesis: Trends and Directions in Industry

Designing a RISC CPU in Reversible Logic

Logic Synthesis Meets Machine Learning: Trading Exactness for Generalization

Logical Equivalence Checking of Asynchronous Circuits Using Commercial Tools

Verilog HDL 1

Object-Oriented Development for Reconfigurable Architectures

Automated Synthesis of Unconventional Computing Systems

Robust Boolean Reasoning for Equivalence Checking and Functional Property Verification

An Optimal Power Supply and Body Bias Voltage for an Ultra Low Power Micro-Controller with Silicon on Thin BOX MOSFET

Busting the Myth That Systemverilog Is Only for Verification