Ultra-Low-Power Design and Implementation of Application-Speciﬁc Instruction-Set Processors for Ubiquitous Sensing and Computing

Ultra-low-power Design and Implementation of Application-specific Instruction-set Processors for Ubiquitous Sensing and Computing NING MA Doctoral Thesis in Electronic and Computer Systems Stockholm, Sweden 2015 TRITA-ICT 2015:11 KTH School of Information and ISSN 1653-6363 Communication Technology ISRN KTH/ICT-2015/11-SE SE-164 40 Kista, Stockholm ISBN 978-91-7595-692-3 SWEDEN Akademisk avhandling som med tillstånd av Kungl Tekniska högskolan framlägges till offentlig granskning för avläggande av teknologie doktorsexamen i Elektronik och Datorsystem onsdag den 4 november 2015 klockan 10.00 i Sal B, Electrum, Kungl Tekniska högskolan, Kista 164 40, Stockholm. © Ning Ma, September 2015 Tryck: Universitetsservice US AB iii Abstract The feature size of transistors keeps shrinking with the development of technology, which enables ubiquitous sensing and computing. However, with the break down of Dennard scaling caused by the difficulties for further lower- ing supply voltage, the power density increases significantly. The consequence is that, for a given power budget, the energy efficiency must be improved for hardware resources to maximize the performance. Application-specific integrated circuits (ASICs) obtain high energy efficiency at the cost of low flexibility for various applications, while general-purpose processors (GPPs) gain generality at the expense of efficiency. To provide both high energy efficiency and flexibility, this dissertation ex- plores the ultra-low-power design of application-specific instruction-set processors (ASIP) for ubiquitous sensing and computing. Two application sce- narios, i.e. high-throughput compute-intensive processing for multimedia and low-throughput low-cost processing for Internet of Things (IoT) are implemented in the proposed ASIPs. Multimedia stream processing for human-computer interaction is always featured with high data throughput. To design processors for networked multimedia streams, customizing application-specific accelerators controlled by the embedded processor is exploited. By abstracting the common features from multiple coding algorithms, video decoding accelerators are implemented for networked multi-standard multimedia stream processing. Fabricated in 0.13 µm CMOS technology, the processor running at 216 MHz is capable of decoding real-time high-definition video streams with power consumption of 414 mW. When even higher throughput is required, such as in multi-view video coding applications, multiple customized processors will be connected with an on-chip network. Design problems are further studied for selecting the capability of single processors, the number of processors, the capacity of communication network, as well as the task assignment schemes. In the IoT scenario, low processing throughput but high energy efficiency and adaptability are demanded for a wide spectrum of devices. In this case, a tile processor including a multi-mode router and dual cores is proposed and implemented. The multi-mode router supports both circuit and wormhole switching to facilitate inter-silicon extension for providing on-demand performance. The control-centric dual-core architecture uses control words to directly manipulate all hardware resources. Such a mechanism avoids in- troducing complex control logics, and the hardware utilization is increased. Programmable control words enable reconfigurability of the processor for sup- porting general-purpose ISAs, application-specific instructions and dedicated implementations. The idea of reducing global data transfer also increases the energy efficiency. Finally, a single tile processor together with network of bare dies and network of packaged chips has been demonstrated as the result. The processor implemented in 65 nm low leakage CMOS technology and achieves the energy efficiency of 101.4 GOPS/W for each core. iv v To my family vi Acknowledgments First and foremost, I would like to express my deep gratitude to my advisors: Prof. Li-Rong Zheng, Assoc. Prof. Zhonghai Lu and Dr. Zhuo Zou. I am heartily thankful to Li-Rong for providing me the opportunity to be a doctoral student at KTH and all the support during these years. His vast expertise and broad visions greatly inspire me for the research. I am sincerely grateful to Zhonghai for his professional guidance and constructive suggestions. I appreciate all the discussions with him and I benefit a lot. I am really thankful to Zhuo for his valuable advices from doing research, writing scientific papers to trivia. It is exciting and helpful to work with him. I would also like to express my gratitude to Dr. Zhibo Pang for his assistance in the early phase of my research. I gained the experience both from engineering and everyday life during that period. I am indebted to Stefan Blixt at Imsys AB for broadening and deepening my knowledge about processor architectures through plentiful technique discussions. His patience and enthusiasm greatly encourage me. I would like to thank Dr. Qiang Chen and Dr. Geng Yang for sharing experi- ences and giving suggestions both on work and life. Many thanks to the previous and current colleagues at KTH and Imsys for their all kinds of help to make the Ph.D journey pleasurable. I am grateful to Prof. Hannu Tenhunen for reviewing the draft of this dissertation. Many thanks to Dávid Juhász and Nicholas Rose for proofreading this dissertation and providing so many valuable comments. I would also like to express my sincere acknowledgment to Prof. Jari Nurmi from Tampere University of Tech- nology for acting as my opponent, as well as to Prof. Peeter Ellervee from Tallinn University of Technology, Adj. Prof. Lauri Koskinen from University of Turku, and Assoc. Prof. Cristina Rusu from ACREO AB for serving as my committee members. I also want to thank all the people I have met to make everyday special! Finally I wish to express my gratitude to my wife Youjia for her understanding, patience and support, to my little boy Kevin for his innocent smiles, to my parents for their unconditional love and constant support. This dissertation is dedicated to you. Ning Ma September 2015 Stockholm vii viii Abbreviations ALU Arithmetic Logic Unit ASIC Application-specific Integrated Circuit ASIP Application-specific Instruction-set Processor CIF Common Interchange Format CISC Complex Instruction Set Computer CMOS Complementary Metal Oxide Semiconductor CMP Chip Multiprocessor CPU Central Processing Unit CS Circuit Switching CW ControlWord DF Deblocking Filter DM Data Memory DMA Direct Memory Access DSP Digital Signal Processor FEC ForwardErrorCorrection FIFO First In First Out FPGA Field-Programmable Gate Array FSM Finite State Machine FU Function Unit GOP Group of Pictures GPP General-purposeProcessor HD High Definition HEVC High Efficiency Video Coding IM Instruction Memory IoT InternetofThings IQ Inverse Quantization ISA Instruction Set Architecture IT Inverse Transformation MAC Multiplication and Accumulate unit MB Macro Block MC Motion Compensation MPEG Moving Picture Experts Group MVC Multi-view Video Coding NoC Network on Chip NRE Non-recurring Engineering RAM Random Access Memory RISC Reduced Instruction Set Computer ROM Read-only Memory ix x SDRAM Synchronous Dynamic Random Access Memory SIMD Single Instruction Multiple Data SMT Sequence Mapping Table SoC System on Chip UHD Ultra High Definition VC Virtual Channel VCEG Video Coding Experts Group VLD Variable Length Decoding VLIW Very Long Instruction Word VGA Video Graphics Array WS Wormhole Switching List of Publications Papers included in the thesis: 1. Ning Ma, Zhonghai Lu, and Li-Rong Zheng. “System design of full HD MVC decoding on mesh-based multicore NoCs,” In Microprocessors and Microsys- tems (Elsevier), Volume 35, Issue 2, March 2011, Pages 217-229, 2. Ning Ma, Zhuo Zou, Zhonghai Lu, and Li-Rong Zheng. “Design and Imple- mentation of Multi-mode Routers for Large-scale Inter-core networks,” Minor Revision (requires no re-evaluation), Integration, the VLSI Journal (Elsevier), 2015 3. Ning Ma, Zhibo Pang, Jun Chen, Hannu Tenhunen and Li-Rong Zheng. “A 5Mgate/414mW networked media SoC in 0.13um CMOS with 720p multi- standard video decoding,” In Proceedings of IEEE Asian Solid-State Circuits Conference A-SSCC, pp.385,388, 16-18 Nov. 2009 4. Ning Ma, Zhibo Pang, Hannu Tenhunen and Li-Rong Zheng. “An ASIC- design-based configurable SOC architecture for networked media,” In Pro- ceedings of IEEE International Symposium on System-on-Chip, pp.1,4, Nov. 2008. 5. Ning Ma, Zhonghai Lu, Zhibo Pang, and Li-Rong Zheng. “System-level ex- ploration of mesh-based NoC architectures for multimedia applications,” In Proceedings of IEEE International SOC Conference (SOCC), pp.99,104, 27-29 Sept. 2010 6. Ning Ma, Zhuo Zou, Zhonghai Lu, and Li-Rong Zheng. “Implementing MVC Decoding on Homogeneous NoCs: Circuit Switching or Wormhole Switching,” In Proceedings of 23rd Euromicro International Conference on Parallel, Dis- tributed and Network-Based Processing (PDP), pp.387,391, 4-6 March 2015 7. Ning Ma, Zhuo Zou, Zhonghai Lu, Stefan Blixt and Li-Rong Zheng. “A hierarchical reconfigurable micro-coded multi-core processor for IoT applications,” In Proceedings of IEEE International Symposium on Reconfigurable and Communication-Centric Systems-on-Chip (ReCoSoC), pp.1,4, 26-28 May 2014 8. Ning Ma, Zhuo Zou, Zhonghai Lu, Yuxiang Huan, Stefan Blixt and Li-Rong Zheng. “A 2-mW Multi-mode Router Design with Dual-core Processor in 65 nm LL CMOS for Inter-silicon Communication,” Manuscript submitted to Microelectronics Journal (Elsevier), 2015 xi 9. Ning Ma, Zhuo

Ultra-Low-Power Design and Implementation of Application-Speciﬁc Instruction-Set Processors for Ubiquitous Sensing and Computing

Efficient Checker Processor Design

Inside Intel® Core™ Microarchitecture Setting New Standards for Energy-Efficient Performance

POWER-AWARE MICROARCHITECTURE: Design and Modeling Challenges for Next-Generation Microprocessors

Three-Dimensional Integrated Circuit Design: EDA, Design And

Understanding Performance Numbers in Integrated Circuit Design Oprecomp Summer School 2019, Perugia Italy 5 September 2019

Hardware Architecture

Microcontroller Serial Interfaces

CSE 141L: Design Your Own Processor What You'll

Efficient Processing of Deep Neural Networks

Reverse Engineering X86 Processor Microcode

Intel(R) Software Guard Extensions Developer Guide

Itanium® 2 Processor Microarchitecture Overview