<<

Ultra-low-power and Implementation of Application-specific Instruction-set Processors for Ubiquitous Sensing and Computing

NING MA

Doctoral Thesis in Electronic and Systems Stockholm, Sweden 2015 TRITA-ICT 2015:11 KTH School of Information and ISSN 1653-6363 Communication Technology ISRN KTH/ICT-2015/11-SE SE-164 40 Kista, Stockholm ISBN 978-91-7595-692-3 SWEDEN

Akademisk avhandling som med tillstånd av Kungl Tekniska högskolan framlägges till offentlig granskning för avläggande av teknologie doktorsexamen i Elektronik och Datorsystem onsdag den 4 november 2015 klockan 10.00 i Sal B, Electrum, Kungl Tekniska högskolan, Kista 164 40, Stockholm.

© Ning Ma, September 2015

Tryck: Universitetsservice US AB iii

Abstract

The feature size of transistors keeps shrinking with the development of technology, which enables ubiquitous sensing and computing. However, with the break down of Dennard scaling caused by the difficulties for further lower- ing supply voltage, the power density increases significantly. The consequence is that, for a given power budget, the energy efficiency must be improved for hardware resources to maximize the performance. Application-specific inte- grated circuits (ASICs) obtain high energy efficiency at the cost of low flexi- bility for various applications, while general-purpose processors (GPPs) gain generality at the expense of efficiency. To provide both high energy efficiency and flexibility, this dissertation ex- plores the ultra-low-power design of application-specific instruction-set pro- cessors (ASIP) for ubiquitous sensing and computing. Two application sce- narios, i.e. high-throughput compute-intensive processing for multimedia and low-throughput low-cost processing for Internet of Things (IoT) are imple- mented in the proposed ASIPs. Multimedia for human-computer interaction is always featured with high data throughput. To design processors for networked mul- timedia streams, customizing application-specific accelerators controlled by the embedded is exploited. By abstracting the common features from multiple coding , video decoding accelerators are implemented for networked multi-standard multimedia stream processing. Fabricated in 0.13 µm CMOS technology, the processor running at 216 MHz is capable of decoding real-time high-definition video streams with power consumption of 414 mW. When even higher throughput is required, such as in multi-view video coding applications, multiple customized processors will be connected with an on-chip network. Design problems are further studied for selecting the capability of single processors, the number of processors, the capacity of com- munication network, as well as the task assignment schemes. In the IoT scenario, low processing throughput but high energy efficiency and adaptability are demanded for a wide spectrum of devices. In this case, a tile processor including a multi-mode router and dual cores is proposed and implemented. The multi-mode router supports both circuit and worm- hole switching to facilitate inter-silicon extension for providing on-demand performance. The control-centric dual-core uses control words to directly manipulate all hardware resources. Such a mechanism avoids in- troducing complex control logics, and the hardware utilization is increased. Programmable control words enable reconfigurability of the processor for sup- porting general-purpose ISAs, application-specific instructions and dedicated implementations. The idea of reducing global data transfer also increases the energy efficiency. Finally, a single tile processor together with network of bare dies and network of packaged chips has been demonstrated as the result. The processor implemented in 65 nm low leakage CMOS technology and achieves the energy efficiency of 101.4 GOPS/W for each core. iv v

To my family vi Acknowledgments

First and foremost, I would like to express my deep gratitude to my advisors: Prof. Li-Rong Zheng, Assoc. Prof. Zhonghai Lu and Dr. Zhuo Zou. I am heartily thankful to Li-Rong for providing me the opportunity to be a doctoral student at KTH and all the support during these years. His vast expertise and broad visions greatly inspire me for the research. I am sincerely grateful to Zhonghai for his professional guidance and constructive suggestions. I appreciate all the discussions with him and I benefit a lot. I am really thankful to Zhuo for his valuable advices from doing research, writing scientific papers to trivia. It is exciting and helpful to work with him.

I would also like to express my gratitude to Dr. Zhibo Pang for his assistance in the early phase of my research. I gained the experience both from and everyday life during that period. I am indebted to Stefan Blixt at Imsys AB for broadening and deepening my knowledge about processor through plentiful technique discussions. His patience and enthusiasm greatly encourage me.

I would like to thank Dr. Qiang Chen and Dr. Geng Yang for sharing experi- ences and giving suggestions both on work and life. Many thanks to the previous and current colleagues at KTH and Imsys for their all kinds of help to make the Ph.D journey pleasurable.

I am grateful to Prof. Hannu Tenhunen for reviewing the draft of this dis- sertation. Many thanks to Dávid Juhász and Nicholas Rose for proofreading this dissertation and providing so many valuable comments. I would also like to express my sincere acknowledgment to Prof. Jari Nurmi from Tampere University of Tech- nology for acting as my opponent, as well as to Prof. Peeter Ellervee from Tallinn University of Technology, Adj. Prof. Lauri Koskinen from University of Turku, and Assoc. Prof. Cristina Rusu from ACREO AB for serving as my committee members.

I also want to thank all the people I have met to make everyday special!

Finally I wish to express my gratitude to my wife Youjia for her understanding, patience and support, to my little boy Kevin for his innocent smiles, to my parents for their unconditional love and constant support. This dissertation is dedicated to you. Ning Ma September 2015 Stockholm

vii viii Abbreviations

ALU ASIC Application-specific ASIP Application-specific Instruction-set Processor CIF Common Interchange Format CISC Complex Instruction Set Computer CMOS Complementary Metal Oxide Semiconductor CMP Chip Multiprocessor CPU CS Circuit Switching CW ControlWord DF Deblocking Filter DM Data Memory DMA Direct Memory Access DSP FEC ForwardErrorCorrection FIFO First In First Out FPGA Field-Programmable FSM Finite State Machine FU Function Unit GOP Group of Pictures GPP General-purposeProcessor HD High Definition HEVC High Efficiency Video Coding IM Instruction Memory IoT InternetofThings IQ Inverse Quantization ISA Instruction Set Architecture IT Inverse Transformation MAC Multiplication and Accumulate unit MB Macro Block MC Motion Compensation MPEG Moving Picture Experts Group MVC Multi-view Video Coding NoC Network on Chip NRE Non-recurring Engineering RAM Random Access Memory RISC Reduced Instruction Set Computer ROM Read-only Memory

ix x

SDRAM Synchronous Dynamic Random Access Memory SIMD Single Instruction Multiple Data SMT Sequence Mapping Table SoC System on Chip UHD Ultra High Definition VC Virtual Channel VCEG Video Coding Experts Group VLD Variable Length Decoding VLIW Very Long Instruction Word VGA Video Graphics Array WS Wormhole Switching List of Publications

Papers included in the thesis:

1. Ning Ma, Zhonghai Lu, and Li-Rong Zheng. “System design of full HD MVC decoding on mesh-based multicore NoCs,” In and Microsys- tems (Elsevier), Volume 35, Issue 2, March 2011, Pages 217-229,

2. Ning Ma, Zhuo Zou, Zhonghai Lu, and Li-Rong Zheng. “Design and Imple- mentation of Multi-mode Routers for Large-scale Inter-core networks,” Minor Revision (requires no re-evaluation), Integration, the VLSI Journal (Elsevier), 2015

3. Ning Ma, Zhibo Pang, Jun Chen, Hannu Tenhunen and Li-Rong Zheng. “A 5Mgate/414mW networked media SoC in 0.13um CMOS with 720p multi- standard video decoding,” In Proceedings of IEEE Asian Solid-State Circuits Conference A-SSCC, pp.385,388, 16-18 Nov. 2009

4. Ning Ma, Zhibo Pang, Hannu Tenhunen and Li-Rong Zheng. “An ASIC- design-based configurable SOC architecture for networked media,” In Pro- ceedings of IEEE International Symposium on System-on-Chip, pp.1,4, Nov. 2008.

5. Ning Ma, Zhonghai Lu, Zhibo Pang, and Li-Rong Zheng. “System- ex- ploration of mesh-based NoC architectures for multimedia applications,” In Proceedings of IEEE International SOC Conference (SOCC), pp.99,104, 27-29 Sept. 2010

6. Ning Ma, Zhuo Zou, Zhonghai Lu, and Li-Rong Zheng. “Implementing MVC Decoding on Homogeneous NoCs: Circuit Switching or Wormhole Switching,” In Proceedings of 23rd Euromicro International Conference on Parallel, Dis- tributed and Network-Based Processing (PDP), pp.387,391, 4-6 March 2015

7. Ning Ma, Zhuo Zou, Zhonghai Lu, Stefan Blixt and Li-Rong Zheng. “A hierarchical reconfigurable micro-coded multi-core processor for IoT appli- cations,” In Proceedings of IEEE International Symposium on Reconfigurable and Communication-Centric Systems-on-Chip (ReCoSoC), pp.1,4, 26-28 May 2014

8. Ning Ma, Zhuo Zou, Zhonghai Lu, Yuxiang Huan, Stefan Blixt and Li-Rong Zheng. “A 2-mW Multi-mode Router Design with Dual-core Processor in 65 nm LL CMOS for Inter-silicon Communication,” Manuscript submitted to Microelectronics Journal (Elsevier), 2015

xi 9. Ning Ma, Zhuo Zou, Yuxiang Huan, Stefan Blixt, Zhonghai Lu and Li-Rong Zheng. “A 101.4 GOPS/W Reconfigurable and Scalable Control-centric Em- bedded Processor for Domain-specific Applications,” Manuscript submitted to IEEE International Symposium on Circuits and Systems (ISCAS), 2016 Papers not included in the thesis:

10. Ning Ma, Zhuo Zou, Yuxiang Huan, Stefan Blixt, Zhonghai Lu and Li-Rong Zheng. “A Cacheless Control-centric Dual-core X3 Processor for Real-time Energy-efficient Processing,” Manuscript submitted to Special Issue on Sus- tainable Processor Architectures and Applications, Journal of Microprocessors and Microsystems (Elsevier), 2015 11. Yuxiang Huan, Ning Ma, Stefan Blixt, Zhuo Zou, and Lirong Zheng, “A 61 µA/MHz Reconfigurable Application-Specific Processor and System-on-Chip for Internet-of-Things,” In Proceedings of IEEE International SOC Confer- ence (SOCC), pp. 235-239, 9-11 Sept. 2015.

xii Contents

Contents xiii

List of Figures xv

1 Introduction 1 1.1 Background...... 1 1.2 Motivation ...... 5 1.3 Thesiscontributionsandorganization ...... 6

2 Low-power application-specific instruction-set processors 11 2.1 Low-powertechniques ...... 12 2.2 Applications...... 13 2.3 Architectures ...... 14 2.4 Approaches ...... 16

3 Single-ASIP implementation for multi-standard multimedia pro- cessing 21 3.1 Introduction...... 21 3.2 Designmethodology ...... 22 3.3 Videocodingstructure...... 24 3.4 Accelerator-enrichedarchitecture ...... 26 3.5 Designinstance...... 26 3.6 Implementationresults...... 28 3.7 Summary ...... 29

4 Multi-processor architecture exploration for multi-view video decoding 31 4.1 Background...... 31 4.2 Designspaceexploration...... 34 4.3 Communicationnetwork ...... 41 4.4 Summary ...... 43

5 Customizable tile-ASIP implementation for Internet of Things 45

xiii 5.1 Background...... 45 5.2 Multi-moderouter ...... 47 5.3 Control-centricprocessor...... 51 5.4 Tileimplementation ...... 56 5.5 Summary ...... 57

6 Conclusions 59 6.1 Thesissummary ...... 59 6.2 Futurework...... 60

Bibliography 63

xiv List of Figures

1.1 Timelineofcomputers ...... 1 1.2 Trendsofvideoapplications...... 3 1.3 OnescenarioofIoT ...... 4 1.4 Applications and implementations ...... 6

2.1 ComparisonofASIC,GPPandASIP ...... 12 2.2 ArchitecturesofASIPs...... 15 2.3 Design flow for extending a existing processor ...... 17 2.4 Designflowforcustomizingaprocessor ...... 18

3.1 Videocodingstandards ...... 22 3.2 Designflow ...... 23 3.3 HW/SWpartition ...... 24 3.4 Generalstructureofvideocoding ...... 25 3.5 Accelerator-enrichedarchitecture ...... 26 3.6 Varietyofdatatransferring ...... 26 3.7 Design instance for networked multimedia processing ...... 27 3.8 Videodecodingpipeline ...... 28 3.9 Diephotoandpowerconsumption ...... 29 3.10 Testsystemandapplicationscenario ...... 29

4.1 Typicalpredictionstructure ...... 33 4.2 Anexampleof4x42DmeshNoC...... 33 4.3 Mapping multimedia applications to NoC ...... 34 4.4 SimplifiedSystem ...... 35 4.5 Workflowforthesystemexploration...... 36 4.6 HierarchicalstructureofMVC ...... 37 4.7 Picture-levelparallelism ...... 38 4.8 A possible data flow for picture-level parallel processing ...... 39 4.9 Design options for picture-level task assignment ...... 39 4.10 A possible data flow for view-level parallel processing ...... 40 4.11 Design options for view-level task assignment ...... 41 4.12 Routerstructures...... 42

xv 5.1 Proposedmulti-moderouter...... 48 5.2 Maximum throughput with only CS or WS activated ...... 49 5.3 Thecomparisonofthroughputandlatency ...... 49 5.4 Areas for routers with different configurations ...... 50 5.5 Overall throughput (OT) and power efficiency (PE) ...... 51 5.6 Complete structure of the on-chip multi-mode router in onetile. . . . . 51 5.7 Architectureoverview ...... 52 5.8 Structureofonecontrolword ...... 53 5.9 Mapping from applications to the processor ...... 53 5.10 Comparisonof instructions processing ...... 53 5.11Controlwordtiming ...... 55 5.12 Hierarchical implementation structure ...... 55 5.13 photo together with gate count breakdown ...... 56 5.14 Power consumption under different working modes and conditions . . . 57 5.15 Networkonaboardandnetworkinapackage ...... 58

xvi Chapter 1

Introduction

1.1 Background

When the first electronic general-purpose computer ENIAC was announced in 1946, it used over 17000 vacuum tubes and covered 140 m2 with power consumption of 174 kW [1]. To program it, manual setting of and cables was necessary. The vacuum tubes were later substituted by discrete transistors, and further by integrated circuits. The miniaturization and integration of devices greatly decreased the size and power consumption as well as increased performance, which enabled the application of personal and portable computers. Figure 1.1 shows a brief history of computers. With the utilization of CMOS and following Moore’s law, the monolithic integrated circuit can hold more and more transistors since the feature size scales down around every two years [2]. Table 1.1 gives the power scaling for fixed-sized chips. Assuming a scaling factor of S, if the size of chips remains the same as that

ϭϵϰϲ ϭϵϱϲ ϭϵϴϭ E/ DĞƚƌŽǀŝĐŬϵϱϬ ϮϬϭϬƐ &ŝƌƐƚĞůĞĐƚƌŽŶŝĐ &ƵůůLJƚƌĂŶƐŝƐƚŽƌŝnjĞĚ KƐďŽƌŶĞϭ tĞĂƌĂďůĞĐŽŵƉƵƚĞƌƐ ϭϵϰϵ WŽƌƚĂďůĞŵŝĐƌŽĐŽŵƉƵƚĞƌ ĐŽŵƉƵƚĞƌ ^ ϭϵϳϭ 'ůĂƐƐ͕tĂƚĐŚ͕ĞƚĐ͘ &ŝƌƐƚƐƚŽƌĞĚͲƉƌŽŐƌĂŵ <ĞŶďĂŬͲϭ ϭϵϵϬƐ ĐŽŵƉƵƚĞƌ &ŝƌƐƚƉĞƌƐŽŶĂůĐŽŵƉƵƚĞƌ ^ŵĂƌƚƉŚŽŶĞƐ

ϭϵϰϬ ϮϬϭϱ ϭϵϲϱ ϭϵϰϱ ϭϵϳϮ ϭϵϳϲ DŽŽƌĞ͛ƐůĂǁ sŽŶEĞƵŵĂŶŶ /ŶƚĞůϴϬϬϴ ZK^D ƌĐŚŝƚĞĐƚƵƌĞƉƌŽƉŽƐĞĚ DŝĐƌŽƉƌŽĐĞƐƐŽƌ DK^ђW

Figure 1.1: Time line of computers

1 Table 1.1: Power scaling for a fixed-sized chip from [4]  ^LJŵďŽů ĞŶŶĂƌĚ^ĐĂůŝŶŐ WŽƐƚͲĞŶŶĂƌĚ^ĐĂůŝŶŐ EŽƚĞƐ Y   YƵĂŶƚŝƚLJŽĨƚƌĂŶƐŝƐƚŽƌƐ  ϭͬ^૛ ϭͬ^૛ ĂƉĂĐŝƚĞŶĐĞ ࡿ ࡿ & ^ ^ &ƌĞƋƵĞŶĐLJ s ϭͬ^ ϭ sŽůƚĂŐĞ  ϭ  WŽǁĞƌĐŽŶƐƵŵƉƚŝŽŶ  ૛ ૛ ࡼ ൌ ࡽ࡯ࡲࢂ ࡿ  of the previous generation, the quantity of transistors and the maximum working frequency scale with the factor of S2 and S, respectively. The capability of chips with the same size will be improved by a factor of S3. To keep the power consump- tion constant, the supply voltage needs to be scaled down correspondingly, which is called Dennard scaling [3]. During this period, the performance of processors is increased by raising the working clock frequency with the support of more and more sophisticated architectures, e.g. deeper pipelines, speculation, hierarchical caches, etc. powered by augmented transistors. The powerful yet small processors boost the miniaturization of computers, and the mobile phones are even more capable than the computers in old generations. The further miniaturization brings about the mergence of computers to the am- bient, and facilitates the ubiquitous sensing and computing empowered by the smart electronics. Application scenarios can be both high data throughput applications such as the human-computer interaction and low data throughput applications such as the Internet of Things (IoT). Ultra-low-power processors are highly demanded for the miniaturized devices. On the other hand, the transistor size is continuously shrinking, while the supply voltage cannot be scaled down correspondingly due to the leakage problems, which breaks down the Dennard scaling [4]. As shown in Table 1.1, if all the transistors work at the maximum frequency, the power con- sumption will increase exponentially. For a given power budget, either the chip size or activity of the transistors should be diminished. The traditional low-power techniques such as and power gating, which either lower the working frequency or completely shut down the circuit, will decrease the power consumption while the performance is degraded at the same time. Dynamic voltage and frequency scaling (DVFS) is utilized in [5, 6, 7] to adapt the voltage supply and clock frequency for providing the proper performance for different application scenarios. When high performance is not needed, a sub- threshold supply voltage can even be applied to save the energy at the cost of large delay and slow speed as reported in [8, 9]. In order to lower the power consumption while maximizing the performance, energy efficiency and hardware utilization must be effectively improved in post- Dennard era. The augmented transistors from Moore’s law should be utilized efficiently instead of only increasing the working clock frequency of processors.

2 7680

8K UHD

3840 4320 4K UHD 1920 VGA HD 2160 1080

CIF (a) Picture Resolutions Single View Multiple Views

Stereo Views

(b) Views Figure 1.2: Trends of video applications

Therefore, the architectural level optimization for high energy efficiency becomes critical. This dissertation explores the design of ultra-low-power application-specific instruction- set processors (ASIP) for ubiquitous sensing and computing in both high-throughput compute-intensive processing of multimedia and low-throughput low-cost process- ing of IoT scenarios. The low power consumption and high performance are realized by increasing the energy efficiency through customization and specialization at the architectural level for particular scenarios.

High-throughput applications Multimedia processing in human-computer interactions always needs high perfor- mance processors for the extremely large data throughput. In the case of video processing, for example, the resolutions of videos have been increased from CIF (352x288), VGA (640x480) to HD (1920x1080), even to 8K UHD (7680x4320) [10]. Digital Cinema Initiatives (DCI) have already adopted 2K (2048x1080) and 4K (4096x2160) images [11]. NHK is going to do experimental broadcasting for Super- Hi Vision (7680x4320) in 2020 [12]. From another point of view, the included views of video streams can vary from single view through stereo to multiple views as shown

3 T

T

Figure 1.3: One scenario of IoT in Figure 1.2. 3D video is an important application and having a great impact on our daily life [13, 14, 15]. The increase of resolution and views have resulted in tens or even hundreds of data volume increase. This trend will continue since it is the instinct of people to pursue the higher quality and vivid experience. Considering the varieties of similar applications and rapid updates, flexibility is also a merit for the design.

Low-throughput applications As one of the enabling technologies for ubiquitous sensing and computing, Internet of Things (IoT) will connect everything in the surrounding seamlessly [16]. There will be billions or trillions of devices embedded in those “things”. Figure 1.3 shows one of the scenarios. Lightweight architectures capable of sensing, processing and communication are highly demanded in extremely low power and low cost manners. Low power consumption is one of the most important characteristics required for IoT devices due to their limited power supply. Furthermore, certain computational capability is still necessary for signal processing and communication. Flexibility and scalability should also be featured for the processors adapting to various and vast number of IoT devices.

Architecture alternatives Many architectures might be used for the aforementioned application scenarios. General-purpose processors (GPP) have good programmability, and can be eas- ily applied in different applications. However, limited performance and high power consumption make them insufficient for high throughput multimedia processing par- ticularly for portable devices [17, 18]. The low energy efficiency of GPP is also the obstacle for IoT applications. Digital signal processors (DSP) enhance the compu-

4 tational capability for digital signal processing and keep the good programmability as well [19]. But the performance is still not adequate for sophisticated multimedia applications, and DSPs are not suitable for IoT applications because their capabil- ity for general-purpose controlling is weak and they are not energy efficient enough. Application-specific integrated circuits (ASIC) are superior to other architectures from the perspective of both performance and energy efficiency [20, 21, 22, 23, 24]. The drawback is the low flexibility and scalability when trying to adapt to different applications, which results in high design and implementation cost as well as long time-to-market. Application-specific instruction-set processors (ASIP) can improve the energy efficiency and provide the flexibility as well [25, 26]. The programmable processors empower the adaptability for similar applications, while the customized function units guarantee the necessary performance. The implementation of ASIPs for mul- timedia and IoT have been reported in [27, 28, 29, 30, 31].

1.2 Motivation

The two mentioned scenarios represent two extremes of the applications. Differ- ent strategies are necessary for designing the ultra-low-power customizable energy- efficient processors following the implementation targets. This dissertation will ex- plore and verify the design strategies by proposing and realizing the corresponding processors. Networked multimedia processing is the performance-demanding application. Furthermore, there are many formats for the multimedia streams. In order to handle the networked multimedia streams, the ASIPs with high performance for high-definition video coding and flexibility for multiple formats should be designed. In this work, video accelerators supporting high-definition multi-standard video decoding are designed and implemented. Those accelerators are controlled by the specific instructions issued by the embedded processor, and both the number and computational capability of the accelerators are adjustable to provide the needed performance. As introduced in last section, another trend for video applications is the utiliza- tion of multiple views for the same scene. The videos from different view directions are recorded simultaneously. Multi-view coding (MVC) has been standardized as an amendment of the H.264/MPEG-4 AVC video coding standard [32], and can be used in 3D video, free view video applications, etc. Compared with the architecture for single-view applications, to process the video with multiple views, performance should be improved by many times. The reasonable solution is to employ multiple processors by exploiting parallelism in the MVC. Considering the huge volume of data exchanged among different processors, an efficient communication architecture is required. Network-on-chip (NoC) based multiple processors are thus used in the proposed architecture for MVC applications in this dissertation.

5 EĞƚǁŽƌŬĞĚ DƵůƚŝͲǀŝĞǁ DƵůƚŝͲƐƚĂŶĚĂƌĚ /Žd sŝĚĞŽ DƵůƚŝŵĞĚŝĂ ƉƉůŝĐĂƚŝŽŶƐ ĞĐŽĚŝŶŐ ĞĐŽĚŝŶŐ

dŝŵĞƐŽĨ ,ŝŐŚƉŽǁĞƌ ,ŝŐŚ ƉĞƌĨŽƌŵĂŶĐĞ ĞĨĨŝĐŝĞŶĐLJ͖ ƉĞƌĨŽƌŵĂŶĐĞ͖ ŝŶĐƌĞĂƐĞ͖ ,ŝŐŚĨůĞdžŝďŝůŝƚLJ ,ŝŐŚĨůĞdžŝďŝůŝƚLJ͘ ,ŝŐŚƐĐĂůĂďŝůŝƚLJ͘ ĂŶĚƐĐĂůĂďŝůŝƚLJ͘

ƵĂůͲĐŽƌĞĐŽŶƚƌŽůͲ ƵĂůͲĐŽƌĞƉƌŽĐĞƐƐŽƌǁŝƚŚ ĐĞŶƚƌŝĐƉƌŽĐĞƐƐŽƌǁŝƚŚ ǀŝĚĞŽĂĐĐĞƌĂƚŽƌƐ EŽĂƌĐŚŝƚĞĐƚƵƌĞƐ ŝŶƚĞŐƌĂƚĞĚƌŽƵƚĞƌ Figure 1.4: Applications and implementations

For the applications discussed above, the design of processors is performance- driven, while from the IoT perspective, high energy efficiency under extremely low power conditions is more important. At the same time, to adapt to different scenar- ios, the ASIPs should also be featured with flexibility and scalability. Based upon the preceding considerations, a reconfigurable and scalable control-centric dual-core processor with an on-chip multi-mode router is proposed and implemented. The high energy efficiency is achieved by increasing the hardware utilization of func- tional units, decreasing non-functional units, and reducing excessive global data transferring. The reorganizable functional units enable reconfigurability of the pro- cessor, and the on-chip multi-mode router extends scalability of the processor even in the post-fabrication stage. Figure 1.4 summarize the applications, together with the required features and the proposed systems in this dissertation.

1.3 Thesis contributions and organization

This dissertation is organized as follows:

6 Chapter 2 This chapter will review the related work on the design of low power ASIPs. The possible application scenarios are shown in this chapter, including multimedia, com- munication, biomedical applications, etc. followed by the implementation architec- tures. The two approaches for designing ASIPs are given at the end of this chapter.

Chapter 3 The design and implementation of an ASIP for networked multimedia processing are described in this chapter. The design flow is firstly given for /hardware co-design, followed by analysis of the application. The video processing accelerators are then customized and integrated with the embedded processor cores to construct the high performance and low power processor. The chip implementation and test system are demonstrated as the results.

• Contributions: The design strategy of utilizing customizable accelerators together with embedded processors is verified for the applications where high performance and energy efficiency are required. As a result, an accelerator- enriched dual-core architecture is proposed for the performance-demanded applications. A design instance for high-definition multi-standard video de- coding is then fabricated and tested with high performance and low power consumption.

The included papers are:

• Paper I. Ning Ma, Zhibo Pang, Hannu Tenhunen and Li-Rong Zheng. “An ASIC-design-based configurable SOC architecture for networked media,” In Proceedings of International Symposium on System-on-Chip, pp.1,4, Nov. 2008.

• Paper II. Ning Ma, Zhibo Pang, Jun Chen, Hannu Tenhunen and Li-Rong Zheng. “A 5Mgate/414mW networked media SoC in 0.13um CMOS with 720p multi-standard video decoding,” In Proceedings of IEEE Asian Solid- State Circuits Conference A-SSCC, pp.385,388, 16-18 Nov. 2009

Chapter 4 This chapter explores the design of multi-processor architecture for multi-view video decoding. It begins with the theoretical analysis on the simplified system, from which the exploration work flow is extracted accordingly. To implement multi- view video decoding, both picture-level and view-level task assignment schemes are introduced and discussed for the homogeneous multi-processor architecture. Based on the exploration platform, the design options for each task assignment scheme are studied and summarized.

7 • Contributions: Multi-processor architecture is proposed and explored for multi-view video applications, which demand even higher performance. To the best of our knowledge, the scalable and flexible architecture for the MVC decoding has not been studied before. It is also found that to achieve the specific performance target, the computation and communication subsystem should be balanced, when all the resources can be utilized efficiently.

The included papers are: • Paper III. Ning Ma, Zhonghai Lu, and Li-Rong Zheng. “System design of full HD MVC decoding on mesh-based multicore NoCs,” In Microprocessors and Microsystems (Elsevier), Volume 35, Issue 2, March 2011, Pages 217-229, • Paper IV. Ning Ma, Zhonghai Lu, Zhibo Pang, and Li-Rong Zheng. “System- level exploration of mesh-based NoC architectures for multimedia applica- tions,” In Proceedings of International SOC Conference (SOCC), pp.99,104, 27-29 Sept. 2010 • Paper V. Ning Ma, Zhuo Zou, Zhonghai Lu, and Li-Rong Zheng. “Imple- menting MVC Decoding on Homogeneous NoCs: Circuit Switching or Worm- hole Switching,” In Proceedings of 23rd Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pp.387,391, 4-6 March 2015

Chapter 5 The energy efficient processor for IoT applications is proposed and implemented in this chapter. It contains two topics: designing multi-mode routers for modular extension and proposing processor architecture for high energy efficiency. The ne- cessity of supporting both circuit and wormhole switching in one router to compose the inter-silicon network is discussed at first, and the corresponding multi-mode router is then introduced. Aiming at increasing the energy efficiency by improv- ing the hardware utilization, the control-centric processor architecture is proposed. The implementation results of the module chip integrating both the processor and router are demonstrated at the end of this chapter.

• Contributions: The quantitative comparison of circuit and wormhole - ing is presented to give the guidelines for choosing the proper switching mode in a specific scenario. To adapt to various application scenarios, a multi-mode router supporting both circuit and wormhole switching is proposed and im- plemented. The multi-mode router is further integrated with a reconfigurable dual-core control-centric processor to construct the processor tile. The tile can either be used independently or compose a large scale multi-processor network to provide high performance, energy efficiency and flexibility.

The included papers are:

8 • Paper VI. Ning Ma, Zhuo Zou, Zhonghai Lu, and Li-Rong Zheng. “De- sign and Implementation of Multi-mode Routers for Large-scale Inter-core networks,” Minor Revision (requires no re-evaluation), Integration, the VLSI Journal (Elsevier), 2015 • Paper VII. Ning Ma, Zhuo Zou, Zhonghai Lu, Yuxiang Huan, Stefan Blixt and Li-Rong Zheng. “A 2-mW Multi-mode Router Design with Dual-core Processor in 65 nm LL CMOS for Inter-silicon Communication,” Manuscript submitted to Microelectronics Journal (Elsevier), 2015 • Paper VIII. Ning Ma, Zhuo Zou, Zhonghai Lu, Stefan Blixt and Li-Rong Zheng. “A hierarchical reconfigurable micro-coded multi-core processor for IoT applications,” In Proceedings of International Symposium on Reconfig- urable and Communication-Centric Systems-on-Chip (ReCoSoC), pp.1,4, 26- 28 May 2014 • Paper IX. Ning Ma, Zhuo Zou, Yuxiang Huan, Stefan Blixt, Zhonghai Lu and Li-Rong Zheng. “A 101.4 GOPS/W Reconfigurable and Scalable Control- centric Embedded Processor for Domain-specific Applications,” Manuscript submitted to IEEE International Symposium on Circuits and Systems (IS- CAS), 2016

Chapter 6 This chapter will conclude the dissertation and indicate the future work from pro- cessor architecture, programming models to application mapping. Brain-inspired computing will be the target for those work to further decrease the power consump- tion while achieving high performance.

9

Chapter 2

Low-power application-specific instruction-set processors

For specific applications, many architectures can be employed as alternative im- plementations. Application-specific integrated circuit (ASIC) utilizes specialized hardware units and determinate data paths to achieve parallel and dedicated pro- cessing. High performance and energy efficiency are the advantages. The weak- nesses are poor generality, high design complexity and non-recurring engineering (NRE) cost. Furthermore, as more transistors can be integrated in one single chip, the design complexity has been greatly increased. Verifying and testing such huge scale circuits have led to tremendous challenges for ASIC implementations [33, 34].

On the other hand, general-purpose processors (GPP) serialize the operations of the arithmetic logic unit (ALU) through instructions for various applications. Wide generality is the merit, and the implementation of a specific application is simplified as designing the sequence of instructions for the GPP. The time-to-market is greatly shortened compared with the ASIC implementation. However, GPP implementations suffer from low performance and poor energy efficiency because of the serial processing of instructions and the overhead for fetching and decoding the instructions.

Application-specific instruction-set processors (ASIP) try to gain the balance of performance, efficiency and generality for specific applications. Dedicated hardware accelerators, data paths or processing flows are customized for the related applica- tions to provide sufficient performance and efficiency. Meanwhile, the specialized functions alleviate the overhead of processing instructions in GPP. The instruction- controlled processing or reconfigurable logic guarantees the generality for similar applications, and gains the programmability and flexibility compared with ASICs. The comparisons of ASIC, GPP and ASIP are shown in Figure 2.1.

11 WĞƌĨŽƌŵĂŶĐĞ ĞƐŝŐŶĐŽŵƉůĞdžŝƚLJ

^/

^/W

'WW

ŶĞƌŐLJĞĨĨŝĐŝĞŶĐLJ 'ĞŶĞƌĂůŝƚLJ Figure 2.1: Comparison of ASIC, GPP and ASIP

2.1 Low-power techniques

Decreasing the working frequency or supply voltage is the direct way to reduce power consumption, which lowers performance as well. Clock gating [35, 36, 37] and power gating [38, 39, 40] are often employed to deactivate the circuits when they are not used. Besides, dynamic voltage and frequency scaling (DVFS) tunes operating conditions at a finer grain for the processors to provide proper performance in diverse application scenarios [41]. In [42], the processor can work at full speed with the supply voltage of 1.14 V, frequency of 1 GHz, and consumes 125.3 W. It can also work for low power consumption of 24.7 W at 0.7 V and 125 MHz. Fine- grained adaptive is implemented in [43] by applying multiple DVFS configurations based on the real-time power and temperature data collected through on-die power and thermal meters. The purpose is to achieve the maximum performance while maintaining temperature and power under predefined constrains. In [44], an on-chip controller performs per-core DVFS by monitoring workload, temperature, voltage and current dynamically. The voltage supply in [45] can even be lowered to 210 mV for a JPEG encoder working at 5 MHz. Heterogeneity is another way to lower the power consumption. ARM’s big.LITTLE heterogeneous multi-core processor [46] runs big core with short duration at peak frequency and LITTLE core at the majority of time with moderate operating fre- quencies. Up to 75 % energy can be saved without decreasing the delivered per- formance. Similarly, high-performance CPU1 runs the performance-intensive tasks with much higher power consumption, and general applications are mapped to the power efficient CPU2 in [47]. Two architecturally identical but physically differ- ent cores are utilized in [48]. One core is implemented using worst-case corners to achieve the target frequency, while the other one uses typical corners to gain energy

12 efficiency. The resulted energy savings can be up to 20 %. Asynchronous design is proposed in [49] for neural signal processors. Handshaking protocol circuitry is added between function units for data exchanging. Compared with the synchronous design, the asynchronous processor shows a 2.3x reduction in power. Integrating the customized application-specific function units will gain both low power consumption and high performance. In [50], fine-grained parallelism is implemented by SIMD architectures to accelerate the classic sequence alignment procedures. Together with the multi-core structure, 25x less energy is used while achieving similar performance with a quad-core Core i7 3820 processor. A data compression is mapped to a specific circuit in [51] for lowering the communication data rate in artificial vision systems. The energy reduction ratios can be up to 85 % compared with its base processor. Linear matrix and vector operations are enhanced for adaptive filters in neural coding in [52], where the proposed application-specific instruction-set processors provide high performance and flexibility at low power consumption compared with the implementations on commercial CPUs.

2.2 Applications

ASIPs are implemented for applications such as multimedia processing, communi- cation, health care, etc. High efficiency video coding (HEVC) is the new standard for video compression, and it can achieve very high compression ratio with high implementation complexity. An ASIP is proposed in [53] targeting at accelerating the motion compensation, which is the most compute-intensive task in video coding process. Three different models with diverse hardware configurations are evaluated to show the trade-offs between implementation cost and performance. Dedicated hardware accelerators for inter prediction and entropy coding could be found in the ASIP designed in [54] for H.264/AVC video coding standard. Hardware ac- celerators work in parallel with the main processor and are coordinated through the customized instructions. Image detection is another application scenario. The algorithm of Histograms for Oriented Gradients is run in the Support Vector Ma- chine implemented by an ASIP in [55]. SIMD instructions, wide memory interfaces and VLIW extensions are added to a RISC processor to achieve high performance, power efficiency and programmability. An audio ASIP is implemented in [56] to decode MPEG-2/4 AAC audio streams. Specific instruction for inverse DCT, Huff- man decoding, inverse quantization, etc. are integrated to accelerate the decoding speed. Multi-standard multi-mode channel coding in wireless communication is an- other important application scenario for ASIPs. Two turbo decoders are presented in [57] with different design objectives. The TurbASIP supplies the maximum flex- ibility by supporting all of the different modes in existing wireless communication standards. Diverse parallelism techniques are exploited. The devised ar- chitecture together with dedicated instructions is properly selected for the target

13 flexibility. Another tuber decoder TDecASIP aims to increase the efficiency by limiting the flexibility and supporting the turbo codes with related parameters specified in 3GPP LTE, WiMAX and DVB-RCS standards. Both the instruction set architecture and memory organization are simplified to increase the execution efficiency. A multi-standard forward error correction ASIP processor is designed in [58] offering QC-LDPC decoding for WiMAX, Turbo decoding for 3GPP-LTE and Convolutional code decoding. By analyzing the supported decoding algorithms, a general-purpose algorithm that could be further tuned to the specific decoding algo- rithm is selected, and the common parts are abstracted and implemented by shared functional blocks. Optimized instructions manage the operations and data paths of functional blocks to implement specific decoding algorithms. In [59], a baseband ASIP processor is designed for multi-mode MIMO detection by parameterizing the dedicated hardware block for specific algorithms. ASIPs can also be found in promising biomedical applications. In [60], a dual- core system is proposed for wearable sensor devices, which employs an ASIP core optimized for processing ECG, EMG, EEG signals and a low power embedded pro- cessor as a controller. The energy efficiency is greatly improved compared with the implementation in a general-purpose processor. An integrated SoC is implemented in [61] for digital hearing aids. The embedded ASIP extends a RISC instruction set by adding dedicated hardware accelerators and compacting certain instructions to provide both flexibility and power efficiency for digital audio signal processing. In [51], a RISC processor and run-length coding accelerators compose the ASIP for data compression in artificial vision systems. Special instructions are designed for data transferring and run-length counting. The energy consumption is greatly reduced for both compression and decompression process compared with the im- plementation in the base RISC processor. DNA sequence alignment is the tar- get application scenario in [62]. SIMD instructions for vector-vector, vector-scalar arithmetic, logic, load/store and control are implemented to exploit fine-grained parallelism in the related algorithms, and coarse-grained parallelism is achieved by utilizing multi-processor architecture for querying multiple sequences. Besides, in other applications, such as compressed sensing [63], encryption [64], packet classification [65], etc. which are scenarios with emphasis on both flexibility and energy efficiency, the utilization of ASIPs can meet the requirements properly.

2.3 Architectures

Figure 2.2 shows the architectures that are utilized for ASIPs. Dedicated hard- ware accelerators can serve as the execution units invoked by application-specific instructions as shown in Figure 2.2(a). In [66], a bit stream engine designed for variable length coding in multiple video formats is integrated into a conventional general-purpose processor, where the bitwise computation is the bottle neck. The coding/decoding instructions share common data path with general-purpose in- structions but only on different execution paths. The processor gains a performance

14 Issue Register file FU Specialized  ...... FU Issue

Decode Decode Issue Register file FU (a) Accelerator (c) VLIW

Core  Core  Core  Core  Register file FU ...... Issue Decode Register file FU Core  Core  Core  Core 

(b) SIMD (d) Multiple cores

Figure 2.2: Architectures of ASIPs

increase for variable length coding and maintains the flexibility as well. For the ap- plication of software defined radio in [67], a trigonometric math unit, a Viterbi complex unit and a floating-point control law accelerator are coupled to a general- purpose CPU. The ASIP can provide enough performance for the existing complex wireless communication algorithms and shows the probability to implement addi- tional standards. Single instruction multiple data (SIMD) architectures exploit the data-level par- allelism to accelerate compute-intensive tasks. Multiple data are operated with the same operations on multiple execution units simultaneously under the control of one instruction. A vector function unit (VFU) is designed in [68] for Fast Fourier Transform (FFT) used in various wireless communication standards. The VFU can perform sixteen 16-bit MACs, or eight 32-bit MACs simultaneously. Memory band- width is enlarged to ease the data transferring between the vector register file and the maim memory. In [69], arithmetic, logic, load/store SIMD instructions realize data parallel processing on vectors of 8 or 16 elements according to the size of op- erated pixel-block in one image when implementing Advanced Video Coding (AVS) standard. The tasks with heavy duty of computation such as motion compensation, deblocking, etc. are greatly accelerated by the SIMD instructions. Figure 2.2(c) presents the very long instruction word (VLIW) architecture which provides instruction-level parallel processing to increase the execution throughput and hardware utilization. Multiple independent instructions are issued and exe- cuted concurrently. In [70], the VLIW architecture is optimized for stereoscopic video applications. The basic instruction processing unit, i.e. vector unit, are repli- cated to utilize possible instruction parallelism. A VLIW processor is implemented on FPGA for image processing in [71]. The application is realized using a high level programming language at first. By analyzing the generated assembly codes, the independent instructions that can be executed in parallel are bound into the very long instructions.

15 To further increase the performance, task- or -level parallelism can be utilized by multi-core or multiprocessor architectures. In [72], multiple ASIP cores are connected by the butterfly NoC network, and can be dynamically configured through a dedicated -based communication structure. By selecting the suitable number of active ASIP cores and the corresponding configuration parameters, mul- timode and multi-standard turbo decoding have been implemented. Turbo decoders are also implemented in [73] by connecting ASIPs and memories with networks in- creasing the throughput. The proposed system employs ring-style topology and enables simultaneous data exchange between the ASIPs and memories. Multiple ASIPs are connected through the shared memory in [74] for H.264 video encoding. Each ASIP is responsible for coding macroblocks (MB) in one slice of a picture. The MB processing in different slices are executed in different ASIPs in parallel. Furthermore, several video accelerators are integrated into each ASIP to increase its processing performance. According to the throughput requirement for differ- ent video resolutions, the platform supports different configurations of the ASIP cores. A multi-core platform for multi-format video decoding is proposed in [75], the macroblock-row level parallelism is investigated and utilized by multiple ASIP cores. Every single ASIP core is featured with a dual-issue VLIW architecture and designated SIMD instructions for pixel-level acceleration.

2.4 Approaches

ASIPs are implemented by either extending a existing processor or customizing a processor. The former approach strengthens the off-the-shelf processor cores with application-specific function units. Application-specific instructions together with the original general-purpose instruction set compose the final application-specific instruction set. The tool chain for software development is inherited from the existing processor. The design flow of this type of ASIPs is shown in Figure 2.3. Based on the analysis of applications, the time-consuming tasks will be accelerated by the designated instructions, while the controlling task is managed by the general- purpose instructions. In [50], 56 SIMD instructions for biological sequence alignment are implemented on a 32-bit Harvard RISC processor to efficiently utilize the fine-grained parallelism in the corresponding algorithms. By adding the specialized instruction set in the back end, an already existing compiler is extended to support the new ASIP. 40 carefully crafted specialized instructions for slice header processing are integrated into the base ISA of a 32-bit RISC processor (ARP32) for video decoding in [76]. Each customized instruction is mapped to an intrinsic function by the ARP32 compiler. The base ISA handles general-purpose controlling and processing, while the customized instructions deal with the complex slice header processing functions. The AES encryption algorithm is mapped to the vector units in the ASIP designed in [77]. By integrating the AES vector units in a 16-bit default processor, the proposed ASIP can be used in wireless sensor node applications, and the security

16 Application  Analysis 

SW/HW Partition Ͳ    Application Ͳspecific  instructions design      Software  Application Ͳspecific    implementation FU implementation   

System integration No Test  No Yes Meet constrains?   Yes Ͳ Co Ͳverification

No  Goal achieved? Yes   ASIP delivered 

Figure 2.3: Design flow for extending a existing processor service is provided by the add-on application-specific instructions. The other implementation approach is to customize a processor, which gener- ates a new processor. The instruction set is tailored to only support the target applications. The control units, execution units, data paths, pipelines are all cus- tomized to maximize the performance for a cluster of applications. The software tool chain needs also to be adjusted to fit the new processor. The correspond- ing design flow is summarized in Figure 2.4. Compared with the previous design methodology, the totally new instruction set and will gain high performance and energy efficiency for the target applications. However, the target applications should include few general-purpose controlling and processing tasks due to the limitation of the instruction set.

17 Application  Analysis

Application Ͳspecific  instruction  set  design

Realization of  software tool chain    Hardware  implementation of  whole processor  Software  architecture implementation

No No Test  Meet constrains? Yes Yes

Co Ͳverification

No Goal achieved? Yes

ASIP delivered 

Figure 2.4: Design flow for customizing a processor

In [78], an ASIP designed for elliptic curve cryptography employs a finite-state machine (FSM) to control the execution of eight designated instructions imple- menting the functions of point addition and doubling. Different algorithms can be realized by programming with the assembly codes and the final program is stored in the internal ROM. Multi-standard forward error correction (FEC) decoding is sup- ported in the SIMD ASIP proposed in [79]. The computational units, data paths, memory system are all customized according to the common features of the FEC algorithms. In order to decode a specific format, the assembly codes are utilized to designate the different fields of the instructions for controlling the computation and data flow. As a result, the proposed ASIP can provides high-throughput de- coding of turbo, QC-LDPC and CC codes. The ASIP in [80] implements motion estimation (ME) for video codecs. Three instructions including Sum Absolute Dif- ference (SAD), Compare, and Mode Decision, together with an efficient hardware

18 architecture, are designed to handle the real-time processing of ME for HD video. This dissertation utilizes both two of the approaches. Besides, the proposed cus- tomizable control-centric processor supports two approaches simultaneously. Either existing general-purpose ISAs or completely customized ISAs can be mapped to the processor as needed.

19

Chapter 3

Single-ASIP implementation for multi-standard multimedia processing

3.1 Introduction

Large data volume is one of the most important features for multimedia applica- tions. For high-definition (HD) video applications, if the resolution of frames is 1920x1080 and the frame rate is 30 fps, about 93 MB/s raw data is to be processed for the YUV 4:2:0 format. It is a great challenge for both transmitting and storing the data. Moreover, the requirement for higher resolution is growing, and ultra HD (UHD) is proposed. For 4K UHD and 8K UHD, the data volume is 4 and 16 times of that for HD video, respectively. To make the data fit into the capacity of network and storage devices, diverse effective compression algorithms are proposed and applied. They employ advanced algorithms to gain higher compression ratio, which results in high implementation cost as well. There is always a trade-off between compression ratio and implemen- tation complexity. However, since the computational capability of processors keeps increasing, more complex compression algorithms can be mapped to the processor. The result is that data rate has been decreased significantly for the same quality. It also means that using the same data rate achieves much better quality. There are mainly two organizations standardizing the video compressing pro- cess: ITU-T Video Coding Experts Group (VCEG) and ISO/IEC Moving Pic- ture Experts Group (MPEG), who developed the H.26x series and MPEG series standards, respectively. Figure 3.1 shows the performance of some of those stan- dards, and compression ratios are from [81]. From H.261, which only supports CIF (352x288) and QCIF (176x144), to the recently published H.265, which sup- ports up to 8K UHD, and from MPEG-1, which could achieve compression ratio of around 26:1, to MPEG-H Part 2 (H.265), the compression ratio of which could be

21 Up to 8192x4320

3.4 Relative compression ratio

2.2

Up to 4096x2160 2013 HEVC/H.265/MPEG-H Part2

Supported pictureSupported resolution 1.2 1.3 Up to 1920x1080 1

Year 2003 1990 1995 1996 1999 2014 H.262/MPEG-2 Part2 H.263 MPEG-4 Part2 H.264/MPEG-4 Part10

Figure 3.1: Video coding standards around 400:1 [81], many video coding standards can be selected for different appli- cation scenarios taking both the required performance and implementation cost into consideration. There are also other popular video coding formats like RealVideo, WMV, VC series, VP series, and so on. Various video formats exist simultane- ously on the Internet. To process the networked media, a flexible architecture with supporting multi-standard multimedia processing is required. If ASICs are im- plemented for decoding multi-standard multimedia, the decoder for each standard should be integrated together [82, 83], which increases the implementation cost and hinders flexibility. In this chapter, a dual-core ASIP with dedicated video processing accelerators is proposed and implemented to support multi-standard video decoding for network applications. High performance is guaranteed by the dedicated video processing accelerators on one hand, and the dual-core processor supplies the flexibility on the other hand. Working at full speed, the ASIP is capable of decoding real-time 1280x720@25fps MPEG-2/MPEG-4/RealVideo video streams at 216 MHz with maximum power consumption of 414 mW, which can be lowered to 95 mW at working frequency of 27 MHz for 352x288@25fps video decoding.

3.2 Design methodology

The design flow is shown in Figure 3.2, and the detailed description is as follows:

22 ƉƉůŝĐĂƚŝŽŶĂŶĂůLJƐŝƐ

&ƵŶĐƚŝŽŶĚĞĨŝŶŝƚŝŽŶ

,tͬ^tƉĂƌƚŝƚŝŽŶ

,tͬ^tŝŶƚĞƌĨĂĐĞͬ ŝŶƐƚƌƵĐƚŝŽŶĚĞƐŝŐŶ

^ŽĨƚǁĂƌĞĚĞƐŝŐŶ ,ĂƌĚǁĂƌĞĚĞƐŝŐŶ

ŽŵƉŝůĂƚŝŽŶ ^LJŶƚŚĞƐŝƐ

,tͬ^t ĐŽͲǀĞƌŝĨŝĐĂƚŝŽŶ

DĞĞƚƌĞƋƵŝƌŵĞŶƚ͍ EŽ zĞƐ ŽŶĞ

Figure 3.2: Design flow

• Application analysis: multimedia streams from Internet need to be processed. There are many different formats for both audio and video. In this case, MPEG-2/MPEG-4/RealVideo formats for the HD video contents are the tar- gets.

• Function definition: the major functions for this application are network pro- tocol processing, media data depacketization, multi-standard audio decoding and multi-standard video decoding.

• HW/SW partition: to make the architecture flexible, we use software to im- plement most of the required features, except the ones with extremely high performance demand. Hardware accelerators are needed for variable length decoding, inverse quantization, inverse transformation, motion compensation and deblocking functions used for video decoding as shown in Figure 3.3.

23 Network protocol  Media data  processing depacketization  SW Multi Ͳstandard  Multi Ͳstandard  audio decoding video decoding

Instructions

Variable length  Inverse  decoding quantation

Inverse  HW Deblocking transformation Motion  compensation

Figure 3.3: HW/SW partition

• HW/SW interface design: dedicated instructions are designed for the embed- ded processors controlling the video accelerators. • HW/SW implementation: the related software and hardware are implemented in parallel. • HW/SW co-verification: during this phase, the whole system is verified for the application, and the maximum performance is tested. If the performance is not adequate for the HD video decoding, adjustment should be made between the software and hardware to gain the required performance.

3.3 Video coding structure

To design the suitable accelerators, video coding algorithms are studied at first. Most of the video coding standards employ block-based algorithms, which utilize certain-sized pixel blocks as the basic coding units. For example, the 16x16 pixels blocks are commonly used in most of the standards, except the latest HEVC/H.265 standard, where blocks are as large as 64x64 pixels. The coding structures of most of the standards are similar as shown in Figure 3.4, and the differences are the detailed implementation algorithms. In the general structure of video compression, prediction is the most efficient technique to decrease data volume. Instead of sending the raw data, only the residuals between the raw data and predictions are put into the final stream. The prediction can be performed on the neighbor pixels in the same frame, which is

24 Raw  Coefficients Entropy Coded  Transform  Quantization  Formating  pictures Coding  Stream

Inverse  Quantization  vectors Decoder   Inverse  Transform  Motion

Intra Ͳframe Prediction Deblocking 

Inter Ͳframe Decoded  Prediction  pictures

Motion Estimation

Figure 3.4: General structure of video coding

known as intra-frame prediction or spatial prediction. Considering similarities be- tween the consecutive frames in video sequence, the prediction could also be made by referring to previous coded frames, which is inter-frame prediction or temporal prediction. Furthermore, to get more accurate values for inter-frame prediction, motion estimation is calculated between the coded frame and the current frame to find motion vectors which indicate how the blocks move from the previous frame to the current one. The motion vectors will help to achieve accurate prediction in the coded frames. To ensure that the decoder gets the same data as those in the encoder end, a decoder is included in the encoder structure as shown in Figure 3.4 with dashed lines, and only the decoded frames are used as the references for prediction.

The residuals after prediction are transformed from spatial to frequency domain to ease the further data compression in the quantization phase. The high frequency part will be quantized with a large step size because people are not sensitive to high frequency data. Finally, the quantized coefficients together with the motion vectors are coded using entropy coding and then formatted to compose the compressed stream.

25 CPU group CPU1 LM1 CPU2 LM2 ... CPUn LMn DBUS1 CBUS1 CBUS2 DSP/ DMAC LM INF1 INF2 INFn CPU ... IM FU1 DM1 FU2 DM2 FUn DMn Enhanced data   DBUS2 processing unit    Figure 3.5: Accelerator-enriched architecture

Ͳ

DM nͲ1 FU n DM n DM x DM y DM x DM y DMAC DMAC

(a) (b) (c)

Figure 3.6: Variety of data transferring

3.4 Accelerator-enriched architecture

To implement applications with the requirement of high performance and energy efficiency, an accelerator-enriched architecture is proposed and shown in Figure 3.5. The CPU group is responsible for high-level task parsing and processing. The application-specific function units (FU) together with the tightly coupled data memories (DM) provide high performance and energy efficiency for low-level data processing. They are controlled by the embedded processor (DSP or CPU) through the application-specific instructions, and can work in parallel to maximize the uti- lization. The deployment of instruction memory (IM) decouples the embedded processor from FUs, and enables both parts to run independently. To avoid global moving of data, neighbor FUs share a DM in between, and an FU can access both its neighbor DMs as shown in Figure 3.6(a). Furthermore, any two of the DMs can exchange local data managed by the DMA controller (DMAC) as shown in Figure 3.6(b). If data transfer with external memories is required, DMA operations will be issued by the DMAC for fast data accessing as shown in Figure 3.6(c).

3.5 Design instance

Based on the proposed architecture and analysis of the application, a design instance for networked multimedia processing is shown in Figure 3.7.

26 NWI LSI CCM ... Bridge Timer GPIO ... HRISC

CRG CSM Accelerators CMD  Dispatcher Ref.  DMA MRISC Buffer

ES IQ IT Res. DF WB  VLD  IQ  IT  MC  DF  Buffer Buffer Buffer Buffer Buffer Buffer

Figure 3.7: Design instance for networked multimedia processing

One of the RISC processors, i.e. HRISC is utilized to run the operating system, manage network devices, configure system environment, process network proto- cols, depacketize multimedia data, etc. The other RISC processor, i.e. MRISC mainly works on the multimedia processing. Multi-standard audio decoding and low-definition multi-standard video decoding will be implemented on the MRISC. For high-definition video decoding, the hardware video decoding accelerators will be activated. In that case, MRISC analyzes the video stream, collects the necessary parameters and assigns tasks to corresponding accelerators by writing customized instructions to the CMD cache. The video enhancement unit (VEU) consists of VLD, IQ, IT, MC and DF mod- ules, which are derived from the general decoder model in last section and re- sponsible for high-definition video decoding. The VEU is designed by adding the individual specific features of MPEG-2/MPEG-4/RealVideo to the common struc- ture. The VEU parses instructions from the CMD cache and performs operations on the data stored in the corresponding buffers. These instructions are different from the normal instructions which are generally simple arithmetic, logic, or data moving operations. For each module in VEU, one instruction includes the required parameters and operations for the whole coding block. When starting processing, each module will perform the corresponding operations for one coding block and write the results to proper memory space without requiring the processors to be involved. In order to make the proposed architecture capable of decoding high-definition videos featured with huge data volume, an efficient data communication structure is necessary. In the proposed architecture, internal buffers are placed between the VEU modules to localize the video data and accelerate the computation. The data transferring between internal buffers and external memories is managed by the DMA module. Due to the deployment of internal buffers, the modules in VEU are able to work in parallel in a pipelined way at the block level as shown in Figure 3.8, and at most five blocks can be processed simultaneously by the VEU.

27 Block1  VLD IQ IT MC DF

Block2  VLD IQ IT MC DF

Block3  VLD IQ IT MC DF

Block4  VLD IQ IT MC DF

Block5  VLD IQ IT MC DF

Time  Figure 3.8: Video decoding pipeline

Flexibility The flexibility of the proposed architecture is three-fold:

• Dual-core processor: when high performance is not necessary, the VEU can be turned off, and two embedded RISC processors are fully programmable for various multimedia applications.

• Activity of VEU modules: the number of active modules in VEU can be adjusted for the specific performance requirement. The unnecessary modules can be turned off to save power.

• Working frequency of each module: working frequency of modules in VEU together with two embedded processors can be tuned according to the specific applications to provide adequate performance while keeping power consump- tion at a low level.

3.6 Implementation results

The ASIP is implemented using 0.13 µm CMOS technology with gate count of around 5 million, and the die size is 6.4 mm x 6.4 mm which is shown in Figure 3.9 together with the power consumption. The maximum working frequency is 216 MHz, when 1280x720 @ 25 fps MPEG-2/MPEG-4/RealVideo streams will be processed with power consumption of 414 mW. Only 13 % is consumed by the VEU, which shows the efficiency of hardware accelerators. For low resolution videos, such as CIF-sized pictures with resolution of 352x288, the working frequency can be lowered to 27 MHz, when the power consumption is only 95 mW. The test system and application scenario are demonstrated in Figure 3.10.

28 15,6& /6, 1:, &&0  FDSWXUH 9LGHRSDWFK  VHUYHU +&% 9(8 HQFRGH  & & *3,2 7,0(5  %5,'*( 05,6&  5 6 /3% +5,6&2WKHU * 0  5,% 6.4mm  +,% 05,6& FOLHQW ,QWHUQHW  GHFRGH  9(8   9LVLEOHYLHZ 3RZHUFRQVXPSWLRQ P:  GLVSOD\ $SSOLFDWLRQOD\HU   6\VWHPIUHTXHQF\ 0+] 6.4mm 0HGLDOD\HU 1:, /6, &&0  (a) Die photo (b) Power consumption 15,6& %5,'*( Figure'HFRGLQJOD\HU 3.9: Die photo and power consumption +6% *3,2 7,0(5  &5* &60 $FFHOHUDWRU /3%

&0' 03% GLVSDWFKHU FDSWXUH HQFRGHU VHUYHU 05,6& FDFKH 5() '0$ EXIIHU

(6 ,4 ,7 5(6 ') :% 9/' ,4 ,7 0& ') EXIIHU EXIIHU EXIIHU EXIIHU EXIIHU EXIIHU PP ,QWHUQHW 6R&SODWIRUP

&KLS NRISC32: network, UI 1:, /6, &&0  $SSOLFDWLRQOD\HU MRISC32: media streams parsing 15,6& %5,'*( 0HGLDOD\HU +6% *3,2 7,0(5  VEU : MB level video decoding &5* &60 $FFHOHUDWRU /3% 7HVWWHUPLQDO 'HFRGLQJOD\HU PP &0' 03% GLVSDWFKHU 05,6& FDFKH 5() '0$ Figure 3.10: Test system and application scenario. (Adapted from [84]) EXIIHU

(6 ,4 ,7 5(6 ') :% 9/' ,4 ,7 0& ') EXIIHU EXIIHU EXIIHU EXIIHU EXIIHU EXIIHU 3.7 Summary  &RQILJXUDEOHVSHFLILFDFFHOHUDWRUV To implement networked multi-standard multimedia processing applications, a dual-  core ASIP with high performance and high flexibility is proposed and implemented. +'  Highly specific accelerators provide the necessary performance. The generality of 6' accelerators enable the adaptability for other similar applications. In addition,  high flexibility is provided by the embedded processors. The balance between  performance-to-power ratio and flexibility is made by software/hardware co-design. 6\VWHPIUHTXHQF\ 0+] &,) 3DUD5HJLVWHU 4&,) 

V Q H H W H WV 29 O NH FDO 'DWD )XQFWLRQ 'DWD HZ ROP MH FLW\ Q PD QHZV REL K DV RE LQ RXW UH PRELO P E PRELO P 8QLW ILOWHUV IR IRUHPDQ RFN VW D FORFNF\FOHVQHHGHGIRU03(* 



 +' 6'  &,)  4&,) 6\VWHPIUHTXHQF\ 0+] 

V W V DQ OH OH H OH DO HW P EL Z EL ROP EF M FLW\ QHZV UH R QH R DVN R P P NK E PREL P IR IRUHPDQ RF VW

  )UHTXHQF\OLPLW

E FORFNF\FOHVQHHGHGIRU5HDO9LGHR Memory organization and data transmission structure are also key issues for applications with high throughput requirement. In this architecture, distributed internal buffers are deployed to increase the processing throughput and DMA is implemented to speed up the data transmission between internal buffers and ex- ternal memories. The buses connect memories and functional units, and the com- munication throughput is adequate for this application. But for applications with heavy communication load, a more efficient communication infrastructure will be explored in the next chapter.

30 Chapter 4

Multi-processor architecture exploration for multi-view video decoding

For applications with even higher performance requirement, multiple ASIPs can be connected to increase the performance while providing sufficient flexibility. This chapter explores multi-processor architectures for multi-view video processing in- cluding design space exploration and implementation of communication infrastruc- ture.

4.1 Background

3D video applications are emerging into our daily life, such as 3DTV, 3D movie, 3D football match, etc. To present 3D video, the pictures from two or more views should be recorded, coded and transmitted. Compared with single-view video, data volume and throughput are multiplied by many times. Because similarities exist between pictures from different views, many algorithms are proposed to compress the multi-view video efficiently by making use of the correlations between different views [85]. As a result, performance of decoders needs to be improved by many times compared with that required for single-view video decoding. There are many challenges not only for processing units but also for communication architectures. Some sophisticated ASICs have already been reported for decoding MVC video in [21, 86, 87]. High speed, high processing throughput circuits could provide adequate performance for high-definition multi-view video decoding. However, the design complexity is also considerable, and flexibility and scalability are limited. By utilizing multiple processors, the applications can be implemented in another way by exploring the parallelism and employing multi-processor architectures. We use this approach which has several advantages:

31 • Low design complexity: the design problems are simplified as: choosing a single-ASIP with proper performance, selecting the necessary number of ASIPs and connecting those processors with an efficient communication struc- ture. • High flexibility: these processors can be re-programmed to adapt to similar applications. • High scalability: the on-demand performance can be provided for a particular application by adjusting the number of processors.

Multi-view video coding (MVC) As Annex H of H.264 [32], MVC inherits all the sophisticated coding algorithms in H.264 for single-view video coding. Besides, inter-view prediction is utilized as a very efficient technique to obtain large compression ratio [88]. For single- view video coding, the inter-frame prediction uses temporal neighbor frames as the reference to predict the data in the current frame, which greatly increases the coding efficiency compared with the techniques for still image encoding. While in MVC, not only temporal neighbor frames but also spatial adjacent frames, i.e. frames from neighbor views, are utilized to predict the current frame. The merit is that coding efficiency is significantly increased. The drawback is the complicated coding structures which will be obstacles for parallel processing. A typical prediction structure in one group of pictures (GOP) is shown in Figure 4.1 for eight views, and totally 64 pictures exist in one GOP. In a base view, encoding or decoding of the pictures will not rely on pictures from other views. The single-view decoder can only process the data in a base view, which keeps the bit stream compatible for normal single-view decoders. For anchor pictures, the temporal prediction will not be applied, and only the inter-view prediction can be utilized. The purpose is to prevent error propagation between GOPs and to support random access among GOPs. For the other pictures, both the temporal inter-frame prediction and spacial inter-view prediction are employed to reduce data redundancy. As a result, MVC streams are characterized by larger number of pictures to be processed and complex dependences among the pictures. In order to handle MVC streams, multiple times of data processing throughput are required compared with that for single-view streams.

Network on Chip (NoC) To connect multiple ASIPs, an efficient data communication infrastructure should be designed to provide sufficient throughput. In [50], multiple ASIPs are connected via a bus structure together with a shared memory for data exchanging, as well as a master core for managing the work flow and collecting the results. Burst mode and DMA operations are provided in the bus system for data transmission to reduce the possible contention. However, for applications with heavy data load,

32 time One GOP

view 00 10 20 30 40 50 60 70 80 Base view

01 11 21 31 41 51 61 71 81

02 12 22 32 42 52 62 72 82

03 13 23 33 43 53 63 73 83

04 14 24 34 44 54 64 74 84

05 15 25 35 45 55 65 75 85

06 16 26 36 46 56 66 76 86

07 17 27 37 47 57 67 77 87

Anchor pictures (Blue squares)

Figure 4.1: Typical prediction structure; one square represents one picture, and arrows show the reference relations.

ASIP 00 ASIP 01 ASIP 02 ASIP 03

R R R R

ASIP10 ASIP 11 ASIP 12 ASIP 13

R R R R

ASIP 20 ASIP 21 ASIP 22 ASIP 23

R R R R

ASIP 30 ASIP 31 ASIP 32 ASIP 33

R R R R

Figure 4.2: An example of 4x4 2D mesh NoC

33 IN Processing  Processing  Processing  Processing  OUT phase I phase II phase ... phase ... (a) Pipelining processing phases

Tile  Processing

Tile  Processing IN Task  Result OUT Distribution Converging ...

Tile  Processing

(b) Tile processing

Figure 4.3: Mapping multimedia applications to NoC the non-sharable and non-scalable bus structure becomes the bottleneck in a multi- ASIP system. By connecting the ASIPs with an on-chip network, the network- on-chip (NoC) [89] can supply high communication bandwidth, high efficiency and scalability. NoC architectures decouple the communication subsystem from the computa- tion subsystem. Data communication is performed in a dedicated on-chip network, and the computation part needs only to interact with the network interface for data exchanging. Design problems related to the communication subsystem include the selection of topology, task assignment scheme, switching modes, flow control, rout- ing algorithms, and so on [90]. In this work, we choose 2D-mesh topology consider- ing the intrinsic modularity and scalability. An example of 4x4 2D-mesh networks is shown in Figure 4.2. Besides, to implement NoCs for specific applications, de- sign decisions for the computation part, such as choosing the processing capability of each node, selecting suitable number of nodes, etc. need also to be taken into consideration. The computation capability and communication capacity should be balanced to achieve an optimized overall performance for the whole system.

4.2 Design space exploration

There are two strategies for implementing multimedia applications on NoC archi- tectures. One is partitioning the whole process into several phases, implementing different phases on different NoC nodes and pipelining those phases to gain perfor- mance improvement as shown in Figure 4.3(a). The merit is the simple traffic pat- tern and determinate data dependency between consecutive nodes in the network, while the drawback is the limited scalability constrained by the partitioned process- ing phases. Examples with this mapping strategy can be found in [91, 92]. Another

34 Proc.

Disp. ... Conv.

Proc. (a) Sd Sc Disp. Proc. Conv.

td tp tc (b)

1 td tp tc 2 td tp tc ... td tp tc Npp td tp tc 1 td tp tc (c)

1 td tp 2 td tp ... td tp Np td tp 1 td tp 2 td tp (d)

Figure 4.4: Simplified System (a) the whole system, (b) one unit, (c) work schedule with inadequate link bandwidth, (d) work schedule with excessive link bandwidth; Sd for the size of tasks, and its dispatching time td; Sc for the size of results, and its collecting time tc; tp is the processing time. (Adapted from [93]) implementation strategy is based on parallel tile processing shown in Figure 4.3(b). Each tile is capable of dealing with the whole process, and the performance is in- creased by enabling concurrent tile processing. The scalability is extended at the cost of complex traffic patterns and possible large volume of data transferred among the tiles. Emphasizing scalability and flexibility, the second strategy is utilized for the following exploration.

Exploration for a simplified system A theoretical analysis is firstly performed to formulate the exploration work flow. To ease the theoretical analysis, a simplified system as shown in Figure 4.4(a) is studied to explore the design space. The data dependency between processing tiles is eliminated, which means the processing tiles will get all the necessary data from the task distribution node and send the results only to the converging node. Assuming that the task size (Sd) is larger than the result size (Sc), the relation

35 Given atarget Ⱥ tsp 

Calculate reference  BW link ɶ according to  the theoretical model

For acertain Np, calculate minimum  value of ɽ p

Sweep  BW link around  BW link ɶ,get the  simulated Ⱥsp under  Npand ɽ p

No Ⱥsp >= Ⱥtsp ?

Yes

Found one  configuration of the  system 

Figure 4.5: Work flow for the system exploration

of system processing throughput (Θsp), single-node processing throughput (θp), physical link bandwidth (BWlink) and number of processing nodes (Np) has been theoretically analyzed in [93]. For the case of Sd smaller than Sc, a similar analysis can be applied. Combining with the simulation, for a given target system processing throughput (Θsp), optimized configurations of θp, BWlink and Np can be explored through the procedure as shown in Figure 4.5.

1. By performing a theoretical analysis as that in [93], the reference value of link bandwidth (BWlinkγ ) is calculated from the given value of target system processing throughput.

2. For each network with certain size (Np), based on BWlinkγ and the relation of BWlink versus θp and Np, the minimum required value of θp is acquired.

36 GOP GOP GOP GOP level

I B P View

View level Picture level Macroblock level

Figure 4.6: Hierarchical structure of MVC (Adapted from [94])

3. For each Np and θp, sweep BWlink whose value is around BWlinkγ and eval- uate them in the simulation platform.

4. If the simulation result of system processing throughput is larger than the given value, stop sweeping BWlink, change the size of network and goto 2 to find another group of BWlink, θp and Np.

Exploration for MVC implementation Based on the aforementioned analysis, the NoC system for MVC decoding is ex- plored accordingly, and the target is to find optimized configurations of the system to implement full HD (1920x1080) MVC decoding at 30 fps (frames per second). Figure 4.6 shows the hierarchical structure of MVC streams. The basic process- ing unit is the macroblock (MB), which consists of 16x16 pixels. Neighbor MBs or the MBs in neighbor frames may be utilized to generate the current MB. One picture is composed of a number of MBs. If only spatial prediction is employed, the picture is an I frame. If only forward pictures in the time domain can be used as the reference to predict the current picture, the picture is a P frame. If both forward and backward pictures in the time domain are selectable as reference pictures, the current picture is a B frame. Thus, at the picture level, to decode an I frame, no related data are required from other frames, while to decode P and B frames, data from other frames are necessary. In MVC, pictures further constitute views as introduced in the previous section. Decoding streams from one view may need to refer to data from other views. On top of views, GOP processing is independent from each other, and can be executed in parallel. Task assignment at different levels of MVC will generate diverse traffic patterns, and result in varied configurations of the system. If the task assignment is at the MB level, frequent data transferring leads to performance degradation. On the other hand, GOP-level task assignment needs large storage space in each node and will cause large processing delay as well. So the picture-level and view-level task assignments are studied for implementing the MVC in the following sections.

37 One GOP time

view 10 20 30 40 50 60 70 80 00 4 3 4 2 4 3 4 1 11 21 31 41 51 61 71 81 01 6 5 6 4 6 5 6 3 Temporal index 12 22 32 42 52 62 72 82 02 View index 5 4 5 3 5 4 5 2 13 23 33 43 53 63 73 83 83 03 7 6 7 5 7 6 7 4 4 14 24 34 44 54 64 74 84 04 6 5 6 4 6 5 6 3 Set index 15 25 35 45 55 65 75 85 05 8 7 8 6 8 7 8 5 16 26 36 46 56 66 76 86 06 7 6 7 5 7 6 7 4 17 27 37 47 57 67 77 87 07 8 7 8 6 8 7 8 5

Number of Set index: 1 2 3 4 5 6 7 8 pictures included 1 2 5 10 12 14 12 8

Figure 4.7: Picture-level parallelism for the typical MVC prediction structure (Adapted from [94])

Picture-level task assignment Data dependences exist not only between pictures of the same view, but also be- tween pictures in different views. However, by analyzing the dependence of pictures in one GOP, the pictures can be subdivided into several sets, and in each set the pictures are independent of each other [95]. The pictures with larger set index may need the pictures with smaller index as the reference as shown in Figure 4.7. To process each set, the dispatching node sends all the necessary data to a pro- cessing node for single picture decoding. Results are collected by the converging node. Only after all the pictures with smaller set index are processed, the task for processing the pictures with larger set index can be issued. A shared memory is deployed between the dispatching and converging nodes. A possible data flow for picture-level task assignment is depicted in Figure 4.8. Picture-level task assignment in this work only issues tasks of decoding pictures in the same set to processing nodes, and transferring data between processing nodes is avoided. In this case, the exploration algorithm proposed for the simplified sys- tem can be directly utilized for the analysis. After the exploration in [94], possible configurations of the system are summarized in Figure 4.9. Both 3x3 and 4x4 mesh NoCs can be selected depending on the physical link bandwidth and single-node performance. With 3x3 networks, the minimum performance requirement for single node is the capability of processing full HD single-view stream at 50 fps, correspond-

38 ... Reference Task data input data input ... Decoded Idle data output message Task Picture Result disp. decoding conv. (0,0) (1,0) (2,0) (3,0) ...

... (0,1) (1,1) (2,1) (3,1)

(a) Data flow of task (0,2) (1,2) (2,2) (3,2)

(0,3) (1,3) (2,3) (3,3)

(b) Data flow on NoC

Figure 4.8: A possible data flow of core (1,3) for the picture-level parallel processing (Adapted from [94])

!p (fps) BL (Gbps) 90 78.6 78.6 80 70 60 60 50 50 50 38.4 38.4 40 40 30 20 4x4 4x4 3x3 10 3x3 0 1 2 3 4 Options

Figure 4.9: Design options for picture-level task assignment; Γ for network size, Φp for single-node performance and BL for link bandwidth (Adapted from [94])

ing to the link bandwidth of 78.6 Gbps. If the link bandwidth could not achieve 78.6 Gbps, the minimum requirement is 38.4 Gbps. In that case, the configuration can be either a small network (3x3) with high single-node performance (60 fps) or a large network (4x4) with relatively low single-node performance (50 fps). With 4x4 networks, the requirement for single-node performance can be lowered to 40 fps. It means that, if one processor was initially designed to implement single-view full HD

39 View 0 Reference Task decoding data input data input View 1 decoding Decoded Idle data output message Task View 2 Result disp. decoding conv. View 3 (0,0) (1,0) (2,0) (3,0) decoding

... (0,1) (1,1) (2,1) (3,1)

(a) Data flow of task (0,2) (1,2) (2,2) (3,2)

(0,3) (1,3) (2,3) (3,3)

(b) Data flow on NoC

Figure 4.10: A possible data flow of core (1,3) for the view-level parallel processing (Adapted from [94])

decoding at 30 fps, to support eight-view MVC decoding, only 33% performance improvement is required.

View-level task assignment The view-level task assignment scheme issues single-view decoding tasks to each processing node. Considering the dependence between views, data exchanging among the processing nodes is necessary. Figure 4.10 shows an example of the data flow for decoding view 2 on node (1,3). The task dispatching node sends coded streams to node (1,3) for decoding, and meanwhile, data from node (1,0) for view 0 are also sent to node (1,3) as the reference. Furthermore, decoded streams need to be sent to the converging node for result collection and to node (0,1) and (3,1) as the reference for view 1 and view 3. View-level task assignment scheme complicates the traffic pattern in the net- work. However, the exploration algorithm for the simplified system can still be utilized as a preliminary analysis. After that, by tuning the single-node perfor- mance and physical link bandwidth, optimized configurations of the NoC system can be obtained from the simulation platform to meet the performance requirement. Feasible configurations of the system using view-level task assignment scheme are shown in Figure 4.11. For eight-view MVC decoding, if only one GOP can be processed at a time, at most eight processing nodes are needed, which is limited by the view number. In this case, increasing the network size will not help to increase the system performance. This explains the fact that options 1 and 3, as well as options 2 and 4, have the same requirement for single-node performance and

40 1-GOP ! 70 p (fps) BL (Gbps) 60 60 60 2-GOP 50 50 50 40 40 3-GOP 30 30 5x5 5x5 5x5 5x5 19.2 19.2 20 20 4x4 4x4 4x4 9.6 9.6 9.6 9.6 9.6 10 0 1 2 3 4 5 6 7 Options

Figure 4.11: Design options for view-level task assignment; Γ for network size, Φp for single-node performance and BL for link bandwidth; n-GOP means n GOPs can be processed at the same time (Adapted from [94]) link bandwidth even though the network size is different. To utilize more processing nodes, some GOPs should be processed in parallel. Option 7 shows the case of three GOPs being processed simultaneously with a 5x5 network, when the requirement for single-node performance is only 20 fps for single-view streams processing and the link bandwidth is 9.6 Gbps. However, since introducing the GOP-level parallelism will increase the implementation cost and processing delay, the maximum number of GOP being processed in parallel should be constrained to a small value.

4.3 Communication network

Besides network topology, the selection of switching techniques also has a large effect on both the performance and the implementation cost. Circuit switching (CS) and wormhole switching (WS) are two commonly used techniques for on-chip networks. Circuit switching will reserve the whole end-to-end path before start- ing to send data, and the data transmission uses a dedicated link without sharing it with other transmissions. The advantages are the lightweight architecture and small transmission delay. The disadvantages are the large link setup delay and lower network utilization. In contrast, a data transmission is divided to many flits in wormhole switching, and starts immediately once there is any free buffer in the downstream node. A header flit contains the routing information, and body flits only need to follow the header flit to reach the destination node. To increase uti- lization of the network, virtual channels (VC) can be deployed to share the physical channels. Different data transmissions can use different virtual channels but share one physical channel. The merits are the zero setup delay and higher network utilization. The drawback is the uncertain transmission delay and relatively high implementation cost.

41 Ctrl_in Ctrl_out Control logic

North_in North_out Reg South_in South_out Reg West_in West_out Reg East_in East_out Reg Local_in Local_out Reg Relay Input Output register lane Switch lane

(a)

Ctrl_in Ctrl_out Control logic

... N0 North_in ...... North_out Nn S0 South_in ...... South_out Sn West_in . . . West_out . Input. . East_in . lane. . East_out Local_in Output Local_out Flit Virtual channel Switch lane (b)

Figure 4.12: Router structures of (a) circuit switching and (b) wormhole switching (Adapted from [97])

Circuit switching is suitable for applications with large infrequent data transmis- sion, otherwise, wormhole switching should be used [96]. For a specific application scenario, the decision of choosing the suitable switching technique still need to be made by carrying out the evaluation on performance and cost. In this case, the analysis is performed for MVC decoding in [97]. The simulation router models of CS and WS used for evaluating the performance and cost are shown in Figure 4.12. Five pairs of in/out ports connect four neighbor routers with the local processor. In the CS router, any one of the output ports can only be linked to one input port during one transmission. For the WS router, a number of VCs share one input port, and one output port can be associated with several VCs during data transferring. To find the suitable switching technique for MVC decoding, the optimized con- figuration of WS on link bandwidth, buffer depth, number of VCs is firstly explored. Then, the optimized structure of WS is compared with CS on the performance of decoding speed, network utilization and delay for decoding MVC streams with the same topology of NoCs. The results indicate that employing CS and WS achieves similar performance for MVC decoding. Considering the simpler architecture which will result in lower implementation cost, CS should be used when implementing MVC decoding with NoC architectures.

42 4.4 Summary

By connecting multiple ASIPs in an NoC manner, the system provides high per- formance together with scalability and flexibility. To implement MVC decoding, if single ASIP is utilized, performance is to be improved by several times which are determined by the number of views supported. While with multi-processor NoC architectures, the required performance improvement can be much smaller both with picture-level and view-level task assignment schemes. Meanwhile, to gain the necessary performance for the whole system, design options including single- node performance, network size, link bandwidth and task assignment schemes are balanced. If the single-node performance is not enough, it can be compensated by choosing a suitable task assignment scheme, network size and link bandwidth, what is also applicable when the network size or link bandwidth is inadequate. For the communication structure, circuit switching instead of wormhole switching is ap- propriate for MVC decoding considering their similar performance and lightweight architecture of CS.

43

Chapter 5

Customizable tile-ASIP implementation for Internet of Things

In this chapter, we focus on the implementation of a low power customizable ASIP for Internet of Things (IoT). In the vision of IoT, everything will be connected and can talk to each other [98, 99]. The ubiquitous sensing, processing and communi- cation should be seamlessly performed by the attached devices, which will demand hardware architectures with characteristics like small-sized memory requirement, flexibility and adaptability to various devices with diverse performance demands, low power consumption but sufficient computational performance for functions such as digital signal processing, communication protocol processing, encryption, etc.

5.1 Background

Modular design In IoT applications, the demanded performances for processors are greatly varied with the devices from sensor nodes to servers and also varied with the vast number of application scenarios. The straightforward solution is to design and implement specific processors for each application scenario. However, the design, verification and mask costs are extremely high and become the obstacles for this solution [100]. A , instead, utilizes mature extensible modules to compose a specific system for the target performance and cost. Suitable number of modules and proper connections can be selected after analyzing the application. The implementation costs are greatly decreased. The challenge, as a result, is the implementation of such extensible and efficient modules. The hardware-level conjunction of the system should be transparent to the high-level programming, and the performance and power efficiency should also be close to that of a single processor implementation.

45 The concept of “Lego Chip” is proposed in [100]. In this concept, a system including two or more chips behaves like a whole single SoC. Diverse but mature Lego Chips are seamlessly connected via the highly efficient Lego Interconnects to achieve the target performance. The engineering cost can be dramatically decreased by using that technique. XMOS architecture and its implementation are introduced in [101]. Multiple processors can be connected with point-to-point communication links, and it allows to build the system either on multiprocessor chip, in packages, on boards or as a distributed system. A number of threads can be directly mapped to the hardware resources in each core, and controlled by a hardware scheduler. The event-driven threads are able to reduce power when waiting for events. For constructing the multi-core system, each core has its global address and can communicate with the others through control and data packets. Networks- in-package is implemented in [102] by connecting four NoC chips with off-chip links in one package to demonstrate the scalability and modularity of the system. For each NoC chip, heterogeneous function units are integrated together with the packet-switched communication infrastructure. If higher performance is required, suitable number of NoC dies can be clustered to construct the networks-in-package. It alleviates the problems of super large scale circuits such as low yield and high costs.

Processor architecture When the computer was introduced at the beginning, the memory size was lim- ited, and the compiler and were very simple. Many instructions which could perform various complex functions were designed together with the necessary hardware units. Those instructions constitute the basic instruction set architec- ture (ISA), based on which applications could be implemented. In order to make the memory requirement smaller and make the programming easier, abundant in- structions were included in the ISA for various functionalities. In this case, due to the complex operations implemented in different instructions, multiple cycles are needed to execute one instruction and the execution time for different instructions varies, which makes the execution of instructions not easy to be pipelined. Compared with the aforementioned design methodology, which is called com- plex instruction set computer (CISC), reduced instruction set computer (RISC) simplifies the functionality of instructions. Most of the instructions are finished in one single cycle. The hardware design is simplified. Complex functions are imple- mented by combining the simple instructions in a suitable way, which is done by a complier. The memory requirement for storing executable codes is thus larger and the code density is lower in RISC [103]. With the rapid development of semiconductor technology, the speed and number of transistors are greatly increased for single chips. The memory space for storing codes is not a problem any more. By pipelining the execution of instructions, the speed of RISCs can be very fast due to the simple hardware architecture. Caches are deployed to fill the gap between the extremely fast executing core logic and the

46 relatively low speed memories. Owing to the simple hardware architecture and fast execution speed, RISCs become popular for embedded applications. For the IoT applications in this case, lightweight architectures with low power consumption and high energy efficiency are highly demanded. Memories occupy large proportion of the power consumption [104]. Small memories should be used considering both the area and power consumption. The utilization of hardware units is supposed to be explored as much as possible. Meanwhile, due to the varieties of application scenarios in IoT, reconfigurability and scalability are also of importance for IoT processors.

In this work, a reconfigurable and scalable control-centric processor with an in- tegrated multi-mode router is proposed and implemented as one tile for the possible modular extension. The router supports circuit and wormhole switching simultane- ously to fit into the different traffic patterns for various application scenarios. The techniques that can ease modular extension also apply to the multi-mode router. Aiming at high energy efficiency, the processor simplifies control logics, and ba- sic functional units are directly managed by the sequence mapping table (SMT) saved in internal storage of the processor. The decoding hardware is thus avoided. By designing the corresponding SMT, different ISAs can be implemented to the processor. Code density is increased through optimizing the SMT for specific ap- plications. Dedicated SMTs can even be mapped to the application directly once extremely high energy efficiency is demanded.

5.2 Multi-mode router

NoC routers that utilize both circuit and wormhole switching to improve communi- cation capacity are popular in many studies [105, 106, 107, 108, 109, 110, 111, 112]. The main purpose of those studies is to combine the merits of two switching tech- niques to increase the transmission speed and lower the latency, but not to fit into the diverse traffic patterns by adaptively selecting the proper switching technique for particular application scenarios, which is emphasized in this work for the mod- ular design. The proposed router architecture is shown in Figure 5.1. Circuit switching (CS) and wormhole switching (WS) utilize the common physical channels and control logics. The use of a specific switching mode is determined by the header sent from the source node in the network. A header in CS will setup en end-to-end link between the source and destination nodes, while a header in WS allocates one available virtual channel (VC) for the following data packets. A maximum of four virtual channels can share one physical channel if it is not occupied for circuit- switched messages. A circuit-switched message will preempt the physical channel being used for WS. As a result, each physical channel supports dynamical and free switch between CS and WS for different application scenarios.

47 Input lane SS

CR RC VA VC0 RR north north Out ctrl. _out _in VC1

Demux VC2

Switches Boundary Buffers BoundaryBuffers VC3

...... Input lane Crossbars

CR RC SS VC0 VA local RR local Out ctrl. _in VC1 _out

Demux VC2 Boundary Buffers Boundary Buffers VC3 Switches CR: registers for circuit switching SS: switching mode selection RC: routing computation VA: virtual channel allocation RR: Round-Robin

Figure 5.1: Proposed multi-mode router (Adapted from [113])

Necessity As described in the last chapter, CS is suitable for large infrequent data transmis- sion, and applying WS is appropriate in other cases. However, “large” or “small” is relative to the hop count number and network status when performing the data transmission. Even if the data are “large” for one transmission, they might be “small” for another transmission. Figure 5.2 gives the simulation results of maximum throughputs with only CS or WS mode activated for a 10x10 mesh NoC when the maximum hop count of each transmission is restricted to 2, 4, and 6. When the hop count number is increased, the performance degradation of CS is much larger than that of WS. It means that CS is more sensitive to the distance of data transmission. For the message size with label “P1” in Figure 5.2, if the maximum hop count number is 2, employing CS is suitable. However, when increasing the maximum hop count to 4 and 6, utilizing WS is appropriate. This shows the importance of the capability for dynamically adjusting the switching mode to actual traffic patterns. In some application scenarios, “large” and “small” data transmissions need to be handled at the same time. In that case, using only one switching technique will restrict the performance of the network. The previous 10x10 mesh NoC is still used as an example, and a maximum hop count number of 4 is applied. Assume that there are two types of data messages in one application at the same time, 10-flit- sized messages (message size of 320 bits in Figure 5.2) as “small” data and 100-flit- sized messages (message size of 3200 bits in Figure 5.2) as “large” data, and they

48 PHVVDJHVL]H&65B[:65B[&65B[:65B[ &60B/:60B/&60B/:60B/ &60B+:60B+&60B+                  1400          1200         1000 CSM_H2 (bits/cycle)  WSM_H2 800 CSM_H4

600 WSM_H4 throughput  CSM_H6 Ͳ400

  WSM_H6 Maximum 200

S1P1 S2 0 0 500 1000 1500 2000 2500 3000   Message size (bits)

Figure 5.2: The maximum throughput with only CS mode (CSM) or WS mode (WSM) activated; Hn means that the maximum hop count is n; S1 and S2 are the WKURXJKSXVHWXSWLPH WKURXJKS XVHWXSWLPH     message sizes corresponding to the throughput cross points of the CSM and WSM    for H2 and H4, respectively (Adapted from [113])              100 1000     80 800

             60 600 (bits/cycle) (bits/cycle)       40 CSM 400 CSM    WSM WSM     20 200 CSM/WSM CSM/WSM 0 0 Throughput Throughput 0 5 10 15 0 5 10 15 Injection rate (bits/cycle) Injection rate (bits/cycle) (a) Throughput for 10 Ͳflit messages (b) Throughput for 100 Ͳflit messages   600 800 500 600 400 (cycles) (cycles)   300 400 CSM CSM 200 WSM 200 WSM Latency Latency 100 CSM/WSM CSM/WSM 0 0 0 5 10 15 0 5 10 15 Injection rate (bits/cycle) Injection rate (bits/cycle) (c) Latency for 10 Ͳflit messages (d) Latency for 100 Ͳflit messages Figure 5.3: The comparison of throughput and latency for 10-flit messages and 100-flit messages with only CSM, only WSM and mixed CSM/WSM (Adapted from [113])

49 EXIIHUGHSWK  EXIIHUGHSWK  ELWVELWVELWVELWVELWVELWVELWVELWV 

600000 Area ( ʅm2) 500000

400000

300000

200000

100000

0 8bits 8bits 8bits 16bits 32bits 64bits 16bits 32bits 64bits 16bits 32bits 64bits buffer depth = 8 buffer depth = 16 buffer depth = 32

Figure 5.4: Areas for routers with buffer depth of 8, 16, 32 and link width from 8 to 64 bits are generated with the same probability. Three possible strategies can be utilized for this application: CS for all messages, WS for all messages, and mixed CS/WS adaptively to message sizes. The simulation results for throughputs and latencies are shown in Figure 5.3. The mixed mode provides both higher performance and lower latency. The previous two situations show the necessity of supporting CS and WS si- multaneously in one router when various traffic patterns may be present on the network.

Implementation The proposed router is implemented with parameterized buffer depth of VCs and link width of physical channels. The areas for routers with buffer depth of 8, 16, 32 and link width from 8 to 64 bits are given in Figure 5.4 after being synthesized in 65 nm low leakage CMOS process. The areas are almost doubled when the buffers or physical links are duplicated. The corresponding throughput and power efficiency are shown in Figure 5.5, under the conditions of 10x10 network, 40-flit- sized messages, maximum hop count of 4, and CS for hop count not larger than 2 otherwise using WS. Increasing the buffer depth will improve performance, but decrease power efficiency. Because of that, buffer depth of 8 is selected in this work. Nevertheless, increasing the link width can both increase the performance and power efficiency. As a proof of concept, one tile including processors and the proposed router with buffer depth of 8 and link width of 8 bits, has been fabricated in 65 nm low leak- age CMOS process. Figure 5.6 depicts the complete structure of the implemented

50 14000 180

12000 160 140

(Gbps) 10000  120 8Ͳflit_OT (Gbps/W)  8000 100 16 Ͳflit_OT 6000 80 32 Ͳflit_OT throughput 

60 efficiency 4000  8Ͳflit_PE 40 16 Ͳflit_PE

Overall 2000 20 Power 32 Ͳflit_PE 0 0 8bits 16 bits 32 bits 64 bits Flit width Figure 5.5: Overall throughput (OT) and power efficiency (PE) for buffer depth of 8, 16, 32 flits with flit width from 8 to 64 bits (Adapted from [113])

TO/FROM PROCESSOR Processor interface MESSAGE  MESSAGE  BUFFER BUFFER BUSS  SENDER RECEIVER ADAPTER MESSAGE  MESSAGE  Link  Link  BUFFER BUFFER interface interface Adaptable router StP PtS INPUT LANES OUTPUT LANES

StP CS path CONTROL PtS  REG CS path BUFFER StP BUFFER PtS ... PARSING BUFFER ANALYZING BUFFER StP WS path WS path PtS

Figure 5.6: Complete structure of the on-chip multi-mode router in one tile (Adapted from [114]) router. The processor interface exchanges data messages between the processor and the multi-mode router. Link interface connecting the external links with the internal router also performs serial-to-parallel and parallel-to-serial signal conver- sion to reduce the required pins on one tile, thus decreasing the area and power consumption of the physical chip.

5.3 Control-centric processor

The proposed architecture of the customizable ASIP is shown in Figure 5.7. A dual-core reconfigurable control-centric processor is integrated together with the

51 CORE_L Configuration registers External  … ... Mem. Intf. Memory Control  & Access +, Ͳ,&,... <<, >> ,* Unit   Debug Unit Multi Ͳmode ALU Shifter &MAC Router Control Word Fetch AROM   ARAM CROM CRAM Clock  Generate Control Word Fetch

… ...

Memory  Control Access +, Ͳ,&,... <<, >> ,* Power  Unit Unit Management

ALU Shifter &MAC CORE_R

Figure 5.7: Architecture overview multi-mode router, internal memories, power management unit, clock generation unit, debug interfaces, and other .

Control-centric architecture In each core (CORE_L or CORE_R), functional units and data paths are directly manipulated by a control word (CW) without internal interpretation. The result is simplified control logics and the capability of parallel execution of all functional units. The structure of one CW is shown in Figure 5.8. One CW is partitioned into several sections, each of which contains the necessary control bits for corresponding functional units or data paths. CWs defining specific functions constitute SMTs, which are saved either in CROM for the fixed configurations of the processor or in CRAM to provide reconfigurability. SMTs can be customized to support general-purpose ISAs or application-specific instructions, and even directly implement a particular application, as shown in Fig- ure 5.9 for application mapping schemes. The comparison of energy efficiency and complexity for those schemes are also presented in that figure. The implementa- tions with general-purpose ISAs gain low design complexity, while dedicated SMTs for specific applications can achieve higher energy efficiency. When the processor is used to execute instructions, the fetching (IF), decoding (ID) and executing (EX) processes are all carried out by the SMT with the same

52 KŶĞĐŽŶƚƌŽůǁŽƌĚ ^ĞĐƚŝŽŶϭ ^ĞĐƚŝŽŶϮ ^ĞĐƚŝŽŶϯ ^ĞĐƚŝŽŶϰ ^ĞĐƚŝŽŶϱ ^ĞĐƚŝŽŶϲ

^ĞƋƵĞŶĐĞĐŽŶƚƌŽů /ŶƚĞƌŶĂůŵĞŵŽƌLJ KƚŚĞƌƐ >hŽƉĞƌĂƚŝŽŶ ĐŽŶƚƌŽů DĂŶĚƐŚŝĨƚĞƌ džƚĞƌŶĂůŵĞŵŽƌLJ ŽƉĞƌĂƚŝŽŶ ĐŽŶƚƌŽů Figure 5.8: Structure of one control word

Applications ඹ General purpose

High level languages ය Application-specific ය ර Dedicated Assembly codes ඹ ර SpecialS i ISA ර Instructions ය Energy

Sequence Mapping Table (SMT) Efficiency ඹ ALU DSP MEM IO CTRL Complexity

Figure 5.9: Mapping from applications to the processor (Adapted from [114])

Data Ͳcentric processor IF IF IF IF IF IF

IF IF IF IF IF IF

ID ID ID ID ID Hardware   ID ID ID ID ID Function  EX EX EX EX If with pipeline EX EX EX EX Hardware Hardware Control Ͳcentric processor This work Function  FU FU FU FU FU FU

IF ID EX &IF ID EX &IF ...

Time  Figure 5.10: Instructions processing in the proposed processor compared with other data-centric processors

hardware units, compared with data-centric processors which employ dedicated hardware units for instruction fetching (IF), decoding (ID) and executing (EX), respectively, as described in Figure 5.10. For data-centric processors, pipelines are

53 normally utilized to increase the processing throughput. Extra hardware resources are required to support of the instructions, since the successive instruction starts to be processed before the processing of current one is finished. In contrast, a control-centric processor changes functions of the same hardware units with different configurations to perform the corresponding operations. Hardware can be utilized to a high extent to achieve high energy efficiency.

Low power considerations The low power design considerations are listed below:

• Reconfigurable architecture: the reprogrammable SMT supports both general- purpose and application-specific instructions. Therefore, dense software pro- grams will be generated. The results are the small memory requirement and low access frequency to the memory, which can both lower the power con- sumption.

• On-chip memories: the on-chip ROM and RAM are used both for storing SMTs and application programs. For many IoT applications, on-chip mem- ories are enough for the programs, and the power consumptions on the pins for connecting the external memories can be saved.

• Power management: different working modes are designed to provide proper performance for adapting to different scenarios. The activity and working frequency of each core and peripherals can be dynamically adjusted following the change of work load.

High performance Two cores are equipped with the same kinds of computational units. Besides the general-purpose ALU, a multiplication and accumulate unit (MAC) together with a shifter is also integrated for each single core to increase the performance for digital signal processing. Moreover, all the function units can be activated simultaneously and work in parallel, which is controlled by the SMT. By organizing function units in a suitable way, hardware resources are utilized efficiently and high performance can be achieved. For each single core, to make the critical path short and guarantee sufficient execution time for each control word, one execution cycle consists of two clock cycles. The address of the SMT memory is given in the first cycle, while the data is available in the next cycle, as shown in Figure 5.11(a). By connecting two cores to the same memories, and letting them access memories alternatively, the memories are utilized efficiently and the performance is doubled without the need of arbiters between two cores, as shown in Figure 5.11(b).

54 Clock

Single Address Data Address Data Address core One executing cycle (a) Single core timing

Clock

Core1 Address Data Address Data Address

Core2 Address Data Address Data (b) Dual core timing

Figure 5.11: Control word timing

>ĂLJĞƌϱ ƉƉůŝĐĂƚŝŽŶ >ĂLJĞƌϰ ,ŝŐŚůĞǀĞůůĂŶŐƵĂŐĞĐŽĚŝŶŐ;͕:s͕ĞƚĐ͘Ϳ >ĂLJĞƌϯ DĂĐŚŝŶĞŝŶƐƚƌƵĐƚŝŽŶƐͬƐƐĞŵďůLJůĞǀĞů >ĂLJĞƌϮ ŽŶƚƌŽůǁŽƌĚƐͬ^DdĐŽĚŝŶŐ >ĂLJĞƌϭ ŽƌĞͺ> ŽƌĞͺZ ZŽƵƚĞƌ

Figure 5.12: Hierarchical implementation structure

Flexibility

For applications to be implemented in the proposed architecture, the hierarchical flexible structure is shown in Figure 5.12. The application layer flexibility is realized by providing a certain number of predefined common functions to support rapid mapping from applications to the processor. Besides, general-purpose programming is applicable in layer 3 and 4 by using high-level languages or assembly codes. For some special applications with critical timing or power requirement, the re- programmability in layer 2, which is the SMT coding layer, can be exploited. As the flexibility in layer 1, i.e. the hardware layer, the number of active cores is adjustable for different applications. If the performance of two cores is not sufficient, more cores can work together by being connected through the integrated multi-mode router to construct a network. The memory structure is also flexible. For applications with small memory re-

55 Technology UMC 65nm LL

Router Gate count 1.8 Million RAM for RAM for Instructions Instructions and and Data Data Die size 1.875mm x 1.875mm Peri. CORE_L 80KB+256KB ROM Memories CORE_RR 40KB+160KB SRAM ROM for ROM for Instructions Instructions Frequency [email protected] ROM for SMT Power 22 mW Consumption 6.9 mW per core RAM for RAM for Instructions Instructions and Efficiency 50.7 GOPS/W/core and Data RAM for Data SMT Switching mode Circuit/Wormhole Pad number 112

Gate count breakdown

1.74% 0.77% 1.31% 92.59% 7.41%

3.06%

0.53%

Memory CORE_L CORE_R PERI Router Other

Figure 5.13: Die photo together with gate count breakdown

quirement, internal memories are enough. If more memories are required, external SDRAM can be connected to the processor. The internal and external memories utilize a uniform memory space, and it is transparent to higher levels program- ming. Besides, the pins for SDRAM interfaces are shared with general-purpose I/O pins, and functions of those pins are configurable depending on the connectivity of external memories.

5.4 Tile implementation

The customizable tile-ASIP including dual cores and the multi-mode router has been fabricated in 65 nm low leakage CMOS process. The die photo and gate count breakdown are presented in Figure 5.13. Most of the gates are used by the memories. Adding the second core (CORE_R) contributes only 0.77% to the gates, but doubles the performance of the processor. The maximum working frequency is 350 MHz with power consumption of 22 mW when all the hardware resources are activated. If one core is turned off, the power consumption is decreased to 15 mW. In order to keep the status of the processor when all clock sources are shut off, 0.74

56                                                               

                                       

Power consumption under different working modes

Working modes Power consumption* Dual cores  22 mW  Single core  15 mW  Standby  8.4 mW  Halt  0.74 mW  *with clock frequency 350 MHz, core voltage 1.2 V. 

Power consumption in 1.2 V Power consumption for dual cores 25 25 Dual Core CORE_L only 1.2 V 20 CORE_R only 20

 (mW) (mW) 1.1 V   Idle 1.0 V 15 15 

10 10 consumption consumption   5 5 Power Power

0 0 0 50 100 150 200 250 300 350 0 50 100 150 200 250 300 350 Clock frequency (MHz) Clock frequency (MHz)

Figure 5.14: Power consumption under different working modes and conditions

mW is needed. The power consumptions for different configurations are shown in Figure 5.14. Based on this figure, appropriate working strategy can be selected to achieve the on-demand performance while keeping the power consumption as low as possible. With the proposed ASIPs, on-board networks with packaged chips and in- package networks with bare dies are feasible to further increase the performance of the whole system. Such configurations are demonstrated in Figure 5.15. A maximum of 256 tiles can be connected with single link bandwidth up to 1.4 Gbps.

5.5 Summary

This chapter discussed the design and implementation of a customizable ASIP for low power applications, which demand high energy efficiency. To achieve that, a multi-mode router and a control-centric processor are integrated to provide the on-demand performance. By utilizing the multi-mode router, diverse inter-silicon networks can be constructed to meet the performance requirement for specific ap- plications. With the control-centric processor, control logics are simplified, and hardware resources are directly operated by the control words of SMT inside the processor. By properly orchestrating the SMTs, both general-purpose ISAs and application-specific instructions can be mapped to the processor. Furthermore, customized SMT can dedicate all resources of the processor to particular applica- tions if required. Finally, the tile-ASIP is fabricated in 65 nm low leakage CMOS process, and different working strategies are proposed for various application sce-

57 P P P P P P R R R R R R

P P P P P P R R R R R R

P P P P P P R R R R R R Network on board P P P P P P R R R R R R

P P P P P P R R R R R R

P P P P P P R R R R R R Network in package

Serial clock FDC (forward ctl.) FDD0 (forward data0) BDC (backward ctl.) Sending header in circuit switching Sending data in wormhole switching Figure 22: Scalable network on a board and network in a package for different application scenarios. Figure 5.15: Scalable network on a board and network in a package for different application scenarios (Adapted from [114]) narios. Meanwhile, both on-board and in-package networks are demonstrated to show the feasibility of performance extension.

58 Chapter 6

Conclusions

6.1 Thesis summary

This dissertation explores the design and implementation of energy-efficient ASIPs for ubiquitous sensing and computing in both high-throughput compute-intensive processing of multimedia and low-throughput low-cost processing of IoT scenarios. For high-throughput processing of networked multi-standard multimedia, single- ASIPs integrated with general-purpose processors and application-specific accel- erators are proposed and implemented. Control-intensive tasks such as proto- col processing, depacketizing, stream parsing, etc. are more suitable to be pro- cessed by the embedded processor, while the dedicated accelerators are efficient for the computation-intensive functions. The instruction-controlled accelerators and general-purpose processor also provide the flexibility to support both the existing and future standards. The chip has been fabricated and a test system is built to handle networked multimedia streams. Multiple ASIPs are connected with an on-chip network to gain even higher per- formance and flexibility for applications such as multi-view video decoding. The design problems including the selection of the number of ASIPs, required per- formance for a single ASIP, link bandwidth of the network and task assignment schemes, have been studied for high-definition eight-view MVC decoding. To use multi-ASIP architectures, parallel processing at the proper level has been explored. Both picture-level and view-level task assignment schemes are proposed, and the corresponding communication and computation subsystems are analyzed to obtain the design options for the whole system. The results show that a limitation of one factor can be compensated by the enhancement of the others under the condition that the minimum requirement for each factor is guaranteed. A tile-ASIP is designed and implemented for applications with emphasis on low power consumption, high energy efficiency, and high adaptability. The integrated multi-mode router supports both circuit and wormhole switching simultaneously, and the two schemes can be dynamically selected to adapt to the actual traffic

59 pattern. The embedded processor core utilizes control words to directly operate the functional units and avoid complex control logics so increasing the hardware utilization. The control words compose the sequence mapping table, which is stored in the on-chip memories. By creating the proper tables, general-purpose ISAs, application-specific instructions and mappings dedicated to particular applications can be implemented by the processor. Dual processor cores are deployed sharing both the control and data storage, which doubles the performance and further improves energy efficiency. Thanks to the integration of the multi-mode router, a suitable number of tile-ASIPs can be connected with the inter-silicon network to provide on-demand performance for diverse application scenarios. To conclude, different strategies for designing ASIPs are utilized for diverse application scenarios in ubiquitous sensing and computing. Employing application- specific accelerators will effectively increase the performance with low power con- sumption, while they pose high design and verification cost together with the rela- tively low flexibility. They can be used in the ASIPs that only similar applications will be mapped to. An example is multi-standard multimedia processing. When the applications demand even higher performance, connecting the existing high performance ASIPs to construct the multi-ASIP architecture is a feasible solution instead of further increasing the single-ASIP performance. As a contrast, the tile- ASIP increases energy efficiency by improving hardware utilization. As a result of the reconfigurability of the processor and the scalability provided by the on-chip multi-mode router, the capability of the tile-ASIP providing on-demand perfor- mance lowers the cost and gains high energy efficiency.

6.2 Future work

One revolutionary approach to achieve high performance at extreme low-power consumption is the application of brain-inspired computing [4, 115, 116]. The human brain runs at very low power consumption, but performs many complex tasks that even the current computers cannot handle. Specialization is one of the most important features to achieve high energy efficiency [117]. Different parts of the brain are trained and customized to perform diverse functions. For the proposed tile-ASIP in this dissertation, the reconfigurable and customizable processor cores together with the integrated router make it a competitive solution for constructing a brain-inspired computer. Future research can be performed in the following three aspects to facilitate the implementation of brain-inspired computing with the tile- ASIP.

Hardware improvement In the current version of the tile-ASIP, fixed configurations of the ASIP are saved in the internal ROM. To enhance reconfigurability and customizability, re-programmable non-volatile memories can be utilized to replace the ROM. Meanwhile, the SMTs

60 and instructions will share the internal RAM to maximize the flexibility instead of using independent storages like in the current version. The integrated router is designed for mesh networks, and up to 256 nodes are supported. The reliability and adaptability can be increased by supporting irregular topology, adding fault detection and correction algorithms, etc. Energy-efficient inter-router connections could also be developed. Besides, for both the router and processor, event-driven working modes are necessary to further increase the energy efficiency by activating the resources only for the time period while incoming events are being recorded.

Programming model The programming of single tile-ASIP including SMTs and application programs is supported. To ease the integration of hundreds or even thousands of tile-ASIPs, programming models for configuring and scheduling multiple tile-ASIPs need to be studied and formulated. Besides, since SMTs can be reprogrammed at real time to change the processor behavior, in-system optimization for specific tasks is desired with the support of customized software tool chain.

Application mapping Brain-inspired computing has drawn much attention due to the extremely high en- ergy efficiency of the human brain compared with traditional computers. The brain works with high specialization, low processing duty cycles and low-voltage opera- tions [4]. One million programmable spiking neurons and 256 million configurable synapses are implemented in a single chip in [115], and many chips can be further tiled to support even more complex neural networks. 18 ARM968 processing cores are integrated as a chip multiprocessor (CMP) in [116] for simulating large-scale spiking neural networks. The architecture can scale up to 65536 chips enabling modeling up to 1 billion neurons. Considering the customizable processor core, its tightly coupled internal mem- ories, and the scalable communication structure of the proposed tile-ASIP, the mapping of brain-inspired computing can be studied as future work.

61

Bibliography

[1] C.D. Martin. ENIAC: press conference that shook the world. Technology and Society Magazine, IEEE, 14(4):3–10, Winter 1995.

[2] G. E. Moore. Cramming more components onto integrated circuits. Electron- ics, 38(8):114–117, 1965.

[3] R.H. Dennard, V.L. Rideout, E. Bassous, and A.R. LeBlanc. Design of ion- implanted mosfet’s with very small physical dimensions. Solid-State Circuits, IEEE Journal of, 9(5):256–268, Oct 1974.

[4] M.B. Taylor. A landscape of the new dark silicon design regime. Micro, IEEE, 33(5):8–19, Sept 2013.

[5] Z. Toprak-Deniz, M. Sperling, J. Bulzacchelli, G. Still, R. Kruse, Seong- won Kim, D. Boerstler, T. Gloekler, R. Robertazzi, K. Stawiasz, T. Diemoz, G. English, D. HUI, P. Muench, and J. Friedrich. Distributed system of digi- tally controlled microregulators enabling per-core DVFS for the POWER8TM . In IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC), pages 98–99, Feb 2014.

[6] D. Jacquet, F. Hasbani, P. Flatresse, R. Wilson, F. Arnaud, G. Cesana, T. Di Gilio, C. Lecocq, T. Roy, A. Chhabra, C. Grover, O. Minez, J. Ug- inet, G. Durieu, C. Adobati, D. Casalotto, F. Nyer, P. Menut, A. Cathelin, I. Vongsavady, and P. Magarshack. A 3 ghz dual core processor ARM Cor- tex TM -A9 in 28 nm UTBB FD-SOI CMOS with ultra-wide voltage range and energy efficiency optimization. IEEE Journal of Solid-State Circuits, 49(4):812–826, April 2014.

[7] Y. Akgul, D. Puschini, S. Lesecq, E. Beigne, I. Miro-Panades, P. Benoit, and L. Torres. Power management through DVFS and dynamic body biasing in FD-SOI circuits. In 51st ACM/EDAC/IEEE Design Conference (DAC), pages 1–6, June 2014.

[8] K. Craig, Y. Shakhsheer, S. Arrabi, S. Khanna, J. Lach, and B.H. Calhoun. A 32 b 90 nm processor implementing panoptic DVS achieving energy efficient

63 operation from sub-threshold to high performance. IEEE Journal of Solid- State Circuits, 49(2):545–552, Feb 2014.

[9] S.R. Sridhara, M. DiRenzo, S. Lingam, Seok-Jun Lee, R. Blazquez, J. Maxey, S. Ghanem, Yu-Hung Lee, R. Abdallah, P. Singh, and M. Goel. Microwatt embedded processor platform for medical system-on-chip applications. IEEE Journal of Solid-State Circuits, 46(4):721–730, April 2011.

[10] E. Nakasu. Super hi-vision on the horizon: A future TV system that conveys an enhanced sense of reality and presence. Consumer Electronics Magazine, IEEE, 1(2):36–42, 2012.

[11] DCI DCSS v12 with errata 2012-1010. http://www.dcimovies.com/, 2012.

[12] T. Ito. Future television - Super Hi-Vision and beyond. In IEEE Asian Solid State Circuits Conference (A-SSCC), pages 1–4, 2010.

[13] M. Ikeda, T. Onishi, T. Sano, A. Sagata, H. Iwasaki, Y. Nakajima, K. Nitta, Y. Takahashi, K. Yokohari, D. Kobayashi, K. Kamikura, and H. Jozawa. MVC real-time video encoder for full-HDTV 3D video. In IEEE International Conference on Consumer Electronics (ICCE), pages 166–167, 2012.

[14] G. Van Wallendael, S. Van Leuven, J. De Cock, F. Bruls, and R. Van De Walle. 3D video compression based on high efficiency video coding. IEEE Transactions on Consumer Electronics, 58(1):137–145, 2012.

[15] J. Dick, H. Almeida, L.D. Soares, and P. Nunes. 3D holoscopic video coding using MVC. In EUROCON - International Conference on Computer as a Tool (EUROCON), 2011 IEEE, pages 1–4, 2011.

[16] Luigi Atzori, Antonio Iera, and Giacomo Morabito. The internet of things: A survey. Computer Networks, 54(15):2787 – 2805, 2010.

[17] Guo-An Jian, Ting-Yu Huang, Jui-Chin Chu, and Jiun-In Guo. Optimization of VC-1/H.264/AVS video decoders on embedded processors. In Sixth In- ternational Conference on Information Technology: New Generations, pages 1313–1318, 2009.

[18] K. Nishihara, A. Hatabu, and T. Moriyoshi. Parallelization of H.264 video decoder for embedded multicore processor. In IEEE International Conference on Multimedia and Expo, pages 329–332, 2008.

[19] F. Pescador, M.J. Garrido, E. Juarez, and C. Sanz. On an implementation of HEVC video decoders with dsp technology. In IEEE International Conference on Consumer Electronics (ICCE), pages 121–122, 2013.

64 [20] Chao-Tsung Huang, M. Tikekar, C. Juvekar, V. Sze, and A. Chandrakasan. A 249Mpixel/s HEVC video-decoder chip for quad full HD applications. In 2013 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC), pages 162–163, 2013.

[21] Dajiang Zhou, Jinjia Zhou, Jiayi Zhu, Peilin Liu, and S. Goto. A 2Gpixel/s H.264/AVC HP/MVC video decoder chip for Super Hi-Vision and 3DTV/FTV applications. In IEEE International Solid-State Circuits Con- ference Digest of Technical Papers (ISSCC), pages 224–226, 2012.

[22] Tzu-Der Chuang, Pei-Kuei Tsung, Pin-Chih Lin, Lo-Mei Chang, Tsung- Chuan Ma, Yi-Hau Chen, and Liang-Gee Chen. A 59.5mW scalable/multi- view video decoder chip for Quad/3D Full HDTV and video streaming ap- plications. In IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC), pages 330–331, 2010.

[23] Chien-Chang Lin, Jia-Wei Chen, Hsiu-Cheng Chang, Yao-Chang Yang, Yi- Huan Ou Yang, Ming-Chih Tsai, Jiun-In Guo, and Jinn-Shyan Wang. A 160k gates/4.5 KB SRAM H.264 video decoder for HDTV applications. IEEE Journal of Solid-State Circuits, 42(1):170–182, 2007.

[24] H. Martin, E. San Millan, P. Peris-Lopez, and J. Tapiador. Efficient ASIC im- plementation and analysis of two EPC-C1G2 RFID authentication protocols. IEEE Sensors Journal, PP(99):1–1, 2013.

[25] K. Keutzer, S. Malik, and A.R. Newton. From ASIC to ASIP: the next design discontinuity. In IEEE International Conference on Computer Design: VLSI in Computers and Processors, pages 84–90, 2002.

[26] M.K. Jain, M. Balakrishnan, and A. Kumar. ASIP design methodologies: survey and issues. In Fourteenth International Conference on VLSI Design, pages 76–81, 2001.

[27] Antoni Portero, G. Talavera, Marc Moreno, J. Carrabina, and F. Catthoor. Methodology for energy-flexibility space exploration and mapping of multi- media applications to single-processor platform styles. IEEE Transactions on Circuits and Systems for Video Technology, 21(8):1027–1039, 2011.

[28] M. Rashid, L. Apvrille, and R. Pacalet. Application specific processors for multimedia applications. In 11th IEEE International Conference on Compu- tational Science and Engineering, pages 109–116, 2008.

[29] SungDae Kim and MyungH. Sunwoo. ASIP approach for implementation of H.264/AVC. Journal of Signal Processing Systems, 50(1):53–67, 2008.

[30] Zheng Shen, Hu He, Yanjun Zhang, and Yihe Sun. A video specific instruction set architecture for ASIP design. VLSI Design, 2007(2):4:1–4:7, April 2007.

65 [31] Erich Wenger and Johann Grobschadl. An 8-bit AVR-Based elliptic curve cryptographic RISC processor for the internet of things. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Workshops, MICROW ’12, pages 39–46, Washington, DC, USA, 2012. IEEE Computer Society. [32] ITU-T SERIES H: Audiovisual and multimedia systems infrastructure of au- diovisual services - coding of moving video: Advanced video coding for generic audiovisual services. In Joint Video Team (JVT) of ISO/IEC MPEG and ITU-T VCEG, JVT-AF11, Nov. 2009. [33] George Kornaros and Dionisios Pnevmatikatos. Dynamic power and thermal management of noc-based heterogeneous mpsocs. ACM Trans. Reconfigurable Technol. Syst., 7(1):1:1–1:26, February 2014. [34] Brett H. Meyer, Adam S. Hartman, and Donald E. Thomas. Cost-effective lifetime and yield optimization for noc-based mpsocs. ACM Trans. Des. Au- tom. Electron. Syst., 19(2):12:1–12:33, March 2014. [35] L. Sterpone, L. Carro, D. Matos, S. Wong, and F. Fakhar. A new reconfig- urable clock-gating technique for low power SRAM-based FPGAs. In Design, Automation Test in Europe Conference Exhibition (DATE), 2011, pages 1–6, March 2011. [36] Dingguo Wei, Chun Zhang, Yan Cui, Hong Chen, and Zhihua Wang. Design of a low-cost low-power baseband-processor for UHF RFID tag with asyn- chronous design technique. In Circuits and Systems (ISCAS), 2012 IEEE International Symposium on, pages 2789–2792, May 2012. [37] L. Benini, P. Siegel, and G. De Micheli. Saving power by synthesizing gated clocks for sequential circuits. Design Test of Computers, IEEE, 11(4):32–41, Winter 1994. [38] Kwangok Jeong, A.B. Kahng, Seokhyeong Kang, T.S. Rosing, and R. Strong. MAPG: Memory access power gating. In Design, Automation Test in Europe Conference Exhibition (DATE), 2012, pages 1054–1059, March 2012. [39] Chiu-Kuo Chen, Yi-Yuan Wang, Zong-Han Hsieh, E. Chua, Wai-Chi Fang, and Tzyy-Ping Jung. A low power independent component analysis proces- sor in 90nm CMOS technology for portable EEG signal processing systems. In Circuits and Systems (ISCAS), 2011 IEEE International Symposium on, pages 801–804, May 2011. [40] H. Singh, K. Agarwal, D. Sylvester, and K.J. Nowka. Enhanced leakage reduction techniques using intermediate strength power gating. Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, 15(11):1215–1224, Nov 2007.

66 [41] T. Burd, T. Pering, A. Stratakos, and R. Brodersen. A dynamic voltage scaled microprocessor system. In Solid-State Circuits Conference, 2000. Digest of Technical Papers. ISSCC. 2000 IEEE International, pages 294–295, Feb 2000.

[42] J. Howard, S. Dighe, Y. Hoskote, S. Vangal, D. Finan, G. Ruhl, D. Jenkins, H. Wilson, N. Borkar, G. Schrom, F. Pailet, S. Jain, T. Jacob, S. Yada, S. Marella, P. Salihundam, V. Erraguntla, M. Konow, M. Riepen, G. Droege, J. Lindemann, M. Gries, T. Apel, K. Henriss, T. Lund-Larsen, S. Steibl, S. Borkar, V. De, R. Van Der Wijngaart, and T. Mattson. A 48-core IA-32 message-passing processor with DVFS in 45nm CMOS. In Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2010 IEEE International, pages 108–109, Feb 2010.

[43] V. Krishnaswamy, J. Brooks, G. Konstadinidis, C. McAllister, H. Pham, S. Turullols, J.L. Shin, Yifan YangGong, and Haowei Zhang. Fine-grained adaptive power management of the SPARC M7 processor. In Solid- State Cir- cuits Conference - (ISSCC), 2015 IEEE International, pages 1–3, Feb 2015.

[44] E.J. Fluhr, S. Baumgartner, D. Boerstler, J.F. Bulzacchelli, T. Diemoz, D. Dreps, G. English, J. Friedrich, A. Gattiker, T. Gloekler, C. Gonzalez, J.D. Hibbeler, K.A. Jenkins, Yong Kim, P. Muench, R. Nett, J. Paredes, J. Pille, D. Plass, P. Restle, R. Robertazzi, D. Shan, D. Siljenberg, M. Sper- ling, K. Stawiasz, G. Still, Z. Toprak-Deniz, J. Warnock, G. Wiedemeier, and V. Zyuban. The 12-core POWER8; processor with 7.6 tb/s io bandwidth, integrated voltage regulation, and resonant clocking. Solid-State Circuits, IEEE Journal of, 50(1):10–23, Jan 2015.

[45] N. Reynders and W. Dehaene. A 210mv 5mhz variation-resilient near- threshold JPEG encoder in 40nm CMOS. In Solid-State Circuits Confer- ence Digest of Technical Papers (ISSCC), 2014 IEEE International, pages 456–457, Feb 2014.

[46] Advances in big.little technology for power and energy savings. http://www.thinkbiglittle.com/.

[47] Jungyul Pyo, Youngmin Shin, Hoi-Jin Lee, Sung il Bae, Min-Su Kim, Kwangil Kim, Ken Shin, Yohan Kwon, Heungchul Oh, Jaeyoung Lim, Dong wook Lee, Jongho Lee, Inpyo Hong, Kyungkuk Chae, Heon-Hee Lee, Sung-Wook Lee, Seongho Song, Chung-Hee Kim, Jin-Soo Park, Heesoo Kim, Sunghee Yun, Uk-Rae Cho, Jae Cheol Son, and Sungho Park. 20nm high-K metal-gate het- erogeneous 64b quad-core CPUs and hexa-core GPU for high-performance and energy-efficient mobile application processor. In IEEE International Solid- State Circuits Conference - (ISSCC), pages 1–3, Feb 2015.

[48] A. Gomez, C. Pinto, A. Bartolini, D. Rossi, L. Benini, H. Fatemi, and J.P. de Gyvez. Reducing energy consumption in -based platforms

67 with low design margin co-processors. In Design, Automation Test in Europe Conference Exhibition (DATE), 2015, pages 269–272, March 2015. [49] T.-T. Liu and J.M. Rabaey. A 0.25 V 460 nW asynchronous neural signal processor with inherent leakage suppression. IEEE Journal of Solid-State Circuits, 48(4):897–906, April 2013. [50] N. Neves, N. Sebastiao, D. Matos, P. Tomas, P. Flores, and N. Roma. Multi- core SIMD ASIP for next-generation sequencing and alignment biochip plat- forms. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 23(7):1287–1300, July 2015. [51] T. Sugiura, S. Nakatsuka, Jaehoon Yu, Y. Takeuchi, and M. Imai. An effi- cient data compression method for artificial vision systems and its low energy implementation using ASIP technology. In IEEE Biomedical Circuits and Systems Conference (BioCAS), pages 81–84, Oct 2014. [52] Y. Xin, W.X.Y. Li, Z. Zhang, R.C.C. Cheung, D. Song, and T.W. Berger. An application specific instruction set processor (asip) for adaptive filters in neural prosthetics. Computational Biology and Bioinformatics, IEEE/ACM Transactions on, PP(99):1–1, 2015. [53] Lee Minkyu, Song Yong Ho, and Chung Ki-Seok. An ASIP approach for inter- polation performance enhancement in HEVC decoder. In 4th IEEE Interna- tional Conference on Network Infrastructure and Digital Content (IC-NIDC), pages 232–235, Sept 2014. [54] Kwang Woo Lee, Sung Dae Kim, and M.H. Sunwoo. VSIP : Video Spe- cific Instruction Set Processor for H.264/AVC. In IEEE Workshop on Signal Processing and Implementation, pages 124–129, Oct 2006. [55] A. Gupta and A. Pal. Accelerating SVM on Ultra Low Power ASIP for High Throughput Streaming Applications. In 28th International Conference on VLSI Design (VLSID), pages 517–522, Jan 2015. [56] Suk Hyun Yoon, Myung Hoon Sunwoo, and Jong Ha Moon. Design of a high-quality audio-specific DSP core. In IEEE Workshop on Signal Processing Systems Design and Implementation, pages 509–513, Nov 2005. [57] Purushotham Murugappa, Amer Baghdadi, and Michel Jezequel. ASIP design for multi-standard channel decoders. In Cyrille Chavet and Philippe Coussy, editors, Advanced Hardware Design for Error Correcting Codes, pages 151– 175. Springer International Publishing, 2015. [58] Zhenzhi Wu and D. Liu. Flexible multistandard FEC with ASIP methodology. In IEEE 25th International Conference on Application- specific Systems, Architectures and Processors (ASAP), pages 210–218, June 2014.

68 [59] Ubaid Ahmad, Min Li, Amir Amin, Meng Li, Liesbet Van der Perre, Rudy Lauwereins, and Sofie Pollin. An energy-efficient reconfigurable ASIP sup- porting multi-mode MIMO detection. Journal of Signal Processing Systems, pages 1–17, 2015.

[60] Frank Bouwens, Jos Huisken, Harmke De Groot, Martijn Bennebroek, An- teneh Abbo, Octavio Santana, Jef van Meerbergen, and Antoine Fraboulet. A dual-core system solution for wearable health monitors. In Proceedings of the 21st Edition of the Great Lakes Symposium on Great Lakes Symposium on VLSI, GLSVLSI ’11, pages 379–382, New York, NY, USA, 2011. ACM.

[61] Li-Ming Chen, Zeng-Hui Yu, Cheng-Ying Chen, Xiao-Yu Hu, Jun Fan, Jun Yang, and Yong Hei. A 1-V, 1.2-mA fully integrated SoC for digital hearing aids. Microelectronics Journal, 46(1):12 – 19, 2015.

[62] N. Neves, N. Sebastiao, A. Patricio, D. Matos, P. Tomas, P. Flores, and N. Roma. BioBlaze: Multi-core SIMD ASIP for DNA sequence alignment. In IEEE 24th International Conference on Application-Specific Systems, Archi- tectures and Processors (ASAP), pages 241–244, June 2013.

[63] J. Constantin, A. Dogan, O. Andersson, P. Meinerzhagen, J.N. Rodrigues, D. Atienza, and A. Burg. TamaRISC-CS: An ultra-low-power application- specific processor for compressed sensing. In IEEE/IFIP 20th International Conference on VLSI and System-on-Chip (VLSI-SoC), pages 159–164, Oct 2012.

[64] Yi Wang and Yajun Ha. A Performance and Area Efficient ASIP for Higher- Order DPA-Resistant AES. IEEE Journal on Emerging and Selected Topics in Circuits and Systems, 4(2):190–202, June 2014.

[65] O. Ahmed and S. Areibi. An efficient application-specific instruction-set pro- cessor for packet classification. In International Conference on Reconfigurable Computing and FPGAs (ReConFig), pages 1–6, Dec 2013.

[66] Seung-Hyun Choi, Neungsoo Park, YongHo Song, and Seong-Won Lee. ASiPEC: An application specific instruction-set processor for high perfor- mance entropy coding. In James J. (Jong Hyuk) Park, Yi Pan, Han-Chieh Chao, and Gangman Yi, editors, Ubiquitous Computing Application and Wireless Sensor, volume 331 of Lecture Notes in , pages 67–75. Springer Netherlands, 2015.

[67] Srinivas Lingam, Timothy Schmidl, and Anuj Batra. Software defined radio for smart utility networks. In Proceedings of the 2014 ACM Workshop on Software Radio Implementation Forum, SRIF ’14, pages 37–44, New York, NY, USA, 2014. ACM.

69 [68] Wenxiang Wang, Ling Li, Guangfei Zhang, Dong Liu, and Ji Qiu. An appli- cation specific instruction set processor optimized for FFT. In Circuits and Systems (MWSCAS), 2011 IEEE 54th International Midwest Symposium on, pages 1–4, Aug 2011.

[69] M. Koziri, D. Zacharis, I. Katsavounidis, and N. Bellas. Implementation of the AVS video decoder on a heterogeneous dual-core SIMD processor. Consumer Electronics, IEEE Transactions on, 57(2):673–681, May 2011.

[70] G. Paya-Vayaa, J. Martin-Langerwerf, C. Banz, F. Giesemann, P. Pirsch, and H. Blume. VLIW architecture optimization for an efficient computation of stereoscopic video applications. In Green Circuits and Systems (ICGCS), 2010 International Conference on, pages 457–462, June 2010.

[71] Vincent Brost, Fan Yang, and Charles Meunier. Flexible VLIW processor based on FPGA for efficient embedded real-time image processing. Journal of Real-Time Image Processing, 9(1):47–59, 2014.

[72] V. Lapotre, P. Murugappa, G. Gogniat, A. Baghdadi, M. Hubner, and J.-P. Diguet. A dynamically reconfigurable multi-ASIP architecture for multistan- dard and multimode turbo decoding. Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, PP(99):1–1, 2015.

[73] Q. Yang, X. Zhou, G.E. Sobelman, and X. Li. Network-on-chip for turbo de- coders. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, PP(99):1–1, 2015.

[74] Hong Chinh Doan, H. Javaid, and S. Parameswaran. Flexible and scalable im- plementation of H.264/AVC encoder for multiple resolutions using ASIPs. In Design, Automation and Test in Europe Conference and Exhibition (DATE), 2014, pages 1–6, March 2014.

[75] Hyun-Ho Jo, Yong-Jo Ahn, Dae-Beom Kang, Bongil Ji, Dong-Gyu Sim, and Jae-Jin Lee. Flexible multi-core platform for a multiple-format video decoder. Journal of Signal Processing Systems, 80(2):163–179, 2015.

[76] Dipan Kumar Mandal, Mihir Mody, Mahesh Mehendale, Naresh Yadav, Ghone Chaitanya, Piyali Goswami, Hetul Sanghvi, and Niraj Nandan. Accel- erating h.264/hevc video slice processing using application specific instruction set processor. In Consumer Electronics (ICCE), 2015 IEEE International Conference on, pages 408–411, Jan 2015.

[77] I. Tsekoura, G. Selimis, J. Hulzink, F. Catthoor, J. Huisken, H. de Groot, and C. Goutis. Exploration of cryptographic ASIP for wireless sensor nodes. In 17th IEEE International Conference on Electronics, Circuits, and Systems (ICECS), pages 827–830, Dec 2010.

70 [78] H. Marzouqi, M. Al-Qutayri, K. Salah, D. Schinianakis, and T. Stouraitis. A high-speed FPGA implementation of an RSD-based ECC processor. Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, PP(99):1–1, 2015.

[79] Z. Wu and D. Liu. High-throughput trellis processor for multistandard FEC decoding. Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, PP(99):1–1, 2015.

[80] Hee Kwan Eun, Sung Jo Hwang, and Myung Hoon Sunwoo. Novel IME instructions and their hardware architecture for ME ASIP. In International SoC Design Conference (ISOCC), pages 139–142, Nov 2010.

[81] J. Ohm, G.J. Sullivan, H. Schwarz, Thiow Keng Tan, and T. Wiegand. Com- parison of the coding efficiency of video coding standards-including high ef- ficiency video coding (HEVC). Circuits and Systems for Video Technology, IEEE Transactions on, 22(12):1669–1684, 2012.

[82] Chih-Da Chien, Chien-Chang Lin, Yi-Hung Shih, He-Chun Chen, Chia-Jui Huang, Cheng-Yen Yu, Chih-Liang Chen, Ching-Hwa Cheng, and Jiun-In Guo. A 252kgate/71mw multi-standard multi-channel video decoder for high definition video applications. In IEEE International Solid-State Circuits Con- ference. ISSCC 2007. Digest of Technical Papers., pages 282–603, 2007.

[83] Tsu-Ming Liu, Ting-An Lin, Sheng- Wang, Wen-Ping Lee, Jiun-Yan Yang, Kang-Cheng Hou, and Chen-Yi Lee. A 125 µW , fully scalable MPEG-2 and H.264/AVC video decoder for mobile applications. Solid-State Circuits, IEEE Journal of, 42(1):161–169, 2007.

[84] Ning Ma, Zhibo Pang, Jun Chen, H. Tenhunen, and Li-Rong Zheng. A 5mgate/414mw networked media soc in 0.13um with 720p multi- standard video decoding. In Solid-State Circuits Conference, 2009. A-SSCC 2009. IEEE Asian, pages 385–388, Nov 2009.

[85] A. Smolic, K. Mueller, N. Stefanoski, J. Ostermann, A. Gotchev, G.B. Akar, G. Triantafyllidis, and A. Koz. Coding algorithms for 3DTV: A survey. Cir- cuits and Systems for Video Technology, IEEE Transactions on, 17(11):1606– 1621, 2007.

[86] Chi-Cheng Ju, Tsu-Ming Liu, Yung-Chang Chang, Chih-Ming Wang, Chun- Chia Chen, Hue-Min Lin, Chia-Yun Cheng, Min-Hao Chiu, Sheng-Jen Wang, Ping Chao, MJ Hu, Hao-Wei Li, and Chung-Hung Tsai. A 775-uW/fps/view H.264/MVC decoder chip compliant with 3D Blu-ray specifications. In IEEE International Symposium on Circuits and Systems (ISCAS), pages 1440–1443, 2012.

71 [87] Pei-Kuei Tsung, Ping-Chih Lin, Kuan-Yu Chen, Tzu-Der Chuang, Hsin-Jung Yang, Shao-Yi Chien, Li-Fu Ding, Wei-Yin Chen, Chih-Chi Cheng, Tung- Chien Chen, and Liang-Gee Chen. A 216fps 4096*2160p 3DTV set-top box SoC for free-viewpoint 3DTV applications. In IEEE International Solid- State Circuits Conference Digest of Technical Papers (ISSCC), pages 124– 126, 2011.

[88] A. Vetro, T. Wiegand, and G.J. Sullivan. Overview of the stereo and multi- view video coding extensions of the H.264/MPEG-4 AVC standard. Proceed- ings of the IEEE, 99(4):626–642, 2011.

[89] Axel Jantsch and Hannu Tenhunen. Networks on Chip. Kluwer Academic Publishers, 2003.

[90] Radu Marculescu, Umit Y. Ogras, Li-Shiuan Peh, Natalie Enright Jerger, and Yatin Hoskote. Outstanding research problems in NoC design: Sys- tem, microarchitecture, and circuit perspectives. In IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 28(1), pages 3 – 21, Jan. 2009.

[91] G. Tuveri, S. Secchi, P. Meloni, L. Raffo, and E. Cannella. A runtime adap- tive H.264 video-decoding MPSoC platform. In Design and Architectures for Signal and Image Processing (DASIP), 2013 Conference on, pages 149–156, Oct 2013.

[92] Comparative reliability analysis between AMBA and network-on-chip: An MPEG-2 case study. In SOC Conference, 2009. SOCC 2009. IEEE Interna- tional, pages 247–250, Sept 2009.

[93] Ning Ma, Zhonghai Lu, Zhibo Pang, and Lirong Zheng. System-level explo- ration of mesh-based NoC architectures for multimedia applications. In IEEE International SOC Conference (SOCC), pages 99–104, Sept 2010.

[94] Ning Ma, Zhonghai Lu, and Lirong Zheng. System design of full HD MVC de- coding on mesh-based multicore NoCs. Microprocess. Microsyst., 35(2):217– 229, March 2011.

[95] Yi Pang, Lifeng Sun, Jiangtao (Gene) Wen, Fengyan Zhang, Weidong Hu, Wei Feng, and Shiqiang Yang. A framework for heuristic scheduling for par- allel processing on multicore architecture: A case study with multiview video coding. In IEEE Transactions on Circuits and Systems for Video Technology, Volume 19, Issue 11, pages 1658 – 1666, Nov. 2009.

[96] J. Duato, S. Yalamanchili, and L. Ni. Interconnection Network - An Engi- neering Approach. IEEE Computer Society Press, 1997.

72 [97] Ning Ma, Zhuo Zou, Zhonghai Lu, and Lirong Zheng. Implementing MVC decoding on homogeneous NoCs: Circuit switching or wormhole switching. In 23rd Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pages 387–391, March 2015.

[98] Hakima Chaouchi. The internet of things: Connecting objects. 2010.

[99] Jayavardhana Gubbi, Rajkumar Buyya, Slaven Marusic, and Marimuthu Palaniswami. Internet of Things (IoT): A vision, architectural elements, and future directions. Future Generation Computer Systems, 29(7):1645 – 1660, 2013.

[100] S. Sutardja. The future of IC design innovation. In IEEE International Solid- State Circuits Conference - (ISSCC), pages 1–6, Feb 2015.

[101] D. May. The XMOS architecture and XS1 chips. Micro, IEEE, 32(6):28–37, Nov 2012.

[102] Kangmin Lee, Se-Joong Lee, Donghyun Kim, Kwanho Kim, Gawon Kim, Joungho Kim, and Hoi-Jun Yoo. Networks-on-chip and networks-in-package for high-performance SoC platforms. In Asian Solid-State Circuits Confer- ence, pages 485–488, Nov 2005.

[103] T. Jamil. RISC versus CISC. IEEE Potentials, 14(3):13–16, 1995.

[104] N. Ickes, Y. Sinangil, F. Pappalardo, E. Guidetti, and A.P. Chandrakasan. A 10 pj/cycle ultra-low-voltage 32-bit microprocessor system-on-chip. In 2011 Proceedings of ESSCIRC (ESSCIRC), pages 159–162, 2011.

[105] G. Jiang, Z. Li, F. Wang, and S. Wei. Mapping of embedded applications on hybrid networks-on-chip with multiple switching mechanisms. Embedded Systems Letters, IEEE, PP(99):1–1, 2015.

[106] Abbas Mazloumi and Mehdi Modarressi. A hybrid packet/circuit-switched router to accelerate memory access in NoC-based chip multiprocessors. In Design, Automation Test in Europe Conference Exhibition (DATE), pages 908–911, March 2015.

[107] G. Chen, M.A. Anders, H. Kaul, S.K. Satpathy, S.K. Mathew, S.K. Hsu, A. Agarwal, R.K. Krishnamurthy, V. De, and S. Borkar. A 340 mV-to-0.9 V 20.2 Tb/s source-synchronous hybrid packet/circuit-switched 16x16 network- on-chip in 22 nm tri-gate CMOS. Solid-State Circuits, IEEE Journal of, 50(1):59–67, Jan 2015.

[108] Mark A. Anders, Himanshu Kaul, Ram K. Krishnamurthy, and Shekhar Y. Borkar. Hybrid circuit/- packet switched network for energy efficient on-chip interconnections. Low Power Networks-on-chip, pages 3 – 20, 2011.

73 [109] M. Modarressi, H. Sarbazi-Azad, and M. Arjomand. A hybrid packet-circuit switched on-chip network based on SDM. In Design, Automation Test in Europe Conference Exhibition (DATE ’09), pages 566 –569, April 2009. [110] Angelo Kuti Lusala and Jean-Didier Legat. Combining sdm-based circuit switching with packet switching in a NoC for real-time applications. In IEEE International Symposium on Circuits and Systems (ISCAS), pages 2505 – 2508, May 2011. [111] M. Modarressi, A. Tavakkol, and H. Sarbazi-Azad. Virtual point-to-point connections for NoCs. IEEE Transactions on Computer-Aided Design of In- tegrated Circuits and Systems, 29(6):855 –868, June 2010. [112] N. Enright Jerger, Li-Shiuan Peh, and M. Lipasti. Circuit-switched coher- ence. In Second ACM/IEEE International Symposium on Networks-on-Chip (NoCS), pages 193 –202, April 2008. [113] Ning Ma, Zhuo Zou, Zhonghai Lu, and Lirong Zheng. Design and imple- mentation of multi-mode routers for large-scale inter-core networks. Minor Revision (requires no re-evaluation), 2015. [114] Ning Ma, Zhuo Zou, Yuxiang Huan, Stefan Blixt, Zhonghai Lu, and Lirong Zheng. A 101.4 gops/w reconfigurable and scalable control-centric embedded processor for domain-specific applications. Manuscript submitted for publi- cation, 2015. [115] Paul A. Merolla, John V. Arthur, Rodrigo Alvarez-Icaza, Andrew S. Cassidy, Jun Sawada, Filipp Akopyan, Bryan L. Jackson, Nabil Imam, Chen Guo, Yu- taka Nakamura, Bernard Brezzo, Ivan Vo, Steven K. Esser, Rathinakumar Appuswamy, Brian Taba, Arnon Amir, Myron D. Flickner, William P. Risk, Rajit Manohar, and Dharmendra S. Modha. A million spiking-neuron inte- grated circuit with a scalable communication network and interface. Science, 345(6197):668–673, 2014. [116] E. Painkras, L.A. Plana, J. Garside, S. Temple, F. Galluppi, C. Patterson, D.R. Lester, A.D. Brown, and S.B. Furber. SpiNNaker: A 1-W 18-core system-on-chip for massively-parallel neural network simulation. Solid-State Circuits, IEEE Journal of, 48(8):1943–1953, Aug 2013. [117] Yu-Ting Chen, Jason Cong, Michael Gill, Glenn Reinman, and Bingjun Xiao. Customizable Computing. Morgan & Claypool Publishers, 2015.

74