Parallel Ultra Low Power Embedded System

Parallel Ultra Low Power Embedded System João Pedro Alves Vieira Thesis to obtain the Master of Science Degree in Electrical and Computer Engineering Supervisor(s): Prof. Aleksandar Ilic Prof. Leonel Augusto Pires Seabra de Sousa Examination Committee Chairperson: Prof. Gonçalo Nuno Gomes Tavares Supervisor: Prof. Leonel Augusto Pires Seabra de Sousa Member of the Committee: Prof. Paulo Ferreira Godinho Flores December 2017 ii Acknowledgments First of all, a special thank you goes to my family and closest friends, who supported me alongside this journey and when it got though. I would like to thank Professor James C. Hoe and Professor Peter Milder from Carnegie Mellon Univer- sity, who were restless, helping on the debug of a main issue found. I would also like to thank my supervisors, for the guidance and insights. iii iv Resumo O futuro do mercado de dispositivos electronico´ portateis´ sera´ constru´ıdo em torno da Internet das Coisas, onde objectos do dia-a-dia estarao˜ ligados a` internet e possivelmente controlados por outros dispositivos. Estes temˆ começado a aparecer nas nossas atividades diarias´ e e´ esperado que ten- ham um grande crescimento num futuro proximo,´ como por exemplo monitores do estado de saude,´ lampadas,ˆ termostatos, pulseiras desportivas, etc. A maior parte destes dispositivos sem fios com sensores, dependem de baterias. Nos quais e´ essencial ter um modo de funcionamento energetica- mente eficiente, atraves´ do desenvolvimento de dispositivos com arquitecturas capazes de responder as` necessidades de baixo consumo e desempenho em tempo real. Esta Tese tem como objetivo melhorar a eficienciaˆ energetica´ de um processador de baixo consumo, nomeadamente o PULPino. Para o alcançar, foram adicionados de forma modular aceleradores de hardware ao mesmo. Tendo o objetivo de encorajar o desenvolvimento de novos aceleradores pela comunidade open-source. Para testar a viabilidade desta abordagem, dois tipos diferentes de aceleradores foram individualmente adicionados. Um primeiro acelerador criptografico´ SHA-3, que implementa um algoritmo de hash, podendo melhorar a segurança nos dispositivos IoT. Em segundo, um acelerador FFT, muito utilizado em aplicaçoes˜ de processamento digital de sinal. Ambos os aceleradores foram testados no PULPino, relativamente as` suas capacidades de aceleraçao˜ e melhoria de eficienciaˆ energetica.´ Conseguindo atingir poupanças de energia ate´ 99% e 66%, aceleraçoes˜ de 185 e 3 vezes no SHA-3 e FFT respectivamente. Em relaçao˜ a uma versao˜ sem acelerador dos algoritmos executados no PULPino com um core RI5CY. Palavras-chave: Internet das Coisas, Consumo de Potencia,ˆ Sistema Embebido, Eficienciaˆ Energetica.´ v vi Abstract The future of portable electronics’ market will be built around Internet of Things(IoT), where everyday objects will be connected to the internet and possibly controlled by other devices. In fact, examples of these devices have already started to take part on our daily activities and are expected to experience a tremendous growth in a near future, such as health monitors, light bulbs, thermostats, fitness wrist- bands, etc. Most of these devices rely on battery-powered wireless transceivers combined with sensors, where it is essential to sustain energy-efficient execution by developing devices’ architectures capable of delivering both low power and real-time computing performance. Within the scope of IoT applications, this Thesis aims to boost the energy-efficiency of a state-of-the-art ultra-low-power processor, namely PULPino. This challenge was tackled by modularly attaching hardware accelerators to it. They connect to PULPino through a low-power and plug-n-play custom AXI-lite interface. It has the objective of encour- aging the development of new accelerators by the growing PULPino’s open-source community. To test the viability of this approach, two kinds of accelerators were individually attached. A first cryptographic SHA-3 accelerator, implementing a commonly used hash algorithm, that could improve IoT applications’ security. And second, an FFT accelerator, having a widely used algorithm in Digital Signal Processing (DSP) applications. Both accelerators were tested on PULPino, for their speedup and energy-efficiency capabilities. Achieving savings up to 99% and 66% of energy, speedups of 185 and 3 times on SHA-3 and FFT respectively. In comparison to a non-hardware accelerated version of the algorithms executed on PULPino RI5CY core configuration. Keywords: Internet of Things, Ultra-low-power, Embedded System, Energy-Efficiency. vii viii Contents Resumo.................................................v Abstract................................................. vii List of Figures............................................. xi Glossary................................................ xiii 1 Introduction 1 1.1 Motivation.............................................2 1.2 Main Objectives..........................................2 1.3 Main Contribution of this Thesis.................................3 1.4 Outline...............................................3 2 Background 4 2.1 State-of-the-Art: PULP - Parallel Ultra Low Power Platform.................4 2.2 PULPino..............................................7 2.3 Additional PULPino’s Core Configurations........................... 11 2.4 Interconnect Networks...................................... 13 2.4.1 Cache Coherent Interconnect for Accelerators(CCIX)................ 14 2.4.2 GEN-Z........................................... 15 2.4.3 Open Coherent Accelerator Processor Interface(OpenCAPI)............ 16 2.4.4 Standards Comparison................................. 16 2.5 Hardware Accelerators...................................... 17 2.6 Summary............................................. 21 3 Hardware/Software Co-design 22 3.1 AXI Protocol........................................... 22 3.1.1 AXI Interconnect..................................... 24 3.2 Overall System Architecture................................... 27 3.2.1 Hardware interface.................................... 28 3.2.2 Software Interface.................................... 29 3.3 Hardware Accelerators...................................... 29 3.4 Summary............................................. 37 ix 4 Implementation and Experimental Work 38 4.1 Target Device........................................... 38 4.2 System Configuration...................................... 40 4.3 New AXI Interconnect Slave................................... 44 4.4 New Accelerator......................................... 45 4.5 Summary............................................. 45 5 Experimental Results 46 5.1 Software vs Hardware...................................... 46 5.1.1 SHA-3........................................... 46 5.1.2 FFT............................................ 49 5.2 Power Efficiency......................................... 51 5.2.1 SHA-3........................................... 52 5.2.2 FFT............................................ 54 5.3 Summary............................................. 56 6 Conclusions and Future Work 59 References 61 A Software-only Algorithms 67 A.1 SHA-3............................................... 67 A.2 FFT................................................ 71 x List of Figures 2.1 PULP cluster with 4 cores....................................5 2.2 Comparison between RI5CY and ARM’s Cortex-M4.....................7 2.3 RISC-V pipeline.........................................8 2.4 LSU Software vs Hardware................................... 10 2.5 Shuffle instruction diagram................................... 10 2.6 Area breakdown of three core configurations......................... 12 2.7 Energy consumption comparison between three core configurations............ 13 2.8 Use cases of CCIX....................................... 14 2.9 Comparison between typical CPU-memory interface and Gen-Z Media Controller..... 15 2.10 Gen-Z arquitecture aggregating different type of media devices............... 15 2.11 Comparison of CCIX, Gen-Z and OpenCAPI main features................. 16 2.12 Comparison between SPIRAL generated design and Xilinx LogiCore FFT v4.1....... 20 3.1 PULPino’s SoC block diagram.................................. 24 3.2 PULPino’s memory map..................................... 25 3.3 AXI4 node overview....................................... 26 3.4 PULPino with attached accelerators block diagram...................... 27 3.5 SHA-3 kernel overview architecture............................... 31 3.6 SHA-3 padding module’s architecture............................. 31 3.7 SHA-3 permutation module’s architecture........................... 32 3.8 SHA-3 accelerator data path.................................. 33 3.9 SPIRAL Fast Fourier Transform(FFT) iterative architecture................. 35 3.10 SPIRAL Fast Fourier Transform(FFT) fully streaming architecture.............. 35 3.11 FFT accelerator’s data path................................... 36 4.1 Xilinx Zynq-7000 SoC block diagram overview......................... 39 4.2 Implementation block diagram.................................. 40 5.1 SHA-3 computation speedup using hardware accelerator.................. 48 5.2 FFT computation speedup using hardware accelerator.................... 51 5.3 SHA-3 computation power versus energy ratio........................ 52 5.4 SHA-3 accelerator energy saved on multiple frequencies................... 53 xi 5.5 FFT accelerator, dynamic and static on-chip power consumption.............. 55 5.6 FFT accelerator, static

Parallel Ultra Low Power Embedded System

Xilinx Synthesis and Verification Design Guide

AN 307: Altera Design Flow for Xilinx Users Supersedes Information Published in Previous Versions

An Architecture and Compiler for Scalable On-Chip Communication

On-Chip Interconnect Schemes for Reconfigurable System-On-Chip

Implementation, Verification and Validation of an Openrisc-1200

RTL Design and IP Generation Tutorial

AXI Reference Guide

Wishbone Bus Architecture – a Survey and Comparison

An Overview of Soc Buses

Vitex-II Pro: the Platfom for Programmable Systems

Small Soft Core up Inventory ©2019 James Brakefield Opencore and Other Soft Core Processors Reverse-U16 A.T

Introduction to Verilog