Parallel Ultra Low Power

João Pedro Alves Vieira

Thesis to obtain the Master of Science Degree in Electrical and Computer Engineering

Supervisor(s): Prof. Aleksandar Ilic Prof. Leonel Augusto Pires Seabra de Sousa

Examination Committee Chairperson: Prof. Gonçalo Nuno Gomes Tavares Supervisor: Prof. Leonel Augusto Pires Seabra de Sousa Member of the Committee: Prof. Paulo Ferreira Godinho Flores

December 2017 ii Acknowledgments

First of all, a special thank you goes to my family and closest friends, who supported me alongside this journey and when it got though. I would like to thank Professor James C. Hoe and Professor Peter Milder from Carnegie Mellon Univer- sity, who were restless, helping on the debug of a main issue found. I would also like to thank my supervisors, for the guidance and insights.

iii iv Resumo

O futuro do mercado de dispositivos electronico´ portateis´ sera´ constru´ıdo em torno da Internet das Coisas, onde objectos do dia-a-dia estarao˜ ligados a` internet e possivelmente controlados por outros dispositivos. Estes temˆ comec¸ado a aparecer nas nossas atividades diarias´ e e´ esperado que ten- ham um grande crescimento num futuro proximo,´ como por exemplo monitores do estado de saude,´ lampadas,ˆ termostatos, pulseiras desportivas, etc. A maior parte destes dispositivos sem fios com sensores, dependem de baterias. Nos quais e´ essencial ter um modo de funcionamento energetica- mente eficiente, atraves´ do desenvolvimento de dispositivos com arquitecturas capazes de responder as` necessidades de baixo consumo e desempenho em tempo real. Esta Tese tem como objetivo mel- horar a eficienciaˆ energetica´ de um processador de baixo consumo, nomeadamente o PULPino. Para o alcanc¸ar, foram adicionados de forma modular aceleradores de hardware ao mesmo. Tendo o objetivo de encorajar o desenvolvimento de novos aceleradores pela comunidade open-source. Para testar a viabilidade desta abordagem, dois tipos diferentes de aceleradores foram individualmente adicionados. Um primeiro acelerador criptografico´ SHA-3, que implementa um algoritmo de hash, podendo melhorar a seguranc¸a nos dispositivos IoT. Em segundo, um acelerador FFT, muito utilizado em aplicac¸oes˜ de processamento digital de sinal. Ambos os aceleradores foram testados no PULPino, relativamente as` suas capacidades de acelerac¸ao˜ e melhoria de eficienciaˆ energetica.´ Conseguindo atingir poupanc¸as de energia ate´ 99% e 66%, acelerac¸oes˜ de 185 e 3 vezes no SHA-3 e FFT respectivamente. Em relac¸ao˜ a uma versao˜ sem acelerador dos algoritmos executados no PULPino com um core RI5CY.

Palavras-chave: Internet das Coisas, Consumo de Potencia,ˆ Sistema Embebido, Eficienciaˆ Energetica.´

v vi Abstract

The future of portable electronics’ market will be built around Internet of Things(IoT), where everyday objects will be connected to the internet and possibly controlled by other devices. In fact, examples of these devices have already started to take part on our daily activities and are expected to experience a tremendous growth in a near future, such as health monitors, light bulbs, thermostats, fitness wrist- bands, etc. Most of these devices rely on battery-powered wireless transceivers combined with sensors, where it is essential to sustain energy-efficient execution by developing devices’ architectures capable of delivering both low power and real-time computing performance. Within the scope of IoT applications, this Thesis aims to boost the energy-efficiency of a state-of-the-art ultra-low-power processor, namely PULPino. This challenge was tackled by modularly attaching hardware accelerators to it. They connect to PULPino through a low-power and plug-n-play custom AXI-lite interface. It has the objective of encour- aging the development of new accelerators by the growing PULPino’s open-source community. To test the viability of this approach, two kinds of accelerators were individually attached. A first cryptographic SHA-3 accelerator, implementing a commonly used hash algorithm, that could improve IoT applications’ security. And second, an FFT accelerator, having a widely used algorithm in Digital Signal Processing (DSP) applications. Both accelerators were tested on PULPino, for their speedup and energy-efficiency capabilities. Achieving savings up to 99% and 66% of energy, speedups of 185 and 3 times on SHA-3 and FFT respectively. In comparison to a non-hardware accelerated version of the algorithms executed on PULPino RI5CY core configuration.

Keywords: Internet of Things, Ultra-low-power, Embedded System, Energy-Efficiency.

vii viii Contents

Resumo...... v Abstract...... vii List of Figures...... xi Glossary...... xiii

1 Introduction 1 1.1 Motivation...... 2 1.2 Main Objectives...... 2 1.3 Main Contribution of this Thesis...... 3 1.4 Outline...... 3

2 Background 4 2.1 State-of-the-Art: PULP - Parallel Ultra Low Power Platform...... 4 2.2 PULPino...... 7 2.3 Additional PULPino’s Core Configurations...... 11 2.4 Interconnect Networks...... 13 2.4.1 Cache Coherent Interconnect for Accelerators(CCIX)...... 14 2.4.2 GEN-Z...... 15 2.4.3 Open Coherent Accelerator Processor Interface(OpenCAPI)...... 16 2.4.4 Standards Comparison...... 16 2.5 Hardware Accelerators...... 17 2.6 Summary...... 21

3 Hardware/Software Co-design 22 3.1 AXI Protocol...... 22 3.1.1 AXI Interconnect...... 24 3.2 Overall System Architecture...... 27 3.2.1 Hardware interface...... 28 3.2.2 Software Interface...... 29 3.3 Hardware Accelerators...... 29 3.4 Summary...... 37

ix 4 Implementation and Experimental Work 38 4.1 Target Device...... 38 4.2 System Configuration...... 40 4.3 New AXI Interconnect Slave...... 44 4.4 New Accelerator...... 45 4.5 Summary...... 45

5 Experimental Results 46 5.1 Software vs Hardware...... 46 5.1.1 SHA-3...... 46 5.1.2 FFT...... 49 5.2 Power Efficiency...... 51 5.2.1 SHA-3...... 52 5.2.2 FFT...... 54 5.3 Summary...... 56

6 Conclusions and Future Work 59

References 61

A Software-only Algorithms 67 A.1 SHA-3...... 67 A.2 FFT...... 71

x List of Figures

2.1 PULP cluster with 4 cores...... 5 2.2 Comparison between RI5CY and ARM’s Cortex-M4...... 7 2.3 RISC-V pipeline...... 8 2.4 LSU Software vs Hardware...... 10 2.5 Shuffle instruction diagram...... 10 2.6 Area breakdown of three core configurations...... 12 2.7 Energy consumption comparison between three core configurations...... 13 2.8 Use cases of CCIX...... 14 2.9 Comparison between typical CPU-memory interface and Gen-Z Media Controller..... 15 2.10 Gen-Z arquitecture aggregating different type of media devices...... 15 2.11 Comparison of CCIX, Gen-Z and OpenCAPI main features...... 16 2.12 Comparison between SPIRAL generated design and LogiCore FFT v4.1...... 20

3.1 PULPino’s SoC block diagram...... 24 3.2 PULPino’s memory map...... 25 3.3 AXI4 node overview...... 26 3.4 PULPino with attached accelerators block diagram...... 27 3.5 SHA-3 kernel overview architecture...... 31 3.6 SHA-3 padding module’s architecture...... 31 3.7 SHA-3 permutation module’s architecture...... 32 3.8 SHA-3 accelerator data path...... 33 3.9 SPIRAL Fast Fourier Transform(FFT) iterative architecture...... 35 3.10 SPIRAL Fast Fourier Transform(FFT) fully streaming architecture...... 35 3.11 FFT accelerator’s data path...... 36

4.1 Xilinx Zynq-7000 SoC block diagram overview...... 39 4.2 Implementation block diagram...... 40

5.1 SHA-3 computation speedup using hardware accelerator...... 48 5.2 FFT computation speedup using hardware accelerator...... 51 5.3 SHA-3 computation power versus energy ratio...... 52 5.4 SHA-3 accelerator energy saved on multiple frequencies...... 53

xi 5.5 FFT accelerator, dynamic and static on-chip power consumption...... 55 5.6 FFT accelerator, static on-chip power consumption...... 56 5.7 FFT accelerator, energy saved vs computation time...... 57 5.8 FFT accelerator, computation energy vs energy ratio(SW/HW)...... 57

xii Acronyms

AMBA Advanced Microcontroller Architecture.

APB Advanced Peripheral Bus.

AXI Advanced eXtensible Interface.

CCIX Cache Coherent Interconnect for Accelerators.

CNN Convolutional Neural Networks.

DCT Discrete Cosine Transforms.

DMA .

DSP Digital Signal Processing.

DVFS dynamic voltage and frequency scaling.

FFT Fast Fourier Transform.

FIR Finite Impulse Response.

FPU Floating Point Unit.

FSBL First-Stage Boot Loader.

I2C Inter-Integrated Circuit.

IoT Internet of Things.

IPC Instructions per Cycle.

LANs Local Area Networks.

LNU Logarithmic Number Systems.

LSU load-store-unit.

LUTs Look-Up Tables.

xiii MCU Microcontroller.

NoCs Network-on-chip.

OpenCAPI Open Coherent Accelerator Processor Interface.

PL Programmable Logic.

PS Processing System.

PULP Parallel Ultra Low Power Processor.

PWM Pulse Width Modulation.

SAIF Switching Activity Interchange Format.

SANs System/Storage Area Network.

SHA Secure Hash Algorithm.

SoC System on Chip.

SPI Serial Peripheral Interface.

ULP Ultra-Low-Power.

WANs Wide Area Networks.

xiv Chapter 1

Introduction

In order to satisfy the growing demands of current consumer electronics market, it is estimated there will be around 50 Billion of Internet connected devices (”things”) by 2020. Every year, the Internet of Things (IoT) semiconductor device revenues increase for more than 30%, in comparison to the total semicon- ductor revenue growth rate of about 5.5% [1]. A large part of IoT lies on battery-powered wireless transceivers combined with sensors, usually designated by motes (short for remote), enabling an inter- action with the environment. Such devices demand ultra-low-power circuits and are usually controlled by a Microcontroller(MCU) responsible for sensor interaction (i.e, for gathering the data) and light-weight processing. In IoT topology, besides the basic motes that in their most primitive form gather data, there are also the end nodes that can be configured as gateways for the motes, which allow gathering data from the motes and eventually pre-processing it before sending to the cloud. As a consequence, they allow to reduce the amount of data to be communicated. One solution for end nodes is presented in this Thesis, i.e. PULP in Section 2.1, which is able to satisfy both computing and energy-efficiency re- quirements of IoT applications, by taking advantage of parallelism. Based on PULP solution, its further simplification into a more basic unit (PULPino) that fits the motes’ requirements is the main goal of this Thesis, in order to provide an adequate substitute for the MCU, having the same kind of functionalities (PWM, timers, SPI, I2C, etc). IoT nodes address a wide range of applications, which might be optimized in different ways to further achieve enhanced energy-efficiency and performance. One approach that fits such goal, is to attach hardware accelerators to the IoT node. In this Thesis a set of well known types of applications in which IoT node might benefits from were carefully chosen, from Digital Signal Processing(DSP) and cryptogra- phy research areas. DSP applications include a wide variety of kernels which are recursively executed, and might require considerable computational power. Cryptography on IoT systems is a recently hot topic. Due to the reduced computational capabilities and low-power characteristics of such systems, having the required security might be challenging. In order to have higher levels of security in such embedded systems, not only a reduction on cryptography algorithms computation is needed, but also, to reduce the energy consumption associated with it.

1 To tackle such challenges and take a step further to achieve these objectives of boosting the energy- efficiency of an ULP system targeting IoT nodes. Were integrated open source kernels into an accelera- tor, one for each application (DSP[2] and cryptography [3]), based on a modular approach. Developing an interface between the accelerators and processor. Thus, all the accelerators will attach and follow the same communication protocol with the processor. This approach encourages the open-source commu- nity to develop and share their own custom accelerator IPs. Continuing a path to open-source hardware, as the release of further presented PULPino was intended for.

1.1 Motivation

In general, IoT nodes can be characterized by a combination of different features and requirements, such as power constraints, size, communication, processing capabilities and availability. Most use cases of IoT nodes’ applications require an Ultra-Low-Power(ULP) operation and high energy efficiency [4]. A multi-core energy-efficient platform appears as a good solution to address this issue, having the pur- pose of satisfying the computational requirements of recent IoT applications, which demand a flexible processing of data that originate from several sensors like heartbeat, ECG, accelerometers, cameras and microphones [4]. The presented multi-core platform (PULP) was released as open-source hard- ware, by making available only one of its single cores, enabling anyone to develop on top of it. Due to it’s intrinsic low-power profile, the released cores are suited for IoT applications. Matching the requirements of restricted energy and hardware flexibility on IoT nodes.

1.2 Main Objectives

The main objective of this thesis is to investigate and further exploit the capabilities of PULPino [5], an ultra low power core, capable of working within power envelopes of a few mW, in the emergent area of IoT. The final goal of this Thesis is on boosting the energy-efficiency of PULPino-based systems, for satisfying IoT constraints and requirements. The existing approaches in the literature are mainly focused on exploiting the capabilities of the PULP platform [6] for different purposes, such as computer vision applications [7–9], extensions and [10–13] and heterogeneous programmable accelerator [14]. However, there are no existing scientific works that focus on PULPino (one core derivation of PULP), specially for IoT purposes. The work proposed at this Thesis aims at closing this gap. This Thesis aims is different from the state-of-the-art, by developing modular and easily attachable hard- ware accelerators, to further boost the energy-efficiency and performance of PULPino-based systems. The accelerators development is based on the integration of a open-source kernels for the target appli- cation. For the kernels to communicate with the processor, was developed and defined an interface as its communication protocol. Such implementation, requires the development of extra hardware, aiming at wrapping the kernel into one top level module, known as accelerator. The same way PULPino en- courages open source hardware, this modular approach of PULPino accelerators, stimulates the open-

2 source community to further enhance and develop accelerators for different kind of applications. Each user might test and deploy the accelerator that best fits its application. This might sparkle the start of an open source library of accelerators for a wide variety of applications on the IoT domain.

1.3 Main Contribution of this Thesis

The main contribution of this Thesis is the development of an accelerator-processor interface, targeting low-power embedded devices. The interface is based on the AXI-lite protocol, which is compatible with AXI-based buses, widely used on embedded hardware designs. The interface provides an easy and plug-n-play manner of deploying the accelerator into the processor’s AXI bus. It does not require any further configurations, besides the connection of the standard AXI signals. Enabling faster and less painful custom hardware design and development, targeting PULPino-based systems.

1.4 Outline

This Thesis is structured as follows. Chapter2 presents the state-of-the-art approaches together with detailed description of the PULPino core’s architecture, i.e., the fundamental topics for the scope of this Thesis. Chapter3 provides details about the hardware/software co-design and overall system ar- chitecture used to attach and integrate the hardware accelerators. Chapter4 provides details about the experimental work, system’s configuration and how to integrate and deploy an accelerator in the targeted platform. Chapter5 presents an analysis of the results obtained from the experimental work. Finally Chapter6 draws the final conclusions and discuss future work.

3 Chapter 2

Background

In this chapter will be presented the state-of-the-art that features relevant scientific works and main char- acteristics of PULP, a novel cluster platform intended to be released as open source hardware in 2018. Since PULPino represents a small part of PULP (one core), which has been already released and is the main focus of this Thesis, its most relevant features will be subsequently described. Additionally the core configurations recently released in August 2017 are also featured. Having as Thesis goal the de- ployment of hardware accelerators on PULPino’s platform, which interfaces through an interconnect bus, state-of-the-art interconnect networks specially designed for such applications are addressed. Hardware acceleration is the last section approached in this chapter, featuring an overview of the state-of-the-art implementations of hardware accelerators related with the kind of applications targeted in the scope of this Thesis.

2.1 State-of-the-Art: PULP - Parallel Ultra Low Power Platform

PULP is a joint project between Integrated Systems Laboratory (ISS) of ETH Zurich and the Energy- efficient Embedded Systems (EEES) group of UNIBO. This project aims to develop an open source scalable hardware and software platform with the objective to break the pJ/operation barrier within power envelopes of a few mW [15]. It supports OpenMP, OpenCL and OpenVX as presented in [6], thus enabling an easier development of parallel algorithms, and it overcomes the power constraints of battery-powered applications which are restricted to a power envelop of a few mW. PULP’s architecture is tuned for efficient near-threshold operation, being optimized for 28nm UTBB FD-SOI technology pro- viding extended range of supply voltage, body bias and improved electrostatic control [6]. A PULP cluster embeds a configurable number of RISC-V based cores with a shared instruction cache and scratchpad memory. Figure 2.1 illustrates a PULP cluster with 4 cores [5]. It has been already taped out with OpenRISC cores and RISC-V based cores and achieves energy efficiency of 193MOps/mW in a 28nm FDSOI technology [16–18]. It is also possible to scale the system, to adjust to the computing demands and power consumption according to the applications by clock-gating the SRAM blocks and the individual cores, which amount

4 Figure 2.1: PULP cluster with 4 cores and 8 TCDM-banks of 8kB SCM and 8kB SRAM each and a shared instruction cache of 4kB [5]. can also be reduced down to a single core. The scaling is controlled by the power manager unit of the cluster and it is tightly coupled with a dynamic voltage and frequency scaling(DVFS). As a result, the performance of a 28nm FDSOI implementation can be adjusted up to 2GOPS by scaling the voltage from 0.32V to 1.15V with the cores operating at 500MHz. The PULP cluster is perfectly suited for IoT endpoint devices due to its efficiency and low power con- sumption while still keeping high computational power [5]. Since only the individual cores are available at the moment with open hardware license and its architecture is already optimized for ULP, they can be intended for IoT remote nodes that do not require as much computational power as an endpoint device. Although PULP represents a relatively recent platform (introduced in 2014), it has been subject of several scientific works targeting different application areas and architecture extensions, as referenced further ahead. In the following text, some of the most relevant state-of-the-art works are described.

Computer Vision Applications

PULP has been used in a set of applications regarding energy-efficient computer vision by taking advan- tage of its parallel computing and support for OpenMP [19]. In [7], as a use case for PULP, it is shown that a computationally demanding vision kernel based on Convolutional Neural Networks(CNN) can be quickly and efficiently switched from a low power, low frame-rate operating point to a high frame-rate one when a detection is performed. Therefore, PULP performance was scaled from 1x to 354x, thus reaching peak performance/power efficiency of 211 GOPS/W.

Similar approach to the one from [7], was used in [8], when considering a different use case that ad- dresses a motion estimation algorithm for smart surveillance. A CNN-based algorithm was implemented for video surveillance, in which it is possible to scale from a low-power with low frame-rate state up to a high-performance state. Based on this, a sample benchmark was developed intended for applications in the nano-UAV field, where PULP was used to accelerate estimation of optical flow from frames produced by an ULP imager, with the objective of autonomous hovering and navigation, achieving results of 14µJ

5 per frame at 60fps. Future work aims to improve PULP with the purpose of being competitive with HW accelerators while the possibility of being programed with general purpose software is maintained.

Further advances were made regarding smart camera sensors, targeting ultra-low-power vision applica- tions using PULP, and its usage within a case-study of moving objects detection, as presented in [9]. By using PULP it was developed a 10.6µW low-resolution contrast based imager featuring internal analog pre-processing. This local processing allows reduction of the total amount of digital data to be sent out of the node by 91%. This is done by having a context aware analog circuit as the imager, which only dis- patches meaningful post-processed data to the processing unit, thus reducing the sensor-to-processor bandwidth by 31x with respect to transmitting a full pixel frame.

Extensions and HW Acceleration

Hardware extensions were also explored in order to bring new energy-efficient solutions, e.g., in the case of a shared Logarithmic Number Systems(LNU) unit implemented in [10], which was as an energy- efficient alternative to a conventional Floating Point Unit(FPU). This LNU, optimized for ultra-low-power operations on PULP multi-core system, is efficiently shared by all the cores. For typical nonlinear pro- cessing tasks, this design can be up to 4.2x more energy-efficient than a private-FPU design. This work was continued in [11], where a novel transformation for hardware efficient LNU implementation was con- sidered. An area reduction of up to 35% was achieved while supporting additional functionality. Being implemented in a 65nm technology, it was demonstrated that the novel LNU can be up to 4.35x faster than a standard FPU.

In [12] work was done towards the development of Hardware Convolution Engines (HWCEs), i.e., ultra- low energy co-processors for accelerating convolutions. First implementations concluded that augment- ing the PULP cluster with HWCEs could lead to an average boost of 40x or more in energy efficiency in convolutional workloads. Moreover, to take advantage of these improvements, the previous refered implementation, was applied to computer vision in [13]. The ability of CNN-based classifiers to ”com- press” low information density data, such as images into highly informative classification tags, makes them suitable to be used in IoT scenarios. It was proposed a 65nm system-on-chip implementing a hybrid HW/SW CNN accelerator that meets energy requirements for IoT targets.

Heterogeneous Programmable Accelerator

PULP was used also as an accelerator on heterogeneous systems for speeding-up computation-intensive algorithms. In [14], a heterogeneous architecture was developed by coupling a Cortex-M series MCU with PULP, supporting offload of parallel computational kernels from the MCU to PULP by taking advan- tage of the OpenMP programming model, supported by PULP.

On the IoT scope, in [20] was proposed Fulmine, a SoC based on a tightly-coupled multi-core clus-

6 ter for near-sensor data analysis, as a promising path to Internet of Things(IoT) endpoints. It minimizes the energy spent on communication as well as the network load, and at the same time concerns about security by making available encryption hardware supported functions. Also supporting software pro- grammability for regular computing tasks.

On the context of computer vision but also the usage of PULP as an Heterogeneous Programmable Accelerator, is presented in [21]. A novel implementation of an ultra-low-power systems base on PULP together with a TI MSP430 microcontroller. It proposes a solution for the market of wearable devices, which cannot have continuous data monitoring due to very short battery life. It could bring new func- tionalities, such as sports performance enhancement, elderly monitoring, disease management or other several application in sports, fitness, gaming or even entertainment. The PULP enables context classifi- cation supported in convolutional neural networks, achieving very low power functioning with 2.2mJ per classification while achieving speedups up to 500x with respect to TI MSP430 operating on the same power restrictions.

2.2 PULPino

PULPino is a small single core system based on PULP, as previously mentioned in Section 2.1. As such, PULPino represents a first step towards the release of full PULP as an open-source multi-core platform. Being part of PULP, PULPino inherits its IPs and cores, focusing on ease of use and simplicity. Its open-source release was done in February 2016 under the solderpad hardware license, including complete RTL sources, all IPs, RI5CY core based on RISC-V, environment for RTL simulation and the complete FPGA build flow. In January 2016, the first ASIC, called Imperio, has been taped out with PULPino on it. The main characteristics of the PULPino core are presented in Figure 2.2. It offers reduced power

Figure 2.2: Comparison between RI5CY and ARM’s Cortex-M4[22]. consumption and area for the same manufacturing technology and conditions when compared with ARM’s Cortex-M4. It also features an IPC close to one, full support for the base integer instruc- tion set (RV32I), compressed instructions (RV32C) and partial support for the multiplication instruction set extension (RV32M). Non-standard extensions have been implemented featuring hardware loops, post-incrementing load and store instructions, ALU and MAC operations. Dot-product and sum-of-dot- products instructions on 8-bit and 16-bit data types allow to perform up to 4 multiplications and accu- mulations in one cycle, consuming the same power as a 32-bit MAC operation. Also, support for real

7 time operating systems such as FreeRTOS was added. A low power mode is available, in which only a simple event unit is active, being able to wake up in case of an event or interrupt arrival. Once in low power mode, only the event unit is active and all other components are clock gated, consuming minimal power. The core with the extended ISA is on average 37% faster for general purpose applications. It can also achieve average speedups on convolutions of up to 3.9x [5]. With extensions, the core is only 15x less energy-efficient than the state-of-the-art hardware accelerator, but has the advantage of being a general purpose architecture that can be used for a wide range of applications [5, 15].

Pipeline Architecture

Since the target is ULP operation, some considerations in the pipeline should be taken into account. The number of pipeline stages is one key aspect of the design, since a high number of stages increase the overall throughput and allow higher frequencies, but increases also the latency. A large number of stages might increase the tendency of data and control hazards. In this case, some high-end opti- mizations (such as speculation and branch prediction) might not be the best approaches to overcome this issue, since they provoke an increase in the overall power consumption. The organization of the

Figure 2.3: Simplified block diagram of the RISC-V four stage pipeline[5].

pipeline is illustrated in Figure 2.3, which consist in four stages: instruction fetch (IF), instruction decode (ID), execute (EX) and write-back (WB). The ALU is extended with fixed-point arithmetic and enhanced multipliers that support dotp operations (while still keeping the same timing), since the critical path is mainly determined by the memory interface. The cluster can achieve frequencies of 350-400MHz under typical conditions with 65nm implementation, being able to reach higher frequencies than the commercially available MCUs that usually operate in the range of 200MHz [5].

8 Instruction Fetch Unit

RISC-V standard supports compressed instructions which are 16-bit. Since instruction cache has both standard and compressed instructions stored in it, its possible to get misaligned instructions when an odd number of 16-bit instruction are followed by a 32-bit instruction, thus requiring an additional fetch cycle and the processor operation would have to be stalled. Due to the pre-fetch-buffer, shown in Figure 2.3 block diagram, it is possible to fetch 128-bit (complete cache line) instead of a single instruction that reduces the accesses to shared instruction cache, once it is possible to fetch 4 to 8 instructions in one access. The misaligned instructions problem is solved by having an additional register that stores the last instruction. This register contains the lower 16-bit part of the 32-bit instruction that can be combined with the higher part and forwarded to the Instruction Decode stage. This prevents stalls when a misaligned instruction occurs, except for jumps, branch or hardware loops [5].

Hardware Loops

Hardware loops or zero-overhead loops allow to execute a piece of code multiple times, thus eliminating the overheads of branches or updating a counter, being a very common feature in DSP. The hardware loop controller can be configured through programming, by defining the start address (pointing to the first instruction of a loop), end address (pointing to the last instruction to be executed in the loop), and by setting the counter value that is decremented every time the loop is completed. These configuration registers are mapped at the Control/Status Register(CSR) block illustrated in Figure 2.3. This config- uration can be done during interrupts or exceptions through a set of dedicated instructions, which are automatically inserted by the modified GCC compiler [5, 23]. Moreover improvements were made re- garding a loop buffer that takes place as a cache holding the loop instructions, thus eliminating the fetch delays [24].

Load-store Unit

The load-store-unit(LSU) is responsible for accessing the data memory, and it can load and store 32b, 16b and 8b words. A post-increment addressing mode feature was added to this unit, which performs a load/store instruction, while it simultaneously increases the base address by the specified offset. This feature leads up to a 20% speed up when memory access patterns are regular, normally found in loops (e.g. matrix multiplication). The address increment is embedded in the memory access instructions, thus eliminating the need to use separate instructions to handle pointers.

Support for unaligned data memory accesses, which can frequently happen, is also provided. This support detects when an unaligned access occurs and stores the high word data in a temporary register, which is combined with the lower word when the second access is issued. This feature has advantages on the code size, shown in Figure 2.4, and the number of required cycles to access unaligned data.

9 (a) (b)

Figure 2.4: a) Support for unaligned access in software (5 instructions/cycles) and b) with hardware support (1 instruction, 2 cycles)[5].

Packed SIMD Support

Support for subword parallelism is provided in PULPino, which consists in packing multiple words into a word in order to process the whole word. This allows data-level parallelism being exploited at small-scale SIMD processing, since the same instruction is applied to all the sub-words within the whole word [25]. Once this is a 32-bit processor, it can compute up to four bytes in parallel. This is advantageous for IoT applications because data acquired from sensors is frequently 8-bit and 16-bit. To accomplish this, the ALU integrates a vectorized datapath segmented into two or four paths, where vectorial operations like addition and subtraction are computed in four sub-operations. A shuffle instruction was implemented that is able to generate an output being any combination of sub- words of the two input operands, as illustrated in Figure 2.5, where the third operand sets selection criteria [5].

Figure 2.5: Shuffle instruction example diagram. Allows the combination of bytes from r1 and r2 through mask encoding [5].

Fixed-Point Support

Fixed-point operations can be thought as a low cost floating-point alternative. Considering that many applications do not require floating-point accuracy, a simpler fixed-point arithmetic operation can be used, saving power and area [5]. Fixed-point numbers are usually given in the Q-Format, which means that a number is represented in format Qm.n, where m the number of integer bits and n for the fractional bits. This core extensions were

10 designed to support any Q-format only limited by m + n < 32. Conversions from one fixed-point representation to another require the number to be normalized by shifting the bits, but before elimination of extra bits, it might be desired to perform a rounding operation to improve the accuracy. Hence, an instruction to add-round-normalize is provided which can save code size and execution time, adding 2 units for rounding while shifting the number by the immediate value. Also a clip instruction is available to check if the number is between two values and saturate the result to a upper or lower bound if it is out of range [5].

Iterative Divider and Multiplication

Long integer division algorithm was chosen due to reuse of existing comparators, shifters and adders of the ALU. This divider has a variable latency from 2 to 32 cycles, depending on the input operands. Despite being slower than a dedicated divider, it has low area overhead [5]. The multiplier block has three modules: a 32x32 multiplier, fractional multiplier, and two dot-product multipliers. It has the capability of multiplying two vectors and accumulate the result in one cycle. The dot-product multiplier allows to perform up to four multiplications and three additions in one operation as follows: d = a[0] · b[0] + a[1] · b[1] + a[2] · b[2] + a[3] · b[3], where a and b are 32-bit registers, a[i] and b[i] correspond to individual bytes of this register, and d is a 32b result. This multiplier also offers fixed-point support, which may require rounding and normalization after the operation, both accomplished in a single instruction. It is important to note that, for these operations, rounding and then normalizing reduces the overall error [5].

2.3 Additional PULPino’s Core Configurations

After the initial release of PULPino on February 2016, which is based on the previously presented RI5CY architecture [5]. Were additionally released new cores with different configurations, in August 2017:

• RI5CY + FPU: the same previously presented RI5CY core enhanced with an single precision Floating Point Unit(FPU), compliant with IEEE-754 standard for floating-point arithmetic [26].

• Zero-riscy: An area-efficient 2-level pipelined core, implementing RISC-V RV32-ICM instruction set as RI5CY core [26].

• Micro-riscy: An even smaller core than the previous ones, implementing RV32-EC instruction set. It features only 16 general purpose registers and no hardware multiplication support [26].

The new core alternatives bring an improved area-efficiency management from the hardware designer’s point-of-view, allowing him to choose which core better fits the target application, moving towards more efficient and area-effective embedded systems designs. Fig. 2.6 shows an area breakdown among the three different core architectures. As it can be noticed, the RI5CY core has the biggest area footprint

11 Figure 2.6: Area breakdown of three core configurations. Results from an ASIC synthesis run [26].

which might seem a drawback. Although, it presents it-self as the best choice for DSP applications, featuring a set of hardware extensions and optimizations for the purpose, as presented before in Section 2.2. Zero-riscy has its area reduced in more than half regarding the full-featured RI5CY core. Designed to be small and efficient, it operates with a 2-level pipelined area-efficient RISC-V based core. Wherein is possible to configure it to full support four ISA configurations: RV32I base integer instruction set, RV32E base integer instruction set, RV32C standard extension for compressed instructions and finally RV32M integer multiplication and division instruction extension. Moreover, to reduce the area foot-print some enhanced instructions supported by RI5CY and RI5CY + FPU were removed, namely hardware loops, post-increment load/store, fixed point, bit manipulation and packed-SIMD. The addition of the FPU unit, although not shown at the provided graph under analysis, implies an extra area of 28.6kGE (kilo Gate Equivalent), 1.7x area increase, when added to RI5CY core. No more details about area or energy consumption are available for the core RI5CY + FPU. Micro-riscy, the smallest core, is 3.5x smaller than RI5CY. Even more optimized than the previous ones, it targets mainly ULP control applications in which area and power consumption requirements prevail against performance. As depicted in Fig. 2.7, micro-riscy has the lowest energy consumption when executing Runtime (a control intensive application with very few ALU operations). In the same figure, all the cores run at a low frequency of 100kHz, implemented in a UMC65 technology at 1.2V and 25oC. As it can be seen, depending on the application benchmark some cores perform better, due to its architecture being optimized to such applications. For 2D convolution, RI5CY has the lowest energy consumption due to its enhanced instructions to improve performance over DSP applications [26].

12 Figure 2.7: Energy consumption comparison between three core configurations [26].

2.4 Interconnect Networks

Computer based systems consist in individual components connected together, communicating with each other. This logical thinking, does not only relate to internal computer components, but may also be applied to computers them selfs. Therefore, having networks of connected computers. These networks rely on communication standards to establish rules on how the data will be converted and transfered among several components. There are four different networking domains in which interconnect networks may be classified, upon the number of connected devices and their proximity:

1. Wide Area Networks(WANs) - Connect a wide range of distributed computer systems around the globe over large distances.

2. Local Area Networks(LANs) - Usually to connect computer systems across small areas of a few kilometers. May also be used in machine rooms or through buildings.

3. System/Storage Area Network(SANs) - Used to connect multiple processors or in processor- memory connections within multiprocessors or multicomputer systems. SANs may also be used within servers and data center environments. In which is required a connections between storage and I/O components, that usually have a distance span of a few tens of meters between them.

4. Network-on-chip(NoCs) - Networks used for interconnecting micro-architecture functional units within chips, e.g. caches, processors, register files or IP cores. Currently this kind of networks support a few tens to a few hundred of such devices connected among them, within a distance on the order of centimeters. Nowadays, some proprietary designs are gaining wider use, e.g. Sonic’s Smart Interconnect, IBM’s CoreConnect or ARM’s AMBA. There are also recent standards, with the purpose of improving interconnectivity between accelerators as other system components(e.g. CCIX, Gen-Z and OpenCAPI) [27].

13 The three new standards previously categorized as NoCs, were announced in 2016 being developed towards the goal of optimizing and easing the connection between accelerators and processors in a tightly-coupled manner. The driving forces behind these new standards, are related with better exploita- tion of new and emerging memory/storage technologies (streamline software stacks, direct memory access, etc) and solutions based on open-standards. There are three distinct standards, explained up front, mainly because different groups of companies had been working to solve similar problems and therefore each approach has its differences. In near future, it is likely to have a convergence of standards. In the next sections all the three standards will be explained in more detail.

2.4.1 Cache Coherent Interconnect for Accelerators(CCIX)

CCIX was founded with the purpose of enabling a new class of interconnect, based on emerging accel- eration applications, such as 4G/5G wireless technology, in-memory data base, network processing and machine learning. It allow processors based on different ISAs to peer processing to multiple accelera- tion devices such as FPGAs, GPU, custom ASICs, etc. CCIX uses a tightly coupled interface between processor, accelerators and memory, together with hardware cache coherence across the links. Data sharing does not require drivers or interrupts. Fig. 2.8 presents some possible system configurations,

Figure 2.8: Use cases of CCIX[28] using CCIX specification. Some of CCIX main targets are: low-latency main memory expansions; extend the processor cache coherency to network/storage adapters, accelerators, etc [28].

14 2.4.2 GEN-Z

GEN-Z is defined as a high-performance, low latency, memory-semantic fabric enabling communication throughout every device in the system. GEN-Z enables the creation of an ecosystem, in which a wide va- riety of high performance solutions can communicate together. Allowing an unification of communication paths and simplifying software through load and store memory-semantics.

(a) Typical CPU-Memory interface (b) Decoupling media from the SoC

Figure 2.9: Comparison between typical CPU-memory interface and Gen-Z Media Controller

A typical CPU/Memory interface is shown in Fig. 2.9a in which SoC (having a media controller) connects to a DRAM interface over a memory bus. In comparison, as depicted in Fig. 2.9b, the media controller is decoupled from the SoC, being placed at where it makes more sense to be, with the media module. This important change, made possible through Gen-Z fabric, allows for every compute entity to be agnostic and disaggregated. Gen-Z architecture easily aggregates different types of media, devices or resources and allow them to scale independently from any other resource in the system, as shown in Fig. 2.10. It may also function as a gateway to other networks e.g. Enet and InfiniBand [29].

Figure 2.10: Gen-Z arquitecture aggregating different type of media devices [29].

15 2.4.3 Open Coherent Accelerator Processor Interface(OpenCAPI)

OpenCAPI is an Open Interface Protocol that allows any processor to attach to coherent user-level ac- celerators, I/O devices and advanced memories (accessible via read/write or DMA semantics). The semantics used to communicate with the multiple components are agnostics to processor architecture. The main key attributes of OpenCAPI are high-bandwidth, low latency and being based on virtual ad- dresses. Which are implemented on the host processor to simplify the attached devices. Consequently, ease the devices’ interoperability between systems of different architectures [28].

2.4.4 Standards Comparison

The three emerging standards referred before are likely to converge in one consortium between several companies, in a way that they might complement and evolve into a more advanced and complete stan- dard. In Figure 2.11 resumes the main specifications detailed before, in which is possible to compare all of its main features and specs. In essence, Gen-Z is a new data access technology which allows operations with directly attached or disaggregated memory/storage. CCIX enables coherency among several heterogeneous components. OpenCAPI provides a coherent accesses between system memory and accelerators, through virtual addresses supported by the host processor.

Figure 2.11: Comparison of CCIX, Gen-Z and OpenCAPI main features [28].

All these standards are suited for high performance IPs and applications, which usually do not target low-power applications and require high throughput data rates (as seen in Fig.2.11) On the scope of this Thesis, is required a low-power interconnect technology, which aims at simplicity and reduced hardware footprint. While still meeting the processor/accelerator data bandwidth requirements. Therefore, none of the state-of-the-art standards presented are suited for being implemented as an processor-accelerator interface in a PULPino-based system.

16 2.5 Hardware Accelerators

A hardware accelerator is a specialized unit designed to perform a very specific task or a set of tasks, achieving higher performance and energy efficiency than a general purpose CPU unit for specific ap- plication. The use of accelerators is not new, in fact it dates back to 1980s with the deployment of a floating-point co-processor as one of the first adoption of accelerators. Since then, they have been widely featured in SoC architectures for embedded systems designs in the past few decades [30]. Nowadays, hardware acceleration design faces challenges such as flexibility and design cost [31]. A design is flexible when it may address a large set of applications with the same initial design. Most of the accelerators are based on fixed-functions only used in a specific target application, therefore providing high flexibility requires a large number of accelerators to cover a wide set of applications. Accelerators are especially suited for real-time applications, I/O processing, data streaming (network traffic, video, audio, etc), specific ”complex” functions (DCT, FFT, exp, log, etc) or specific ”complex” algorithms such as neural networks. Designing such systems in hand-written RTL implementations is highly tedious, time-consuming and consequently costly. Possible solutions to overcome this issues may be found in high level synthesis tools, as an automated way to generate digital hardware through the process of interpreting the algorithm written in an higher level language (C, C++, system C, chisel, etc). To after- wards, translate it into synthesizable hardware that fits the application’s goals. Xilinx has been paving the way to such implementations with Vivado HLS tool [32]. Another approach is to create solutions for universal and optimized interfaces between processors and accelerators, as shown in Section 2.4. The development of custom hardware solutions usually is related with high design costs, due to the many hours, weeks or even years that some designs take to complete, mainly due to traditional design methods. The design is mostly intuition driven, having the need to make decisions upfront and might lead to costly miscalculations [31]. A possible and attainable solution is presented in this Thesis, by defining an light-weight interface based on AXI-lite standard, between the processor and accelerator. It allows the reutilization of accelerators hardware designs, in any systems that support AXI specification, which is widely used. Associated to an embedded processor it enhances flexibility and decreases de- sign cost by reusing the same accelerators in different platforms and scenarios. Accelerators performance analysis take into account a critical parameter which is speedup. It measures how many times, the use of an accelerator, could reduce execution time that a non accelerated system would take to complete the exact same task. The speedup is influenced by the accelerator comput- ing time, synchronization with the processor and data transfers bandwidth, which are usually the main bottlenecks of the system. This analysis translates into the equation of accelerator’s total execution time:

Tacc = tin + tcomp + tout,

in which tin and tout represent the time taken to transfer the input and output data respectively into and out of the accelerator and tcomp for the accelerator’s computing time. Some problems in specific applications were tackled making use of PULP itself as an accelerator as referred in Section 2.1, although there are none that combines PULPino with hardware acceleration

17 such as done in this Thesis. In the mentioned section, some applications of PULP target security in IoT nodes [20] and Digital Signal Processing(DSP)[10][11]. Although, all of them are tightly-coupled with PULP’s architecture, not being possible to easily change between different kind of accelerators, without all of the effort of redesigning all the architecture and hardware, once again. As said before, the high cost of custom hardware design, encourage the reutilization of IPs. Which is one of the main goals of this Thesis, while still targeting low-power embedded applications, just like PULP. Cryptographic and Digital Signal Processing(DSP) functions are highly suitable for hardware acceler- ation. They usually require a considerable amount of operations per input. They are frequently used along the application’s algorithm. These functions are generally inefficiently when computed in general purpose CPUs, which do not have extensions for this specific functions. Examples of this kind of func- tions improvements are [33] and [34], in which SIMD instruction set extensions, applied to ARM NEON and Inter AVX (only in [33]) architectures, were developed to support applications such as SHA-3, Keyak and Ketje. They have achieved significant performance improvements over software, in comparison with cryptographic applications executed on general purpose CPUs. SHA-3 has not only been improved by SIMD instruction extensions, but also by specialized hardware accelerators as proposed in [35]. The ac- celerator implemented as a co-processor compliant with ROOC interface, is based on an parametrized implementation using automated tools and integrated with a Rocket RISC-V processor. Developed un- der a new hardware construction language (HDL) Chisel, it promises different levels of configuration, regarding performance, energy efficiency and size. It was also developed a tool, to help the designer on which configuration to choose to achieve an optimal point of operation, based on the under development system’s requirements. Besides the previous cryptographic applications presented, additionally in [36] are explored RSA and Blowfish cryptography. In which, the goal was to provide a design that focuses on optimizing the critical path of each cryptographic algorithm. [36] is based on a customized co-processor to improve the overall throughput on an FPGA platform. In which were achieved reductions on energy consumption and im- provements on performance, over a standard software implementation.

Besides security, Digital Signal Processing(DSP) is as well a huge subject on which most of nowa- days computing intensive applications are based on [37], for instance, functions like adaptive Finite Impulse Response(FIR) filters, Fast Fourier Transform(FFT) and Discrete Cosine Transforms(DCT). Which are required in a wide range of applications, as audio/video compression and decompression, convolution, encryption, computer vision, digital switches and many other systems [38]. Some of these applications are currently used in novel IoT or ultra-low-power systems, as targeted in [7], [8], [9], [19] and [21] regarding computer vision (as presented in Section 2.1). Apart from computer vision, PULPino has available several improvements that enhance performance and reduce energy consumption (see Section 2.2), over applications such as those DSP addresses. Today’s main FPGA manufacturers already provide full featured DSP cores, e.g., for FFT take for in- stance from Xilinx and , respectively [39, 40]. Also, in FIR applications, some cores like [41] and [42] were made available by Xilinx and , respectively. A more optimized solution that claims to

18 beat such proprietary core implementations is SPIRAL. It is a novel hardware generation framework and system for linear transforms is introduced in [43]. Wherein it takes a problem specification as input with the additional configurations that will shape the datapath. The configuration method resembles the ones used in the previously mentioned cores. The system will automatically customize algorithms, mapping it to a datapath and finally it results in an synthesizable RTL description file, which are ready for FPGA or ASIC implementation. A comparison between these generated designs and the previously mentioned Xilinx FFT core was performed and portrayed in Fig.2.12. In these graphs, throughput and latency are exposed, from a DFT design with 1024 samples (16 bit fixed point) on a Xilinx Virtex-5 FPGA. It was synthesized with Xilinx ISE, and all the data relative to performance and cost were fetched after place and route stage. As seen in Figure 2.12, the SPIRAL design outperforms the Xilinx LogiCore FFT core either on latency or throughput. It might have an extra area cost in comparison with the Xilinx one, although it that might be managed by the designer on the generation tool [2] (further detailed in Section 3.3).

19 (a) Latency(us) vs area(slices)

(b) Throughput(Million samples/s) vs area(slices) vs performance(Gop/s)

Figure 2.12: Comparison between SPIRAL generated design and Xilinx LogiCore FFT v4.1. Results from a DFT 1024 samples (16 bit fixed point) on Xilinx Virtex-5 FPGA [2]

20 2.6 Summary

In this Chapter, the state-of-the-art of PULP platform was presented, starting by describing its main functional principle of operation and further elaborating the relevant scientific works performed on this platform, namely in computer vision applications, extensions and HW acceleration and heterogeneous programmable accelerators. Afterwards, a more detailed description of the PULPino core was pre- sented, addressing its main features that are relevant for scope of this Thesis. Furthermore, are pre- sented new core configurations for PULPino. As the objective of this Thesis is related with hardware acceleration, the topic was addressed as the state-of-the-art interconnect networks related to it.

21 Chapter 3

Hardware/Software Co-design

In this Chapter the proposed hardware/software architecture, which is on a high-level system view, com- posed by PULPino and several attached accelerators. Pulpino as the processing unit, made available as open source, was adapted and modified to accommodate attachable accelerators into the existent AXI bus. The chapter starts with a presentation of the main protocol used (AXI) as its main sub-components. In order to introduce the developed hardware under the goal of this Thesis, an overall system archi- tecture is presented, complemented by a detailed description of the software and hardware interfaces. Furthermore, the architecture of each accelerator is explained.

3.1 AXI Protocol

This section introduces some specifications, communication protocols and the general operation of the AXI protocol, which allows the main system components on PULPino to be memory mapped. It enables the possibility to access those components through the core with simple load/store instructions. AMBA AXI4, after its antecedent AXI3, is an open standard specified by ARM. It facilitates the con- nection and management of functional blocks in SoC designs, specially in the ones with large number of controllers and peripherals. AXI4 was introduced with Advanced Microcontroller Bus Architecture (AMBA)-4 in 2011, having backwards compatibility with its previous specification AXI3. It is now de facto standard for embedded processors, being free of royalties and well documented specifications. AMBA facilitates the way design blocks connect to each other, encouraging modular systems that do not de- pend on technology and can be reused across different systems and applications, while maintaining an high performance and low power communication.

In essence, AMBA integrates a set of protocol specifications [44]:

• Advanced eXtensible Interface(AXI)4 - Further explained in Section 3.1.

• AXI Coherency Extensions (ACE) - extends AXI with additional coherency that allow multiple processors to share memory in a coherent manner.

22 • Advanced High-performance Bus (AHB) - Larger bus widths (64/128 bit), using a full duplex and has increased performance over AXI4.

• Advanced Peripheral Bus(APB) - It was designed to interface with peripherals that have sim- ple interfaces and low power profiles. It provides low-bandwidth control accesses using a low complexity signal list version from AHB.

AXI4 has several subsets, namely AXI-full, AXI-lite and AXI-Stream. These may interface together through bridges and protocol converters. AXI-full targets high performance, high clock frequency sys- tems designs. Mainly featuring support for: multiple region interface; quality of service signaling; support for unaligned data transfers recurring to data bursts; individual control and data phases; individual read and write data channels; simultaneous reads and writes support; handle multiple outstanding addresses; out-of-order transactions. All the AXI transactions operate on the basis of valid/ready signal handshake. The source of the data holds the valid signal when it has valid data for transference. Once the destination is ready to receive data, it asserts the ready signal. When both valid and ready signals are asserted, data is transferred be- tween source and destination. Another useful signal is tlast, responsible for signaling the last data packet in a burst data transaction. The protocol operates in a master-slave paradigm, meaning that each end of the connection is required to be either a master or a slave. It uses 5 different channels: read address; read data; write address; write data and write response. It is capable of bursts up to 256 beats, meaning that it allows 256 individual data transfers in a single transaction, that are based upon the same address.

AXI-Lite is a lighter implementation of AXI-full protocol. Uses the same 5 channels, although the use of bursts is not allowed. Each transfer is limited to a data width of 32 or 64bits. AXI-lite is suited for simple implementations that do not have high bandwidth requirements. Set up and control components might be used, which require proper configuration upon its utilization. The ease of configuration is related to the fact that AXI-Lite supports from 4 to 512 individual addressable slave registers. Each of them may be written to or read from. With its simpler implementation comes a smaller footprint, advantageous in ultra low-power systems. Despite its reduced performance, its possible to bridge back to AXI-full, a different and more complex standard which PULPino uses. Allowing a interaction between both protocol specifications, even creating with ease a bridge among low and high throughput systems.

AXI-Stream uses one data channel in which the data only flows in one way: from master to slave. De- signed to applications that require high bandwidth data transfers and low latency. By having only a data channel, it does not require addresses to proceed with the data transactions. These transaction are started once both slave and master are ready to receive or send data, respectively. It has an unlimited burst of data, suited for applications that require data streaming. To indicate the end of a transaction, it uses the signal tlast, being asserted on the last word sent over the data channel.

AXI-full is used in the current PULPino’s design to connect together components, such as debug unit

23 and SPI slave. Other components, like the instruction and data memories, core and peripherals make use of AXI-full indirectly, since they require a bridge to interact between different interfaces and be able to communicate through an AXI interconnect (further detailed in Section 3.1.1). On the other hand AXI-lite, being a much simpler and lightweight protocol, in this Thesis proposed design it is responsible for handling communications between core and accelerator. One of the reasons AXI-Stream is not used in the proposed design, is because AXI-lite provides a better generic interface that might support both configuration and ”stream” of data. AXI-Stream is mainly suited for data streaming, and the lack off individual slave register make its configuration less desirable to interface with components. For instance, the use of AXI-lite slave register, might be used to provide means of synchronization between both processor and accelerator.

3.1.1 AXI Interconnect

Pulpino integrates multiple components, using an interconnect network, which is supported by AMBA standard. As depicted in Fig.3.1, it uses a main interconnect AXI block and a bridge to Advanced Periph- eral Bus(APB) to connect peripherals. They both feature a 32-bit wide data channels. It also includes an advanced debug unit that enables access for both RAMs, core registers and memory mapped IOs via JTAG. Pulpino has its components connected through an AXI interconnect block, that allows all of

Figure 3.1: PULPino’s SoC block diagram [45]. them to be mapped in a memory space, providing a homogeneous view of the system. A memory map has been defined, in which all the components have user-configurable address spaces as shown in Fig. 3.2. This map might be completed by the user, adding new address ranges to incor- porate additional hardware mapped components. For instance, if an accelerator needs to be added, a new set of addresses (start and end) is configured at the AXI interconnect configuration sources. Those components are ideally meant to be as much plug-an-play as possible and have an ”universal” interface,

24 towards the goal to be easily deployed in any given IP. The interconnect block provides a way of multiple masters and slaves to be connected to several blocks

Figure 3.2: PULPino’s memory map [45]. at once. A typical implementation of a AXI interconnect provides clock conversion mechanisms, data widths conversions and even FIFOs if necessary. To route all the slaves and masters together it has a central crossbar. The routing mechanism may be configured to be based on an addressing space for each component (used in the presented design) or as simple as a round-robin solution. Pulpino has a custom AXI interconnect block, which design objectives follow [44]:

• Suitable for high-bandwidth and low-latency designs;

• Operation at high-frequency without using complex bridges;

• Meet the interface needs of a wide range of components;

• Indicated for memory controllers that have high initial access latency;

25 • More flexible implementation of interconnect architectures;

• Be compatible with previously existing APB and AHB interfaces.

Figure 3.3: AXI4 node overview for a NxM interconnect node [46].

Axi node is a system verilog soft IP, defined as the top level source file for the AXI4 crossbar. In Fig. 3.3 is depicted a block diagram for a NxM interconnect node, composed mainly by four parts [46]:

1. Slices: optional buffers inserted on each master/slave ports. Provides buffering and cuts the critical paths that may lead to timing degradation;

2. Request trees: there is one for each target port, containing the arbitration tree, request decoders and error management;

3. Response trees: one for each initiator port, including arbitration tree for request and decoders for responses;

4. Configuration block: is used to map the initiator regions (memory mapped). It can also be used to implement a first layer of protection mechanism, despite not being used in PULPino, by limiting the connectivity from a target port to an initiator port.

The initial memory map is defined at the instantiation of axi node, by configuring a set of parameters:

• NB MASTER: Number of memory mapped master components;

• NB SLAVE: Number of memory mapped slave components;

26 •{ start addr i, end addr i}: arrays of 32-bit start/end addresses, e.g. if NB MASTER = 4 then four pairs of start and end addresses must follow. Another example, in data memory start addr i = 0x0010 0000 and end addr i = 0x0010 8000, as shown in Fig. 3.2.

3.2 Overall System Architecture

This section addresses the proposed overall system architecture, based on PULPino. In order to im- prove PULPino’s energy efficiency, hardware acceleration was provided in an loosely-coupled manner. It is described in the following section how the acceleration was implemented, allowing to reduce a cer- tain task execution time. The goal is to develop a generic plug-n-play interface for the accelerators to be easily attached to the core (Section 3.2.1), without having to adapt its interface each time a different one needs to be intro- duced. The processor is able to interact with the accelerator by simple load/store instructions, which were used as processor-accelerator synchronization mechanisms, further detailed on Section 3.2.2. The developed work under the scope of this Thesis, was to attach the accelerators that interface with

Figure 3.4: PULPino with attached accelerators block diagram. the processor, through the implemented AXI connection. The difference between the original pulpino architecture and the new developed one, may be noticed by comparing Figure 3.1 and 3.4. The accel- erator interfaces with an AXI-lite to AXI-full converter, as shown in Fig. 3.4. The AXI-full interface of the converter is connected to PULPino’s AXI interconnect block, that handles the communications coming from the processor. When the processor issues a load/store to the accelerator’s memory space, the AXI interconnect block does the all the required address translation to redirect the read/write into it. Is possible to have more than one accelerator at a time, by defining multiple address spaces for each one.

27 3.2.1 Hardware interface

The hardware interface between accelerator and the remaining system, only requires the standard AXI signals to operate, being one of the main design goals. The accelerator block has one unique AXI-lite port as interface, as depicted in Fig. 3.4. This block behaves like a wrapper that integrates the kernel and has all the required hardware to handle the interface between it and the AXI-lite slave registers. This integration might consist, for instance, on storing the processor’s incoming data and feeding it in the kernel’s inputs according with its timing requirements. The same applies for kernel’s output data handling, being afterwards read by the processor. All kernel’s control signals are provided as well. The AXI-Lite interface that connects to the accelerator, has the following configuration:

• Address width: 2 bits; Addressing 4 slave registers of 32-bit each.

• Data width: 32-bit;

• Read/Write mode. In this mode both read and write channels are enabled.

The data width is restricted to 32-bit, due to the processor only allowing 32-bit operations. The amount of registers used was a project choice, upon the goal of using the minimum hardware possible, while also meeting the required functionality for interface. Once the kernel has been properly integrated within the wrapper, the accelerator block is ready to be attached to the AXI-lite to AXI-full protocol converter block. It allows each accelerator to be connected to the AXI interconnect block, and therefore memory mapped and addressable by the processor. Each accelerator to be plugged in, has to have an independent memory region configured in the axi interconnect block and its own protocol converter, as depicted in Fig. 3.4. The design option for each accelerator to have its own converter might seem a bad design for an energy efficient SoC. Although, this is a lightweight-hardware protocol converter, that corresponds to less than 1% of the total PULPino’s platform’s hardware. Both the converter and accelerator are instantiated in PULPino’s core region.sv top module file. This module instantiates all core-related components: data and instruction memories, RISC-V core and its debug module. To connect the accelerator to the converter, a new AXI4 slave bus was instantiated, which contains all the required signals to perform a connection between two AXI-full interfaced components. The AXI slave port of the converter is connected to the AXI interconnect block’s master AXI port. On the other hand, accelerator’s AXI-lite slave port connects to the AXI-lite master port of the converter. Once all these requirements are meet, the accelerator is ready to be plugged-in into the converter and connected to the AXI interconnect block. In essence, this architecture aims to deliver an infrastructure in which is possible to have a well defined interface between processor and accelerator. Encouraging the open source community to develop a set of accelerators PULPino-compatible, creating an easy way for developers to test new kernels on the go. Reducing development time and complexity of such hardware systems.

28 3.2.2 Software Interface

The processor needs to interact with the accelerator in order to benefit from its capabilities. Such is done through load/store instructions, addressing the memory region predefined to it. In essence, the processor needs to send the data to be processed and afterwards fetch the computed results. A pro- cessor needs to know when the results are ready to be fetched, therefore a synchronization mechanism is required. To accomplish such requirements, a processor-accelerator communication protocol was de- fined. It is based on read/writes to the available AXI-lite slave registers. Table 3.1 provides an overview of such register and its functionality depending on the type of operation (read/write).

Table 3.1: AXI-Lite slave register write/read map functionalities. Register Write Read

slv reg0 Reset Done slv reg1 Input Data N.A. slv reg2 Last Data N.A. slv reg3 Optional Result

For instance, when the user wants to reset the accelerator, a write to its first address (for demonstration purposes lets assume the first address is 0) must be performed sending the hexadecimal value of 0x01010101. This is the first step before starting to send new data into the accelerator. Then, the address should be increased by one unit and the data streamed into address 1. The last data value should be sent to address 2, it will indicate to the accelerator that this will be the last data input to be sent. Once the computation is done, the output values are available to be read in address 3. The processor might start to fetch results when the Done signal represented by the hexadecimal value 0xdeadbeef is read from the address 0. This method requires the processor to be checking this register in polling mode. An alternative to this method, is to have the Done signal associated to an interrupt vector, which would trigger a flag once the accelerator’s computation were done. Redirecting the program counter to the proper interrupt service routine, handling the output data as desired. Although this was not implemented, it is a strong recommendation for future work. From the energy-efficiency point of view, it would allow the processor to stay at a sleep, in a low power operation mode, while it waits for the computation and then awaking upon the interrupt. The feature of stepping out of the sleep stage when an interrupt is triggered is available on PULPino, as mentioned in Section 2.2. An optional functionality for slv reg3 was added to fill the need that some accelerators might have. For instance, it might be used to pass some configuration value or information about the input data, which happens in SHA-3 accelerator as detailed in Section 3.3.

3.3 Hardware Accelerators

The integration of open-source kernels into one top level accelerator is one of the main targets of The- sis. It is intended to demonstrate the energy-efficiency impact of it on the overall system operation. The

29 kernels were chosen upon the first criteria of being open source, to match the kind of licensing provided by PULPino’s team. This way, the hardware development made in this thesis might contribute to the open-source community of PULPino and keep growing the framework. The kernels used in the further presented accelerators were not inner modified. Based on its interface, control specifications and tim- ing requirements, the required hardware was developed to integrate the kernel within the accelerator’s block. The developed hardware that wraps the kernel inside the accelerator, acts as bridge between the accelerator’s AXI-lite interface and kernel’s interface. In the following subsections, two accelerators are proposed. Due to different kernel’s requirements, different hardware designs were needed for both. Although the main structure is the same, due to its similar requirement of streaming input data on each clock cycle. The processor does not meet such requirement, since each read/write through AXI-lite takes 10/11 clock cycles respectively. Therefore, additional hardware to accommodate the input data from the processor and then feed it in the kernel is needed. Apart from this requirement shared by both kernel, the control signals to operate them are different.

SHA-3

In any kind of modern computer systems, security is an extremely important feature. The most basic security or authentication system use hashing algorithms. These take as input a stream of data and return a fixed sized hash, for that specific input message. These hash functions are required to generate unique hashes for any given message. The message cannot be generated from the hash and it should be easy to compute [35]. NIST has standardized Secure Hash Algorithm(SHA), being SHA-3 the more recent one in the family. It is a cryptographic hash function, originally known as Keccak. It was devel- oped after successful attacks on MD5, SHA-0 and theoretical attacks on SHA-1. A cryptographic kernel was chosen due to the recurring importance of security on low power IoT de- vices, which still represents a challenge to nowadays common applications. A SHA-3 accelerator aims to provide a faster computation of the hash function with a reduced energy consumption, for the target low-power processor. The hardware implementation of this function exploits parallelism, which would not be possible with the single core processor under analysis. The input message might take any size, while the output length will remains the same. The output might take the following lengths n ∈ {224, 256, 384, 512}bits. The current implementation has the highest se- curity level of 512-bit among all SHA-3 variants. Keccak is based in the sponge construction approach, which through random permutation functions allows inputting any amount of data, leading to great flex- ibility. The padding of a message M to a sequence of 576bit blocks is denoted by Mkpad[576](|M|). It makes use of multi-rate padding, denoted by pad10∗1, appending a single bit 1 followed by the sufficient number of 0 followed by a single bit 1, such that the length of the result is a multiple of the block length

(576 in the current design) [47]. The number of blocks P is designated by |P |576, and the i-th block of P by Pi. The number of blocks determines the number of times the permutation f is executed, as shown in Algorithm1.

30 Algorithm 1 The sponge construction [3] 1: procedure SPONGE ∗ 512 2: Interface: Z = SHA − 3 − 512(M),M ∈ Z2,Z ∈ Z2 3: P = Mkpad[576](|M|) 4: s = 01600 5: for i = 0 to | P |576 −1 do 1600−576 6: s = f(s⊕(Pi)k0 ) 7: end for 8: return bscr

The kernel was developed by Homer Hsing, available in OpenCores website is under the Apache license (version 2) [3]. The code is FPGA-vendor independent, and fully optimized. It uses only one clock domain, without any latches. Capable of computing an 512 bit hash result in 29 clock cycles, is based on a padding module followed by a permutation module as shown in Fig.3.5.

Figure 3.5: SHA-3 kernel overview architecture [3].

The input is limited to a width of 32 bit. Since it is far less than 576 bit, the padding module, which architecture is shown in Fig. 3.6, uses a buffer to assemble the user input. If this buffer achieves maximum capacity, the permutation module is notified that a valid buffer output is ready. Then, the permutation module starts the calculation. The previous buffer is cleared and the padding module waits once again for new input.

Figure 3.6: SHA-3 padding module’s architecture [3].

After the padding is complete, the output is the permutation block’s new input, as depicted in Fig. 3.7. The permutation is performed by combinational logic, performing the permutation in 24 rounds. The round constant used by it, is selected on its first iteration. A 1600-bit register stores the output, although only 512-bit are selected from it, resulting in the final hash value.

31 Figure 3.7: SHA-3 permutation module’s architecture [3].

When the permutation module is computing the current input, the padding module is preparing the next input. The permutation takes 24 clock cycles, meaning that the padding should get the next 576 bit ready in time. The kernel presents input/output ports shown in Table 3.2. To start computing a hash value, the core must be reseted by holding the reset signal synchronously high during one clock cycle. This procedure must be repeated at every new hash value computation. For instance if the kernel computed SHA-3-512 (”FooBar”) then it should be reseted before computing SHA-3-512 (”XPTO”). The padder uses the input signal byte num in last input block, that indicates how many bytes the input has, hence the message M may not be multiple of 32 bit. If the last input block is reduced to 1 byte, it should be aligned to the most significant bit. Letting ”A” be the message with 1 byte, the input signal in should follow like this: in[31:24]=”A”. Notice that if the message is multiple of in width, an additional input zero-length block should be provided. For example, let the input be:

in = ”ABCD”, is last = 0

Then set, is last = 1, byte num = 0

Table 3.2: Table of SHA-3 kernel’s input/output ports. Port Width Direction Description clk 1 In Clock reset 1 In Synchronous positive asserted reset in 32 In Input data byte num 2 In Number of bytes of in in ready 1 In Input is valid or not is last 1 In Current input is last or not buffer full 1 Out Buffer is full or not out 512 Out Hash result out ready 1 Out Result is ready or not

To comply with these requirements, a wrapper to accommodate the SHA-3 kernel was developed. It complies with the specifications provided in Section 3.2. Based upon the following data path depicted in Fig. 3.8. The design aims for efficiency and reduced hardware usage, by using only one FIFO (32 words

32 Figure 3.8: SHA-3 accelerator data path. of 32-bit) and a set of control signals. The input data is redirected from the processor into the kernel as soon as it arrives (every 11 clock cycles). The FIFO acts as a buffer in the event of the kernel’s buffer achieves full capacity. If it does, the buffer full signal will be asserted. Upon such event, the input data is held in the FIFO, until the kernel is ready to continue computation. Afterwards, the data is feed into the kernel’s input at the previous rate. When the last input data is sent by the processor, FIFO might have exceeding data due to previous buffer full events. Such data is streamed into the kernel on every clock cycle after the last input data. Which is sent to the slv reg2 AXI-lite register, asserting the is last and selecting the previously received byte num value. As explained before, the byte num signal has to follow certain rules, based on the length of the last message block. The user is responsible for this handling, by sending a last dummy input data with the value zero after the last value message block, into the slv reg2 register. When the out ready signal is asserted, the 512bit hash key is made available at the output port, being iteratively fetched by the processor on blocks of 32bit. This output is redirected to the slv reg3 register, since it is limited to a 32bit word, an auxiliary counter is used to iterate through the 512bit result. It starts by sending the least significant word iterating 16 times until the most significant word is sent. Afterwards the accelerator needs to be reseted, to be able to proceed with further computation. Metrics regarding the accelerator performance, achieved upon simulation and hardware tests, are pre- sented in Chapter5.

FFT

Fast Fourier Transform(FFT) is widely used in DSP and in many other application fields, as a fast and more efficient algorithm to compute the Discrete Fourier Transform (DFT). It allows the conversion from the signal’s original domain to the frequency domain. While the DFT has a complexity of O(n2), the FFT complexity is O(nlogn), in which n is the data size [48]. Many applications require a highly efficient com- putation of this operation, often leading to a hardware implementation. The algorithm may be mapped in many different architectures, depending on the hardware restrictions and performance requirements of the application [43]. This Thesis addresses a low-power processor targeting IoT applications, which often require DSP al- gorithms, namely FFTs. Therefore, it is important to have a commonly used DSP accelerator such as

33 the addressed one, aiming at high energy-efficiency and performance. The purpose was not to develop an FFT kernel from scratch, but instead selecting the more appropriate one, having a low-power profile and low hardware resources requirements. At the same time, it should be based on an open-source license, as stated before. The FFT kernel was wrapped together with an AXI-lite interface, communicat- ing through it with the processor. All the additional hardware to bridge between the kernel’s and AXI-lite interfaces was designed with the goal of reducing the used hardware, while efforts were made to take profit from the kernel’s performance. The AXI-Lite interface might limit the input data feed rate, such as the kernel’s throughput, in case the AXI-lite data transactions cannot keep up with it. The chosen FFT kernel was developed by SPIRAL - Software/Hardware Generation for DSP Algorithms

Table 3.3: SPIRAL FFT kernel online configuration parameters [2]. Parameter Value Range Description Problem specification

transform size 256 4-32768 number of samples direction forward forward or inverse DFT data type fixed point fixed or floating point 32 4-32 bits fixed point precision unscaled scaled or unsaled mode Parameters controlling implementation

architecture iterative iterative or fully streaming radix 2 2, 4, 16 size of DFT basic block streaming width 2 2–256 number of complex words per cycle data ordering natural in/out natural or digit-reversed data order BRAM budget -1 maximum number of BRAMs (-1 = no limit) Permutation method JACM’09 JACM’09 [49] or DATE’09 [50]

[43], which has won the ACM TODAES Best Paper Award 2014. SPIRAL provides an online tool [2] for hardware generation, outputting a generated FFT kernel in verilog, upon the chosen parameters. These parameters might taken the values specified on Table 3.3, on the ”Value” column are the parameters chosen for the used kernel. The remaining available options (if applicable) for those are described in the next column ”Range”. A short description of the parameters meaning, is also presented in the last column. It was chosen a kernel with 256 number of samples n, for the forward DFT defined as:

y = DFTnx,

−2πjkl/n DFTn = [e ]k,l=0,...,n−1 where y is the n point output vector, and x the n point input vector. The data type chosen was ”fixed point”, due to the lack of floating point operations support by the processor in hardware. The current AXI-Lite configuration is set to 32-bit messages, the fixed point precision was set to 32-bit together in unscaled arithmetic mode.

34 Figure 3.9: SPIRAL Fast Fourier Transform(FFT) iterative architecture [2].

Figure 3.10: SPIRAL Fast Fourier Transform(FFT) fully streaming architecture [2].

The parameters controlling implementation will be addressed next. Both architectures, iterative and streaming were tested under the scope of this Thesis. The developed hardware is prepared to accom- modate both architectures. The iterative one is slower than the streaming version, hence the data stream has to iterate over a single stage O(logn) (n is the size of DFT to be computed) times, as shown in Fig. 3.9, and cannot begin a new input vector before the last vector is processed. In the case of a streaming architecture the data stream would flow in and out the system continuously. The architecture consists of multiple O(logn) cascaded stages, and each one is composed by computation and data reordering components, as depicted in Fig. 3.10. The radix defines the size of the DFT block, controlling the num- ber of points processed on a basic computational block. In case of an iterative architecture, the problem size n must be a power of the chosen radix, meaning that n = rk for an integer k. Such restriction is not applied to a streaming architecture. The streaming width must be a multiple of the chosen radix. It controls the input data stream width (defined as w in Fig. 3.9 and 3.10), when increasing this parameter by a factor of k, consequently the system’s parallelism increases by a factor of k. As specified in Table 3.3, multiple data ordering might be selected from natural in/out, natural input/reversed output and re- served input/natural output, on the reversed ordering the MSBs become the LSBs and vice-versa. The ”BRAM budget” option allows to choose the maximum number of BRAM blocks when Xilinx FPGAs are targeted, by adding synthesis directives into the generated verilog code, interpreted by the Xilinx tools. For an non-restricted number of BRAMs, the value -1 should be inserted. The permutation (last line on Table 3.3) method defines has the name indicates, how the permutation is done. From the two available methods, the DATE’09 [50] is patent free, requiring almost twice the amount of SRAMs and with reduced performance and higher logic costs. On the other hand, the second method JACM’09 [49], which holds

35 a patent (protected by U.S. Patent No. 8,321,823), is based on a different and improved technique [2].

Given the previous configurations for the FFT kernel, is generated a single verilog file by SPIRAL’s online tool [2]. For the Kernel’s integration in the accelerator, it is instantiated within a wrapper, interfacing with the processor through AXI-Lite. Such integration was done with additional hardware, which data path is depicted in Fig. 3.11, acting as a bridge between the FFT kernel and AXI-Lite interface. The resultant top level hardware block is denominated by accelerator. The input data redirected from the AXI-Lite

Figure 3.11: FFT accelerator’s data path. slave registers, which contain the data sent by the processor, based on the protocol defined in Section 3.2.2. The input data with its width defined by w, might be 32-bit or 16-bit, hence the kernel is ready to compute both types of data. The hardware design only has the 32-bit word width option available. The input data is redirected by the demultiplexer at the moment it arrives, into a buffer register. Its purpose is to store the data until it has been fully filled, since the processor sends one 32 bit wide data in each communication. The register’s data width corresponds to 2∗N ∗w, with N being the FFT kernel’s number of complex inputs. Each one is composed by a real and imaginary part, individually w bits wide. After the buffer is full, its data is stored in the FIFO. Thus, it has the same data width has the buffer and a data depth equal to the number of input samples, as defined in the online tool [2] explained before. After the processor is done sending all the data samples, the FIFO will be full and ready to stream all the stored data into the FFT kernel. The need for an streamed input in at each clock cycle after the kernel’s

36 next input signal assertion, is the main reason to include a FIFO memory in the design. Hence the AXI-Lite interface can only provide one new message from the processor at every 11 clock cycles. After the first kernel’s input it takes a known amount of clock cycles until the computed output is ready, defined by the latency, which depends on the kernel’s configuration chosen in [2]. Upon assertion of the signal next out by the kernel, its output is streamed into the same FIFO. The signal sel in fifo selects the FIFO’s inputs between the data to be computed and the output results to be stored. Similarly, the signal sel out fifo multiplexes the FIFO’s output data, redirecting it to either the kernel’s input or the output multiplexer operated by sel out reg control signal. This last mux on the data path splits the one word FIFO’s output 2Nw sized, into smaller words with w width compliant with the supported slave registers width of AXI-Lite. The hardware design which data path is shown in Fig. 3.11, is ready to handle different FFT kernel’s configuration with minor adjustments on the accelerator’s input parameters and the kernel’s component declaration and instantiation. These parameters adjust the data width of several signals, as the counters used to control them. All the configuration options on the online generator tool [2] are covered, with the exception of fixed point precision which is fixed to 32bit and the number of samples to 256. The defined accelerator’s top module parameters are the following:

• FFT INOUT NR: number of inputs/outputs which the current kernel’s configuration can handle.

• DATA WIDTH FIFO: The data width of each word stored in FIFO. Calculation based on FFT INOUT NR ∗ w, w = 32bits.

• DATA DEPTH FIFO: Data depth of FIFO, how many words of DATA WIDTH FIFO width are stored in it. Value based on the number of data input samples divided by the chosen radix. samples 256 E.g. radix = 2 = 128

3.4 Summary

In this chapter is introduced the target device in which the development was done. Followed by a description of the AXI protocol, in which the accelerators are based. Detailing the developed interface developed, on both hardware and software point of view. The architectures of each accelerator and kernel are presented, as the developed hardware to wrap up all the components together.

37 Chapter 4

Implementation and Experimental Work

This chapter presents the basic steps and procedures to set up an working environment to start with PULPino on the chosen development board. From the fist step of configuring the system and getting it up and running, to deploying the bitstream into the FPGA and executing a program on the core. Along with the development process, simulations took part as an essential step. It is also presented how to simulate such system. During all these development phases some draw backs and problems were found. They will be detailed and possible solutions discussed. As stated before, PULPino is an open-source project, therefore all the base sources were retrieved from the project’s GitHub page [51]. Knowing that this was its first release, and not a mature project, its expected to have incompatibilities and unsolved issues/bugs. This was one of the main setbacks found. Having to deal with a release that was not well documented from a technical point of view (only a very basic user manual and a datasheet are available). Due to the recent release (dating 2016), there is still a very small active open source community working with this platform. dificulting the resolution of the many of prompt issues. In what concerns the development board, a ZedBoard was chosen to conduct the necessary tests and fetch results, as presented in the next section.

4.1 Target Device

The development board was chosen accordingly with PULPino developers specifications. PULPino is mainly targeted for RTL simulation and ASICs, although there is also a FPGA version supported on ZedBoard. The FPGA version is not optimized for performance and efficiency, since it was used mainly for emulation instead of standalone platform. ZedBoard carries a Xilinx Zynq-7000 Family All Programmable System on Chip(SoC) XC7020-CLG484- 1. This device series enables extensive system level integration and flexibility through its main hardware,

38 software and I/O programmability. Most of its inner system level components have available GUI configu- ration tools, which contribute to reduce the development time and ease of debugging, by auto-generating the required source code upon the user’s hardware specification/requirements.

Figure 4.1: Xilinx Zynq-7000 SoC block diagram overview [52]

In Fig. 4.1, the main components for the range of Zynq devices are represented. The main components of this design are: Programmable Logic(PL) and Processing System(PS). ThePL is derived from Xilinx 7 series FPGA technology, namely Artix-7 for the present XC7020 device. The integratedPL block is available for the user to deploy the custom designed hardware, as in any other FPGA, through hardware description languages such as Verilog, VHDL or System Verilog. The given low-range Zynq-7000PL features 85.000 Programmable Logic Cells, 53.200 Look-Up Tables(LUTs), 220 Digital Signal Process- ing(DSP) slices and 4.9Mb of BRAM memory.

ThePS itself features an Application Processing Unit composed by two ARM cortex A9 hard-cores, with a maximum frequency of 866MHz on the featured device, capable of 1GHz on higher-end SoCs. Also featuring two level caches, L1 and L2, having respectively 32KB (each instruction and data memories) and 256KB. Additionally, each processor has its own 256KB On-Chip Memory and FPU unit. Together with the memory controller each processor has access to external 512MB DDR memory. I/O peripher- als inner composed of SPI, CAN, UARTs, I2C, USB and interface, are also available to thePS through central interconnect block

39 The Zynq-7000 SoC can be booted in a multi-stage process and includes the boot ROM and First-Stage Boot Loader(FSBL). The boot process initializes and clean-up the system, and prepares it to boot from the selected external boot device. Once it is concluded, the FSBL is executed and the systems main components (PS andPL) are configured accordingly. For instance, loading an light-weight operative system intoPS and a bitstream intoPL.

4.2 System Configuration

Previous to pulpino’s deployment in FPGA, some essential system configurations and main steps were taken. The recommended toolchain used is Vivado 2015.1 from Xilinx, for synthesis and implementation, and ModelSim 10.2c from Mentor for simulation. To compile the source C/C++ files, which are meant to be executed on the addressed core, was used a riscv-toolchain. It may be provided by University of California - Berkeley if it is the RISC-V official toolchain, or the custom one from ETH Zurich University. The last one was used, due to its support for all ISA extensions present in RI5CY core (see Section 2.2). A dual-ARM Cortex A9 is also part of Xilinx SoC in use, therefore its compilers are a important. Hence it is compiled by Xilinx SDK, gcc-arm-none-eabi and gcc-arm--gnueabi are the ones required, ad- ditionally with lib32 libraries. The following stages of configuration presented below, describe a path that must be followed in order get PULPino up and running, ready to be tested. An external hardware view of the system’s block diagram is shown in 4.2, wherein the SoC Zynq-7000, contained by the ZedBoard, communicates with the PC via serial interface (UART).

Figure 4.2: Implementation block diagram. Communications between PS-PL and PC-PS.

Boot

There are multiple methods for booting a linux system on a Zynq SoC, although it was used the SD flash memory method. The ARM processor boot is a three-stage process: an internal boot ROM stores a stage-0 boot code, which will configure the processor and the necessary peripherals to start fetching the first stage bootloader (FSBL) code from the SD card. Which is composed by a root and boot partitions, it was configured according to Xilinx specifications [53]. The FSBL will be copied to the SoC’s on- chip memory and afterwards executed. The FSBL includes all the required initialization code for the

40 peripherals used in thePS, and configures thePL with the bitstream. The third step is to bring the OS into the SoC’s memory from the SD card, because when the processor is powered on, the memory is empty. This is done by the bootloader u-boot. Apart from the linux buildroot OS loading, it does other tasks that the kernel might not be able to. Such as, configuring clock frequency, load device-tree, enable boot commands, etc. The loaded buildroot OS, has all the necessary drivers and configurations for the OS to initialize along with its peripheral,s using the hardware description present in the device tree [54].

Generating the Bitstream

Unlike most kind of Vivado’s projects, this one operates accordingly with a makefile that sets up the environment and executes tcl commands (Vivado commands) to create the project, add the required sources, set properties, set compile order, synthesizes and implements accordingly with an area opti- mized strategy. It is composed by two different projects. A top project, that has thePS, AXI buses and the required AXI converters to interface both with the core (via SPI) and FPGA GPIOs ports. A second project contains PULPino and all its RTL sources, meaning that it would only be needed this second Vivado project, in case of a standalone implementation of PULPino an FPGA. It would facilitate all the developing/de- bugging processes and allow the use of Vivado’s full block design potential, if this second project could be added to the top one as a design block. Although due to an incompatibility between Vivado’s block design and PULPino’s sources, this is not possible. Consequently, all Vivado’s available debug tools and automatic block connections features (being one of its remarkable development advantages) are not available. Translating into much more difficult and time consuming problem solving and development processes.

Deploy the Bitstream

Once the bitstream is generated it can be programed into the fpga at least in two different ways. On Processing System(PS) boot, loading the bitstream file from the SDcard and programming the fpga when the u-boot linux is booting up. Otherwise, it can also be deployed using XMD Xilinx tool on the PC the board is connected to. To perform it, one needs to issue the command ”fpga -f bitstream.bit” to upload it into the FPGA. To be noticed that if this method is chosen, the core needs to reseted before uploading any pre-compiled program into it. The reset should be done by issuing the spiload script on pulpino (via serial port) loading an empty stimulus file into the memories, forcing the core to reset.

Execute a Program on PULPino

When the bitstream is deployed and the core reseted, it is ready to receive the proper stimulus files which have all the data to be uploaded into the memories. Once the C/C++ code is compiled it’s generated a stimulus file containing the memory address and the value to be stored on that memory address. This file might be uploaded to the FPGA, either by saving it directly on the sdcard or by connecting to the

41 board via ssh and using secure copy (scp). Once uploaded into the board, on thePS a script (spiload) that load the stimulus file into the fpga is executed. Not only loads the file via SPI into the memories but also defines the boot address, reset the core, and listens to the core outputs that will be redirected into linux stdout. This process could also be performed ”manually”, step by step, using a jtag debug tool that connects to the internal AXI bus, having access to all the address space. Although a proper working version that complies with this core version is not up to date. Such tool is only available in previous versions of the core (e.g. or1k or or10n core version).

Simulation

An important phase of development is the test of an application in an controlled environment, in which is easier to debug and ensure all its intrinsic functioning, for example through wave form analysis. ModelSim 10.2c is the default platform in which Pulpino was tested. All the simulation scripts (available with pulpino project) were conceived to fit this platform requirements and interface. These reproduce the behavior of pulpino as if it was running on ZedBoard. Meaning that it loads the stimulus file over SPI, reproducing the behavior of spiload (see 4.2) or loading the stimulus directly into the memory. The loading method might be chosen upon the pre-set of ModelSim MEMLOAD argument. If there is a need to simulate pulpino in another simulator engine, all the simulation files, including re- quired libraries and simulation sources, need to be redesigned to fit the new simulator requirements. For instance, the simulation sources of pulpino are not compatible with vivado simulator, mainly due to some features of system verilog that are not supported (e.g. dynamic arrays). Making it difficult and time consuming to port to new simulation platforms or hardware description languages.

The simulation environment is built using CMake. A bash script located at pulpino’s git repository sw folder needs to be configured with the paths to ri5cy toolchain, ModelSim, pulpino git directory and enable the use of compressed instructions. Next one on the list is compiling all RTL libraries using ModelSim, which is done with the previously generated makefile. In this step, many compiling errors may prompt, if ModelSim version is not the exact same one recommended, even if it is the same version some extra features might not be enabled by the license, leading to compiling errors. Once all libraries are ready, the simulation might be launched by issuing for instance ”make helloworld.vsim”, opening an ModelSim GUI. Some faulty ModelSim RTL sources and libraries were detected, when optimization is needed. There- fore, is not recommended to use optimizations during simulation, which may lead to errors and untrust- worthy results.

In the need of performing a post-synthesis or post-implementation simulation, a set of possible ap- proaches were tested:

1. The use of Vivado Simulator. This method requires an adaptation of the system verilog written simulation sources and testbench, hence it does not support all kinds of structures, e.g. Dynamic

42 data structures.

2. Generate a post-synthesis/implementation netlist to be simulated in ModelSim. A netlist contains all the hardware blocks (LUTs, DSPs, etc) and connections between them, from which is generated the synthesized/implemented schematic. Compiling the netlist in ModelSim requires that all the Vivado’s Unisim libraries are properly imported into ModelSim, those libraries have all the hardware elements used by Vivado upon the netlist generation. Although some Unisim elements are not compatible with ModelSim compiler. Even if all the libraries have an error-free compilation, using this method only allows to ensure the well-functioning of post-synthesized/implemented design. On a development phase, if there is a need to debug the generated hardware in simulation, this method is not recommended, since all the nets have names that were automatically generated by the tool and do not resemble the original/user-defined ones.

3. Using ModelSim as default simulator in vivado. After synthesize/implement the project, is possible to simulate it directly from Vivado, using ModelSim to compile and run pulpino’s simulation sources.

43 4.3 New AXI Interconnect Slave

Tackling the challenge of attaching a new accelerator to the AXI interconnect bus thorough a AXI-lite to AXI-full converter, without any automatic configuration tools (as the ones Vivado has available). A new RAM memory was attached, to test communications with the core and issuing load/store instructions. It interfaces with an AXI-full specification in a new customly set address region, through a wrapper. Which translates between AXI-full and the RAM read/write interface. The wrapper used is the same one that comes along with PULPino’s data memory. Wherein the protocol conversion is already implemented. The objective is to test the custom connections made at the core top level design in which the compo- nents are instantiated and connected between them-selfs. It was necessary to chose an AXI interconnect address space region, which was free to house the new memory. Accordingly with PULPino’s memory map previously presented in Fig. 3.2. The address region chosen is next to the existent data memory: from 0x00108100 to 0x00110100.

Listing 4.1: AXI Interconnect Instantiation in System Verilog

1 axi n o d e i n t f w r a p 2 #( 3.NB MASTER ( 4 ) , 4. NB SLAVE ( 3 ) , 5. AXI ADDR WIDTH ( ‘AXI ADDR WIDTH ) , 6.AXI DATA WIDTH(‘AXI DATA WIDTH), 7.AXI ID WIDTH(‘AXI ID MASTER WIDTH), 8.AXI USER WIDTH ( ‘AXI USER WIDTH ) 9 ) 10 axi interconnect i 11 ( 12.clk (clk i n t ) , 13.rst n ( r s t n i n t ) , 14.test e n i ( testmode i ) , 15 16.master ( slaves ), 17.slave ( masters ), 18 19.start a d d r i( { 32 ’h0010 8100, 32’h1A10 0000, 32’h0010 0000, 32’h0000 0000 } ), 20.end addr i( { 32 ’h0011 0100, 32’h1A11 FFFF, 32’h0010 7FFF, 32’h0008 FFFF } ) 21 );

The memory map is defined with a start and end vector of addresses as shown in Listing 4.1. Associated to each region is a new AXI master bus. Therefore is defined the number of masters by the parameter NB MASTER=4, matching the number of defined address regions. These buses connect between the AXI interconnect and the remaining components connected to the core (instruction/data memories, pe- ripherals and other additional components). The new instantiated memory was tested using GDB as debug tool. Issuing writes and reads on the de- fined address range, proving functionality, testing configurations and new components. Based on these tests, the work moved on to the next stage of adding an AXI converter. It would need to be capable of

44 converting from AXI-full to AXI-lite, supporting communications between processor and accelerator.

4.4 New Accelerator

Following the previous work on Section 4.3, the same procedure of defining a memory region applies when attaching a new accelerator into the AXI interconnect, a top overview is shown in Figure 3.4. The accelerator is instantiated in the core region.sv file, wherein all the core related hardware blocks are as well (debug, data and instruction memories, protocol converters, memory multiplexers and RISC-V core). In the same core region the accelerator, holding an AXI-lite slave interface is connected to the master interface of the AXI-full to AXI-lite converter. This converter is based on the Vivado block AXI protocol converter, although it was optimized to fit the requirements of pulpino’s AXI-full interface, com- plying with the protocol specifications already in use on it. All the additional compatibility with AXI3 (see Section 3.1) was removed, hence all AXI communications are based on the AXI4 specification. The new accelerator inputs all the AXI-lite signals and has all the necessary logic to it. Additionally the kernel block needs to be instantiated and any further necessary hardware added, to be able to work with the AXI-lite mode of operations (using slave registers). The additional hardware added to the addressed accelerators are covered in Sections 3.3 and 3.3.

4.5 Summary

On the previous sections of the current chapter, all the required steps to setup the environment in which PULPino was tested are presented. After system configuration and boot, generating and deploying the bitstream into the FPGA, are required to afterwards upload and execute programs on it. Additionally, it is detailed in this chapter the approach taken to deploy an new AXI interconnect slave/accelerator.

45 Chapter 5

Experimental Results

In this Chapter are presented the experimental results obtained from the tests performed on PULPino, with and without the hardware accelerators addressed in this Thesis. Both algorithms FFT and SHA-3, are hereby under analysis. The goal is to compare the performance of both software and hardware, tak- ing conclusions on the attainable speedup and energy-savings achieved. Moreover, energy consumption and efficiency are also compared.

5.1 Software vs Hardware

To measure the speedup that an hardware accelerator provides against software-only one, where set test benches to verify the performance and energy efficiency of both implementations. Herein this sec- tion are presented the software-only algorithms (which do not require an accelerator), and the ones that interact with the accelerators, used to compute SHA-3 and FFT. The software-only algorithms were ad- justed to perform in similar conditions on the accelerators. Thus, performing a fair comparison between hardware and software implementations. Both accelerators were synthesized and implemented with the same tools and optimization strategy. On synthesis was used the strategy: ”Flow Area Optimized High”, while on implementation the optimization strategy was ”Area Explore”. These strategies were chosen by PULPino’s developers to be the most adequate for the release here under analysis. Taking into account that the purpose is to operate under restricted power envelopes on IoT domain, while still accomplishing its computational requirements.

5.1.1 SHA-3

The algorithm used to implement SHA-3 was based on [55] implementation (AppendixA), written in C++. Compiled using gcc with -O3 optimization and all the available instructions extensions enabled. Configured with the same amount of permutation rounds and 512 bit hash output, that equals the hard- ware accelerator kernel configuration. The base input test message used was: ”The quick brown fox jumps over the lazy dog ”. Having a size of 44 bytes when translated from ASCII to binary, is a known pangram sentence that includes all the letters of the alphabet, commonly used test message on different

46 kinds of hash and encryption algorithms. To test the accelerator, was developed and used the algorithm shown in Listing 5.1. It interfaces with the accelerator according with specifications shown in Section 3.2.2. In this case, the optional functionality for slv reg3, was used to send the byte num configuration value needed on the SHA-3 kernel. The AXI-lite registers addresses are defined at the beginning of Listing 5.1. The address region was set like described in Section 4.3. It starts by reseting the accelerator at line 14, followed by the byte num value written on slv reg1. Then it is ready to start sending the input message NR MSG times (set for test bench purposes), at line 19. The last input message is sent to slv reg2, in this case it is a dummy message, because the byte num value is equal to zero (see Section 3.3). After the last input is sent, it waits for the computation to finish (line 36). When the output hash is ready to be fetched, the value of slv reg0 is equal to 0xdeadbeef. Finally the resultant 512-bit hash is read from slv reg3. To acquire the amount of clock cycles the algorithm takes to finish, it was set a timer incrementing at every clock cycle. Reseting and starting it at the beginning of the algorithm (line 11 and 12) and finally stopped at the end (line 42) of the algorithm. Afterwards, code to print the outputs, for instance, to the serial port that might be added, for debug or user interface purposes.

Listing 5.1: SHA-3 interface with accelerator C++ code

1#define SLV REG0 0x00200000 2#define SLV REG1 0x00200004 3#define SLV REG2 0x00200008 4#define SLV REG3 0x0020000C 5#define NR MSG 10 6 7 void main() { 8 volatile int ∗ a x i l i t e r e g=(volatile int ∗ ) ( SLV REG0); 9 unsigned int aux[16]; 10 11 r e s e t t i m e r ( ); 12 s t a r t t i m e r ( ); 13 14 ∗ a x i l i t e r e g= 0x01010101;// reset doinga write on slv r e g 0 15 a x i l i t e r e g=(volatile int ∗ ) ( SLV REG3); 16 ∗ a x i l i t e r e g= 0x00000000; 17 a x i l i t e r e g=(volatile int ∗ ) ( SLV REG1);// increment addr to write on slv r e g 1 18 19 for(intk=0;k

47 30 ∗ a x i l i t e r e g= 0x646f6720;//”dog” 31 } 32 a x i l i t e r e g=(volatile int ∗ ) ( SLV REG2);//last write 33 ∗ a x i l i t e r e g= 0x00000000;// dummy write when byte num=0 34 35 a x i l i t e r e g=(volatile int ∗ ) ( SLV REG0); 36 while(( unsigned) ∗ a x i l i t e r e g!= 0xdeadbeef);// wait computation completion 37 38 a x i l i t e r e g=(volatile int ∗ ) ( SLV REG3);// read on addr3 39 for(intj=0;j <=15; j++)// Fetch Hash −512 40 aux [ j ]= ∗ a x i l i t e r e g; 41 42 s t o p t i m e r ( ); 43 }

A set of tests were performed with different sized messages defined by NR MSG, always with the same text of the message before but replicated up to 10 times. The hashes were checked in every different message, to verify the well functioning of both systems (with and without acceleration). After executing the tests to evaluate the amount of clock cycles required for both implementations, with a single 40MHz clock (maximum frequency on FPGA), is possible to verify the speedup in Figure 5.1. Through an analysis of the graph can be concluded, that a significant speedup is achieved over a

Figure 5.1: SHA-3 computation speedup using hardware accelerator. Multiple message sizes were tested.

software-only implementation, by having an additional hardware accelerator for the purpose. Achieving a speedup of 104 times having a 44 bytes length message, meaning that the hardware accelerated implementation is 104 times faster than a software-only one. Being possible to verify a speedup up to 185 times on a 440 bytes length message, in comparison to the non accelerated version. With the speedup tending to increase along with the length of the input message, as the linear regression line indicates.

48 5.1.2 FFT

A well known algorithm was used to test the performance of the FFT on software-only. Namely Cooley- Turkey FFT which implementation, written in C++, was based on [56] algorithm (AppendixA). As in SHA-3 previously presented, it was compiled with the top optimization level with all PULPino’s instruc- tion extensions enabled. It requires only the stock implementation of PULPino to be executed, since no hardware acceleration is at stake in this software-only test bench. Unfortunately, the hardware developed presented in Section 3.3, implementing the FFT accelerator, is only fully functional in simulation. Due to the lack of debugging tools, was solely possible to conclude that a problem is present on the multipliers unit. After mapping the design into hardware, this units were not outputting the most significant 32-bit result of 64-bit total, on a 32x32-bit multiplication. Although, despite the malfunctioning of this unit, the hardware was properly mapped after synthesis and imple- mentation. Making it possible to experimentally evaluate the performance and the efficiency of the FFT accelerator. Both speedup and power consumption results are valid and comparable with the SHA-3 accelerator, relying on the same tools, optimization settings and platform. In order to test the software-only and accelerated algorithms that implement the FFT, a data set of 256 complex samples were defined as input. Due to a complex number being composed by a real and imag- inary part, the total of inputs goes to 512 of 32-bit each. It corresponds to a total of 2Kbytes of input and output data.

The chosen FFT kernel, which is integrated in the accelerator, as detailed in Section 3.3, has several possible configurations. Which allows a deeper analysis on different kinds of setups, to establish a term of comparison between them and conclude which one would be more beneficial. Either on a energy- efficiency or performance point of view. The defined configuration has fixed and variable parameters. The fixed ones are the problem specification (check Table 3.3):

• Transform size: 256 input complex data samples;

• Direction: Forward DFT;

• Data type: Fixed Point;

• Fixed point precision: 32-bit;

• Mode: unscaled

The variable parameters control the translation of the FFT algorithm into the kernel’s verilog file. Multiple setups were tested, by changing this parameters and its experimental results will be presented further on. To test the FFT accelerator, the algorithm shown in Listing 5.2 was developed, which allows the core to interface with it. It can be decomposed in four main sections. First it resets the accelerator at line 8, then starts by sending the assigned inputs to slv reg1 and the last input to slv reg2, at lines 11 and 15, respectively. Moreover, it waits for the computation to be complete after sending the last input, at line 18. Finally the results are fetched at line 21, from the slv reg3. The algorithm is completely independent from the hardware configuration of the accelerator.

49 Listing 5.2: FFT interface with accelerator C++ code

1 void main() { 2 volatile int ∗ a x i l i t e r e g=(volatile int ∗ ) (ADDR0); 3 int aux[512]; 4 5 r e s e t t i m e r ( ); 6 s t a r t t i m e r ( ); 7 8 ∗ a x i l i t e r e g= 0x01010101;// reset doinga wr on slv r e g 0 9 a x i l i t e r e g=(volatile int ∗ ) (ADDR1);// write on slv r e g 1 10 11 for(inti=0;i <511; i++) 12 ∗ a x i l i t e r e g= buf[i]; 13 14 a x i l i t e r e g=(volatile int ∗ ) (ADDR2);//last write 15 ∗ a x i l i t e r e g= buf[511]; 16 17 a x i l i t e r e g=(volatile int ∗ ) (ADDR0); 18 while(( ∗ a x i l i t e r e g )!= 0xdeadbeef);// wait for computation completion 19 20 a x i l i t e r e g=(volatile int ∗ ) (ADDR3);// fetch results from slv r e g 3 21 for(intj=0;j <=511; j++) 22 aux [ j ]= ∗ a x i l i t e r e g; 23 24 s t o p t i m e r ( ); 25 }

Tests were performed with different kind of parameters, by changing the architecture and radix (see Table 3.3). When the radix is increased, automatically the streaming width is also increased to match the number of input words needed by it. The architecture may vary between iterative and streaming version (Section 3.3). On Figure 5.2 is shown the speedup achieved by adding an FFT accelerator. The software-only FFT algorithm performs on a total of 38126 clock cycles, which is the basis to the speedup calculations. Figure 5.2 presents the ratio between the amount of clock cycles required to compute the FFT algorithm on PULPino, with and without hardware acceleration. Is noticeable, by the analysis of Figure 5.2, that the stream version speedup overcomes the iterative one, although it is achieved with extra hardware cost. The extra hardware versus power consumption analysis will be further on presented in Section 5.2. Not so significant speedup results were achieved with FFT, when compared to SHA-3 previous analysis. This is in part due to the increased amount of compressed instructions used in FFT algorithm, in com- parison with the SHA-3 one. The FFT software implementation executes a total of 39458 instructions, from which 34179 are compressed, having 87% compressed instructions. On the other hand the SHA-3 software-only algorithm has only 26% of compressed instructions, of a total 251677 instructions from which 64299 are compressed. RISC-V compressed instruction claim to increase, not only performance, but also energy-efficiency and reduce the code size [57].

50 Figure 5.2: FFT computation speedup using hardware accelerator. On multiple radix implemented in iterative or stream mode.

5.2 Power Efficiency

This section addresses how the novel hardware accelerators, developed under the scope of this thesis, will influence the power efficiency of the hole system (pulpino + accelerators). Is intended to show that with the use of hardware accelerators, both performance and energy consumption might be enhanced. Having additional hardware, implies an increase of fabric area on ASICs or more logic resources on FPGA. On certain applications that will directly benefit from the acceleration, it might reduce signifi- cantly the computation time. Allowing the processor to sleep in a more early stage. Most part of IoT embedded systems, which PULPino is fitted for, might directly benefit from this improvement. Hence, usually they use a run to halt operation mode. In which, all the required computation is done as fast as possible to afterwards enter in sleep mode. For an energy efficient hardware acceleration, the power savings achieved by entering faster in sleep mode, need to overcome the extra static energy cost accel- erators bring over. Measuring the real power consumption of such presented systems is not a simple task. Due to the required development platform (ZedBoard) restrictions, is not possible to measure real power consump- tion on the FPGA fabric alone. Only to estimate it by using the available tools, explained further ahead. To overcome this issue, would be needed a development board, that could power the FPGA chip with an external power supply. Additionally, would also be needed real time control over the FPGA’s package temperature, due to its influence in power consumption measurements [58]. With such hardware restrictions, the setup used to measure the system’s power consumption was working together with Modelsim as simulation tool. Modelsim provides all the activity from the active signals of the application under analysis, which is compiled into a Switching Activity Interchange Format(SAIF) file. On Vivado the hardware is synthesized and implemented, having the simulation run- ning over the implemented hardware. It corresponds to the hardware effectively mapped into the board’s

51 FPGA. On such simulation, the saif file is loaded to improve it’s accuracy, providing average switching activity information about active signals. More information about the simulation tool and justification about the choice of Modelsim as default simulation tool of Vivado, was previously detailed in Section 4.2. On the following section, the results from the several power consumption tests performed, upon the hardware accelerators under the scope of this Thesis. The goal was to measure its power consumption on the attainable states of operation, such as, size of encrypted messages in SHA-3 or multiple radix/ar- chitecture configurations in the FFT one. To finally conclude which one would the most power efficient regarding, computation time, static and dynamic power consumption and overall computation power. By tracing energy savings graphs that help on such analysis. Some of the graphs showing the obtained results presented on the next Sections, do not contain all three values (40MHz, 20MHz and 5MHz) for frequencies. Due to the fact that some of them are similar, and not containing new relevant curves.

5.2.1 SHA-3

SHA-3 accelerator energy efficiency was tested by performing multiple encryptions, with different mes- sage sizes. They were performed in a PULPino implementation with and without hardware accelerator. Not having any additional hardware when testing the software-only in order to obtain most accurate power measurements. With an initial message length of 44 bytes, which corresponds to the size of a commonly used sentence as referenced before in Section 5.1.1. The next test messages correspond to the initial message replicated and concatenated up to 10 times. Figure 5.3, depicts the power measurement results obtained from the several computations described

Figure 5.3: Pulpino with SHA-3 accelerator computation energy with and without hardware acceleration. Combined with achieved energy ratio (SW/HW) at 5MHz previously. In which are shown the computation power consumption, which corresponds to the total

52 amount of energy required to perform the message encryption. Notice that the total amount of energy required by Pulpino, with SHA-3 accelerator, is multiplied by a factor of 10, in order to be perceptible in the graph. Otherwise it would not be noticed, in comparison with the software-only computation energy. Both were calculated accordingly with the following equation 5.1:

1 ComputationEnergy = ∗ ClkCycles ∗ P ower (5.1) F req

In which F req corresponds to Pulpino’s main operation frequency, ClkCycles to the amount of clock cycles required by the processor to conclude the computation and P ower defines the on-chip power consumption estimated by Vivado. This power result is composed by two main parcels, dynamic and static power. The dynamic power is originated from the logic switching activity. The static power repre- sents the power consumed by the FPGA logic when no signals are toggling. Regarding PULPino with SHA-3 accelerator, the dynamic and static values of on-chip power consump- tion are 189mW and 123mW, respectively. On the stock version of PULPino, the dynamic and static values are lower, 47mW and 121mW, respectively. This is due to not having the additional hardware that the accelerator bring on. Although, is possible to notice that the additional static power is very small. Adding only 2mW, which translates into a 1.7% increase on static power. This is the amount of extra power the accelerator would consume when is idle. Hence the hardware does not vary with the increase

Figure 5.4: Pulpino with SHA-3 accelerator energy saved. At 40MHz, 20MHz and 5MHz of main clock frequency with different encrypted message sizes. of the message size, there is no need to collect new on-chip power consumption data on each compu- tation (corresponding to each bar on Figure 5.3). Consequently, the power figures presented before, are the same throughout calculations for hardware accelerated and software-only results. On the right vertical axis of Figure 5.3, is represented the energy ratio, depicted by the black full line on the graph. Which corresponds to the amount of times the energy required to compute was reduced, by

53 using the accelerator. In this 5MHz show case, it is reduced up to 160 times when the message length is 440 bytes. As the tendency line of the energy ratio indicates, the energy savings tend to increase with the length of the message. Having consequently, more efficient usages of the SHA-3 accelerator, as the message length increases. The same pattern is common among all remaining frequencies results. Although, the maximum achieved energy ratio varies. At 20MHz a maximum energy ratio of 114 times, was achieved. At 40MHz, it can only go up to 100 times. All of this values correspond to an input mes- sage of 440 bytes length. Figure 5.4 depicts the energy savings achieved by using the SHA-3 accelerator. In which are shown how much energy, in percentage, is saved in comparison with the stock version of PULPino, without hardware acceleration. From the obtained results, the lowest energy savings start at 98.23% for a 44 byte message at 40MHz. Going up to 99.39% at 5MHz with a message size of 440 bytes. On lower frequencies the energy savings are higher. Starting with difference of 0.68% between the highest and lowest frequencies, on a message size of 44 bytes. This delta tends to decrease to 0.39% for 440 bytes messages, along with the increase of the message size. Thus, the operation frequency, tends to have less impact on the energy saved, for longer message sizes. Meaning, that ”long” messages can be computed faster, by increasing the main clock frequency, with less impact on energy savings.

5.2.2 FFT

On the PULPino with FFT accelerator, the power results were obtained in a similar manner as the SHA-3 one. With a slightly difference, instead of varying the input data sizes as before, the same input was tested on multiple configurations of the FFT accelerator’s architecture. More precisely, on radix 2 and radix 4, using on both iterative and stream architectures (more details in Section 3.3). Despite the pre- sented speedup results on Section 3.3, also having figures for radix 16 architecture, it was not possible to successfully translate it into hardware on the FPGA. Due to similar problems of the previously stated in Section 5.1.2 regarding the faulty multiplier unit of Vivado. Nevertheless, the power results on radix 2 and 4, are sufficient to draw conclusions on the energy efficiency of PULPino with the attached FFT accelerator. The power results for PULPino without hardware accelerator, means that no extra hardware was added. On Figure 5.5, it is shown the total on-chip power consumption in mW, divided in two parts: dynamic and static power. Each column of the graph represents a new hardware configuration to compute the same input data, using FFT algorithm. The SW-only one, corresponds to the version in which there are no ac- celerator attached to PULPino. All the remaining columns results, come from the multiple architectures of the fft accelerator attached to PULPino. As might be interpreted from the graph on Figure 5.5, the accelerated version consumes always more on-chip power than the sw-only version. This is explained by the additional hardware required by the accelerator. Stream architectures, as more resource hungry than the iterative ones, have a higher overall on-chip power consumption. Consequently, achieving su- perior speedups than the iterative architecture, as shown in Section 5.1.2. This increase in total on-chip power, is mainly due to dynamic power, since static power varies significantly less, as shown in Figure

54 Figure 5.5: Pulpino with FFT accelerator, dynamic and static on-chip power consumption at 40MHz.

5.6. Even having relatively small variations, when compared with dynamic power, is noticeable the in- crease of static power on stream architectures over the iterative ones. The use of a FFT accelerator translates into a maximum increase of 5mW of static power, which corre- sponds to 4% of the total on-chip static power of PULPino without accelerators. Meaning that when the processor is at idle, only an maximum increase of 4% on power consumption would occur. Being an ac- ceptable figure, for a such system that targets IoT embedded systems with restricted power envelopes, which usually operates in a run to halt mode. Consequently, might stay at a considerable amount of time at idle or sleep mode. The purpose of attaching accelerator to PULPino is to enhance its power efficiency. Said so, Figure 5.7 presents the power saved in percentage by using the fft accelerator, when computing an FFT algorithm. Achieving a maximum saving of 66 % when using a radix 2 iterative architecture configuration. Same architectures as before on multiple frequencies, allow to analyze which one tends to save more power. Combined with the computation time the algorithm takes to finish execution. The ”optimal” mode of operation is achieved when the computation time is minimum and the energy savings is maximum. Thus, if simple ratio between both of these results is calculated for every column of the graphs, is possible to point out which one would be it. Therefore,

EnergySaved = 1.75 ComputationT ime when using radix 2 iterative at 40MHz, presents itself as the highest ratio among all. This architecture is from all the remaining, the one that requires less FPGA resources, thus at a higher frequency of 40MHz, seems to be the best configuration to balance energy savings and computation time. Even though, this might not be the best configuration for all embedded applications. Each one has its own energy restrictions and computation time requirements. The graph on Figure 5.7, might be a useful guide in finding out which architecture and frequency, best fits a certain type of system requirements.

55 Figure 5.6: Pulpino with FFT accelerator, static on-chip power consumption at 40MHz.

On Figure 5.8, is portrayed a graphs that combines computation energy and energy ratio, which were calculated from the same results used on the previous graphs, using the same data input at 40MHz of main clock frequency. The computation energy figures are based on the same equation 5.1 as presented on the previous Section 5.2.1. These results also confirm that the FFT accelerator’s architecture which saves more power is the radix 2 iterative version. It can be stated by analyzing the energy ratio line. This line corresponds to the ratio between the energy consumption (on the sames graph) of both software- only and hardware accelerated version.

5.3 Summary

In this chapter, the analysis of the experimental results were presented, starting by the achieved speedup due to the hardware accelerators. All the results are relative to the implemented SHA-3 and FFT ac- celerators developed under the scope of this Thesis. Afterwards, a power efficiency analysis on such accelerators is presented, in which conclusions are drawn about its power consumption and overall impact on energy-efficiency of PULPino.

56 Figure 5.7: Pulpino with FFT accelerator, energy saved vs computation time.

Figure 5.8: Pulpino with FFT accelerator, computation energy vs energy ratio (SW/HW) at 40MHz.

57 58 Chapter 6

Conclusions and Future Work

In conclusion, the initial goal of boosting the energy efficiency of PULPino for applications on embedded IoT devices, operating within restricted power envelopes, was successfully accomplished in this Thesis. The improvements were achieved by attaching two different hardware accelerators, namely a crypto- graphic SHA-3 accelerator and a digital signal processing FFT accelerator.

In order to successfully attach and deal with the heterogeneity between accelerator and processor, a custom low-power AXI-lite based interface was developed. Having the advantage of providing a simple and plug-n-play manner for the current and future accelerators to interface with the processor. Encour- aging the development of new attachable accelerators by the open source community, since PULPino was release under an open source license. Consequently, saving development time due to reutilization of hardware designs. Paving the way for more modular embedded systems, in which is possible to add the most suitable accelerator for a certain kind of final application. Under the scope of this Thesis, two accelerators were attached and speedup and energy efficiency eval- uated: SHA-3 and FFT. Achieving a speedup of 185 times on the SHA-3 algorithm and 3 times on the FFT one. Being possible to achieve higher values of speedup as the input data size increases, as shown by the presented tendency lines. As stated, the cryptographic algorithm presents itself as the most suit- able for acceleration in comparison to the FFT one. This can be explained by the low percentage of compressed instructions that the SHA-3 non accelerated algorithm translates into. Apart from that, the type of processing performed in the SHA-3 is more suitable for acceleration in hardware then the FFT. Having only 26% of compressed instruction against 87% on the FFT non accelerated algorithm. RISC-V compressed instructions claim to reduce code-size, while enhancing energy-efficiency and performance [57]. Regarding energy savings, the SHA-3 and FFT accelerators can save up to 99.39% and 66% of energy, respectively. For the FFT accelerator, an ”optimal” point of operation was proposed, setting it with a radix 2 iterative configuration at the maximum attainable frequency of 40MHz. Among the several tested configurations, the best ratio was obtained between energy savings and the amount of time required to compute the FFT algorithm. Other modes of operation might be best suited, depending on the energy

59 requirements of the target application.

Regarding future work, there are always room for improvements on the current AXI-lite interface, which connects the accelerator to the main AXI interconnect bus. Other kind of accelerators might benefit from additional control signals or other custom features, that might be added to this interface. Apart from AXI-lite, there are other kind of buses that could improve data communication between the accelerator and the AXI interconnect bus. Such as AXI Stream, that has advantages on data streaming, but lacks individual control registers. This could be overcome by an implementing a more complex communication protocol over the data stream, having higher data throughputs while still being possible to control the accelerator, without any external signals. On this Thesis, the processor is the one to fetch the required data from the memories into the accelerator, over the AXI bus. Nevertheless, higher data interchange could be achieved by featuring the accelerator with an Direct Memory Access(DMA) functionality. Al- lowing a direct access to such data directly from the memory. Despite being possible to achieve higher data throughputs, on a usually known bottleneck. It also brings over additional hardware, which might increase the overall power consumption on a system targeting low-power applications. Another possible improvements, for future work, is to enable the control signals received from the accel- erator to trigger intrinsic PULPino interruptions. Meaning that the processor would not have to operate in pooling mode. With interrupts, when the external signal is received, a flag is triggered and a the proper interrupt service routine executed. As PULPino already provides such feature on its peripherals, a similar implementation could be developed for the new attachable accelerators. Consequently, having room for improvements on its overall energy-efficiency.

60 References

[1] S. Davis, K. Holland, J. Yang, M. A. Fury, and L. Shon-Roy. The era of iot advancing cmp consum- ables growth. In International Conference on Planarization/CMP Technology (ICPT), 2015.

[2] DFT/FFT IP Core Generator, 2017. URL http://www.spiral.net/hardware/dftgen.html.

[3] H. Hsing. SHA3 Core Specification, 2013.

[4] M. Alioto. Ultra Low Power Design Approaches for IoT. In HOTCHIPS, 2015.

[5] M. Gautschi, P. D. Schiavone, A. Traber, I. Loi, A. Pullini, D. Rossi, E. Flamand, F. K. Gurkaynak, and L. Benini. A near-threshold risc-v core with dsp extensions for scalable iot endpoint devices. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2016.

[6] D. Rossi, F. Conti, A. Marongiu, A. Pullini, I. Loi, M. Gautschi, G. Tagliavini, P. Flatresse, and L. Benini. Pulp: A parallel ultra-low-power platform for next generation iot applications. In HOTCHIPS, 2015.

[7] F. Conti, D. Rossi, A. Pullini, I. Loi, and L. Benini. Energy-efficient vision on the PULP platform for ultra-low power parallel computing. In Proceedings of the 2014 IEEE Workshop on Signal Processing Systems, Piscataway, NJ, 2014. IEEE.

[8] F. Conti, D. Rossi, A. Pullini, I. Loi, and L. Benini. Pulp: A ultra-low power parallel accelerator for energy-efficient and flexible embedded vision. Journal of Signal Processing Systems, 84(3):339– 354, 2016. ISSN 1939-8115. doi: 10.1007/s11265-015-1070-9. URL http://dx.doi.org/10. 1007/s11265-015-1070-9.

[9] M. Rusci, D. Rossi, M. Lecca, M. Gottardi, L. Benini, and E. Farella. Energy-efficient design of an always-on smart visual trigger. In 2016 IEEE International Smart Cities Conference (ISC2), pages 1–6, Sept 2016. doi: 10.1109/ISC2.2016.7580824.

[10] M. Gautschi, M. Schaffner, F. K. Gurkaynak,¨ and L. Benini. 4.6 a 65nm cmos 6.4-to- 29.2pj/fl[email protected] shared logarithmic floating point unit for acceleration of nonlinear function kernels in a tightly coupled processor cluster. In 2016 IEEE International Solid-State Circuits Conference (ISSCC), pages 82–83, Jan 2016. doi: 10.1109/ISSCC.2016.7417917.

61 [11] Y. Popoff, F. Scheidegger, M. Schaffner, M. Gautschi, F. K. Gurkaynak,¨ and L. Benini. High- efficiency logarithmic number unit design based on an improved cotransformation scheme. In 2016 Design, Automation Test in Europe Conference Exhibition (DATE), pages 1387–1392, March 2016.

[12] F. Conti and L. Benini. A ultra-low-energy convolution engine for fast brain-inspired vision in mul- ticore clusters. In 2015 Design, Automation Test in Europe Conference Exhibition (DATE), pages 683–688, March 2015. doi: 10.7873/DATE.2015.0404.

[13] A. Pullini, F. Conti, D. Rossi, I. Loi, M. Gautschi, and L. Benini. A heterogeneous multi-core system- on-chip for energy efficient brain inspired vision. In 2016 IEEE International Symposium on Circuits and Systems (ISCAS), pages 2910–2910, May 2016. doi: 10.1109/ISCAS.2016.7539213.

[14] F. Conti, D. Palossi, A. Marongiu, D. Rossi, and L. Benini. Enabling the heterogeneous accelerator model on ultra-low power microcontroller platforms. In 2016 Design, Automation Test in Europe Conference Exhibition (DATE), pages 1201–1206, March 2016.

[15] PULP - An Open Parallel Ultra-Low-Power Processing-Platform, 2016. URL http:// iis-projects.ee.ethz.ch/index.php/PULP.

[16] D. Rossi, A. Pullini, I. Loi, M. Gautschi, F. K. Gurkaynak,¨ A. Bartolini, P. Flatresse, and L. Benini. A 60 gops/w, -1.8 v to 0.9 v body bias ulp cluster in 28 nm utbb fd-soi technology. Solid-State Electronics, 117:170–184, 2015.

[17] D. Rossi, A. Pullini, I. Loi, M. Gautschi, F. K. Gurkaynak,¨ J. Constantin, A. Bartolini, I. Miro-Panades, E. Beigne,` F. Clermidy, F. Abouzeid, P. Flatresse, and L. Benini. 193 mops/mw @ 162 mops, 0.32v to 1.15v voltage range multi-core accelerator for energy efficient parallel and sequential digital processing. Cool Chips XIX, pages 1–3, 2016.

[18] A. Pullini, F. Conti, D. Rossi, I. Loi, M. Gautschi, and L. Benini. A heterogeneous multi-core system- on-chip for energy efficient brain inspired vision. ISCAS, pages 2–4, 2016.

[19] D. Rossi, I. Loi, F. Conti, G. Tagliavini, A. Pullini, and A. Marongiu. Energy efficient parallel com- puting on the pulp platform with support for openmp. In 2014 IEEE 28th Convention of Electrical Electronics Engineers in Israel (IEEEI), pages 1–5, Dec 2014. doi: 10.1109/EEEI.2014.7005803.

[20] F. Conti, R. Schilling, P. D. Schiavone, A. Pullini, D. Rossi, F. K. Gurkaynak,¨ M. Muehlberghuber, M. Gautschi, I. Loi, G. Haugou, S. Mangard, and L. Benini. An iot endpoint system-on-chip for secure and energy-efficient near-sensor analytics. IEEE Transactions on Circuits and Systems I: Regular Papers, PP(99):1–14, 2017. ISSN 1549-8328. doi: 10.1109/TCSI.2017.2698019.

[21] F. Conti, D. Palossi, R. Andri, M. Magno, and L. Benini. Accelerated visual context classification on a low-power smartwatch. IEEE Transactions on Human-Machine Systems, 47(1):19–30, Feb 2017. ISSN 2168-2291. doi: 10.1109/THMS.2016.2623482.

[22] PULPino: A small single-core RISC-V SoC, 2016. URL iis-projects.ee.ethz.ch/images/d/d0/ Pulpino_poster_riscv2015.pdf.

62 [23] A. Traber and M. Gautschi. RI5CY:User Manual, 2016.

[24] G.-R. Uh, Y. Wang, D. Whalley, S. Jinturkar, C. Burns, and V. Cao. Techniques for Effec- tively Exploiting a Zero Overhead Loop Buffer, pages 157–172. Springer Berlin Heidelberg, Berlin, Heidelberg, 2000. ISBN 978-3-540-46423-5. doi: 10.1007/3-540-46423-9 11. URL http://dx.doi.org/10.1007/3-540-46423-9_11.

[25] R. B. Lee. Subword parallelism with MAX-2, volume 16, pages 51–59. IEEEMicro, 1996.

[26] Pulp-Platform Documentation, 2017. URL http://www.pulp-platform.org/documentation/.

[27] J. L. Hennessy and D. A. Patterson. Computer Architecture: A Quantitative Approach. The Morgan Kaufmann Series in Computer Architecture and Design. Elsevier Science, San Francisco, CA, USA, 5th edition, 2011.

[28] B. Benton. Ccix, gen-z, opencapi: Overview & comparison. In OPENFABRICS ALLIANCE, 2017.

[29] Gen-Z-Consortium. Gen-Z Overview, 2016.

[30] Y. Shao and D. Brooks. Research Infrastructures for Hardware Accelerators. Synthesis Lectures on Computer Architecture. Morgan & Claypool Publishers, 2015. ISBN 9781627058322. URL https://books.google.pt/books?id=uzEECwAAQBAJ.

[31] Research Infrastructures for Accelerator Centric Architectures, 2017. URL http://accelerator. eecs.harvard.edu/isca14tutorial/isca2014-tutorial-all.pdf.

[32] Xilinx. Vivado Design Suite User Guide - High-Level Synthesis, 2017.

[33] H. K. Rawat and P. Schaumont. Simd instruction set extensions for keccak with applications to sha-3, keyak and ketje. In Proceedings of the Hardware and Architectural Support for Security and Privacy 2016, HASP 2016, pages 4:1–4:8, New York, NY, USA, 2016. ACM. ISBN 978-1-4503- 4769-3. doi: 10.1145/2948618.2948622. URL http://doi.acm.org/10.1145/2948618.2948622.

[34] H. Rawat and P. Schaumont. Vector instruction set extensions for efficient computation of keccak. IEEE Transactions on Computers, PP(99):1–1, 2017. ISSN 0018-9340. doi: 10.1109/TC.2017. 2700795.

[35] C. Schmidt and A. Izraelevitz. A fast parameterized sha3 accelerator. Technical Report UCB/EECS- 2015-204, EECS Department, University of California, Berkeley, Oct 2015. URL http://www2. eecs.berkeley.edu/Pubs/TechRpts/2015/EECS-2015-204.html.

[36] C. Liu, R. Duarte, O. Granados, J. Tang, and J. Andrian. Critical path based hardware acceleration for cryptosystems 1, 2012.

[37] P. Gaydecki and I. of Electrical Engineers. Foundations of Digital Signal Processing: The- ory, Algorithms and Hardware Design. IEE circuits and systems series: Institution of Electri- cal Engineers. Institution of Engineering and Technology, 2004. ISBN 9780852964316. URL https://books.google.pt/books?id=6Qo7NvX3vz4C.

63 [38] I. Kramberger. Dsp acceleration using a reconfigurable fpga. In Industrial Electronics, 1999. ISIE ’99. Proceedings of the IEEE International Symposium on, volume 3, pages 1522–1525 vol.3, 1999. doi: 10.1109/ISIE.1999.797022.

[39] Xilinx. Fast Fourier Transform v9.0 - LogiCORE IP Product Guide, 2015.

[40] Intel. FFT IP Core - User Guide, 2017.

[41] Xilinx. FIR Compiler v7.2 - LogiCORE IP Product Guide, 2015.

[42] Altera. FIR Compiler- User Guide, 2011.

[43] P. A. Milder, F. Franchetti, J. C. Hoe, and M. Puschel.¨ Computer generation of hardware for lin- ear digital signal processing transforms. ACM Transactions on Design Automation of Electronic Systems, 17(2), 2012.

[44] Xilinx. AXI Reference Guide, 2011.

[45] A. Traber and M. Gautschi. PULPino: Datasheet, 2016.

[46] I. Loi. AXI 4 NODE Application note, 2014.

[47] M. P. G. Bertoni, J. Daemen. The Keccak reference, version 3, 2011.

[48] C. Van Loan. Computational Frameworks for the Fast Fourier Transform. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 1992. ISBN 0-89871-285-8.

[49] P. A. Milder, F. Franchetti, J. C. Hoe, and M. Puschel.¨ Hardware implementation of the discrete Fourier transform with non-power-of-two problem size. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2010.

[50]M.P uschel,¨ P. A. Milder, and J. C. Hoe. Permuting streaming data using rams. Journal of the ACM, 56(2):10:1–10:34, 2009.

[51] PULPino’s Github online repository, 2016. URL https://github.com/pulp-platform/pulpino.

[52] Xilinx. UG585 - Zynq-7000 AP SoC Technical Reference Manual, 2016.

[53] Xilinx’s Tutorial - Prepare Boot Medium, 2016. URL http://www.wiki.xilinx.com/Prepare+Boot+ Medium.

[54] Zynq-7000 All Programmable SoC Software Developers Guide, 2015. URL https://www.xilinx. com/support/documentation/user_guides/ug821-zynq-7000-swdev.pdf.

[55] A baseline Keccak implementation, 2011. URL https://github.com/coruus/saarinen-keccak/ tree/master/readable_keccak.

[56] A Simple and Efficient FFT Implementation in C++, 2017. URL http://www.drdobbs.com/cpp/ a-simple-and-efficient-fft-implementatio/199500857?pgno=1.

64 [57] A. Waterman. Improving energy efficiency and reducing code size with risc-v compressed. Master’s thesis, EECS Department, University of California, Berkeley, May 2011. URL http://www2.eecs. berkeley.edu/Pubs/TechRpts/2011/EECS-2011-63.html.

[58] R. P. Duarte and C.-S. Bouganis. Arc 2014 over-clocking klt designs on fpgas under process, voltage, and temperature variation. ACM Trans. Reconfigurable Technol. Syst., 9(1):7:1–7:17, Nov. 2015. ISSN 1936-7406. doi: 10.1145/2818380. URL http://doi.acm.org/10.1145/2818380.

65 66 Appendix A

Software-only Algorithms

A.1 SHA-3

67 1 // keccak.c 2 // 19-Nov-11 Markku-Juhani O. Saarinen 3 // A baseline Keccak (3rd round) implementation. 4 5 #include "common.h" 6 7 #define KECCAK_ROUNDS 24 8 9 #define ROTL64(x, y) (((x) << (y)) | ((x) >> (64 - (y)))) 10 11 #if __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__ 12 #define __bswap_64(x) \ 13 ( (((x) & 0xff00000000000000ull) >> 56) \ 14 | (((x) & 0x00ff000000000000ull) >> 40) \ 15 | (((x) & 0x0000ff0000000000ull) >> 24) \ 16 | (((x) & 0x000000ff00000000ull) >> 8) \ 17 | (((x) & 0x00000000ff000000ull) << 8) \ 18 | (((x) & 0x0000000000ff0000ull) << 24) \ 19 | (((x) & 0x000000000000ff00ull) << 40) \ 20 | (((x) & 0x00000000000000ffull) << 56)) 21 #elif __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__ 22 #define __bswap_64(x) (x) 23 #else 24 #error Unsupported endianness 25 #endif 26 27 const uint64_t keccakf_rndc[24] = 28 { 29 0x0000000000000001, 0x0000000000008082, 0x800000000000808a, 30 0x8000000080008000, 0x000000000000808b, 0x0000000080000001, 31 0x8000000080008081, 0x8000000000008009, 0x000000000000008a, 32 0x0000000000000088, 0x0000000080008009, 0x000000008000000a, 33 0x000000008000808b, 0x800000000000008b, 0x8000000000008089, 34 0x8000000000008003, 0x8000000000008002, 0x8000000000000080, 35 0x000000000000800a, 0x800000008000000a, 0x8000000080008081, 36 0x8000000000008080, 0x0000000080000001, 0x8000000080008008 37 }; 38 39 const int keccakf_rotc[24] = 40 { 41 1, 3, 6, 10, 15, 21, 28, 36, 45, 55, 2, 14, 42 27, 41, 56, 8, 25, 43, 62, 18, 39, 61, 20, 44 43 }; 44 45 const int keccakf_piln[24] = 46 { 47 10, 7, 11, 17, 18, 3, 5, 16, 8, 21, 24, 4, 48 15, 23, 19, 13, 12, 2, 20, 14, 22, 9, 6, 1 49 }; 50 51 static inline int mod5(int a) { 52 while (a > 9) { 53 int s = 0; /* accumulator for the sum of the digits */ 54 while (a != 0) { 55 s = s + (a & 7); 56 a = (a >> 3) * 3; 57 } 58 a = s; 59 } 60 /* note, at this point: a < 10 */ 61 if (a > 4) a = a - 5; 62 return a; 63 } 64 // update the state with given number of rounds 65 66 void keccakf(uint64_t st[25], int rounds) 67 { 68 int i, j, round; 69 uint64_t t, bc[5]; 70 71 for (round = 0; round < rounds; round++) { 72 73 // Theta 74 for (i = 0; i < 5; i++) 75 bc[i] = st[i] ^ st[i + 5] ^ st[i + 10] ^ st[i + 15] ^ st[i + 20]; 76 77 for (i = 0; i < 5; i++) { 78 t = bc[mod5(i + 4)] ^ ROTL64(bc[mod5(i + 1)], 1); 79 for (j = 0; j < 25; j += 5) 80 st[j + i] ^= t; 81 } 82 83 // Rho Pi 84 t = st[1]; 85 for (i = 0; i < 24; i++) { 86 j = keccakf_piln[i]; 87 bc[0] = st[j]; 88 st[j] = ROTL64(t, keccakf_rotc[i]); 89 t = bc[0]; 90 } 91 92 // Chi 93 for (j = 0; j < 25; j += 5) { 94 for (i = 0; i < 5; i++) 95 bc[i] = st[j + i]; 96 for (i = 0; i < 5; i++) 97 st[j + i] ^= (~bc[mod5(i + 1)]) & bc[mod5(i + 2)]; 98 } 99 100 // Iota 101 st[0] ^= keccakf_rndc[round]; 102 } 103 } 104 105 // compute a keccak hash (md) of given byte length from "in" 106 107 int do_keccak(const uint8_t *in, int inlen, uint8_t *md, int mdlen) 108 { 109 uint64_t st[25]; 110 uint8_t temp[144]; 111 int i, rsiz, rsizw; 112 113 rsiz = 200 - 2 * mdlen; 114 rsizw = rsiz / 8; 115 116 memset(st, 0, sizeof(st)); 117 118 for ( ; inlen >= rsiz; inlen -= rsiz, in += rsiz) { 119 for (i = 0; i < rsizw; i++) 120 st[i] ^= __bswap_64(((uint64_t *) in)[i]); 121 keccakf(st, KECCAK_ROUNDS); 122 } 123 124 // last block and padding 125 memcpy(temp, in, inlen); 126 temp[inlen++] = 1; 127 memset(temp + inlen, 0, rsiz - inlen); 128 temp[rsiz - 1] |= 0x80; 129 130 for (i = 0; i < rsizw; i++) 131 st[i] ^= __bswap_64(((uint64_t *) temp)[i]); 132 133 keccakf(st, KECCAK_ROUNDS); 134 135 #if __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__ 136 137 for (i = 0; i < mdlen / 8; i++) 138 ((uint64_t *) md)[i] = __bswap_64(((uint64_t *) st)[i]); 139 140 int remaining = mdlen % 8; 141 for (i = 0; i < remaining; i++) 142 ((uint8_t *) md)[mdlen - remaining + i] = ((uint8_t *) st)[mdlen + remaining - i - 1]; 143 #else 144 memcpy(md, st, mdlen); 145 #endif 146 147 return 0; 148 } 1 #include "common.h" 2 3 typedef struct { 4 int mdlen; 5 char *msgstr; 6 uint8_t md[64]; 7 } test_triplet_t; 8 9 static const test_triplet_t testvec = { 10 11 64, "The quick brown fox jumps over the lazy dog ", { 12 0x07, 0xb8, 0x47, 0x18, 0xDC, 0xBA, 0x3C, 0x74, 13 0x61, 0x9B, 0xA1, 0xFA, 0x7F, 0x57, 0xDF, 0xE7, 14 0x76, 0x9D, 0x3F, 0x66, 0x98, 0xA8, 0xB3, 0x3F, 15 0xA1, 0x01, 0x83, 0x89, 0x70, 0xA1, 0x31, 0xE6, 16 0x21, 0xCC, 0xFD, 0x05, 0xFE, 0xFF, 0xBC, 0x11, 17 0x80, 0xF2, 0x63, 0xC2, 0x7F, 0x1A, 0xDA, 0xB4, 18 0x60, 0x95, 0xD6, 0xF1, 0x25, 0x33, 0x14, 0x72, 19 0x4B, 0x5C, 0xBF, 0x78, 0x28, 0x65, 0x8E, 0x6A } 20 21 }; 22 23 uint8_t md3[64] __sram; 24 25 uint8_t *md __sram = md3; 26 27 28 extern int do_keccak(const uint8_t *in, int, uint8_t *out, int); 29 30 void keccak_test() { 31 // for (int i = 0; i < 4; i++) 32 do_keccak((uint8_t *) testvec.msgstr, strlen(testvec.msgstr), md, testvec.mdlen); 33 } 34 35 void test_setup() { 36 } 37 38 void test_clear() { 39 //for (int i = 0; i < 4; i++) 40 memset(md, 0, testvec.mdlen); 41 } 42 43 void test_run() { 44 keccak_test(); 45 } 46 47 int test_check() { 48 //for (int i = 0; i < 4; i++) 49 if (0 != memcmp(md, testvec.md, testvec.mdlen)) 50 return 0; 51 return 1; 52 } A.2 FFT

71 1 #include "common.h" 2 3 static int wprBase[] __sram = { 4 32767, 32758, 32729, 32679, 32610, 32522, 32413, 32286, 5 32138, 31972, 31786, 31581, 31357, 31114, 30853, 30572, 6 30274, 29957, 29622, 29269, 28899, 28511, 28106, 27684, 7 27246, 26791, 26320, 25833, 25330, 24812, 24279, 23732, 8 23170, 22595, 22006, 21403, 20788, 20160, 19520, 18868, 9 18205, 17531, 16846, 16151, 15447, 14733, 14010, 13279, 10 12540, 11793, 11039, 10279, 9512, 8740, 7962, 7180, 11 6393, 5602, 4808, 4011, 3212, 2411, 1608, 804, 12 0, -804, -1608, -2411, -3212, -4011, -4808, -5602, 13 -6393, -7180, -7962, -8740, -9512, -10279, -11039, -11793, 14 -12540, -13279, -14010, -14733, -15447, -16151, -16846, -17531, 15 -18205, -18868, -19520, -20160, -20788, -21403, -22006, -22595, 16 -23170, -23732, -24279, -24812, -25330, -25833, -26320, -26791, 17 -27246, -27684, -28106, -28511, -28899, -29269, -29622, -29957, 18 -30274, -30572, -30853, -31114, -31357, -31581, -31786, -31972, 19 -32138, -32286, -32413, -32522, -32610, -32679, -32729, -32758, 20 }; 21 22 static int wpiBase[] __sram = { 23 0, 804, 1608, 2411, 3212, 4011, 4808, 5602, 24 6393, 7180, 7962, 8740, 9512, 10279, 11039, 11793, 25 12540, 13279, 14010, 14733, 15447, 16151, 16846, 17531, 26 18205, 18868, 19520, 20160, 20788, 21403, 22006, 22595, 27 23170, 23732, 24279, 24812, 25330, 25833, 26320, 26791, 28 27246, 27684, 28106, 28511, 28899, 29269, 29622, 29957, 29 30274, 30572, 30853, 31114, 31357, 31581, 31786, 31972, 30 32138, 32286, 32413, 32522, 32610, 32679, 32729, 32758, 31 32767, 32758, 32729, 32679, 32610, 32522, 32413, 32286, 32 32138, 31972, 31786, 31581, 31357, 31114, 30853, 30572, 33 30274, 29957, 29622, 29269, 28899, 28511, 28106, 27684, 34 27246, 26791, 26320, 25833, 25330, 24812, 24279, 23732, 35 23170, 22595, 22006, 21403, 20788, 20160, 19520, 18868, 36 18205, 17531, 16846, 16151, 15447, 14733, 14010, 13279, 37 12540, 11793, 11039, 10279, 9512, 8740, 7962, 7180, 38 6393, 5602, 4808, 4011, 3212, 2411, 1608, 804, 39 }; 40 41 void fft(int *data, int len) { 42 43 int max = len; 44 len <<= 1; 45 int wstep = 1; 46 while (max > 2) { 47 int *wpr = wprBase; 48 int *wpi = wpiBase; 49 50 for (int m = 0; m < max; m +=2) { 51 int wr = *wpr; 52 int wi = *wpi; 53 wpr+= wstep; 54 wpi+= wstep; 55 56 int step = max << 1; 57 58 for (int i = m; i < len; i += step) { 59 int j = i + max; 60 61 int tr = data[i] - data[j]; 62 int ti = data[i+1] - data[j+1]; 63 64 data[i] += data[j]; 65 data[i+1] += data[j+1]; 66 67 int xr = ((wr * tr + wi * ti) << 1) + 0x8000; 68 int xi = ((wr * ti - wi * tr) << 1) + 0x8000; 69 70 data[j] = xr >> 16; 71 data[j+1] = xi >> 16; 72 } 73 } 74 max >>= 1; 75 wstep <<= 1; 76 } 77 78 { 79 int step = max << 1; 80 81 for (int i = 0; i < len; i += step) { 82 int j = i + max; 83 84 int tr = data[i] - data[j]; 85 int ti = data[i+1] - data[j+1]; 86 87 data[i] += data[j]; 88 data[i+1] += data[j+1]; 89 90 91 data[j] = tr; 92 data[j+1] = ti; 93 } 94 } 95 96 97 #define SWAP(a, b) tmp=(a); (a)=(b); (b)=tmp 98 99 data--; 100 int j = 1; 101 for (int i = 1; i < len; i += 2) { 102 if(j > i) { 103 int tmp; 104 SWAP(data[j], data[i]); 105 SWAP(data[j+1], data[i+1]); 106 } 107 int m = len>> 1; 108 while (m >= 2 && j >m) { 109 j -= m; 110 m >>= 1; 111 } 112 j += m; 113 } 114 } 1 #include "common.h" 2 3 #define NINPUTS 256 4 5 short buf[2*NINPUTS] __sram; 6 7 int dataR1[NINPUTS] = { 8 /* inputs for test 1 */ 9 2, -4, -3, -8, -10, -11, -23, 11, 32, 10, 11, 8, 3, 3, -7, -5, 1, -4, -4, -4, 10 -9, -5, -4, -8, -5, -2, 0, 0, -6, -7, -2, 3, 3, 8, 15, 10, 6, 6, 1, 4, -1, 11 -10, -4, -2, -9, -5, -7, -8, -2, -5, -6, -2, -3, 1, -3, -8, -6, 0, 5, 4, 15, 12 17, 6, 5, 2, 0, 2, -3, -5, 0, -5, -5, -4, -9, -6, -2, -4, -4, -3, -1, 1, -5, 13 -7, -4, 3, 5, 6, 20, 16, 8, 7, 3, 7, 4, -5, -4, -3, -8, -6, -7, -7, -1, -2, 14 -2, -2, -3, 1, 1, -5, -4, 2, 6, 7, 13, 17, 8, 7, 6, 2, 7, 4, 0, -3, -6, -2, 15 -3, -7, -7, -4, -5, -4, -2, 1, 4, -2, -4, -1, 3, 5, 5, 18, 19, 9, 7, 2, 4, 2, 16 -6, -5, 0, -1, -2, -5, -8, -2, -4, -7, -5, -4, 1, 0, -5, -4, 1, 3, 3, 7, 15, 17 11, 6, 5, 2, 6, 3, -5, -4, -4, -7, -6, -9, -8, -3, -4, -5, -5, -4, 1, -3, -7, 18 -5, 0, 4, 3, 12, 15, 7, 5, 4, 1, 1, -5, -7, -1, -2, -5, -4, -8, -7, -3, -6, 19 -6, -5, -3, 0, -5, -6, -3, 1, 2, 3, 13, 14, 9, 6, 3, 4, 3, -4, -6, -3, -5, 20 -5, -6, -9, -5, -3, -5, -5, -3, 0, 0, -5, -3, 1, 3, 3, 9, 16, 10, 6, 6, 6, 8, 21 2, -2, -2, 22 }; 23 24 int dataI1[NINPUTS] = { 25 /* inputs for test 1 */ 26 1, -1, -1, -2, -2, -2, -3, -1, 0, 0, 1, 2, 2, 3, 3, 2, 2, 2, 2, 1, 0, 0, -1, 27 -1, -2, -2, -2, -3, -3, -3, -3, -2, -2, -1, 0, 1, 1, 2, 3, 3, 3, 3, 2, 2, 1, 28 1, 0, 0, -1, -1, -2, -2, -2, -2, -3, -3, -3, -2, -2, -1, 0, 1, 1, 2, 3, 3, 4, 29 3, 3, 3, 2, 2, 1, 1, 0, 0, -1, -1, -2, -2, -3, -3, -3, -3, -2, -2, -1, 0, 1, 30 2, 3, 3, 4, 4, 4, 3, 3, 2, 2, 1, 0, 0, -1, -1, -2, -2, -2, -3, -3, -3, -3, 31 -2, -2, -1, 0, 1, 2, 2, 3, 3, 4, 3, 3, 3, 2, 2, 1, 0, -1, -1, -2, -2, -3, -3, 32 -4, -3, -4, -3, -2, -2, 0, 1, 1, 2, 2, 3, 3, 3, 2, 2, 2, 2, 1, 0, 0, -1, -2, 33 -2, -3, -3, -3, -4, -4, -3, -3, -2, -1, 0, 0, 1, 2, 2, 3, 3, 3, 3, 2, 2, 1, 34 1, 0, 0, -1, -1, -2, -2, -2, -3, -3, -3, -3, -2, -2, -1, 0, 1, 2, 2, 3, 3, 3, 35 2, 2, 2, 2, 1, 0, 0, 0, -1, -1, -2, -2, -3, -3, -3, -3, -3, -2, -2, 0, 1, 1, 36 2, 3, 3, 4, 4, 3, 3, 3, 2, 1, 1, 0, 0, -1, -2, -2, -2, -3, -3, -3, -3, -2, 37 -2, -1, 0, 1, 2, 2, 3, 4, 4, 4, 3, 38 }; 39 40 int ref[2*NINPUTS] = { 41 /* outputs for test 1 */ 42 6, 13, -47, -11, 91, 38, 44, 30, 48, 21, 48, 13, 43 71, 26, 92, 41, 162, 76, 598, 139, -1002, -60, -284, 59, 44 -210, 40, -154, 65, -100, 67, -23, 35, -98, 92, -18, 96, 45 -125, 53, -414, 121, 135, 80, 78, 37, 62, 57, 29, 19, 46 40, -16, 84, 115, 23, -6, 52, 8, -3, -52, 283, 126, 47 123, -38, 50, -63, 28, -16, 31, -78, 31, -36, -2, -62, 48 -12, -54, -33, -43, -1, 60, -12, -141, -44, -49, -21, -68, 49 -62, -59, -32, -61, -39, -23, -48, -13, -70, -14, -28, -7, 50 5, -98, -28, 10, -50, -10, -32, 2, -42, 11, -4, 36, 51 -65, 37, -9, 19, -52, 28, -94, 3, 228, 74, 53, 73, 52 69, 22, 52, 52, 56, 8, 21, 80, 55, 6, 41, 5, 53 54, 21, 95, 83, 8, -75, -6, -27, 23, -32, 14, -27, 54 20, -34, -2, -57, -2, -28, -7, -32, -11, -21, 19, -70, 55 -20, -7, -16, -32, -25, -15, -27, -17, -21, -13, -25, 0, 56 -10, -7, 17, -20, 48, -41, -154, 48, -59, 75, -44, 45, 57 -19, 42, 1, 39, -18, 33, -2, 43, -1, 36, 30, 33, 58 48, 150, 46, -33, 10, -11, 17, 5, 23, 9, 36, 2, 59 29, 2, 22, -9, -1, -16, 11, -8, 47, -38, -1, -14, 60 2, -20, 7, 4, 9, -25, -2, 7, 5, -30, -5, -1, 61 8, 0, 14, -18, -7, 0, -6, 2, 10, -10, -4, 8, 62 7, -3, 8, -3, 9, 7, -8, 10, 2, -3, 12, 8, 63 19, -7, -1, -4, -2, -9, -3, 9, -3, 6, 18, -2, 64 10, -1, 2, -1, -1, -6, 0, 5, -4, 10, 4, 1, 65 0, -10, -6, 7, -4, 4, 8, 21, 8, -9, 3, 19, 66 4, 32, 14, -6, -1, 29, 0, 13, 11, 22, 16, 9, 67 56, 55, 5, -11, 2, 28, 22, 9, 25, 8, 22, 12, 68 17, -2, 13, -6, 7, 10, 40, 31, 72, -156, 39, -36, 69 5, -32, 12, -46, -16, -44, 17, -55, -21, -48, -22, -35, 70 -50, -65, -141, -52, 40, 37, 12, 17, -21, -15, -16, -15, 71 -28, 6, -10, 3, -21, 24, -14, 18, -20, 4, 8, 62, 72 -12, 11, -6, 38, -9, 31, 1, 75, 24, 38, 12, 37, 73 26, 38, -11, 31, 5, 80, 98, -86, 64, -30, 31, -18, 74 61, -10, 21, -63, 50, -9, 55, -55, 66, -13, 48, -53, 75 219, -56, -86, -15, -58, -36, -12, -19, -56, -42, 5, -34, 76 -27, -27, -17, 4, -48, -3, -20, -6, -1, 83, -29, 3, 77 -65, 7, -45, 10, -44, 29, -23, 62, -58, 49, -28, 52, 78 -39, 49, -14, 147, 27, -68, -7, 45, -8, 41, 10, 65, 79 31, 40, 32, 63, 36, 6, 56, 71, 122, 39, 289, -121, 80 1, 45, 56, 4, 33, 6, 96, -99, 47, 10, 43, -16, 81 67, -40, 81, -38, 126, -58, -303, -140, -63, -55, 8, -75, 82 -45, -88, 3, -31, -38, -54, -66, -54, -79, -34, -112, -52, 83 -396, -18, 216, -26, 69, -15, 42, -17, 25, -15, 34, -26, 84 40, 1, 38, -10, 141, -27, -83, 30, 85 }; 86 87 extern void fft(int *, int); 88 89 void test_clear() { 90 for (int i = 0; i < NINPUTS; ++i) { 91 buf[2 * i] = dataR1[i]; 92 buf[2 * i + 1] = dataI1[i]; 93 } 94 } 95 96 void test_run(int n) { 97 fft(buf, NINPUTS); 98 } 99 100 int test_check() { 101 for (int i = 0; i != 2 * NINPUTS; ++i) 102 if (buf[i] != ref[i]) 103 return 0; 104 105 return 1; 106 } 76