Design andTitolo implementation presentazione of an efficient Floating Point sottotitoloUnit (FPU) in the IoT era Milano, XX mese 20XX

OPRECOMP summer of code Andrea Galimberti NiPS Summer School Andrea Romanoni Sept 3-6, 2019 Perugia, Italy Davide Zoni The computing platforms in the Internet-of-Things (IoT) era

Efficient computing: challenges and opportunities in the Internet-of-Things (HiPEAC vision 2019*)

 Distributed data processing

 Specific requirements and constraints at each stage

 Revise the design of the System-on-Chip to maximize efficiency

* HiPEAC Vision 2019: https://www.hipeac.net/vision/2019/

OPRECOMP 2019 - A. Galimberti, A. Romanoni and D. Zoni 2 / - Open-hardware platforms for research

Motivations (HiPEAC 2019):  Design open-hardware computing platforms for efficiency, security and safety  Trustfulness and security requirements Successful open-hardware projects  Computational efficiency Limitations of current open-hardware solutions:  Monolithic solutions, not modular  Targeting efficient computation Spectre and meltdown: ICT cannot  Mainly focused to ASIC design flow rely on closed-hardware solutions  Not for research in microarchitecture

Let’s focus on a single optimization direction ...

Floating point data transfer and computation dominate the energy consumption

OPRECOMP 2019 - A. Galimberti, A. Romanoni and D. Zoni 3 / 11 The motivation behind the bfloat16 format

Brain-inspired Float (bfloat16): same format of the IEEE754 single-precision (SP) floating-point standard but truncates the mantissa field from 23 bits to 7 bits.

Float32 Exp (8 bits) Mantissa (23 bits)

S E E E E E E E E M M M M M M M M M M M M M M M M M M M M M M M ~2-(126+23) < x < ~2(128)

MRRE = 0.5 * ulp * B = Float16 = 0.5 * 2-23 * 2 = Exp (5 bits) Mantissa (10 bits) = 2-23

S E E E E E M M M M M M M M M M ~2-(14+10) < x < ~2(16) MRRE = 0.5 * ulp * B = = 0.5 * 2-10 * 2 = bFloat16 = 2-10 Exp (8 bits) Mantissa (7 bits) S E E E E E E E E M M M M M M M ~2-(126+7) < x < ~2(128) MRRE = 0.5 * ulp * B = = 0.5 * 2-7 * 2 = 16 bits = 2-7 Motivations:

 Better timing and smaller area

 For machine learning tasks a 7 bit mantissa is enough (Google and )

 Still possible to represent tiny values

 Easier debugging than float16 since bfloat16 shares the same format of the IEEE754-SP standard

OPRECOMP 2019 - A. Galimberti, A. Romanoni and D. Zoni 4 / 11 single SystemVerilog design that implements bFloat and IEEE-754 single-precision

Goals and Efficient FPU for IoT SoCs, i.e., lamp-platform: status of the both timing and area efficient project

Reproducible results:  new set of benchmarks targeting machine learning, vision and IA  validation infrastructure using Vivado, coupled with SystemVerilog DPI-C and random features. Free tools only

OPRECOMP 2019 - A. Galimberti, A. Romanoni and D. Zoni 5 / 11 Design and verification methodology

opCode, FPU_top FPU_top_tb Flexibility op1,op2  Two SV packages implementing doOp, IEEE754_SP_pkg rndMode Res, the functions for each operation bFloat_pkg IsValid, flags according to the selected size add_sub of the mantissa;  Each operation can be either Generate mul random implemented or not operands Efficiency: and div opcode Check  Rounding scheme can be results i2f configured to consider N LSB

f2i bits to optimize timing/area; Constrained rnd verification cmp using gcc and

DPI-C

OPRECOMP 2019 - A. Galimberti, A. Romanoni and D. Zoni 6 / 11 Area and timing: preliminary results

 We optimize each operation in the FPU to improve the timing

 The reduction in the mantissa improves both area and timing

 The proposed FPU allows to selectively implement a subset of the available operations

 NOTE: Flip-Flops results are not reported

OPRECOMP 2019 - A. Galimberti, A. Romanoni and D. Zoni 7 / 11 New set of benchmarks targeting machine learning, vision and IA

Ongoing

FPU integration in the lamp-platform

OPRECOMP 2019 - A. Galimberti, A. Romanoni and D. Zoni 8 / 11 FPU benchmarks for IoT System-on-Chip*

Motivations:

 Bare-metal C code for IoT devices (different from PARSEC and SPLASH2 bench suites)

 Consider emerging computing scenarios: (computer vision, computer graphics, machine learning)

Errors as function of the bits in the mantissa (tolerance 0.001)

* Project website: www.lamp-platform.org

OPRECOMP 2019 - A. Galimberti, A. Romanoni and D. Zoni 9 / 11 Lightweight Application-specific Modular Processor (LAMP)*

Key idea: The Raspberry successful story Our vision: A small and affordable computer that you can LAMP-platform: a flexibile and use to learn programming affordable System-on-Chip for investigating IoT challenges

Contributions:  FPGA targets with up to 4k LUTs with CPU (a RISC-V 5 stage CPU is offered as well )

 Modular design to target different design metrics  For example: Side-channel evaluation using the same design in post-synthesis, post- implementation and board

 Designed for Artix 7 FPGA family

 Easily plug different computing devices (request/response data/instruction interface)

* Project website: www.lamp-platform.org

OPRECOMP 2019 - A. Galimberti, A. Romanoni and D. Zoni 10 / 11 Andrea Galimberti, PhD Student Email: [email protected]

Thank you Andrea Romanoni, PhD Email: [email protected]

Davide Zoni, PhD Email: [email protected]

OPRECOMP 2019 - A. Galimberti, A. Romanoni and D. Zoni 11 / 11