Design andTitolo implementation presentazione of an efficient Floating Point sottotitoloUnit (FPU) in the IoT era Milano, XX mese 20XX
OPRECOMP summer of code Andrea Galimberti NiPS Summer School Andrea Romanoni Sept 3-6, 2019 Perugia, Italy Davide Zoni The computing platforms in the Internet-of-Things (IoT) era
Efficient computing: challenges and opportunities in the Internet-of-Things (HiPEAC vision 2019*)
Distributed data processing
Specific requirements and constraints at each stage
Revise the design of the System-on-Chip to maximize efficiency
* HiPEAC Vision 2019: https://www.hipeac.net/vision/2019/
OPRECOMP 2019 - A. Galimberti, A. Romanoni and D. Zoni 2 / - Open-hardware platforms for research
Motivations (HiPEAC 2019): Design open-hardware computing platforms for efficiency, security and safety Trustfulness and security requirements Successful open-hardware projects Computational efficiency Limitations of current open-hardware solutions: Monolithic solutions, not modular Targeting efficient computation Spectre and meltdown: ICT cannot Mainly focused to ASIC design flow rely on closed-hardware solutions Not for research in microarchitecture
Let’s focus on a single optimization direction ...
Floating point data transfer and computation dominate the energy consumption
OPRECOMP 2019 - A. Galimberti, A. Romanoni and D. Zoni 3 / 11 The motivation behind the bfloat16 format
Brain-inspired Float (bfloat16): same format of the IEEE754 single-precision (SP) floating-point standard but truncates the mantissa field from 23 bits to 7 bits.
Float32 Exp (8 bits) Mantissa (23 bits)
S E E E E E E E E M M M M M M M M M M M M M M M M M M M M M M M ~2-(126+23) < x < ~2(128)
MRRE = 0.5 * ulp * B = Float16 = 0.5 * 2-23 * 2 = Exp (5 bits) Mantissa (10 bits) = 2-23
S E E E E E M M M M M M M M M M ~2-(14+10) < x < ~2(16) MRRE = 0.5 * ulp * B = = 0.5 * 2-10 * 2 = bFloat16 = 2-10 Exp (8 bits) Mantissa (7 bits) S E E E E E E E E M M M M M M M ~2-(126+7) < x < ~2(128) MRRE = 0.5 * ulp * B = = 0.5 * 2-7 * 2 = 16 bits = 2-7 Motivations:
Better timing and smaller area
For machine learning tasks a 7 bit mantissa is enough (Google and Intel)
Still possible to represent tiny values
Easier debugging than float16 since bfloat16 shares the same format of the IEEE754-SP standard
OPRECOMP 2019 - A. Galimberti, A. Romanoni and D. Zoni 4 / 11 single SystemVerilog design that implements bFloat and IEEE-754 single-precision
Goals and Efficient FPU for IoT SoCs, i.e., lamp-platform: status of the both timing and area efficient project
Reproducible results: new set of benchmarks targeting machine learning, vision and IA validation infrastructure using Xilinx Vivado, coupled with SystemVerilog DPI-C and random features. Free tools only
OPRECOMP 2019 - A. Galimberti, A. Romanoni and D. Zoni 5 / 11 Design and verification methodology
opCode, FPU_top FPU_top_tb Flexibility op1,op2 Two SV packages implementing doOp, IEEE754_SP_pkg rndMode Res, the functions for each operation bFloat_pkg IsValid, flags according to the selected size add_sub of the mantissa; Each operation can be either Generate mul random implemented or not operands Efficiency: and div opcode Check Rounding scheme can be results i2f configured to consider N LSB
f2i bits to optimize timing/area; Constrained rnd verification cmp using gcc and Xilinx Vivado
DPI-C
OPRECOMP 2019 - A. Galimberti, A. Romanoni and D. Zoni 6 / 11 Area and timing: preliminary results
We optimize each operation in the FPU to improve the timing
The reduction in the mantissa improves both area and timing
The proposed FPU allows to selectively implement a subset of the available operations
NOTE: Flip-Flops results are not reported
OPRECOMP 2019 - A. Galimberti, A. Romanoni and D. Zoni 7 / 11 New set of benchmarks targeting machine learning, vision and IA
Ongoing
FPU integration in the lamp-platform
OPRECOMP 2019 - A. Galimberti, A. Romanoni and D. Zoni 8 / 11 FPU benchmarks for IoT System-on-Chip*
Motivations:
Bare-metal C code for IoT devices (different from PARSEC and SPLASH2 bench suites)
Consider emerging computing scenarios: (computer vision, computer graphics, machine learning)
Errors as function of the bits in the mantissa (tolerance 0.001)
* Project website: www.lamp-platform.org
OPRECOMP 2019 - A. Galimberti, A. Romanoni and D. Zoni 9 / 11 Lightweight Application-specific Modular Processor (LAMP)*
Key idea: The Raspberry successful story Our vision: A small and affordable computer that you can LAMP-platform: a flexibile and use to learn programming affordable System-on-Chip for investigating IoT challenges
Contributions: FPGA targets with up to 4k LUTs with CPU (a RISC-V 5 stage CPU is offered as well )
Modular design to target different design metrics For example: Side-channel evaluation using the same design in post-synthesis, post- implementation and board
Designed for Artix 7 FPGA family
Easily plug different computing devices (request/response data/instruction interface)
* Project website: www.lamp-platform.org
OPRECOMP 2019 - A. Galimberti, A. Romanoni and D. Zoni 10 / 11 Andrea Galimberti, PhD Student Email: [email protected]
Thank you Andrea Romanoni, PhD Email: [email protected]
Davide Zoni, PhD Email: [email protected]
OPRECOMP 2019 - A. Galimberti, A. Romanoni and D. Zoni 11 / 11