RISC-V Hardware-Accelerated Dynamic Binary Translation 6th RISC-V Workshop

Rokicki Simon - Irisa / Université de Rennes 1 Steven Derrien - Irisa / Université de Rennes 1 Erven Rohou - Inria Rennes Embedded Systems

Tight constraints in • Power consumption • Production cost • Performance

Hardware Accelerated Dynamic Binary Translation 2 Systems on a Chip

• Complex heterogeneous designs • Heterogeneity brings new power/performance trade-off

Out-of-order Superscalar

Power In-order core

Overhead from in-order to Out-of-Order Performance Hardware Accelerated Dynamic Binary Translation 3 Systems on a Chip

• Complex heterogeneous designs • Heterogeneity brings new power/performance trade-off • Are there better trade-off?

Out-of-order Superscalar

Power In-order core VLIW

Performance HW/SW Dynamic Compilation for Adaptive Embedded Systems 4

Architectural choice

Ins4 Ins4 Ins4

ins3 ins3 ins3

… ins4 ins3 ins2 ins1 Renaming ROB

& &

VLIW

ins2

ins2 ins2

Decode

ins1

ins1 ins1

Out-of-Order processor VLIW processor • Dynamic Scheduling • Static scheduling • Performance portability • No portability • Poor energy efficiency • High energy efficiency

Hardware Accelerated Dynamic Binary Translation 5 The best of both world ?

Dynamically translate native binaries into VLIW binaries: • Performance close to Out-of-Order processor • Energy consumption close to VLIW processor

Binaries Dynamic Binary VLIW

(RISC-V) Translation Binaries VLIW

Hardware Accelerated Dynamic Binary Translation 6 Existing approaches

Code Morphing Software & Crusoe architectures • on VLIW architecture • User experience polluted by cold-code execution penalty

Denver architecture • ARM on VLIW architecture

• Translation overhead is critical • Too few information on closed platforms

Hardware Accelerated Dynamic Binary Translation 7 Our contribution

• Hardware accelerated DBT framework  Make the DBT cheaper (time & energy)  First approach that try to accelerate binary translation • Open source framework  Allows research

Binaries Dynamic Binary VLIW

(RISC-V) Translation Binaries VLIW

Hardware Accelerators

Hardware Accelerated Dynamic Binary Translation 7 Outline

• Hybrid-DBT Platform • How does it work? • What does it cost? • Focus on optimization levels • Experimental Study • Impact on translation overhead • Impact on area utilization • Performance results • Conclusion & Future work

Hardware Accelerated Dynamic Binary Translation 9 Outline

• Hybrid-DBT Platform • How does it work? • What does it cost? • Focus on optimization levels • Experimental Study • Impact on translation overhead • Impact on area utilization • Performance results • Conclusion & Future work

Hardware Accelerated Dynamic Binary Translation 10 How does it work?

• RISC-V binaries cannot be executed on VLIW

RISC-V

binaries VLIW

Hardware Accelerated Dynamic Binary Translation 11 How does it work?

• Direct, naive translation from native to VLIW binaries • Does not take advantage of Instruction Level Parallelism

RISC-V VLIW binaries binaries VLIW

Optimization Direct level 0 Translation No ILP

Hardware Accelerated Dynamic Binary Translation 12 How does it work?

• Build an Intermediate Representation (CFG + dependencies) • Reschedule Instructions on VLIW execution units

RISC-V VLIW binaries binaries VLIW

Optimization Direct level 0 Translation No ILP

Optimization ILP level 1 IR Builder IR IR Scheduler

Hardware Accelerated Dynamic Binary Translation 13 How does it work?

• Code profiling to detect hotspot • Optimization level 1 only on hotspots

RISC-V VLIW binaries binaries VLIW

Optimization Direct Insert level 0 Translation No ILP Profiling

Optimization ILP level 1 IR Builder IR IR Scheduler

Hardware Accelerated Dynamic Binary Translation 14 level Optimization level Optimization 1 0 • • What Need Cycle/ to instr does accelerate 150 cycle/ Translation : binaries RISC Direct number 400 cycle/ - V IR instr Builder it time of cycles to translate one RISC No ILP cost instr consuming Hardware Accelerated Dynamic Binary Translation ? binaries parts ofthe translation VLIW IR ILP - V instruction 500 cycle/ IR Scheduler Profiling Insert Insert instr

VLIW 15 Hybrid-DBT framework

• Hardware acceleration of critical steps of DBT • Can be seen as a hardware accelerated compiler back-end

RISC-V VLIW binaries binaries VLIW

Optimization First-Pass Insert level 0 Translation No ILP Profiling Software pass

Optimization ILP Hardware level 1 IR Builder IR IR Scheduler accelerators

Hardware Accelerated Dynamic Binary Translation 16 Optimization level 0

• Implemented as a Finite State Machine • Translate each native instruction separately • Produces 1 VLIW instruction per cycle • 1 RISC-V instruction => up to 2 VLIW instructions • Simple because ISA are similar

RISC-V VLIW binaries binaries

imm12 rs1 funct rd opcode First-Pass imm13 rs1 rd opcode Translation

Hardware Accelerated Dynamic Binary Translation 17 Optimization level 1

VLIW VLIW IR Builder IR IR Scheduler binaries binaries

• Build an higher level intermediate representation • Perform Instruction Scheduling

• Critical to start exploiting VLIW capabilities

Hardware Accelerated Dynamic Binary Translation 18 Choosing an Intermediate Representation

IR advantages: • Direct access to dependencies and successors • Regular structure (no pointers / variable size)

g1 1 g2 g3 0 nbDep nbDSucc nbSucc 96 64 32 0 2 1 0 op registers[4] succNames[8] + ld st 0 - st @g3 = 0 0 0 1 1 ------1 - ld r1 = @g2 1 2 2 3 4 ------3 0 - 2 - addi g1 = g1 1 0 1 1 3 ------5 3 - sub r3 = r1 g1 2 1 2 4 5 ------mv 4 - st @g2 = r3 2 0 1 5 ------4 st 5- mov r3 = 0 2 0 0 ------

r 3 Hardware Accelerated Dynamic Binary Translation 19 Details on hardware accelerators

VLIW VLIW IR Builder IR IR Scheduler binaries binaries

One-pass dependencies analysis List-scheduling algorithm

• Developing such accelerators using VHDL is out of reach • Accelerators are developed using High-Level Synthesis • Loops unrolling/pipelining • Memory partitioning See DATE’17 paper • Memory accesses factorization for more details ! • Explicit forwarding Hardware Accelerated Dynamic Binary Translation 20 Outline

• Hybrid-DBT Platform • How does it work? • What does it cost? • Focus on optimization levels • Experimental Study • Impact on translation overhead • Impact on area utilization • Performance results • Conclusion & Future work

Hardware Accelerated Dynamic Binary Translation 21 Impact on translation overhead

• VLIW baseline is executed with ST200simVLIW • Fully functionnal Hybrid-DBT platform on FPGA • JIT processor: Nios II • Altera DE2-115 RISC-V VLIW binaries binaries VLIW

Optimization First-Pass level 0 Translation 150 cycle/instr

IR Optimization IR Builder IR level 1 Scheduler 400 cycle/instr 500 cycle/instr Hardware Accelerated Dynamic Binary Translation 22 Impact on translation overhead

• Cost of optimization level 0 using the hardware accelerator cycle/instruction Speed-up vs Software DBT 5 200

4 150 3 100 2 1 50 0 0

First-Pass IR Builder IR Scheduler Hardware Accelerated Dynamic Binary Translation 23 Impact on translation overhead

• Cost of optimization level 1 using the hardware accelerator

cycle/instruction Speed-up vs Software DBT 200 40 35 150 30 144 132 25 100 106 112 106 116 105 20 15 50 61 60 10 37 5 0 14 15 13 14 13 13 13 13 12 13 0

First-Pass Translator IR Builder IR Scheduler Hardware Accelerated Dynamic Binary Translation 24 Impact on area/resource cost

• Resource usage for all our platform components • ASIC 65nm

NAND equivalent gates 25000 20000 19 220 Overhead from Hybrid-DBT 15000

10000 7 626 6 300 5 019 5000 779 0 VLIW DBT IR First-Pass IR Builder Processor Scheduler Translator

Hardware Accelerated Dynamic Binary Translation 25 Performance results • Comparison against OoO architectures • Compare area, power consumption and performance with BOOM

Power Consumption (mW) Speed-up OoO vs VLIW 2000 0,5 1500 0,4 Stay at opt level 0 5x 0,3 1000 0,2 500 0,1 0 0 BOOM VLIW4 adpcm dec g721 dec g721 enc matmul

HW/SW Dynamic Compilation for Adaptive Embedded Systems 26 Conclusion

• Presentation of Hybrid-DBT framework • Hardware accelerated DBT • Open-source DBT framework RISC-V to VLIW • Tested FPGA prototype

• Sources are available on GitHub: https://github.com/srokicki/HybridDBT

Binaries Dynamic Binary VLIW

(RISC-V) Translation Binaries VLIW

Hardware Accelerators

Hardware Accelerated Dynamic Binary Translation 27 Simty: a synthesizable SIMT CPU GPU-style SIMT execution assembles vector instructions across threads of SPMD applications Alternative to vector processing based on scalar ISA Simty: proof of concept for SIMT on general-purpose binaries Runs the RISC-V instruction set (RV32I), no modification Written in synthesizable VHDL Warp size and warp count adjustable at synthesis time 10-stage in-order pipeline

Scales up to 2048 threads per core with 64 warps × 32 threads

More details on https://team.inria.fr/pacap/simty/ !! 28 Questions ? https://github.com/srokicki/HybridDBT https://team.inria.fr/pacap/simty/

Hardware Accelerated Dynamic Binary Translation 29 FPGA synthesis of Simty On Altera Cyclone IV

Logic area (LEs) Memory area (M9Ks) Frequency (MHz)

• Up to 2048 threads per core: 64 warps × 32 threads

• Sweet spot: 8x8 to 32x16 Latency hiding Throughput multithreading depth SIMD width

 Overhead of per-PC control is easily amortized 30