Hardware-Accelerated Dynamic Binary Translation 6Th RISC-V Workshop

RISC-V Hardware-Accelerated Dynamic Binary Translation 6th RISC-V Workshop Rokicki Simon - Irisa / Université de Rennes 1 Steven Derrien - Irisa / Université de Rennes 1 Erven Rohou - Inria Rennes Embedded Systems Tight constraints in • Power consumption • Production cost • Performance Hardware Accelerated Dynamic Binary Translation 2 Systems on a Chip • Complex heterogeneous designs • Heterogeneity brings new power/performance trade-off Out-of-order Superscalar Power In-order core Overhead from in-order to Out-of-Order Performance Hardware Accelerated Dynamic Binary Translation 3 Systems on a Chip • Complex heterogeneous designs • Heterogeneity brings new power/performance trade-off • Are there better trade-off? Out-of-order Superscalar Power In-order core VLIW Performance HW/SW Dynamic Compilation for Adaptive Embedded Systems 4 Architectural choice … Ins4 Ins4 Ins4 … ins3 ins3 ins3 … ins4 ins3 ins2 ins1 Renaming ROB & VLIW … ins2 ins2 ins2 … Decode ins1 ins1 ins1 Out-of-Order processor VLIW processor • Dynamic Scheduling • Static scheduling • Performance portability • No portability • Poor energy efficiency • High energy efficiency Hardware Accelerated Dynamic Binary Translation 5 The best of both world ? Dynamically translate native binaries into VLIW binaries: • Performance close to Out-of-Order processor • Energy consumption close to VLIW processor Binaries Dynamic Binary VLIW (RISC-V) Translation Binaries VLIW Hardware Accelerated Dynamic Binary Translation 6 Existing approaches • Transmeta Code Morphing Software & Crusoe architectures • x86 on VLIW architecture • User experience polluted by cold-code execution penalty • Nvidia Denver architecture • ARM on VLIW architecture • Translation overhead is critical • Too few information on closed platforms Hardware Accelerated Dynamic Binary Translation 7 Our contribution • Hardware accelerated DBT framework Make the DBT cheaper (time & energy) First approach that try to accelerate binary translation • Open source framework Allows research Binaries Dynamic Binary VLIW (RISC-V) Translation Binaries VLIW Hardware Accelerators Hardware Accelerated Dynamic Binary Translation 7 Outline • Hybrid-DBT Platform • How does it work? • What does it cost? • Focus on optimization levels • Experimental Study • Impact on translation overhead • Impact on area utilization • Performance results • Conclusion & Future work Hardware Accelerated Dynamic Binary Translation 9 Outline • Hybrid-DBT Platform • How does it work? • What does it cost? • Focus on optimization levels • Experimental Study • Impact on translation overhead • Impact on area utilization • Performance results • Conclusion & Future work Hardware Accelerated Dynamic Binary Translation 10 How does it work? • RISC-V binaries cannot be executed on VLIW RISC-V binaries VLIW Hardware Accelerated Dynamic Binary Translation 11 How does it work? • Direct, naive translation from native to VLIW binaries • Does not take advantage of Instruction Level Parallelism RISC-V VLIW binaries binaries VLIW Optimization Direct level 0 Translation No ILP Hardware Accelerated Dynamic Binary Translation 12 How does it work? • Build an Intermediate Representation (CFG + dependencies) • Reschedule Instructions on VLIW execution units RISC-V VLIW binaries binaries VLIW Optimization Direct level 0 Translation No ILP Optimization ILP level 1 IR Builder IR IR Scheduler Hardware Accelerated Dynamic Binary Translation 13 How does it work? • Code profiling to detect hotspot • Optimization level 1 only on hotspots RISC-V VLIW binaries binaries VLIW Optimization Direct Insert level 0 Translation No ILP Profiling Optimization ILP level 1 IR Builder IR IR Scheduler Hardware Accelerated Dynamic Binary Translation 14 What does it cost? • Cycle/instr : number of cycles to translate one RISC-V instruction • Need to accelerate time consuming parts of the translation RISC-V VLIW binaries binaries VLIW Optimization Direct Insert level 0 Translation No ILP Profiling 150 cycle/instr Optimization ILP level 1 IR Builder IR IR Scheduler 400 cycle/instr 500 cycle/instr Hardware Accelerated Dynamic Binary Translation 15 Hybrid-DBT framework • Hardware acceleration of critical steps of DBT • Can be seen as a hardware accelerated compiler back-end RISC-V VLIW binaries binaries VLIW Optimization First-Pass Insert level 0 Translation No ILP Profiling Software pass Optimization ILP Hardware level 1 IR Builder IR IR Scheduler accelerators Hardware Accelerated Dynamic Binary Translation 16 Optimization level 0 • Implemented as a Finite State Machine • Translate each native instruction separately • Produces 1 VLIW instruction per cycle • 1 RISC-V instruction => up to 2 VLIW instructions • Simple because ISA are similar RISC-V VLIW binaries binaries imm12 rs1 funct rd opcode First-Pass imm13 rs1 rd opcode Translation Hardware Accelerated Dynamic Binary Translation 17 Optimization level 1 VLIW VLIW IR Builder IR IR Scheduler binaries binaries • Build an higher level intermediate representation • Perform Instruction Scheduling • Critical to start exploiting VLIW capabilities Hardware Accelerated Dynamic Binary Translation 18 Choosing an Intermediate Representation IR advantages: • Direct access to dependencies and successors • Regular structure (no pointers / variable size) g1 1 g2 g3 0 nbDep nbDSucc nbSucc 96 64 32 0 2 1 0 op registers[4] succNames[8] + ld st 0 - st @g3 = 0 0 0 1 1 - - - - - - - 1 - ld r1 = @g2 1 2 2 3 4 - - - - - - 3 0 - 2 - addi g1 = g1 1 0 1 1 3 - - - - - - - 5 3 - sub r3 = r1 g1 2 1 2 4 5 - - - - - - mv 4 - st @g2 = r3 2 0 1 5 - - - - - - - 4 st 5- mov r3 = 0 2 0 0 - - - - - - - - r 3 Hardware Accelerated Dynamic Binary Translation 19 Details on hardware accelerators VLIW VLIW IR Builder IR IR Scheduler binaries binaries One-pass dependencies analysis List-scheduling algorithm • Developing such accelerators using VHDL is out of reach • Accelerators are developed using High-Level Synthesis • Loops unrolling/pipelining • Memory partitioning See DATE’17 paper • Memory accesses factorization for more details ! • Explicit forwarding Hardware Accelerated Dynamic Binary Translation 20 Outline • Hybrid-DBT Platform • How does it work? • What does it cost? • Focus on optimization levels • Experimental Study • Impact on translation overhead • Impact on area utilization • Performance results • Conclusion & Future work Hardware Accelerated Dynamic Binary Translation 21 Impact on translation overhead • VLIW baseline is executed with ST200simVLIW • Fully functionnal Hybrid-DBT platform on FPGA • JIT processor: Nios II • Altera DE2-115 RISC-V VLIW binaries binaries VLIW Optimization First-Pass level 0 Translation 150 cycle/instr IR Optimization IR Builder IR level 1 Scheduler 400 cycle/instr 500 cycle/instr Hardware Accelerated Dynamic Binary Translation 22 Impact on translation overhead • Cost of optimization level 0 using the hardware accelerator cycle/instruction Speed-up vs Software DBT 5 200 4 150 3 100 2 1 50 0 0 First-Pass Translator IR Builder IR Scheduler Hardware Accelerated Dynamic Binary Translation 23 Impact on translation overhead • Cost of optimization level 1 using the hardware accelerator cycle/instruction Speed-up vs Software DBT 200 40 35 150 30 144 132 25 100 106 112 106 116 105 20 15 50 61 60 10 37 5 0 14 15 13 14 13 13 13 13 12 13 0 First-Pass Translator IR Builder IR Scheduler Hardware Accelerated Dynamic Binary Translation 24 Impact on area/resource cost • Resource usage for all our platform components • ASIC 65nm NAND equivalent gates 25000 20000 19 220 Overhead from Hybrid-DBT 15000 10000 7 626 6 300 5 019 5000 779 0 VLIW DBT IR First-Pass IR Builder Processor Scheduler Translator Hardware Accelerated Dynamic Binary Translation 25 Performance results • Comparison against OoO architectures • Compare area, power consumption and performance with BOOM Power Consumption (mW) Speed-up OoO vs VLIW 2000 0,5 1500 0,4 Stay at opt level 0 5x 0,3 1000 0,2 500 0,1 0 0 BOOM VLIW4 adpcm dec g721 dec g721 enc matmul HW/SW Dynamic Compilation for Adaptive Embedded Systems 26 Conclusion • Presentation of Hybrid-DBT framework • Hardware accelerated DBT • Open-source DBT framework RISC-V to VLIW • Tested FPGA prototype • Sources are available on GitHub: https://github.com/srokicki/HybridDBT Binaries Dynamic Binary VLIW (RISC-V) Translation Binaries VLIW Hardware Accelerators Hardware Accelerated Dynamic Binary Translation 27 Simty: a synthesizable SIMT CPU GPU-style SIMT execution assembles vector instructions across threads of SPMD applications Alternative to vector processing based on scalar ISA Simty: proof of concept for SIMT on general-purpose binaries Runs the RISC-V instruction set (RV32I), no modification Written in synthesizable VHDL Warp size and warp count adjustable at synthesis time 10-stage in-order pipeline Scales up to 2048 threads per core with 64 warps × 32 threads More details on https://team.inria.fr/pacap/simty/ !! 28 Questions ? https://github.com/srokicki/HybridDBT https://team.inria.fr/pacap/simty/ Hardware Accelerated Dynamic Binary Translation 29 FPGA synthesis of Simty On Altera Cyclone IV Logic area (LEs) Memory area (M9Ks) Frequency (MHz) • Up to 2048 threads per core: 64 warps × 32 threads • Sweet spot: 8x8 to 32x16 Latency hiding Throughput multithreading depth SIMD width Overhead of per-PC control is easily amortized 30.

Load more