Hardware-Accelerated Dynamic Binary Translation 6Th RISC-V Workshop

RISC-V Hardware-Accelerated Dynamic Binary Translation 6th RISC-V Workshop Rokicki Simon - Irisa / Université de Rennes 1 Steven Derrien - Irisa / Université de Rennes 1 Erven Rohou - Inria Rennes Embedded Systems Tight constraints in • Power consumption • Production cost • Performance Hardware Accelerated Dynamic Binary Translation 2 Systems on a Chip • Complex heterogeneous designs • Heterogeneity brings new power/performance trade-off Out-of-order Superscalar Power In-order core Overhead from in-order to Out-of-Order Performance Hardware Accelerated Dynamic Binary Translation 3 Systems on a Chip • Complex heterogeneous designs • Heterogeneity brings new power/performance trade-off • Are there better trade-off? Out-of-order Superscalar Power In-order core VLIW Performance HW/SW Dynamic Compilation for Adaptive Embedded Systems 4 Architectural choice … Ins4 Ins4 Ins4 … ins3 ins3 ins3 … ins4 ins3 ins2 ins1 Renaming ROB & VLIW … ins2 ins2 ins2 … Decode ins1 ins1 ins1 Out-of-Order processor VLIW processor • Dynamic Scheduling • Static scheduling • Performance portability • No portability • Poor energy efficiency • High energy efficiency Hardware Accelerated Dynamic Binary Translation 5 The best of both world ? Dynamically translate native binaries into VLIW binaries: • Performance close to Out-of-Order processor • Energy consumption close to VLIW processor Binaries Dynamic Binary VLIW (RISC-V) Translation Binaries VLIW Hardware Accelerated Dynamic Binary Translation 6 Existing approaches • Transmeta Code Morphing Software & Crusoe architectures • x86 on VLIW architecture • User experience polluted by cold-code execution penalty • Nvidia Denver architecture • ARM on VLIW architecture • Translation overhead is critical • Too few information on closed platforms Hardware Accelerated Dynamic Binary Translation 7 Our contribution • Hardware accelerated DBT framework Make the DBT cheaper (time & energy) First approach that try to accelerate binary translation • Open source framework Allows research Binaries Dynamic Binary VLIW (RISC-V) Translation Binaries VLIW Hardware Accelerators Hardware Accelerated Dynamic Binary Translation 7 Outline • Hybrid-DBT Platform • How does it work? • What does it cost? • Focus on optimization levels • Experimental Study • Impact on translation overhead • Impact on area utilization • Performance results • Conclusion & Future work Hardware Accelerated Dynamic Binary Translation 9 Outline • Hybrid-DBT Platform • How does it work? • What does it cost? • Focus on optimization levels • Experimental Study • Impact on translation overhead • Impact on area utilization • Performance results • Conclusion & Future work Hardware Accelerated Dynamic Binary Translation 10 How does it work? • RISC-V binaries cannot be executed on VLIW RISC-V binaries VLIW Hardware Accelerated Dynamic Binary Translation 11 How does it work? • Direct, naive translation from native to VLIW binaries • Does not take advantage of Instruction Level Parallelism RISC-V VLIW binaries binaries VLIW Optimization Direct level 0 Translation No ILP Hardware Accelerated Dynamic Binary Translation 12 How does it work? • Build an Intermediate Representation (CFG + dependencies) • Reschedule Instructions on VLIW execution units RISC-V VLIW binaries binaries VLIW Optimization Direct level 0 Translation No ILP Optimization ILP level 1 IR Builder IR IR Scheduler Hardware Accelerated Dynamic Binary Translation 13 How does it work? • Code profiling to detect hotspot • Optimization level 1 only on hotspots RISC-V VLIW binaries binaries VLIW Optimization Direct Insert level 0 Translation No ILP Profiling Optimization ILP level 1 IR Builder IR IR Scheduler Hardware Accelerated Dynamic Binary Translation 14 What does it cost? • Cycle/instr : number of cycles to translate one RISC-V instruction • Need to accelerate time consuming parts of the translation RISC-V VLIW binaries binaries VLIW Optimization Direct Insert level 0 Translation No ILP Profiling 150 cycle/instr Optimization ILP level 1 IR Builder IR IR Scheduler 400 cycle/instr 500 cycle/instr Hardware Accelerated Dynamic Binary Translation 15 Hybrid-DBT framework • Hardware acceleration of critical steps of DBT • Can be seen as a hardware accelerated compiler back-end RISC-V VLIW binaries binaries VLIW Optimization First-Pass Insert level 0 Translation No ILP Profiling Software pass Optimization ILP Hardware level 1 IR Builder IR IR Scheduler accelerators Hardware Accelerated Dynamic Binary Translation 16 Optimization level 0 • Implemented as a Finite State Machine • Translate each native instruction separately • Produces 1 VLIW instruction per cycle • 1 RISC-V instruction => up to 2 VLIW instructions • Simple because ISA are similar RISC-V VLIW binaries binaries imm12 rs1 funct rd opcode First-Pass imm13 rs1 rd opcode Translation Hardware Accelerated Dynamic Binary Translation 17 Optimization level 1 VLIW VLIW IR Builder IR IR Scheduler binaries binaries • Build an higher level intermediate representation • Perform Instruction Scheduling • Critical to start exploiting VLIW capabilities Hardware Accelerated Dynamic Binary Translation 18 Choosing an Intermediate Representation IR advantages: • Direct access to dependencies and successors • Regular structure (no pointers / variable size) g1 1 g2 g3 0 nbDep nbDSucc nbSucc 96 64 32 0 2 1 0 op registers[4] succNames[8] + ld st 0 - st @g3 = 0 0 0 1 1 - - - - - - - 1 - ld r1 = @g2 1 2 2 3 4 - - - - - - 3 0 - 2 - addi g1 = g1 1 0 1 1 3 - - - - - - - 5 3 - sub r3 = r1 g1 2 1 2 4 5 - - - - - - mv 4 - st @g2 = r3 2 0 1 5 - - - - - - - 4 st 5- mov r3 = 0 2 0 0 - - - - - - - - r 3 Hardware Accelerated Dynamic Binary Translation 19 Details on hardware accelerators VLIW VLIW IR Builder IR IR Scheduler binaries binaries One-pass dependencies analysis List-scheduling algorithm • Developing such accelerators using VHDL is out of reach • Accelerators are developed using High-Level Synthesis • Loops unrolling/pipelining • Memory partitioning See DATE’17 paper • Memory accesses factorization for more details ! • Explicit forwarding Hardware Accelerated Dynamic Binary Translation 20 Outline • Hybrid-DBT Platform • How does it work? • What does it cost? • Focus on optimization levels • Experimental Study • Impact on translation overhead • Impact on area utilization • Performance results • Conclusion & Future work Hardware Accelerated Dynamic Binary Translation 21 Impact on translation overhead • VLIW baseline is executed with ST200simVLIW • Fully functionnal Hybrid-DBT platform on FPGA • JIT processor: Nios II • Altera DE2-115 RISC-V VLIW binaries binaries VLIW Optimization First-Pass level 0 Translation 150 cycle/instr IR Optimization IR Builder IR level 1 Scheduler 400 cycle/instr 500 cycle/instr Hardware Accelerated Dynamic Binary Translation 22 Impact on translation overhead • Cost of optimization level 0 using the hardware accelerator cycle/instruction Speed-up vs Software DBT 5 200 4 150 3 100 2 1 50 0 0 First-Pass Translator IR Builder IR Scheduler Hardware Accelerated Dynamic Binary Translation 23 Impact on translation overhead • Cost of optimization level 1 using the hardware accelerator cycle/instruction Speed-up vs Software DBT 200 40 35 150 30 144 132 25 100 106 112 106 116 105 20 15 50 61 60 10 37 5 0 14 15 13 14 13 13 13 13 12 13 0 First-Pass Translator IR Builder IR Scheduler Hardware Accelerated Dynamic Binary Translation 24 Impact on area/resource cost • Resource usage for all our platform components • ASIC 65nm NAND equivalent gates 25000 20000 19 220 Overhead from Hybrid-DBT 15000 10000 7 626 6 300 5 019 5000 779 0 VLIW DBT IR First-Pass IR Builder Processor Scheduler Translator Hardware Accelerated Dynamic Binary Translation 25 Performance results • Comparison against OoO architectures • Compare area, power consumption and performance with BOOM Power Consumption (mW) Speed-up OoO vs VLIW 2000 0,5 1500 0,4 Stay at opt level 0 5x 0,3 1000 0,2 500 0,1 0 0 BOOM VLIW4 adpcm dec g721 dec g721 enc matmul HW/SW Dynamic Compilation for Adaptive Embedded Systems 26 Conclusion • Presentation of Hybrid-DBT framework • Hardware accelerated DBT • Open-source DBT framework RISC-V to VLIW • Tested FPGA prototype • Sources are available on GitHub: https://github.com/srokicki/HybridDBT Binaries Dynamic Binary VLIW (RISC-V) Translation Binaries VLIW Hardware Accelerators Hardware Accelerated Dynamic Binary Translation 27 Simty: a synthesizable SIMT CPU GPU-style SIMT execution assembles vector instructions across threads of SPMD applications Alternative to vector processing based on scalar ISA Simty: proof of concept for SIMT on general-purpose binaries Runs the RISC-V instruction set (RV32I), no modification Written in synthesizable VHDL Warp size and warp count adjustable at synthesis time 10-stage in-order pipeline Scales up to 2048 threads per core with 64 warps × 32 threads More details on https://team.inria.fr/pacap/simty/ !! 28 Questions ? https://github.com/srokicki/HybridDBT https://team.inria.fr/pacap/simty/ Hardware Accelerated Dynamic Binary Translation 29 FPGA synthesis of Simty On Altera Cyclone IV Logic area (LEs) Memory area (M9Ks) Frequency (MHz) • Up to 2048 threads per core: 64 warps × 32 threads • Sweet spot: 8x8 to 32x16 Latency hiding Throughput multithreading depth SIMD width Overhead of per-PC control is easily amortized 30.

Hardware-Accelerated Dynamic Binary Translation 6Th RISC-V Workshop

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support