RISC-V Hardware-Accelerated Dynamic Binary Translation 6th RISC-V Workshop
Rokicki Simon - Irisa / Université de Rennes 1 Steven Derrien - Irisa / Université de Rennes 1 Erven Rohou - Inria Rennes Embedded Systems
Tight constraints in • Power consumption • Production cost • Performance
Hardware Accelerated Dynamic Binary Translation 2 Systems on a Chip
• Complex heterogeneous designs • Heterogeneity brings new power/performance trade-off
Out-of-order Superscalar
Power In-order core
Overhead from in-order to Out-of-Order Performance Hardware Accelerated Dynamic Binary Translation 3 Systems on a Chip
• Complex heterogeneous designs • Heterogeneity brings new power/performance trade-off • Are there better trade-off?
Out-of-order Superscalar
Power In-order core VLIW
Performance HW/SW Dynamic Compilation for Adaptive Embedded Systems 4
Architectural choice
…
Ins4 Ins4 Ins4
…
ins3 ins3 ins3
… ins4 ins3 ins2 ins1 Renaming ROB
& &
VLIW
…
ins2
ins2 ins2
…
Decode
ins1
ins1 ins1
Out-of-Order processor VLIW processor • Dynamic Scheduling • Static scheduling • Performance portability • No portability • Poor energy efficiency • High energy efficiency
Hardware Accelerated Dynamic Binary Translation 5 The best of both world ?
Dynamically translate native binaries into VLIW binaries: • Performance close to Out-of-Order processor • Energy consumption close to VLIW processor
Binaries Dynamic Binary VLIW
(RISC-V) Translation Binaries VLIW
Hardware Accelerated Dynamic Binary Translation 6 Existing approaches
• Transmeta Code Morphing Software & Crusoe architectures • x86 on VLIW architecture • User experience polluted by cold-code execution penalty
• Nvidia Denver architecture • ARM on VLIW architecture
• Translation overhead is critical • Too few information on closed platforms
Hardware Accelerated Dynamic Binary Translation 7 Our contribution
• Hardware accelerated DBT framework Make the DBT cheaper (time & energy) First approach that try to accelerate binary translation • Open source framework Allows research
Binaries Dynamic Binary VLIW
(RISC-V) Translation Binaries VLIW
Hardware Accelerators
Hardware Accelerated Dynamic Binary Translation 7 Outline
• Hybrid-DBT Platform • How does it work? • What does it cost? • Focus on optimization levels • Experimental Study • Impact on translation overhead • Impact on area utilization • Performance results • Conclusion & Future work
Hardware Accelerated Dynamic Binary Translation 9 Outline
• Hybrid-DBT Platform • How does it work? • What does it cost? • Focus on optimization levels • Experimental Study • Impact on translation overhead • Impact on area utilization • Performance results • Conclusion & Future work
Hardware Accelerated Dynamic Binary Translation 10 How does it work?
• RISC-V binaries cannot be executed on VLIW
RISC-V
binaries VLIW
Hardware Accelerated Dynamic Binary Translation 11 How does it work?
• Direct, naive translation from native to VLIW binaries • Does not take advantage of Instruction Level Parallelism
RISC-V VLIW binaries binaries VLIW
Optimization Direct level 0 Translation No ILP
Hardware Accelerated Dynamic Binary Translation 12 How does it work?
• Build an Intermediate Representation (CFG + dependencies) • Reschedule Instructions on VLIW execution units
RISC-V VLIW binaries binaries VLIW
Optimization Direct level 0 Translation No ILP
Optimization ILP level 1 IR Builder IR IR Scheduler
Hardware Accelerated Dynamic Binary Translation 13 How does it work?
• Code profiling to detect hotspot • Optimization level 1 only on hotspots
RISC-V VLIW binaries binaries VLIW
Optimization Direct Insert level 0 Translation No ILP Profiling
Optimization ILP level 1 IR Builder IR IR Scheduler
Hardware Accelerated Dynamic Binary Translation 14 level Optimization level Optimization 1 0 • • What Need Cycle/ to instr does accelerate 150 cycle/ Translation : binaries RISC Direct number 400 cycle/ - V IR instr Builder it time of cycles to translate one RISC No ILP cost instr consuming Hardware Accelerated Dynamic Binary Translation ? binaries parts ofthe translation VLIW IR ILP - V instruction 500 cycle/ IR Scheduler Profiling Insert Insert instr
VLIW 15 Hybrid-DBT framework
• Hardware acceleration of critical steps of DBT • Can be seen as a hardware accelerated compiler back-end
RISC-V VLIW binaries binaries VLIW
Optimization First-Pass Insert level 0 Translation No ILP Profiling Software pass
Optimization ILP Hardware level 1 IR Builder IR IR Scheduler accelerators
Hardware Accelerated Dynamic Binary Translation 16 Optimization level 0
• Implemented as a Finite State Machine • Translate each native instruction separately • Produces 1 VLIW instruction per cycle • 1 RISC-V instruction => up to 2 VLIW instructions • Simple because ISA are similar
RISC-V VLIW binaries binaries
imm12 rs1 funct rd opcode First-Pass imm13 rs1 rd opcode Translation
Hardware Accelerated Dynamic Binary Translation 17 Optimization level 1
VLIW VLIW IR Builder IR IR Scheduler binaries binaries
• Build an higher level intermediate representation • Perform Instruction Scheduling
• Critical to start exploiting VLIW capabilities
Hardware Accelerated Dynamic Binary Translation 18 Choosing an Intermediate Representation
IR advantages: • Direct access to dependencies and successors • Regular structure (no pointers / variable size)
g1 1 g2 g3 0 nbDep nbDSucc nbSucc 96 64 32 0 2 1 0 op registers[4] succNames[8] + ld st 0 - st @g3 = 0 0 0 1 1 ------1 - ld r1 = @g2 1 2 2 3 4 ------3 0 - 2 - addi g1 = g1 1 0 1 1 3 ------5 3 - sub r3 = r1 g1 2 1 2 4 5 ------mv 4 - st @g2 = r3 2 0 1 5 ------4 st 5- mov r3 = 0 2 0 0 ------
r 3 Hardware Accelerated Dynamic Binary Translation 19 Details on hardware accelerators
VLIW VLIW IR Builder IR IR Scheduler binaries binaries
One-pass dependencies analysis List-scheduling algorithm
• Developing such accelerators using VHDL is out of reach • Accelerators are developed using High-Level Synthesis • Loops unrolling/pipelining • Memory partitioning See DATE’17 paper • Memory accesses factorization for more details ! • Explicit forwarding Hardware Accelerated Dynamic Binary Translation 20 Outline
• Hybrid-DBT Platform • How does it work? • What does it cost? • Focus on optimization levels • Experimental Study • Impact on translation overhead • Impact on area utilization • Performance results • Conclusion & Future work
Hardware Accelerated Dynamic Binary Translation 21 Impact on translation overhead
• VLIW baseline is executed with ST200simVLIW • Fully functionnal Hybrid-DBT platform on FPGA • JIT processor: Nios II • Altera DE2-115 RISC-V VLIW binaries binaries VLIW
Optimization First-Pass level 0 Translation 150 cycle/instr
IR Optimization IR Builder IR level 1 Scheduler 400 cycle/instr 500 cycle/instr Hardware Accelerated Dynamic Binary Translation 22 Impact on translation overhead
• Cost of optimization level 0 using the hardware accelerator cycle/instruction Speed-up vs Software DBT 5 200
4 150 3 100 2 1 50 0 0
First-Pass Translator IR Builder IR Scheduler Hardware Accelerated Dynamic Binary Translation 23 Impact on translation overhead
• Cost of optimization level 1 using the hardware accelerator
cycle/instruction Speed-up vs Software DBT 200 40 35 150 30 144 132 25 100 106 112 106 116 105 20 15 50 61 60 10 37 5 0 14 15 13 14 13 13 13 13 12 13 0
First-Pass Translator IR Builder IR Scheduler Hardware Accelerated Dynamic Binary Translation 24 Impact on area/resource cost
• Resource usage for all our platform components • ASIC 65nm
NAND equivalent gates 25000 20000 19 220 Overhead from Hybrid-DBT 15000
10000 7 626 6 300 5 019 5000 779 0 VLIW DBT IR First-Pass IR Builder Processor Scheduler Translator
Hardware Accelerated Dynamic Binary Translation 25 Performance results • Comparison against OoO architectures • Compare area, power consumption and performance with BOOM
Power Consumption (mW) Speed-up OoO vs VLIW 2000 0,5 1500 0,4 Stay at opt level 0 5x 0,3 1000 0,2 500 0,1 0 0 BOOM VLIW4 adpcm dec g721 dec g721 enc matmul
HW/SW Dynamic Compilation for Adaptive Embedded Systems 26 Conclusion
• Presentation of Hybrid-DBT framework • Hardware accelerated DBT • Open-source DBT framework RISC-V to VLIW • Tested FPGA prototype
• Sources are available on GitHub: https://github.com/srokicki/HybridDBT
Binaries Dynamic Binary VLIW
(RISC-V) Translation Binaries VLIW
Hardware Accelerators
Hardware Accelerated Dynamic Binary Translation 27 Simty: a synthesizable SIMT CPU GPU-style SIMT execution assembles vector instructions across threads of SPMD applications Alternative to vector processing based on scalar ISA Simty: proof of concept for SIMT on general-purpose binaries Runs the RISC-V instruction set (RV32I), no modification Written in synthesizable VHDL Warp size and warp count adjustable at synthesis time 10-stage in-order pipeline
Scales up to 2048 threads per core with 64 warps × 32 threads
More details on https://team.inria.fr/pacap/simty/ !! 28 Questions ? https://github.com/srokicki/HybridDBT https://team.inria.fr/pacap/simty/
Hardware Accelerated Dynamic Binary Translation 29 FPGA synthesis of Simty On Altera Cyclone IV
Logic area (LEs) Memory area (M9Ks) Frequency (MHz)
• Up to 2048 threads per core: 64 warps × 32 threads
• Sweet spot: 8x8 to 32x16 Latency hiding Throughput multithreading depth SIMD width
Overhead of per-PC control is easily amortized 30