RAW2020, May 18-19, 2020

Kyutech

A -based for Deep Learning Processors

Qian Zhao1 ([email protected]) , Yasuhiro Nakahara2, Motoki Amagasaki2, Masahiro Iida2, and Takaichi Yoshida1

1 Kyushu Institute of Technology, 2 Kumamoto University

Japan Introduction A fixed ISA-based controller may limit the flexibility of DPU

Instruction mem (ISA-based Program) (Horizontal microcode) SW

HW We propose a Control Unit microcode-based (Instruction to control unit, which Once the control Control Unit control sequence) allows control logic logic is hard-wired, (Soft-sequencer) to be determined by we cannot change it . for feature upgrade, bug fix, etc. Processing Unit Processing Unit (PEs, Data mem, …) (PEs, Data mem, …)

ISA architecture Our proposal Provide full control over the DPU in a writable , and thus delivers high flexibility and shorter dev. life cycle.

1 ISA: Instruction set architecture, DPU: Deep learning processing unit DPU architecture using our approach

DPU Microcode MEM

Soft-sequencer-based Control Unit - SW-defined control sequence for entire Fixed system - Cycle-accurate Pre- & Post- Processing signal generation processing logics block array (PB array) Weight MEM Feature MEM

Datapath Control signals

Microcode-based control unit is suitable for DPU because: • DPUs have fewer instructions than general-purpose processors. • The control flow and data flow of DPUs are much less complex and more predictable than those of general-purpose processors. 2 Soft-sequencer control unit design

Soft-sequencer structure Microcode format

Control store Address Microcode

Address Microcode PC CUSTOM WAIT generation controller controller {pc[14:0],1'b0} unit (AGU)

CLK Output latches RSTL PURGE Ctrl signals from Address Custom ctrl microcode signals signals

Soft-sequencer fetches a microcode and generates control signal sequences according to the binary status saved in it. • Simple flow control like loops are implemented in the PC controller, allowing the size of the microcode to be significantly shortened. • The AGU holds and computes addresses in user memory. • A designer can add a custom controller to generate signals by combining other control signals. 3 State transition of PC and AGU design

PC=PC-1 Sequential execution INST_WAIT RSTL & WAIT=1 Waiting PURGE WAIT=0 PC=0 PC=PC+1 INST_JUMP

CNT= Initial Normal INST_COUNTER Loop execution PC=INST_PC

PC state Loop CNT=1 transition diagram CNT>1

PC=PC+1 CNT=CNT-1 INST_JUMP

ADDR=INST_ADD R RSTL INST_ADDR_WE Set with PURGE Microcode • In our DPU design, three addresses for ADDR=0 ADDR=ADDR+ !INST_ADDR_WEINST_ADDR_BAK INST_ADDR &ADDR_BAK==0 feature memory write, feature memory !INST_ADDR_BAK Initial Normal read, and weight memory read are INST_ADDR_BAK !INST_ADDR_BA&ADDR_BAK!=0 generated by the AGU. K ADDR_BAK=ADDR • Flexible address sequence generation, Address like: AUG state backup • 1, 2, 3, 4 transition diagram • 1, 3, 5, 7 ADDR=ADDR_BAK ADDR_BAK = 0 • 1, 2, 3, 4, 2, 3, 4, 5 4 (using backup & restore) Restore from backup Implementation case study

FPGA implementation Resource utilization • Device: ADM-PCIE-KU3 • DPU: 32*32 PB array • Freq.: 100MHz • Power: 1.255W in total, control unit(<1%), control store(<3%)

ASIC implementation • : TSMC 40-nm Processing • DPU: 64*64 PB array block array • Freq.: 350MHz • Area: control unit (5%), control store (5.2%) Weight MEM Weight Feature MEM Feature • Power: 0.516W in total, control unit(<0.1%), Soft-sequencer & control store(5.2%) Control store logic Control store

Microprogramming DPU layout • Demo DNN: 4 CNN layers, 2 FC layers, CIFAR-10 task • Lines-of-code: 759 lines of microprogram 5 Conclusion

• Problem: A fixed ISA-based controller may limit the flexibility of DPU.

• Approach: We propose a microcode-based control unit, which allows control logic to be determined by SW.

• Pros: Full control over the DPU in a writable firmware, and thus delivers high flexibility and shorter dev. life cycle.

• Result: We evaluate the proposed control unit using two DPU design cases implemented by FPGA and ASIC. The results show that high flexibility of the proposed control unit can be achieved without obvious performance overhead.

6