A Microcode-Based Control Unit for Deep Learning Processors
Total Page:16
File Type:pdf, Size:1020Kb
RAW2020, May 18-19, 2020 Kyutech A Microcode-based Control Unit for Deep Learning Processors Qian Zhao1 ([email protected]) , Yasuhiro Nakahara2, Motoki Amagasaki2, Masahiro Iida2, and Takaichi Yoshida1 1 Kyushu Institute of Technology, 2 Kumamoto University Japan Introduction A fixed ISA-based controller may limit the flexibility of DPU Instruction mem Control store (ISA-based Program) (Horizontal microcode) SW HW We propose a Control Unit microcode-Based (Instruction to control unit, which Once the control Control Unit control sequence) allows control logic logic is hard-wired, (Soft-sequencer) to Be determined By we cannot change it software. for feature upgrade, bug fix, etc. Processing Unit Processing Unit (PEs, Data mem, …) (PEs, Data mem, …) ISA architecture Our proposal Provide full control over the DPU in a writable firmware, and thus delivers high flexibility and shorter dev. life cycle. 1 ISA: Instruction set architecture, DPU: Deep learning processing unit DPU architecture using our approach DPU Microcode MEM Soft-sequencer-based Control Unit - SW-defined control sequence for entire Fixed datapath system - Cycle-accurate Pre- & Post- Processing signal generation processing logics block array (PB array) Weight MEM Feature MEM Datapath Control signals Microcode-based control unit is suitable for DPU because: • DPUs have fewer instructions than general-purpose processors. • The control flow and data flow of DPUs are much less complex and more predictable than those of general-purpose processors. 2 Soft-sequencer control unit design Soft-sequencer structure Microcode format Control store Address Microcode Address Microcode PC CUSTOM WAIT generation controller controller {pc[14:0],1'b0} unit (AGU) CLK Output latches RSTL PURGE Ctrl signals from Address Custom ctrl microcode signals signals Soft-sequencer fetches a microcode and generates control signal sequences according to the binary status saved in it. • Simple flow control like loops are implemented in the PC controller, allowing the size of the microcode to be significantly shortened. • The AGU holds and computes addresses in user memory. • A designer can add a custom controller to generate signals by combining other control signals. 3 State transition of PC and AGU design PC=PC-1 Sequential execution INST_WAIT RSTL & WAIT=1 Waiting PURGE WAIT=0 PC=0 PC=PC+1 INST_JUMP CNT= Initial Normal INST_COUNTER Loop execution PC=INST_PC PC state Loop CNT=1 transition diagram CNT>1 PC=PC+1 CNT=CNT-1 INST_JUMP ADDR=INST_ADD R RSTL INST_ADDR_WE Set with PURGE Microcode • In our DPU design, three addresses for ADDR=0 ADDR=ADDR+ !INST_ADDR_WEINST_ADDR_BAK INST_ADDR &ADDR_BAK==0 feature memory write, feature memory !INST_ADDR_BAK Initial Normal read, and weight memory read are INST_ADDR_BAK !INST_ADDR_BA&ADDR_BAK!=0 generated By the AGU. K ADDR_BAK=ADDR • FlexiBle address sequence generation, Address like: AUG state backup • 1, 2, 3, 4 transition diagram • 1, 3, 5, 7 ADDR=ADDR_BAK ADDR_BAK = 0 • 1, 2, 3, 4, 2, 3, 4, 5 4 (using Backup & restore) Restore from backup Implementation case study FPGA implementation Resource utilization • Device: ADM-PCIE-KU3 • DPU: 32*32 PB array • Freq.: 100MHz • Power: 1.255W in total, control unit(<1%), control store(<3%) ASIC implementation • Process: TSMC 40-nm Processing • DPU: 64*64 PB array block array • Freq.: 350MHz • Area: control unit (5%), control store (5.2%) Weight MEM Weight Feature MEM Feature • Power: 0.516W in total, control unit(<0.1%), Soft-sequencer & control store(5.2%) Control store logic Control store Microprogramming DPU layout • Demo DNN: 4 CNN layers, 2 FC layers, CIFAR-10 task • Lines-of-code: 759 lines of microprogram 5 Conclusion • Problem: A fixed ISA-based controller may limit the flexibility of DPU. • Approach: We propose a microcode-based control unit, which allows control logic to be determined by SW. • Pros: Full control over the DPU in a writable firmware, and thus delivers high flexibility and shorter dev. life cycle. • Result: We evaluate the proposed control unit using two DPU design cases implemented by FPGA and ASIC. The results show that high flexibility of the proposed control unit can be achieved without obvious performance overhead. 6.