A Stochastic-Computing Based Deep Learning Framework Using

A Stochastic-Computing based Deep Learning Framework using Adiabatic Quantum-Flux-Parametron Superconducting Technology Ruizhe Cai Olivia Chen Ning Liu Ao Ren Yokohama National University Caiwen Ding Northeastern University Japan Northeastern University USA [email protected] USA {cai.ruiz,ren.ao}@husky.neu.edu {liu.ning,ding.ca}@husky.neu.edu Xuehai Qian Jie Han Wenhui Luo University of Southern California University of Alberta Yokohama National University USA Canada Japan [email protected] [email protected] [email protected] Nobuyuki Yoshikawa Yanzhi Wang Yokohama National University Northeastern University Japan USA [email protected] [email protected] ABSTRACT increases the difficulty to avoid RAW hazards; the second is The Adiabatic Quantum-Flux-Parametron (AQFP) supercon- the unique opportunity of true random number generation ducting technology has been recently developed, which achieves (RNG) using a single AQFP buffer, far more efficient than the highest energy efficiency among superconducting logic RNG in CMOS. We point out that these two characteristics families, potentially 104-105 gain compared with state-of-the- make AQFP especially compatible with the stochastic com- art CMOS. In 2016, the successful fabrication and testing of puting (SC) technique, which uses a time-independent bit AQFP-based circuits with the scale of 83,000 JJs have demon- sequence for value representation, and is compatible with strated the scalability and potential of implementing large- the deep pipelining nature. Further, the application of SC scale systems using AQFP. As a result, it will be promising has been investigated in DNNs in prior work, and the suit- for AQFP in high-performance computing and deep space ability has been illustrated as SC is more compatible with applications, with Deep Neural Network (DNN) inference approximate computations. acceleration as an important example. This work is the first to develop an SC-based DNN accelera- Besides ultra-high energy efficiency, AQFP exhibits two tion framework using AQFP technology. The deep-pipelining arXiv:1907.09077v1 [cs.NE] 22 Jul 2019 unique characteristics: the deep pipelining nature since each nature of AQFP circuits translates into the difficulty in de- AQFP logic gate is connected with an AC clock signal, which signing accumulators/counters in AQFP, which makes the prior design in SC-based DNN not suitable. We overcome Permission to make digital or hard copies of all or part of this work for this limitation taking into account different properties in personal or classroom use is granted without fee provided that copies are not CONV and FC layers: (i) the inner product calculation in FC made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components layers has more number of inputs than that in CONV layers; of this work owned by others than ACM must be honored. Abstracting with (ii) accurate activation function is critical in CONV rather credit is permitted. To copy otherwise, or republish, to post on servers or to than FC layers. Based on these observations, we propose (i) redistribute to lists, requires prior specific permission and/or a fee. Request accurate integration of summation and activation function permissions from [email protected]. in CONV layers using bitonic sorting network and feedback ISCA ’19, June 22–26, 2019, Phoenix, AZ, USA loop, and (ii) low-complexity categorization block for FC lay- © 2019 Association for Computing Machinery. ACM ISBN 978-1-4503-6669-4/19/06...$15.00 ers based on chain of majority gates. For complete design we https://doi.org/10.1145/3307650.3322270 also develop (i) ultra-efficient stochastic number generator a in AQFP, (ii) a high-accuracy sub-sampling (pooling) block Ia (I in ) Ix L L in AQFP, and (iii) majority synthesis for further performance xin x1 x2 xout a improvement and automatic buffer/splitter insertion for re- k1 k2 L L xin xout J1 1 2 J quirement of AQFP circuits. Experimental results suggest buffer 2 “0” “1” Iout that the proposed SC-based DNN using AQFP can achieve Lq up to 6:8 × 104 times higher energy efficiency compared to b kq CMOS-based implementation while maintaining 96% accu- Id racy on the MNIST dataset. b Figure 1: Junction level schematic of an AQFP buffer. KEYWORDS state-of-the-art CMOS (even two order of magnitude energy Stochastic Computing, Deep Learning, Adiabatic Quantum- efficiency gain when cooling energy is accounted for), with Flux-Parametron, Superconducting a clock frequency of several GHz. Recently, the successful fabrication and testing of AQFP-based implementations with ACM Reference Format: Ruizhe Cai, Ao Ren, Olivia Chen, Ning Liu, Caiwen Ding, Xuehai the scale of 83,000 JJs, have demonstrated the scalability and Qian, Jie Han, Wenhui Luo, Nobuyuki Yoshikawa, and Yanzhi Wang. potential of implementing large-scale circuits/systems using 2019. A Stochastic-Computing based Deep Learning Framework AQFP [31]. As a result, it will be promising for the AQFP tech- using Adiabatic Quantum-Flux-Parametron Superconducting Tech- nology in high-performance computing, with DNN inference nology. In The 46th Annual International Symposium on Computer Ar- acceleration an important example. chitecture (ISCA ’19), June 22–26, 2019, Phoenix, AZ, USA. ACM, New The AQFP technology uses AC bias/excitation currents as York, NY, USA, 12 pages. https://doi.org/10.1145/3307650.3322270 both multi-phase clock signal and power supply [43] to miti- gate the power consumption overhead of DC bias in other 1 INTRODUCTION superconducting logic technologies. Besides the ultra-high Wide-ranging applications of deep neural networks (DNNs) energy efficiency, AQFP exhibits two unique characteristics. in image classification, computer vision, autonomous driving, The first is the deep pipelining nature since each AQFP logic embedded and IoT systems, etc., call for high-performance gate is connected with an AC clock signal and occupies one and energy-efficient implementation of the inference phase clock phase, which increases the difficulty to avoid RAW of DNNs. To simultaneously achieve high performance and (Read after Write) hazards in conventional binary comput- energy efficiency, hardware acceleration of DNNs, including ing. The second is the unique opportunity of true random FPGA- and ASIC-based implementations, has been exten- number generation (RNG) using a single AQFP buffer (dou- sively investigated [4, 6–9, 12, 13, 17–20, 28, 29, 33, 34, 37, 39– ble JJ) in AQFP [16, 44], which is far more efficient than RNG 41, 48, 49, 51, 53–55]. However, most of these designs are in CMOS. CMOS based, and suffer from a performance limitation be- We point out that these two characteristics make AQFP cause the Moore’s Law is reaching its end. technology especially compatible with the stochastic comput- Being widely-known for low energy dissipation and ultra- ing (SC) technique [14][1], which allows the implementation fast switching speed, Josephson Junction (JJ) based supercon- of 64 basic operations using extremely small hardware foot- ductor logic families have been proposed and implemented print. SC uses a time-independent bit sequence for value to process analog and digital signals [24]. It has been per- representation, and lacks RAW dependency among the bits ceived to be an important candidate to replace state-of-the- in the stream. As a result it is compatible with the deep- art CMOS due to the superior potential in operation speed pipelining nature of superconducting logic. Furthermore, and energy efficiency, as recognized by the U.S. IARPA C3 one important limiting factor of SC, i.e., the overhead of and SuperTools Programs and Japan MEXT-JSPS Project. Adi- RNG [27][32], can be largely mitigated in AQFP. abatic quantum-flux-parametron (AQFP) logic is an energy- SC is inherently an approximate computing technique efficient superconductor logic family based on the quantum- [15][52], and there have been disputes about the suitabil- flux-parametron (QFP)[26]. AQFP logic achieves high energy ity of SC for precise computing applications [3]. On the efficiency by adopting adiabatic switching [23], in which the other hand, the DNN inference engine is essentially an ap- potential energy profile evolves from a single well to a double proximate computing application. This is because the final well so that the logic state can change quasi-statically. The classification result depends on the relative score/logit val- energy-delay-product (EDP) of the AQFP circuits fabricated ues of different classes, instead of absolute values. Recent using processes such as the AIST standard process 2 (STP2) work [35][3] have pointed out the suitability of SC for DNN [30] and the MIT-LL SFQ process [47], is only three orders of acceleration, and [50] has further proved the equivalence magnitude larger than the quantum limit [45]. It can poten- between SC-based DNN and binary neural networks, where tially achieve 104-105 energy efficiency gain compared with 2 a b c a b c a b c a the latter originate from deep learning society [11]. All the MAJ AND NOR SPLITTER buffer buffer buffer buffer ‘0' buffer inv ‘1' inv buffer above discussions suggest the potential to build SC-based xin xout xin xout xin xout xin xout DNN acceleration using AQFP technology. d d d b c d (a) (b) (c) (d) This paper is the first to develop an SC-based DNN accel- Figure 2: Example of AQFP logic gates. eration framework using AQFP superconducting technology. Data input We adopt bipolar format in SC because weights and inputs AQFP Clock_in_phase 1 logic block can be positive or negative, and build stochastic number gen- Data input Clock_in_phase 2 eration block in AQFP with ultra-high efficiency. The deep- Phase 1 (clock in) Clock_in_phase 3 Phase 1 (data out) pipelining nature of AQFP circuits translates into the diffi- Clock_in_phase 4 Phase 2 (clock in) culty in designing accumulators/counters in AQFP, which Clock_in_phase 1 Phase 2 (data out) Phase 3 (clock in) Clock_in_phase 2 makes the prior design in SC-based DNN [35] not suitable.

A Stochastic-Computing Based Deep Learning Framework Using

Magnonic Logic Circuits

ERSFQ 8-Bit Parallel Arithmetic Logic Unit

Japan's ERATO and PRESTO Basic Research Programs

A Stochastic-Computing Based Deep Learning Framework Using

Nonconventional Computer Arithmetic Circuits, Systems and Applications Leonel Sousa, Senior Member, IEEE

Computer Aided Systems Theory – EUROCAST 2019

Theory, Synthesis, and Application of Adiabatic and Reversible Logic

All-NDR Crossbar Logic Dmitri B

Some Key Issues in Microelectronic Packaging

Superconducting Nanowire Electronics for Alternative Computing

Adiabatic Quantum-Flux-Parametron with Delay-Line Clocking: Logic Gate Demonstration and Phase Skipping Operation

Single-Electron Logic and Memory Devices