Efficient Neurocomputing with Optical Neural Networks

ICCAD: G: Light in Artificial Intelligence: Efficient Neurocomputing with Optical Neural Networks Jiaqi Gu, [email protected], ACM ID: 7981158 ECE Department, University of Texas at Austin, Austin, TX USA 78712 1 PROBLEM AND MOTIVATION SOFTWARE Deep neural networks have received an explosion of interest for their ONN Design Stack Challenges Proposed Methods superior performance in various intelligent tasks and high impacts on Optical Neural Scalability FFT-ONN: Efficient ONN our lives. The computing capacity is in an arms race with the rapidly Architecture Design Architecture [18-20] escalating model size and data amount for intelligent information processing. Practical application scenarios, e.g., autonomous vehicles, data centers, and edge devices, have strict energy efficiency, latency, ONN Robustness ROQ: Noise-Aware and bandwidth constraints, raising a surging need to develop more effi- Model Training Quantization Flow [11,12] cient computing solutions. However, as Moore’s law is winding down, it becomes increasingly challenging for conventional electrical proces- sors to support such massively parallel and energy-hungry artificial Deployment & Learnability FLOPS: Efficient On-Chip intelligence (AI) workloads. Limited clock frequency, millisecond-level On-Chip Training Learning Protocol [21,22] latency, high heat density, and large energy consumption of CPUs, FPGAs, and GPUs motivate us to seek an alternative solution using HARDWARE silicon photonics. Silicon photonics is a promising hardware platform that could represent a paradigm shift in efficient AI acceleration with Figure 1: Challenges in current ONN design stack and the pro- its CMOS-compatibility, intrinsic parallelism of optics, and near-zero posed hardware/software co-design solutions. power consumption. With potentially petaFLOPS per mm2 execution speed and attojoule/MAC computational efficiency, fully-optical neural networks (ONNs) demonstrate orders-of-magnitude higher performance than their electrical counterparts [1–6]. However, previ- ous ONN designs have a large footprint and noise robustness issues, which prevent practical applications of photonic accelerators. In this work, we propose to explore efficient neuromorphic computing solutions with optical neural networks. Various photonic integrated circuit designs and software-hardware co-optimization methods are explored and presented here to enable high-performance photonic accelerators with lower area cost, better energy efficiency, higher variation-robustness, and more on-device learnability. (a) (b) Figure 2: (a) Matrix multiplication achieved by cascaded MZI 2 BACKGROUND AND RELATED WORK arrays [1]. (b) Micro-ring resonator (MRR) based ONN [13]. Optical computing has been demonstrated as a promising substitution for electronics in efficient artificial intelligence due to its ultra-high order-of-magnitude higher inference throughput and comparable ac- bandwidth, sub-nanosecond latency, attojoule/MAC energy efficiency. curacy on the vowel recognition task compared with GPUs [1]. Zhao The early research efforts focus on diffractive free-space optical com- et al.[10] proposed a slimmed ONN architecture to cut down the puting, optical reservoir computing [7], and spike processing [8, 9] area cost by 30-50% through a software-hardware co-design method- to achieve optical multi-layer perceptrons (MLPs). Recently, the in- ology. These architectures generally have a large area cost and low tegrated optical neural networks (ONNs) have attracted extensive robustness due to phase accumulation error in the cascaded MZI research interests given their compactness, energy efficiency, and meshes [11, 10, 12]. Micro-ring resonator (MRR) based incoherent electronics-compatibility [1,3,4,6]. Figure1 shows the ONN design ONNs have been proposed to build MRR weight banks for matrix stack, including architecture and circuit design, model optimization, multiplication [13, 14], shown in Fig. 2(b) the MRR-ONN has a small and final deployment and on-chip training. Due to the complexity footprint, but it suffers from robustness issues due to MRR sensitivity. of photonic analog circuit design and non-ideality in physical chip To enable scalable optical AI accelerators, novel photonic structures deployment, the power of ONNs will not be fully unleashed without are in high demand to construct compact, efficient, and expressive careful optimization on scalability, robustness, and learnability. optical neural architectures. In the first design stage, neural architectures and their correspond- Given an ONN architecture, the second stage is to perform ONN ing photonic circuits will be jointly designed to map neurocomput- model optimization. Specifically, we need to determine the optical ing to optical components. Previously, Shen et al.[1] successfully component configurations that can achieve the AI tasks with highper- demonstrates a singular-value-decomposition-based coherent ONN formance and fidelity. Previously, hardware-unaware model training constructed by cascaded Mach-Zehnder interferometer (MZI) arrays, is adopted to obtain a theoretically trained ONN model. However, the shown in Fig. 2(a). Their photonic tensor core prototype demonstrates trained ONN weights are not necessarily implementable given limited device control precision and non-ideal process variations [11, 12, 15]. quantization scheme ROQ for robust ONN design, and an efficient ONN Prior work exists to show that the error accumulation effects cause un- on-chip learning framework FLOPS leveraging stochastic zeroth- desired sensitivity of analog ONN chips to noises, which generally lead order optimization for in situ ONN training. to unacceptable accuracy drop and even complete malfunction [11, 15]. However, existing training methods lack practical estimation of de- 3.1 FFT-ONN: Hardware-Efficient ONN vice non-ideality, which leaves a large room for hardware-software Architecture co-design to bridge the performance gap between theoretical models Though ONNs have ultra-fast execution speed, they typically have and the physically deployed ONN chips. large area cost due to the physical limits of the optical devices. Previ- The third stage happens after ONN chip manufacturing. Once there ous MZI-based ONNs [1, 10] consume a large number of MZIs, thus is any change in the environment, tasks, or data distribution, auto- fail to provide efficient and scalable photonic solutions. We focuson matic tuning and on-chip training will be performed in situ to calibrate answering the following critical questions: 1) how to construct a new the circuit states and quickly adapt the ONN chip accordingly. Thanks ONN architecture using much fewer optical components without sac- to the ultra-fast execution speed and reconfigurability of silicon pho- rificing its expressiveness, 2) how to efficiently map high-dimensional tonics chip, ONNs are also perfect self-learning platforms. Such self- neural networks to 2-dimensional (2D) photonic integrated chips learnability is especially beneficial to offload centralized cloud-based (PICs), and 3) how to further cut down the power consumption of training to resource-limited edge devices, which not only saves ex- optical devices with minimum performance degradation [18–20]. pensive communication cost but also boosts the intelligence of edge Proposed Frequency-Domain Optical MLP: To remedy the area cost computing units. To enable such on-device learnability, prior work at- issue, we propose a novel optical multi-layer perceptron architecture tempts to perform in situ ONN training to boost the training efficiency. based on fast Fourier transform, as shown in Fig.3 Instead of imple- Brute-force device tuning [1] and evolutionary algorithms [16] are menting general matrix multiplication, we adopt a block-circulant proposed to search for an optimal device configuration with a large weight matrix as an efficient substitution. This block-circulant matrix number of ONN queries. Adjoint variable methods [17] are applied to can efficiently realize restricted linear projection via an FFT-based fast directly generate and read out the gradients w.r.t. device configura- multiplication algorithm. After transformed into the Fourier domain, tions in-situ to achieve parallel on-chip backpropagation. Though the matrix multiplication can be cast to lightweight element-wise multi- training speed is already orders-of-magnitude higher than software plication between two Fourier-domain signals. A photonic butterfly training, their scalability and robustness are inadequate for practi- network is designed using compact photonic devices to achieve on- cal on-chip learning due to algorithmic inefficiency or prohibitive chip optical FFT/IFFT. Frequency-domain weights are implemented in hardware overhead. the element-wise multiplication (EM) stage to achieve complex-valued Overall, existing studies still fail to provide hardware-efficient, ro- multiplication by leveraging the polarization feature of light. Opti- bust, and self-learnable ONN designs. Hence, better ONN architecture cal splitter and combiner trees are designed to perform light-speed designs and advanced circuit-algorithm co-optimization are still in fan-out and partial product accumulation, respectively. This frame- great demand. Therefore, we propose a holistic ONN design solution work has provably comparable model expressivity with classical NNs. to help build scalable, reliable, and adaptive photonic accelerators Without sacrificing any classification accuracy, this frequency-domain with the following

Load more