
FLOPS: EFficient On-Chip Learning for OPtical Neural Networks Through Stochastic Zeroth-Order Optimization Jiaqi Gu1*, Zheng Zhao1, Chenghao Feng1, Wuxi Li2, Ray T. Chen1, and David Z. Pan1y 1ECE Department, The University of Texas at Austin 2Xilinx Inc. *[email protected]; [email protected] Abstract— Optical neural networks (ONNs) have attracted extensive at- severe performance degradation and poor variation-robustness for tention due to its ultra-high execution speed and low energy consumption. lack of accurate non-ideality modeling. Simply mapping the back- The traditional software-based ONN training, however, suffers the prob- propagation algorithm onto photonic chips is conceptually straightfor- lems of expensive hardware mapping and inaccurate variation modeling while the current on-chip training methods fail to leverage the self-learning ward but may not fully exploit the self-learning capability of ONN capability of ONNs due to algorithmic inefficiency and poor variation- chips. Back-propagation is technically difficult to be implemented on robustness. In this work, we propose an on-chip learning method to chip, especially considering the expensive hardware overhead and resolve the aforementioned problems that impede ONNs’ full potential for time-consuming matrix decomposition procedure. Also, decoupling ultra-fast forward acceleration. We directly optimize optical components using stochastic zeroth-order optimization on-chip, avoiding the traditional the ONN hardware and software training process disables potential high-overhead back-propagation, matrix decomposition, or in situ device- online learning applications. level intensity measurements. Experimental results demonstrate that the A brute-force phase tuning algorithm is proposed and adopted proposed on-chip learning framework provides an efficient solution to in [1], [11] to perform ONN on-chip training. Each MZI phase is train integrated ONNs with 3∼4× fewer ONN forward, higher inference accuracy, and better variation-robustness than previous works. individually perturbed by a small value and then updated based on the function evaluation results. This greedy parameter search algorithm I. INTRODUCTION is intractable when tuning a large number of phases and may not be robust enough to handle non-ideal variations. To mitigate the As Moore’s Law winds down, it becomes challenging for electronic inefficiency issue of the above brute-force algorithm, an in situ adjoint digitals to meet with escalating computational demands of machine variable method (AVM) [12] is applied to perform ONN on-chip learning applications. In recent years, the optical neural network training. This inverse design method directly computes the gradient (ONN) has been demonstrated to be an emerging neuromorphic w.r.t. MZI phases by propagating through the chip back-and-forth for platform with ultra-low latency, high bandwidth, and high energy several times. However, it strictly assumes the photonic system is efficiency, which becomes a promising alternative to its traditional fully-observable and requires light field intensity measurement on each electronic counterpart. Computationally-intensive operations in neural device, which is technically challenging to scale to larger systems. networks, e.g., matrix multiplication, can be efficiently realized by Evolutionary algorithms, e.g., genetic algorithm (GA) and particle optics at the speed of light [1]–[4]. Shen, et al. [1] demonstrated a clas- swarm optimization (PSO), are applied to train ONNs by emulating sical integrated fully-optical neural network platform to implement a the biological evolution process [13] with a large population. multi-layer perceptron. Based on matrix singular value decomposition The above on-chip training protocols are only demonstrated to (SVD) and unitary parametrization [5], [6], the weight matrices are handle small photonic systems with <100 MZIs, while in this work, mapped onto cascaded Mach-Zehnder interferometer (MZI) arrays to we propose a novel framework FLOPS that extends the learning scale realize ultra-fast neural network inference with over 100 GHz photo- to ∼1000 MZIs with higher training efficiency, higher inference accu- detection rate and near-zero energy consumption [7]. racy, and better robustness to non-ideal thermal variations. Compared However, training methodologies for integrated ONNs still have with the previous state-of-the-art methods [1], [11], [13], our on-chip limited efficiency and performance these days. The mainstream ap- learning framework FLOPS has the following advantages. proach is to offload the training process to pure software engines, obtain pre-trained weight matrices using classical back-propagation • Efficiency: our learning method leverages stochastic zeroth-order (BP) algorithm, and then map the trained model onto photonic hard- optimization with a parallel mini-batch based gradient estimator ∼ × ware through matrix decomposition and unitary parametrization [1], and achieves 3 4 fewer ONN forward than previous methods. [2]. This traditional method benefits from off-the-shelf deep learning • Accuracy: our proposed optimization algorithm is extended with SparseTune toolkits for easy training of weight matrices under a noise-free a light-weight second-stage learning procedure to environment. Nevertheless, this inevitably bears several downsides in perform sparse phase tuning, achieving further accuracy boost efficiency, performance, and robustness. First, pure software-based while the efficiency superiority still maintains. ONN training is still bottlenecked by the performance of digital • Robustness: our method is demonstrated to improve the test computers, which suffers inefficiency when emulating the actual accuracy of ONNs under thermal cross-talk and produces better ONN architectures given the expensive computational cost required variation-robustness than previous methods. in matrix decomposition and parametrization. Hence, there exist great The remainder of this paper is organized as follows. Section potentials to fully utilize ultra-fast photonic chips to accelerate the II introduces the ONN architecture and previous training methods. training process as well. Second, photonic chips could be exposed Section III discusses details of our proposed framework. Section IV to various non-ideal effects [1], [8]–[10], e.g., device manufacturing analyzes FLOPS in the presence of thermal variations. Section V variations, limited phase encoding precision, and thermal cross-talk, evaluates the accuracy, efficiency, and robustness improvement of thus the ONN model trained by pure software methods could suffer FLOPS over previous works, followed by the conclusion in Section VI. all phases are updated once. Adjoint variable method (AVM) [12] is a recently proposed inverse design method that computes the exact gradient based on in situ intensity measurement, which is intrinsically unscalable as it is based on expensive device-level signal detection. Evolutionary algorithms, e.g., PSO, are also applied to search for optimal MZI configurations in the phase space [13]. However, a high- dimensional parameter space could lead to optimization inefficiency and unsatisfying optimality as they typically require a large population Fig. 1: Schematic of an MZI triangular array and a closeup view of and inevitably faces pre-mature issues. the MZI structure. B. Optimization with Zeroth-Order Gradient Estimation II. PRELIMINARIES Zeroth-order optimization (ZOO) has received much attention re- In this section, we will introduce the architecture of integrated cently [14], [15] as it enables effective optimization when the explicit ONNs, current ONN training methods, and background knowledge gradient of the objective function is infeasible, e.g., reinforcement about stochastic zeroth-order optimization with gradient estimation. learning and black-box adversarial attack on DNNs. ZOO gains much A. ONN Architecture and Training Methods momentum as it is capable of handling higher-dimensional problems The ONN is a photonic implementation of artificial neural networks. than traditional methods, e.g., Bayesian optimization, and can be It implements matrix multiplication with optical inference units (OIU) integrated with extensive state-of-the-art first-order optimization al- consisting of cascaded MZIs, shown in Fig. 1. Each MZI is a gorithms with the true gradient being replaced by the approximated reconfigurable photonic component that can produce interference of zeroth-order gradient [14], [15]. ZOO uses a Gaussian function input light signals as follows [1], approximation to estimate the local value of the original function f(θ) as follows, y1 cos φ sin φ x1 2 = : (1) h i 1 Z −||∆θjj2 y2 − sin φ cos φ x2 2 fσ(θ) = E∆θ f(θ + ∆θ) = f(θ + ∆θ)e 2σ d∆θ; σp(2π)d The phase shift φ produced in the coupling region of an MZI can be (4) achieved by an optical phase shifter. The triangular structure of an where ∆θ is a perturbation sampled from Gaussian distribution MZI array shown in Fig. 1 is proven to realize a unitary matrix based N (0; σ2). Nesterov et al. [16] proves a bound of the difference on unitary group parametrization [1], [5], between the true gradient and its Gaussian approximation if the 2 i−1 original function f(θ) is β-smooth, Y Y U(n) = D Rij : (2) σ 3=2 d jjrfσ(θ) − rf(θ)jj ≤ β(d + 3) ; 8θ 2 : (5) i=n j=1 2 R where the 2-dimensional planar rotator Rij is an identity matrix Based on this, we can approximate the true gradient rf(θ) through except that entries at (i; i), (i; j), (j; i), (j; j) are cos φ, sin φ,-sin φ, estimating this surrogate gradient rfσ(θ) with the above error bound. cos φ, respectively. D is a matrix where only its main diagonal has To estimate rfσ(θ), a symmetric difference quotient method is -1 or 1. Each Rij can be implemented by an MZI in the OIU. typically used, Φ We denote all parametrized phases as a phase vector . Since an 1 m × n weight matrix W 2 m×n can be decomposed via singular r^ f(θ) = (f(θ + ∆θ) − f(θ − ∆θ))∆θ: (6) R 2σ2 value decomposition (SVD) W = UΣV ∗, where U 2 m×m, R Such a sampling-based first-order oracle is an unbiased estimator V ∗ 2 n×n are square unitary matrices and Σ is an m × n non- R of f (θ).
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages6 Page
-
File Size-