Large-Scale Optical Neural-Network Accelerators Based on Coherent Detection

Large-Scale Optical Neural-Network Accelerators based on Coherent Detection Ryan Hamerly, Alex Sludds, Liane Bernstein, Marin Soljacic, Dirk R. Englund Research Laboratory of Electronics, MIT, 50 Vassar St, Cambridge, MA 02139 May 10, 2019 Deep Learning Images & Video Games (Chess, Go, DOTA...) ImageNet GoogologyWiki Control / autonomous vehicles Speech recognition / Language processing SCMP NY Daily News Many other applications: marketing, finance, healthcare, fraud detection, counterintelligence... 1/14 Deep Learning = Deep Neural Networks x(1) x(2) x(3) x(K‒1) x(0) x(K) A(1) f A(2) f A(3) f A(K) f Deep neural network consists of K steps, each step containing: (1) Linear matrix-vector multiplication (gray) f(y) ReLU (2) Nonlinear function (red) def run_deepnet(x_in): sigmoid x(0) = x_in y for i in range(1, K+1): tanh # synaptic connections y(i) = dot(A(i), x(i-1)) # activation functions activation fn x(i) = f(y(i)) return x(K) 2/14 Hardware Acclerators for Deep Learning Google Intel Chen et al. (2016) Xilinx Nvidia Three important facts guide design of accelerators: 1. Key metric is power consumption (limits performance due to finite chip power). - State of the art is ~1 pJ/MAC (multiply-and-accumulate) 2. Linear matrix operations (synaptic connections) dominate. - ComputationGoal: time build / energy a photonic for nonlinear NN step accelerator (activation function) that beatsis negligible. CMOS O(N 12) vs.pJ/MAC O(N). 3. Data movement & memory access (rather than processing) dominates. [Keckler, IEEE Micro 2011; Horowitz, ISSCC 2014; Sze, Proc. IEEE 2017] 3/14 Previous work: Beamsplitter Mesh Collaboration with Marin Soljacic GrouP, MIT Inputs Outputs Advantages: • Performs O(N2) MACs at an energy cost O(N) for I/O • Can be very fast, low latency • Extendable to quantum aPPlications Problems: • Chip area of Phase shifters for MZI • Number of Phase shifters is O(N2) • Error ProPagation limits dePth of NN Result: difficult to scale beyond ~100’s optical neurons Y Shen*, N C. Harris*, S Skirlo, M Prabhu, T Baehr-Jones, M Hochberg, X Sun, S Zhao, H Larochelle, D Englund, and M Soljacić, Nature Photonics 11 (2017). *equal authors 4/14 Goal: build scalable photonic neural network R. Hamerly, A. Sludds, L. Bernstein, M. Soljacic, and D. Englund, Phys. Rev. X (to appear) [1812.07614] Optical Multiplication Multiply and accumulate. Dot product. Time Multiplexing Photoelectric Free-space optical Multiplication propagation (x+y)/21/2 x x1 x2 xN + + y (x‒y)/21/2 - y1 y2 yN - ∫ ‒ ‒ Top PD: I = (x+y)2/2 + Output: Σ x y = dot(x, y) Bottom PD: I = (x‒y)2/2 n n n ‒ Works if (x, y) non-binary too Difference: (I+‒I‒) = 2xy 5/14 The Photoelectric Effect arXiv:1812.07614 Light incident on a metal generates a photocurrent. Input encoding: Photodiode Light is an EM wave E(t). Electrons are 0 à phase 0 In: x particles. Evidence of the wave-particle 1 à phase π Out: |x|2 duality. Led to quantum mechanics. E(t) e- e- I(t) ~ |E(t)|2 e- A. Einstein (Nobel Prize 1921 “for his services to theoretical physics, especially for his disc- overy of the law of the photoelectric effect”) Allows “squaring” of an optical field, since photocurrent scales as |E(t)|2, Beamsplitter Output encoding: “quantum photoelectric multiplication”. In: x, y 0 à +V Out: x+y, x–y 1 à –V 6/14 Quantum Photoelectric Multiplier arXiv:1812.07614 Homodyne detector à Optoelectronic multiplier Time-encoded data à Vector dot-product Extremely low power possible (limited by photodetector sensitivity) Paradox: Nonlinear optics is hard. Interference is linear, how does this compute nonlinear fn (product)? Resolution: It’s opto-electronic, not all-optical. Photodetector input-output relation is nonlinear. Optical Multiplication Multiply and accumulate. Dot product. (x+y)/21/2 x x1 x2 xN + + y (x‒y)/21/2 - y1 y2 yN - ∫ ‒ ‒ Top PD: I = (x+y)2/2 + Output: Σ x y = dot(x, y) Bottom PD: I = (x‒y)2/2 n n n ‒ Works if (x, y) non-binary too Difference: (I+‒I‒) = 2xy 7/14 Application to Optical Neural Network (a) (b) (k+1) x(2) x(3) x (1) Copies of x(k) Combiner Multiplier, Nonlinear (output) x (beamsplitter) integrator function ∫ f A(1) f A(2) f ∫ f ∫ f Source ∫ f Fan-out Conversion ∫ f to optical x(K) xout ∫ f (K-1) (K) (k) + Aijxj A f A f x xj (input) A (k) A (k) ... A (k) 1 2 N' ‒ (weigHts) Aij Matrix-vector multiplication yN’x1 = AN’xN * xNx1. NxN’ multiply-and-accumulate (MAC) ops performed Energy efficiency: energy cost scales as O(N) + O(N’) (I/O cost). That’s O(1/N) + O(1/N’) per MAC. Scalability: number of neurons limited by # detectors, potentially >106. (nobody uses FC layers this big, but 102-104 is common in deep learning) Simplicity: only O(N) components needed (not O(N2)). No error propagation issues. 8/14 Quantum Limits to Optical Analog Processing arXiv:1812.07614 • Another revelation of the photoelectric effect: quantum mechanics is stochastic. • Emission of photoelectrons is a Poisson process • Number of photoelectrons follows Poisson distribution: God does not play dice shot noise with the Universe. Q — A. Einstein = Poisson E 2 E 2 + E N(0, 1) e | | ⇡ | | | | ⇥ E(t) e- This gives rise to quantum-limited “shot e- noise” in photodetectors. I(t) ~ |E(t)|2 e- • Leads to Gaussian noise term in neural network: 9/14 Quantum Limits to Optical NN Noisy analog system, SNR set by # photons per MAC Quantum limit to performance from finite SNR ~100 zJ/op (~10 Exa-ops/W) Benchmark with MNIST dataset Input: image Many photons SNR large Digital NN limit Random guess Classification 10/14 Other Limits to Energy Consumption If limited By I/O energy. 102-103 improve- State-of-the-art ~pJ/neuron reasonaBle ment vs. CMOS pos- CMOS electronics in near term sible with near-term (TPU, ASIC: technology ~pJ/MAC) Thermodynamic limit for digital Useful Problems: processing (kTlog 2 N = 100’s-1000’s per gate, ~1000 gates/MAC) Much more aggres- sive I/O estimate Quantum limit from modulator switching energies. 11/14 Optical GEMM Machine learning applications: Simple extension: optical GEMM Complementary beneﬁts of integrated (matrix-matrix multiply) based on •& freeIn-situ-space training optics / back-propagation coherent detection. •• EfficientIntegrated: calculation phase stability, of convolutions many Up to 103 x 103 matrices possible viacomponents “patching” on (most-chip deep (transmitters). NN’s are convolutional)Large-scale detector arrays. n (a) (c) • InferenceFree-space: Training 3rd dimension Back-propagation for M 1 optical fan-Tout, routing X Y X ∇YL ∇XL ∇YL (m x k) m m T A ∇AL A T M1M2 M2 n (d) Input image Patch matrix X (n x k) Amplitude modulators (b) Output Mij C y K patch 1 patch 2 patch 3 patch W'H' H x K Input data Mij Routing/ Light in DAC C W W'H' patches W'H' index i index ≈ Kernel Matrix K kernel 1 Pulse train kernel 2 kernel 3 C' GratingC' couplersY = K X index j Phase (1D & 2D arrays) K x K x C' x C kernel C' x y modulators KxKyC W'H' (Rui Tang, collaboartion with AIM / AFRL) 12/14 GEMM for Training and Convolutions • Many NN’s have lots of convolutional layers • AlexNet on optical NN: simulations show quantum limit is ~few aJ/MAC (a) n CONV layers FC layers (a) (c) Inference Training Back-propagation M 1 T X Y X ∇YL ∇XL ∇YL (m x k) m m T A ∇AL A T M1M2 M2 n (d) Input image Patch matrix X (n x k) (b) (b) Output Mij C y K patch 1 patch 2 patch 3 patch W'H' H x K Input data Mij Routing/ DAC C W W'H' W'H' patches aJ/MAC 3 - 2 index i index Kernel Matrix K kernel 1 Pulse train kernel 2 kernel 3 C' C' Y = K X index j kernel C' Kx x Ky x C' x C KxKyC W'H' 13/14 New application: Optical Ising Machines An op&cal system to solve combinatorial op-miza-on problems. Beats D-Wave quantum annealer by >106x (Dense MAX-CUT) OPO Pump Signal 2ω ω+ω Pulsed Laser π 0 Above threshold: oscillation Pump SHG χ(2) (PPLN) aN-1 aN a1 a2 At threshold: bifurcation Increasing pump power pump Increasing ] Feedback i Im[a IM Re[ai] FPGA PM Below threshold: squeezed vacuum Scalable to ~10,000’s neurons Problems: • Phase stability over ≥km of ﬁber R. Hamerly, T. Inagaki, P. McMaHon et al., Science • FPGA coupling not scalable, power-hungry Advances (in press) [arXiv:1805.05217] C. R-Carmes, Y. SHen, C. Zanoci et al., submitted Use op-cal NN hardware as Ising machine instead? [arXiv:1811.02705] 14/14 Conclusion Time-Multiplexed Optical Neural Networks Acknowledgements • Growing need for NN accelerators (R.H.) IC Postdoctoral Research Fellowship at MIT, - Energy: CMOS ASIC SoA = 1 pJ/MAC administered by ORISE through U.S. DOE and ODNI. - Nanophotonic solutions hard to scale (L.B.) NSERC Doctoral Fellowship. (A.S.) NSF Graduate • Photoelectric Multiplication Research Fellowship. (D.E.) U.S. ARO / ISN no. - Time-multiplexed data W911NF-18-2-0048. - Opto-electronic nonlinearity • Limits to Performance Papers: - Standard Quantum Limit: ~100 zJ/MAC • R. Hamerly*, A. Sludds, L. Bernstein et al., “Large- - Beating the Landauer Limit Scale Op_cal Neural Networks based on Photo- - Realistic near-term technology: ~fJ/MAC electric Mul_plica_on”, Phys. Rev. X (in press) arXiv:1812.07614 (a) (b) (k+1) x(2) x(3) x (1) Copies of x(k) Combiner Multiplier, Nonlinear (output) x (beamsplitter) integrator function ∫ f • R.

Load more