Large-Scale Optical Neural-Network Accelerators based on Coherent Detection
Ryan Hamerly, Alex Sludds, Liane Bernstein, Marin Soljacic, Dirk R. Englund Research Laboratory of Electronics, MIT, 50 Vassar St, Cambridge, MA 02139
May 10, 2019 Deep Learning
Images & Video Games (Chess, Go, DOTA...)
ImageNet GoogologyWiki
Control / autonomous vehicles Speech recognition / Language processing
SCMP NY Daily News
Many other applications: marketing, finance, healthcare, fraud detection, counterintelligence... 1/14 Deep Learning = Deep Neural Networks
x(1) x(2) x(3) x(K‒1) x(0) x(K)
A(1) f A(2) f A(3) f A(K) f
Deep neural network consists of K steps, each step containing: (1) Linear matrix-vector multiplication (gray) f(y) ReLU (2) Nonlinear function (red)
def run_deepnet(x_in): sigmoid x(0) = x_in y for i in range(1, K+1): tanh # synaptic connections y(i) = dot(A(i), x(i-1)) # activation functions activation fn x(i) = f(y(i)) return x(K) 2/14 Hardware Acclerators for Deep Learning
Google Intel Chen et al. (2016)
Xilinx Nvidia Three important facts guide design of accelerators: 1. Key metric is power consumption (limits performance due to finite chip power). - State of the art is ~1 pJ/MAC (multiply-and-accumulate) 2. Linear matrix operations (synaptic connections) dominate. - ComputationGoal: time build / energy a photonic for nonlinear NN step accelerator (activation function) that beatsis negligible. CMOS O(N 12) vs.pJ/MAC O(N). 3. Data movement & memory access (rather than processing) dominates. [Keckler, IEEE Micro 2011; Horowitz, ISSCC 2014; Sze, Proc. IEEE 2017] 3/14 Previous work: Beamsplitter Mesh
Collaboration with Marin Soljacic Group, MIT
Inputs Outputs
Advantages:
• Performs O(N2) MACs at an energy cost O(N) for I/O • Can be very fast, low latency • Extendable to quantum applications Problems:
• Chip area of phase shifters for MZI • Number of phase shifters is O(N2) • Error propagation limits depth of NN
Result: difficult to scale beyond ~100’s optical neurons
Y Shen*, N C. Harris*, S Skirlo, M Prabhu, T Baehr-Jones, M Hochberg, X Sun, S Zhao, H Larochelle, D Englund, and M Soljacić, Nature Photonics 11 (2017). *equal authors 4/14 Goal: build scalable photonic neural network
R. Hamerly, A. Sludds, L. Bernstein, M. Soljacic, and D. Englund, Phys. Rev. X (to appear) [1812.07614]
Optical Multiplication Multiply and accumulate. Dot product. Time Multiplexing Photoelectric Free-space optical Multiplication propagation (x+y)/21/2 x x1 x2 xN
+ + y (x‒y)/21/2 - y1 y2 yN - ∫ ‒ ‒
Top PD: I = (x+y)2/2 + Output: Σ x y = dot(x, y) Bottom PD: I = (x‒y)2/2 n n n ‒ Works if (x, y) non-binary too Difference: (I+‒I‒) = 2xy
5/14 The Photoelectric Effect
arXiv:1812.07614 Light incident on a metal generates a photocurrent. Input encoding: Photodiode Light is an EM wave E(t). Electrons are 0 à phase 0 In: x particles. Evidence of the wave-particle 1 à phase π Out: |x|2 duality. Led to quantum mechanics.
E(t) e-
e-
I(t) ~ |E(t)|2 e-
A. Einstein (Nobel Prize 1921 “for his services to theoretical physics, especially for his disc- overy of the law of the photo- electric effect”)
Allows “squaring” of an optical field, since photocurrent scales as |E(t)|2, Beamsplitter Output encoding: “quantum photoelectric multiplication”. In: x, y 0 à +V Out: x+y, x–y 1 à –V 6/14 Quantum Photoelectric Multiplier
arXiv:1812.07614 Homodyne detector à Optoelectronic multiplier Time-encoded data à Vector dot-product
Extremely low power possible (limited by photodetector sensitivity)
Paradox: Nonlinear optics is hard. Interference is linear, how does this compute nonlinear fn (product)? Resolution: It’s opto-electronic, not all-optical. Photodetector input-output relation is nonlinear.
Optical Multiplication Multiply and accumulate. Dot product.
(x+y)/21/2 x x1 x2 xN
+ +
y (x‒y)/21/2 - y1 y2 yN - ∫ ‒ ‒
Top PD: I = (x+y)2/2 + Output: Σ x y = dot(x, y) Bottom PD: I = (x‒y)2/2 n n n ‒ Works if (x, y) non-binary too Difference: (I+‒I‒) = 2xy
7/14 Application to Optical Neural Network
(a) (b) (k+1) x(2) x(3) x (1) Copies of x(k) Combiner Multiplier, Nonlinear (output) x (beamsplitter) integrator function ∫ f A(1) f A(2) f ∫ f ∫ f Source ∫ f Fan-out Conversion ∫ f to optical x(K) xout ∫ f
(K-1) (K) (k) + Aijxj A f A f x xj (input) A (k) A (k) ... A (k) 1 2 N' ‒ (weights) Aij
Matrix-vector multiplication yN’x1 = AN’xN * xNx1. NxN’ multiply-and-accumulate (MAC) ops performed
Energy efficiency: energy cost scales as O(N) + O(N’) (I/O cost). That’s O(1/N) + O(1/N’) per MAC. Scalability: number of neurons limited by # detectors, potentially >106. (nobody uses FC layers this big, but 102-104 is common in deep learning) Simplicity: only O(N) components needed (not O(N2)). No error propagation issues. 8/14 Quantum Limits to Optical Analog Processing
arXiv:1812.07614 • Another revelation of the photoelectric effect: quantum mechanics is stochastic. • Emission of photoelectrons is a Poisson process • Number of photoelectrons follows Poisson distribution: God does not play dice shot noise with the Universe. Q — A. Einstein = Poisson E 2 E 2 + E N(0, 1) e | | ⇡ | | | | ⇥ E(t) e- This gives rise to quantum-limited “shot e- noise” in photodetectors. I(t) ~ |E(t)|2 e- • Leads to Gaussian noise term in neural network:
9/14 Quantum Limits to Optical NN
Noisy analog system, SNR set by # photons per MAC Quantum limit to perfor- mance from finite SNR
~100 zJ/op (~10 Exa-ops/W)
Benchmark with MNIST dataset Input: image
Many photons SNR large Digital NN limit Random guess Classification
10/14 Other Limits to Energy Consumption
If limited by I/O energy. 102-103 improve- State-of-the-art ~pJ/neuron reasonable ment vs. CMOS pos- CMOS electronics in near term sible with near-term (TPU, ASIC: technology ~pJ/MAC)
Thermodynamic limit for digital Useful Problems: processing (kTlog 2 N = 100’s-1000’s per gate, ~1000 gates/MAC)
Much more aggres- sive I/O estimate Quantum limit from modulator switching energies.
11/14 Optical GEMM
Machine learning applications: Simple extension: optical GEMM Complementary benefits of integrated (matrix-matrix multiply) based on •& freeIn-situ-space training optics / back-propagation coherent detection. •• EfficientIntegrated: calculation phase stability, of convolutions many Up to 103 x 103 matrices possible viacomponents “patching” on (most-chip deep (transmitters). NN’s are convolutional)Large-scale detector arrays.
n (a) (c) • InferenceFree-space: Training 3rd dimension Back-propagation for M 1 optical fan-Tout, routing X Y X ∇YL ∇XL ∇YL (m x k) m m
T A ∇AL A T M1M2
M2 n (d) Input image Patch matrix X (n x k) Amplitude modulators
(b) Output Mij C y K patch 1 patch 2 patch 3 patch W'H' H x K
Input data Mij Routing/ Light in DAC C W W'H' patches W'H'
index i index ≈ Kernel Matrix K
kernel 1 Pulse train kernel 2 kernel 3 C' GratingC' couplersY = K X index j Phase (1D & 2D arrays) K x K x C' x C kernel C' x y modulators KxKyC W'H' (Rui Tang, collaboartion with AIM / AFRL) 12/14 GEMM for Training and Convolutions
• Many NN’s have lots of convolutional layers • AlexNet on optical NN: simulations show quantum limit is ~few aJ/MAC
(a) n CONV layers FC layers (a) (c) Inference Training Back-propagation
M 1 T X Y X ∇YL ∇XL ∇YL (m x k) m m
T A ∇AL A T M1M2
M2 n (d) Input image Patch matrix X (n x k) (b)
(b) Output Mij C y K patch 1 patch 2 patch 3 patch W'H' H x K
Input data Mij Routing/ DAC C W W'H' W'H' patches aJ/MAC 3 - 2 index i index Kernel Matrix K
kernel 1 Pulse train kernel 2 kernel 3 C' C' Y = K X
index j kernel C' Kx x Ky x C' x C
KxKyC W'H' 13/14 New application: Optical Ising Machines
An op&cal system to solve combinatorial op-miza-on problems. Beats D-Wave quantum annealer by >106x
(Dense MAX-CUT) OPO
Pump Signal 2ω ω+ω Pulsed Laser π 0
Above threshold: oscillation Pump SHG χ(2) (PPLN)
aN-1 aN a1 a2
At threshold: bifurcation Increasing pump power pump Increasing ] Feedback i Im[a
IM Re[ai] FPGA PM
Below threshold: squeezed vacuum
Scalable to ~10,000’s neurons Problems: • Phase stability over ≥km of fiber R. Hamerly, T. Inagaki, P. McMahon et al., Science • FPGA coupling not scalable, power-hungry Advances (in press) [arXiv:1805.05217] C. R-Carmes, Y. Shen, C. Zanoci et al., submitted Use op-cal NN hardware as Ising machine instead? [arXiv:1811.02705]
14/14 Conclusion
Time-Multiplexed Optical Neural Networks Acknowledgements
• Growing need for NN accelerators (R.H.) IC Postdoctoral Research Fellowship at MIT, - Energy: CMOS ASIC SoA = 1 pJ/MAC administered by ORISE through U.S. DOE and ODNI. - Nanophotonic solutions hard to scale (L.B.) NSERC Doctoral Fellowship. (A.S.) NSF Graduate • Photoelectric Multiplication Research Fellowship. (D.E.) U.S. ARO / ISN no. - Time-multiplexed data W911NF-18-2-0048. - Opto-electronic nonlinearity • Limits to Performance Papers: - Standard Quantum Limit: ~100 zJ/MAC • R. Hamerly*, A. Sludds, L. Bernstein et al., “Large- - Beating the Landauer Limit Scale Op_cal Neural Networks based on Photo- - Realistic near-term technology: ~fJ/MAC electric Mul_plica_on”, Phys. Rev. X (in press) arXiv:1812.07614 (a) (b) (k+1) x(2) x(3) x (1) Copies of x(k) Combiner Multiplier, Nonlinear (output) x (beamsplitter) integrator function ∫ f • R. Hamerly*, T. Inagaki, P. McMahon et al., “Exper- A(1) f A(2) f ∫ f ∫ f Source imental inves_ga_on of performance differences ∫ f Fan-out Conversion between Coherent Ising Machines and a Quantum ∫ f to optical x(K) xout ∫ f Annealer”, Science Advances (in press) arXiv:1805.05217 (K-1) (K) (k) + Aijxj A f A f x xj (input) A (k) A (k) ... A (k) 1 2 N' ‒ (weights) Aij *[email protected]