Large-Scale Optical Neural-Network Accelerators based on Coherent Detection

Ryan Hamerly, Alex Sludds, Liane Bernstein, Marin Soljacic, Dirk R. Englund Research Laboratory of Electronics, MIT, 50 Vassar St, Cambridge, MA 02139

May 10, 2019

Images & Video Games (Chess, Go, DOTA...)

ImageNet GoogologyWiki

Control / autonomous vehicles / Language processing

SCMP NY Daily News

Many other applications: marketing, finance, healthcare, fraud detection, counterintelligence... 1/14 Deep Learning = Deep Neural Networks

x(1) x(2) x(3) x(K‒1) x(0) x(K)

A(1) f A(2) f A(3) f A(K) f

Deep neural network consists of K steps, each step containing: (1) Linear matrix-vector multiplication (gray) f(y) ReLU (2) Nonlinear function (red)

def run_deepnet(x_in): sigmoid x(0) = x_in y for i in range(1, K+1): tanh # synaptic connections y(i) = dot(A(i), x(i-1)) # activation functions activation fn x(i) = f(y(i)) return x(K) 2/14 Hardware Acclerators for Deep Learning

Google Intel Chen et al. (2016)

Xilinx Nvidia Three important facts guide design of accelerators: 1. Key metric is power consumption (limits performance due to finite chip power). - State of the art is ~1 pJ/MAC (multiply-and-accumulate) 2. Linear matrix operations (synaptic connections) dominate. - ComputationGoal: time build / energy a photonic for nonlinear NN step accelerator () that beatsis negligible. CMOS O(N 12) vs.pJ/MAC O(N). 3. Data movement & memory access (rather than processing) dominates. [Keckler, IEEE Micro 2011; Horowitz, ISSCC 2014; Sze, Proc. IEEE 2017] 3/14 Previous work: Beamsplitter Mesh

Collaboration with Marin Soljacic Group, MIT

Inputs Outputs

Advantages:

• Performs O(N2) MACs at an energy cost O(N) for I/O • Can be very fast, low latency • Extendable to quantum applications Problems:

• Chip area of phase shifters for MZI • Number of phase shifters is O(N2) • Error propagation limits depth of NN

Result: difficult to scale beyond ~100’s optical neurons

Y Shen*, N C. Harris*, S Skirlo, M Prabhu, T Baehr-Jones, M Hochberg, X Sun, S Zhao, H Larochelle, D Englund, and M Soljacić, Nature Photonics 11 (2017). *equal authors 4/14 Goal: build scalable photonic neural network

R. Hamerly, A. Sludds, L. Bernstein, M. Soljacic, and D. Englund, Phys. Rev. X (to appear) [1812.07614]

Optical Multiplication Multiply and accumulate. Dot product. Time Multiplexing Photoelectric Free-space optical Multiplication propagation (x+y)/21/2 x x1 x2 xN

+ + y (x‒y)/21/2 - y1 y2 yN - ∫ ‒ ‒

Top PD: I = (x+y)2/2 + Output: Σ x y = dot(x, y) Bottom PD: I = (x‒y)2/2 n n n ‒ Works if (x, y) non-binary too Difference: (I+‒I‒) = 2xy

5/14 The Photoelectric Effect

arXiv:1812.07614 Light incident on a metal generates a photocurrent. Input encoding: Photodiode Light is an EM wave E(t). Electrons are 0 à phase 0 In: x particles. Evidence of the wave-particle 1 à phase π Out: |x|2 duality. Led to quantum mechanics.

E(t) e-

e-

I(t) ~ |E(t)|2 e-

A. Einstein (Nobel Prize 1921 “for his services to theoretical physics, especially for his disc- overy of the law of the photo- electric effect”)

Allows “squaring” of an optical field, since photocurrent scales as |E(t)|2, Beamsplitter Output encoding: “quantum photoelectric multiplication”. In: x, y 0 à +V Out: x+y, x–y 1 à –V 6/14 Quantum Photoelectric Multiplier

arXiv:1812.07614 Homodyne detector à Optoelectronic multiplier Time-encoded data à Vector dot-product

Extremely low power possible (limited by photodetector sensitivity)

Paradox: Nonlinear optics is hard. Interference is linear, how does this compute nonlinear fn (product)? Resolution: It’s opto-electronic, not all-optical. Photodetector input-output relation is nonlinear.

Optical Multiplication Multiply and accumulate. Dot product.

(x+y)/21/2 x x1 x2 xN

+ +

y (x‒y)/21/2 - y1 y2 yN - ∫ ‒ ‒

Top PD: I = (x+y)2/2 + Output: Σ x y = dot(x, y) Bottom PD: I = (x‒y)2/2 n n n ‒ Works if (x, y) non-binary too Difference: (I+‒I‒) = 2xy

7/14 Application to Optical Neural Network

(a) (b) (k+1) x(2) x(3) x (1) Copies of x(k) Combiner Multiplier, Nonlinear (output) x (beamsplitter) integrator function ∫ f A(1) f A(2) f ∫ f ∫ f Source ∫ f Fan-out Conversion ∫ f to optical x(K) xout ∫ f

(K-1) (K) (k) + Aijxj A f A f x xj (input) A (k) A (k) ... A (k) 1 2 N' ‒ (weights) Aij

Matrix-vector multiplication yN’x1 = AN’xN * xNx1. NxN’ multiply-and-accumulate (MAC) ops performed

Energy efficiency: energy cost scales as O(N) + O(N’) (I/O cost). That’s O(1/N) + O(1/N’) per MAC. Scalability: number of neurons limited by # detectors, potentially >106. (nobody uses FC layers this big, but 102-104 is common in deep learning) Simplicity: only O(N) components needed (not O(N2)). No error propagation issues. 8/14 Quantum Limits to Optical Analog Processing

arXiv:1812.07614 • Another revelation of the photoelectric effect: quantum mechanics is stochastic. • Emission of photoelectrons is a Poisson process • Number of photoelectrons follows Poisson distribution: God does not play dice shot noise with the Universe. Q — A. Einstein = Poisson E 2 E 2 + E N(0, 1) e | | ⇡ | | | | ⇥ E(t) e- This gives rise to quantum-limited “shot e- noise” in photodetectors. I(t) ~ |E(t)|2 e- • Leads to Gaussian noise term in neural network:

9/14 Quantum Limits to Optical NN

Noisy analog system, SNR set by # photons per MAC Quantum limit to perfor- mance from finite SNR

~100 zJ/op (~10 Exa-ops/W)

Benchmark with MNIST dataset Input: image

Many photons SNR large Digital NN limit Random guess Classification

10/14 Other Limits to Energy Consumption

If limited by I/O energy. 102-103 improve- State-of-the-art ~pJ/neuron reasonable ment vs. CMOS pos- CMOS electronics in near term sible with near-term (TPU, ASIC: technology ~pJ/MAC)

Thermodynamic limit for digital Useful Problems: processing (kTlog 2 N = 100’s-1000’s per gate, ~1000 gates/MAC)

Much more aggres- sive I/O estimate Quantum limit from modulator switching energies.

11/14 Optical GEMM

Machine learning applications: Simple extension: optical GEMM Complementary benefits of integrated (matrix-matrix multiply) based on •& freeIn-situ-space training optics / back-propagation coherent detection. •• EfficientIntegrated: calculation phase stability, of many Up to 103 x 103 matrices possible viacomponents “patching” on (most-chip deep (transmitters). NN’s are convolutional)Large-scale detector arrays.

n (a) (c) • InferenceFree-space: Training 3rd dimension Back-propagation for M 1 optical fan-Tout, routing X Y X ∇YL ∇XL ∇YL (m x k) m m

T A ∇AL A T M1M2

M2 n (d) Input image Patch matrix X (n x k) Amplitude modulators

(b) Output Mij C y K patch 1 patch 2 patch 3 patch W'H' H x K

Input data Mij Routing/ Light in DAC C W W'H' patches W'H'

index i index ≈ Kernel Matrix K

kernel 1 Pulse train kernel 2 kernel 3 C' GratingC' couplersY = K X index j Phase (1D & 2D arrays) K x K x C' x C kernel C' x y modulators KxKyC W'H' (Rui Tang, collaboartion with AIM / AFRL) 12/14 GEMM for Training and Convolutions

• Many NN’s have lots of convolutional layers • AlexNet on optical NN: simulations show quantum limit is ~few aJ/MAC

(a) n CONV layers FC layers (a) (c) Inference Training Back-propagation

M 1 T X Y X ∇YL ∇XL ∇YL (m x k) m m

T A ∇AL A T M1M2

M2 n (d) Input image Patch matrix X (n x k) (b)

(b) Output Mij C y K patch 1 patch 2 patch 3 patch W'H' H x K

Input data Mij Routing/ DAC C W W'H' W'H' patches aJ/MAC 3 - 2 index i index Kernel Matrix K

kernel 1 Pulse train kernel 2 kernel 3 C' C' Y = K X

index j kernel C' Kx x Ky x C' x C

KxKyC W'H' 13/14 New application: Optical Ising Machines

An op&cal system to solve combinatorial op-miza-on problems. Beats D-Wave quantum annealer by >106x

(Dense MAX-CUT) OPO

Pump Signal 2ω ω+ω Pulsed Laser π 0

Above threshold: oscillation Pump SHG χ(2) (PPLN)

aN-1 aN a1 a2

At threshold: bifurcation Increasing pump power pump Increasing ] Feedback i Im[a

IM Re[ai] FPGA PM

Below threshold: squeezed vacuum

Scalable to ~10,000’s neurons Problems: • Phase stability over ≥km of fiber R. Hamerly, T. Inagaki, P. McMahon et al., Science • FPGA coupling not scalable, power-hungry Advances (in press) [arXiv:1805.05217] C. R-Carmes, Y. Shen, C. Zanoci et al., submitted Use op-cal NN hardware as Ising machine instead? [arXiv:1811.02705]

14/14 Conclusion

Time-Multiplexed Optical Neural Networks Acknowledgements

• Growing need for NN accelerators (R.H.) IC Postdoctoral Research Fellowship at MIT, - Energy: CMOS ASIC SoA = 1 pJ/MAC administered by ORISE through U.S. DOE and ODNI. - Nanophotonic solutions hard to scale (L.B.) NSERC Doctoral Fellowship. (A.S.) NSF Graduate • Photoelectric Multiplication Research Fellowship. (D.E.) U.S. ARO / ISN no. - Time-multiplexed data W911NF-18-2-0048. - Opto-electronic nonlinearity • Limits to Performance Papers: - Standard Quantum Limit: ~100 zJ/MAC • R. Hamerly*, A. Sludds, L. Bernstein et al., “Large- - Beating the Landauer Limit Scale Op_cal Neural Networks based on Photo- - Realistic near-term technology: ~fJ/MAC electric Mul_plica_on”, Phys. Rev. X (in press) arXiv:1812.07614 (a) (b) (k+1) x(2) x(3) x (1) Copies of x(k) Combiner Multiplier, Nonlinear (output) x (beamsplitter) integrator function ∫ f • R. Hamerly*, T. Inagaki, P. McMahon et al., “Exper- A(1) f A(2) f ∫ f ∫ f Source imental inves_ga_on of performance differences ∫ f Fan-out Conversion between Coherent Ising Machines and a Quantum ∫ f to optical x(K) xout ∫ f Annealer”, Science Advances (in press) arXiv:1805.05217 (K-1) (K) (k) + Aijxj A f A f x xj (input) A (k) A (k) ... A (k) 1 2 N' ‒ (weights) Aij *[email protected]