Digital Electronics and Analog Photonics for Convolutional Neural Networks (DEAP-CNNs)

Viraj Bangari,∗ Bicky A. Marquez, Heidi B. Miller, and Bhavin J. Shastri† Department of Physics, Engineering Physics & Astronomy, Queen’s University, Kingston, ON KL7 3N6, Canada

Alexander N. Tait National Institute of Standards and Technology (NIST), Boulder, Colorado 80305, USA

Mitchell A. Nahmias, Thomas Ferreira de Lima, Hsuan-Tung Peng, and Paul R. Prucnal Department of Electrical Engineering, Princeton University, Princeton, NJ 08544, USA (Dated: July 3, 2019) Convolutional Neural Networks (CNNs) are powerful and highly ubiquitous tools for extracting features from large datasets for applications such as computer vision and natural language pro- cessing. However, a convolution is a computationally expensive operation in digital electronics. In contrast, neuromorphic photonic systems, which have experienced a recent surge of interest over the last few years, propose higher bandwidth and energy efficiencies for neural network training and inference. Neuromorphic photonics exploits the advantages of optical electronics, including the ease of analog processing, and busing multiple signals on a single waveguide at the speed of light. Here, we propose a Digital Electronic and Analog Photonic (DEAP) CNN hardware architecture that has potential to be 2.8 to 14 times faster while maintaining the same power usage of current state-of-the-art GPUs.

I. INTRODUCTION predefined random weights of their hidden layers cannot be modified [8]. The success of CNNs for large-scale image recognition An alternative approach uses silicon photonics to de- has stimulated research in developing faster and more sign fully programmable neural networks [15], using a so- accurate algorithms for their use. However, CNNs are called broadcast-and-weight protocol [10–12]. This pro- computationally intensive and therefore results in long tocol is capable of implementing reconfigurable, recurrent processing latency. One of the primary bottlenecks is and feedforward neural network models, using a bank of computing the matrix multiplication required for forward tunable silicon microring resonators (MRRs) that recre- propagation. In fact, over 80% of the total processing ate on-chip synaptic weights. Therefore, such a protocol time is spent on the convolution [1]. Therefore, tech- allows it to emulate physical neurons. Mach-Zehnder in- niques that improve the efficiency of even forward-only terferometers have been also used to model synaptic-like propagation are in high demand and researched exten- connections of physical neurons [14]. The advantage of sively [2, 3]. the former approach over the latter is that it has already In this work, we present a complete digital electronic demonstrated fan-in, inhibition, time-resolved process- and analog photonic (DEAP) architecture capable of per- ing, and autaptic cascadability [12]. The DEAP CNN de- forming highly efficient CNNs for image recognition. The sign is therefore compatible with mainstream silicon pho- competitive MNIST handwriting dataset[4] is used as a tonic device platforms. This approach leverages the ad- benchmark test for our DEAP CNN. At first, we train a vances in silicon photonics that have recently progressed standard two-layer CNN offline, after which network pa- to the level of sophistication required for large-scale inte- rameters are uploaded to the DEAP CNN. Our scope is gration. Furthermore, this proposed architecture allows arXiv:1907.01525v1 [eess.SP] 23 Apr 2019 limited to the forward propagation, but includes power the implementation of multi-layer networks to implement and speed analyses of our proposed architecture. the deep learning framework. Due to their speed and energy efficiency, photonic neu- Inspired by the work of Mehrabian et al. [16], which ral networks have been widely investigated from different lays out a potential architecture for photonic CNNs with approaches that can be grouped into three categories: DRAM, buffers, and microring resonators, our design (1) reservoir computing [5–8]; reconfigurable architec- goes a step further by considering specific input repre- tures based on (2) ring-resonators [9–12], and (3) Mach- sentation, as well as an example of how an algorithm for Zehnder interferometers [13, 14]. Reservoir computing in tasks such as MNIST handwritten digit recognition can the discrete photonic domain successfully implement neu- be mapped to photonics. Moreover, we consider summa- ral networks for fast information processing, however the tion of multi-channel inputs, multi-dimensional kernels, the limitations on weights being between 0 and 1, and the architecture for the depth of kernel or inputs. This work is divided in five sections: Following this ∗ [email protected] introduction, in section (II), we describe convolutions as † [email protected] used in the field of signal processing. Then, we intro- 2 duce silicon photonic devices to perform convolutions in image kernel output photonics. Section (III) introduces a hardware inspired o A1,1A1,2A1,3 F1,1 F1,2 o 1 2 R o algorithm to perform such full photonic convolutions. In H A1,4 A1,5 A1,6 F1,3 F1,4 o 3 4 D

R Section (IV), we utilize our previously described architec- A1,7 A1,8 A1,9

D { ture to build a two-layers DEAP CNN for MNIST hand- W F3,4 F3,1F3,2

written digit recognition. Finally, in section (V), we show F { { 3,3 F3,3 F3,4 an energy-speed benchmark test, where we compare the F3,2 F2,1 F2,2 performance of DEAP with the empirical dataset Deep- A3,1A3,2A3,3 A2,1A2,2A2,3 A1,1A1,2A1,3 F3,1 3,4 3,5 2,4 2,5 1,4 1,5 F2,3 F2,4 Bench [17]. Note, we have made the high level simulator A A A3,6 A A A2,6 A A A1,6 F2,4 A3,7 A3,8 A3,9 A2,7 A2,8 A2,9 A1,7 A1,8 A1,9 and mapping tool for the DEAP architecture publicly F1,1 F1,2 F2,3 2 available [18]. F1,3 F1,4 F2,2 DR

A { F2,1

AA3,9 AA3,8 AA3,6AA3,5 AA2,9 AA2,8 AA2,6AA2,5 AA1,9 AA1,8 AA1,6AA1,5 F1,4 o4

II. CONVOLUTIONS AND PHOTONICS A3,8A3,7A3,5A3,4A2,8A2,7A2,5A2,4A1,8A1,7A1,5A1,4 F1,3 o3 = o A3,6 A3,5A3,3 A3,2A2,6 A2,5A2,3 A2,2A1,6 A1,5A1,3 A1,2 * F1,2 2 II.1. Convolutions Background A3,5A3,4A3,2A3,1A2,5A2,4A2,2A2,1A1,5A1,4A1,2A1,1 F1,1 o 1 DR 2 F o A convolution of two discrete domain functions f and g is defined by: Figure 1. Schematic illustration of a convolution. At the top of the figure, an input image is represented as a matrix of ∞ X numbers with dimensionality H × W × D where H, W and (f ∗ g)[t] = f[τ]g[t − τ], (1) D are the height, width and depth of the image, respectively. t=−∞ Each element Ai,j of A represents the intensity of a pixel at where (f ∗g) represents a weighted average of the function that particular spatial location. The kernel F is a matrix with f[τ] when it is weighting by g[−τ] shifted by t. The dimensionality R × R × D, where each element Fi,j is defined as a real number. The kernel is slid over the image by using a weighting function g[−τ] emphasizes different parts of stride S equal to one. As the image has multiple channels (or the input function f[τ] as t changes. depth) D, the same kernel is applied to each channel. Assum- In , a similar process is fol- ing H = W , the overall output dimensionality is (H −R+1)2. lowed. The convolution of an image A with a kernel F The bottom of the figure shows how a convolution operation produces a convolved image O. An image is represented generalized into a single matrix-matrix multiplication. where as a matrix of numbers with dimensionality H×W , where the kernel F is transformed into a vector F with DR2 ele- H and W are the height and width of the image, respec- ments, and the image A is transformed into a matrix A of 2 2 tively. Each element of a matrix represents the intensity dimensionality DR × (H − R + 1) . Therefore, the output is of a pixel at that particular spatial location. A kernel represented by a vector with (H − R + 1) elements. is a matrix of real numbers with dimensionality R × R. The value of a particular convolved pixel is defined by: Table I. Summary of Convolutional Parameters R R X X Oi,j = Fk,lAi+k,j+l. (2) Parameter Meaning k=1 l=1 N Number of input images H Height of input image including padding Using matrix slicing notation, Eq. (2) can be represented W Width of input image including padding as a dot product of two vectorized matrices: D Number of input channels O = vec(F )T · vec((A )m∈[i,i+R])T . (3) R Edge length of kernel i,j m,n n∈[j,j+R] K Number of kernels A convolution reduces the dimensionality of the input S Stride image to (H − R + 1) × (W − R + 1), so a padding of zero values is normally applied around the edges of the input image to counteract this. A schematic illustration parameter S is referred to as the “stride” of the convolu- of a convolution in digital image processing is shown at tion. This convolution is similar to Eq. (3), except that the top of Fig. 1. the outputs from each channel are summed together in When convolutions are used to perform parallel ma- the end, and that the stride parameter is always equal to trix multiplications in neural networks such as CNNs, a 1 in image processing. The dimensionality of the output convolution operation is defined as: feature is: m∈[iS,iS+R] T T Oi,j = vec(F ) · vec((Am,n,k)n∈[jS,jS+R]) , (4) H − R  W − R  k∈[1,D] + 1 × + 1 × K, (5) S S where the input A has dimensionality H ×W ×D, kernel F has dimensionality R×R×D and D refers to the num- where K is the number of different kernels applied to an ber of channels within the input image. The additional image, and d·e is the ceiling function. Table (I) contains 3

modulation weight b a summary of all the convolutional parameters described (a) (b) 2 so far. A1,2 A1,R One of the challenges with convolutions is that they Ai are computationally intensive operations, taking up 86% F1,1 F1,2 to 94% of execution time for CNNs [1]. For heavy work- Transmission loads, convolutions are typically run on graphical pro- 2 cessing units (GPUs), as they are able to perform many A2,1 A2,2 A2,R input through mathematical operations in parallel. A GPU is a spe- Ø cialized hardware unit that is capable of performing a Figure 2. (a) All-pass MRR and (b) transfer function: the single mathematical operation on large amounts of data orange curve represents the Lorentzian line shape described at once. This parallelization allow GPUs to compute by Eq. (6), centered in the initial phase where MRR is in matrix-matrix multiplication at speeds much higher than resonance with the incoming light. The blue triangle curve a CPU [19]. The convolution operation can be gen- shows how such phase can be modified by heating the MRR eralized into a single matrix-matrix multiplication [20]. via the application of a current through Ai. This is shown at the bottom of Fig. 1, where the ker- nel F is transformed into a vector F with dimensionality KDR2 × 1, and the image is transformed into a matrix The parameter r is the self-coupling coefficient, and a 2  H−R   W −R  A of dimensionality KDR × S + 1 S + 1 K. defines the propagation loss from the ring and the direc- Therefore, the output is represented by a vector with tional coupler. The phase φ depends on the wavelength  H−R   W −R  S + 1 S + 1 K elements; where in this partic- λ of the light and radius d of the MRR [23]: ular case K = 1, S = 1 and H = W . 4π2dn φ = eff , (7) λ II.2. Silicon Photonics Background where neff is the effective index of refraction between the ring and waveguide. The value of neff can be modi- An emerging alternative to GPU computing is optical fied to indirectly change the resonance peak. Such tuning computing using silicon photonics for ultrafast informa- is usually made by applying current to the ring propor- tion processing. Silicon photonics is a technology that al- tional to the variable Ai. This process heats the ring, lows for the implementation of photonic circuits by using yielding a shift of the resonance peak. Figure 2(b) shows the existing complementary-metal-oxide-semiconductor an example of such tuning: the orange curve represents (CMOS) platform for electronics [21]. In recent years, the Lorentzian line shape described by Eq. (6), centered the silicon photonic based broadcast-and-weight archi- in the initial phase of the ring resonator, indicating that tecture has been shown to perform multiply-accumulate the MRR is in resonance with the incoming light. The operations at frequencies up to five times faster than con- blue triangle curve shows how such phase can be modified ventional electronics [22]. Therefore, there is motivation by heating the MRR. to explore how photonics can be used to perform convo- The phase for an all-pass resonator corresponding to lutions, and how it compares to GPU-based implemen- a particular intensity modulation value can be computed tations. by using Eq. (6): MRRs are the essential devices of our approach. A  2 2 2  MRR is a circular waveguide that is coupled with either Ai(1 + (ar) ) − a − r φi = arccos , (8) one or two waveguides. Such silicon waveguides can be 2ra(1 − Ai,j) manufactured to have a width of 500 nm while having a thickness of 220 nm. These waveguides have a bend resulting in a modulated intensity equal to Ai: radius of 5 µm and can support TE and TM polarized 2 wavelengths between 1.5 µm and 1.6 µm [21]. The single Imod = Tn(φi)|E0| = Ai, (9) waveguide configuration is called an all-pass MRR, see Fig. 2(a). where E0 is amplitude of the electric field. The light from the waveguide is transferred into the An alternative double waveguide configuration is called ring via a directional coupler and then recombined. The the add-drop MRR. The transfer function of the through effective index of refraction between the waveguide and port light intensity with respect to the input light is: the MRR and the circumference of the MRR cause the re- 2 2 2 combined wave to have a phase shift, thereby interfering (ar) − 2r cos(φ) + r Tp(φ) = ; (10) with the intensity of original light. The transfer func- 1 − 2r2 cos(φ) + (r2a)2 tion of the intensity of the light coming out through port with the light going into the input port of the all-pass and the transfer function of the drop port light intensity resonator is described by: with respect to the input light is:

2 2 2 (1 − r) a a − 2ra cos(φ) + r T (φ) = . (11) T (φ) = . (6) d 2 2 2 n 1 − 2ra cos(φ) + (ar)2 1 − 2r cos(φ) + (r a) 4

PD 2 ing photonics before proposing an architecture capable A1,2 A1,R (a) (b) drop of performing convolutions. A wavelength multiplexed signal consists of k electro- TIA magnetic waves, each with angular frequency ωi, i = Ai 2 1, . . . , k. If it is assumed that each wave has an ampli-

2,2 A2,R Transmission tude of E , a power enveloping function µ whose modu- TIA 0 i input through 2 F 2,1 F 2,2 F2,R lation frequency is significantly smaller than ω , then the (c) Ø i (d) slowly varying envelope approximation and a short-time Fourier transform can be used to derive an expression for

2 the multiplexed signal in the frequency domain: AD,2 AD,R modulation weig Transmission 2 TIA FD,1 FD,2 FD,R 2 k A1,2 A1,R X √ Ø Ø Emux(ω) = E0 µiδ(ω − ωi), (13) i=1 Figure 3. (a) Add-drop configuration and O/E conversion and amplification. (b) Output of the balanced photodiode, the where δ(ω − ωi) is the Dirac delta function and µi ≥ 0, transfer function of Tp − Td. Orange circle and green curves since power envelopes are not negative. If the enveloping are drop and through ports, described by Eqs. (11) and (10), function is prevented from amplifying the electric field, respectively. In panels (c) and (d), the phase shifted (φ + 0.2) µi can further be restricted to the domain 0 ≤ µi ≤ blue curves show how such positive and negative kernel values 1. Next, we introduce tunable linear filters H+(ω) and from the drop and the through outputs, respectively. The or- − ange triangle curves show how those values can be amplified H (ω) such that when they interact with multiple fields, by a factor of two using a TIA at the output of the balance the following weighted signals are created: photodiode. Those phase shifts are achieved by the applica- tion of a current through Ai. − − Ew (ω) = H (ω)Emux(ω), + + (14) Ew (ω) = H (ω)Emux(ω). In the case where the coupling losses are negligible, a ≈ 1, the relationship between the add-drop through Assuming that the two signals are fed into a balanced photodiode (balanced PD) with spectral response R(ω), and drop transfer functions is Tp = Td − 1. In addition, if we connect the through and drop ports into a balanced the induced photocurrent is described by: photodiode and TIA as in Fig. 3(a), we get an effective transfer function of g(T − T ) where g is the gain of the ∞ p d Z TIA. Therefore, we get a modulation of:  + 2 − 2 iPD = dωR(ω) Ew (ω) − Ew (ω) , 2 −∞ Imod = g(Tp(φi) − Tn(φi))|E0| = Ai. (12) ∞ Z  + 2 − 2 2 At the output of the balanced photodiode, the transfer = dωR(ω) H (ω) − H (ω) |Emux(ω)| , function of Tp − Td is shown by the blue triangle curve in −∞ Fig. 3(b). Orange circle and green curves are Lorentzian k−1 line shapes, centered in the initial phase where MRR is in X  + 2 − 2 = R(ωi) H (ωi) − H (ωi) E0ri. resonance with the incoming light, described by Eqs. (11) i=0 and (10), respectively. Differently, Fig. 3(c) and (d), are (15) centered in a modified phase (φ+0.2), according to a spe- cific value of the current Ai. Here we aim to demonstrate Assuming that R(ω) is roughly constant in the area how to represent positive and negative kernel values in ∗ of spectral interest, one can set Ai = E0R0µi and Fi = analog photonics. This can be achieved by incorporat- + 2 − 2 |H (ωi)| − |H (ωi)| resulting in a photocurrent equal ing a balanced-PD at the output of the add-drop MRR. to In panels (c) and (d), the blue curves show such positive and negative kernel values from the drop and the through outputs, respectively. The orange triangle curves show k X ∗ ~ ~ ∗ the TIA transfer function g(Tp − Td), where g amplifies iPD = AiFi = A · F . (16) Tp − Td by a factor of two. i=1 The through and drop ports of a MRR can be used to implement the linear filters H+ and H− such that II.3. Dot Products with Photonics + 2 − 2 |H | = Td and |H | = Td. Knowing that Tp = Td − 1 with minimal losses, we can set a particular weight using: The fundamental operation of a convolution is the dot product of two vectorized matrices. Therefore, one needs ∗ to understand how to compute a vector dot product us- Fi = 2Td(φi) − 1, (17) 5

input F1,1 F1,2 1). If the same set of derivations is followed, we can intensities modify Eq. (21) to be:

2 balanced-PD A2,1 A 1 weightA2,2 banksA2,R

F 2,1 F 2,2 k k ! X ∗ X ∗ 2 TIA A F1 F2 Fk ~x · ~w = g AiFi + E0R0Fi . (22) i=1 i=1

MUX

WDM 2 AD,1 AD,2 AD,R The second term in this sum is a predictable bias cur-

FD,1 FD,2 rent term that conceptually be subtracted before feeding Ak into the TIA. This is a disadvantage of supporting neg- ative inputs, as additional optical or electronic control Figure 4. An electro-optic architecture that performs dot circuitry would need to be designed. Another trade-off is products. Ai (i = 1, . . . , k) are input elements encoded in a loss in precision due to a larger range of inputs needing intensities, multiplexed by a WDM and linked to the weight to be represented, analogous to the loss in precision with banks via a silicon waveguide. Fi are filter values that modu- signed integers for classical computing. late the MRRs in the PWB. Drop and through output ports are connected to a balanceD-PD, where the matrix multipli- cation is performed, followed by an amplifier TIA. III. PERFORMING CONVOLUTIONS USING PHOTONICS

The goal of this section is to present a photonic ar- where the phase, φ can obtained by using Eq. 10 and i chitecture capable of performing convolutions for CNNs. Eq. 11 to get: This new architecture is called DEAP.  1 2(1 − r)2a  For a maximum number of input channels Dm and a φ = arccos − − 1 − (r2a)2 , (18) maximum kernel edge length R as bounding parame- i 2r2a F ∗ + 1 m i ters for DEAP, we represent the range of convolutional parameters that a particular implementation of DEAP we can see that F ∗ can be between -1 and 1. Since T is i d can support. If a convolutional parameter described in a filter that only represents values between 0 and 1. In Table (I) does not have a complementary bounding pa- order to perform a dot product with a weight vector ~w rameter, it means that the DEAP architecture can sup- whose components are not limited to the range -1 to 1, a port for arbitrary values of said convolutional parameter. gain gTIA can be applied to the photocurrent such that:

∗ III.1. Producing a Single Convolved Pixel A~ · F~ = gTIAA~ · F~ k X (19) First, we consider an architecture that can produce = g A F ∗, TIA i i one convolved pixel at a time. To handle convolutions i=1 for kernels with dimensionality up to Rm ×Rm ×Dm, we 2 if: will require Rm lasers with unique wavelengths since a particular convolved pixel can be represented as the dot product of two 1 × R2 vectors. To represent the values of gTIA = max |Fi| , (20) m 1≤i≤k 2 each pixel, we require DmRm modulators (one per ker- nel value) where each modulator keeps the intensity of then, the corresponding carrier wave proportional to the nor- malized input pixel value. The R2 lasers are multiplexed ~ ~ ∗ m F = gTIAF ; (21) together using wavelength division multiplexing (WDM), which is then split into Dm separate lines. On every line, assuming that each φi corresponds to a weighting of there are R2 all-pass MRRs, resulting in D R2 MRRs ∗ m m m wi . This electronic gain can be performed using a tran- in total. Each WDM line will modulate the signals cor- simpedance amplifier (TIA), which can be manufactured 2 responding to a subset of Rm pixels on channel k, mean- in a standard CMOS process [24] and packaged or in- ing that the modulated wavelengths on a particular line tegrated with the photonic chip [21]. A diagram of the m∈[i,i+Rm] correspond to the pixel inputs (Am,n,k) where electro-optic architecture described in this section is pre- n∈[j,j+Rm] sented in Fig. 4. From now on, this amalgamation of k ∈ [1,Dm]. electronic and optical components is referred as a pho- The Dm WDM lines will then be fed into an ar- tonic weight bank (PWB). PWBs similar to the one in ray of Dm PWBs. Each PWB will contain Rm MRRs Fig. 4 have been successfully implemented in the past with the weights corresponding to the kernel values at [11, 25, 26]. a particular channel. For example, the PWB on line k should contain the vectorized weights for the kernel We can represent negative inputs between -1 and 1 by 2 1 m∈[1,Rm] modifying the power enveloping function to µ = (x + (Fm,n,k) 2 . Each MRR within a PWB should be i 2 i n∈[1,Rm] 6

Input Kernel modulation weight banks voltage adder

2 A1,1 A1,2 A1,R R

2 TIA F1,1 F1,2 F1,R laser diodes

2 λ1 A2,1 R A2,2 A2,R output TIA 2 * F 2,1 F 2,2 F2,R λ2

MUX * WDM 2 AD,1 AD,2 AD,R R

2 TIA FD,1 FD,2 FD,R 2 λR *

Figure 5. Photonic architecture for producing a single convolved pixel. Input images are encoded in intensities Al,k, where the 2 pixel inputs Am,n,k with m ∈ [i, i + Rm], n ∈ [j, j + Rm], k ∈ [1,Dm] are represented as Al,h, l = 1,...,D and h = 1,...,R . Considering the boundary parameters, we set D = Dm and R = Rm. Likewise, the filter values Fm,n,k are represented as are 2 represented as Fl,h under the same conditions. We use an array of R lasers with different wavelengths λh to feed the MRRs. The input and kernel values, Al,h and Fl,h modulate the MRRs via electrical currents proportional to those values. Once the matrix parallel multiplications are performed, the voltage adder has the function to add all signals from weight banks. Here, R are resistance values. Then the output is the convolved feature. tuned unique the resonant wavelength within the multi- Algorithm 1 Convolutions for CNNs using DEAP plexed signal. The outputs of the weight bank array are 1: A is the input image electrical signals, each proportional to the dot product 2: F is the kernel 2 2 m[1,Rm] p∈[i,i+Rm] 3: R is the edge length of the kernel (Fm,n,k) 2 · (Ap,q,k) 2 . Finally, the signals n∈[1,Rm] q∈[j,j+Rm] 4: O is a memory block to store the convolution from the weight banks need to be added together. This 5: S is the stride can be achieved using a passive voltage adder. The out- 6: H and W are the height and width of the input image put from this adder will therefore be the value of a single 7: function convolve(A, F, R, O, S, H, W ) convolved pixel. Fig. 5 shows a complete picture of what 8: for (k = 1; k ≤ K; k = k + 1) do such an architecture would look like. 9: load kernel weights from F[:,:,:,k] 10: for (h = 1; h ≤ H − R + 1; h = h + S) do To perform a convolution with a kernel edge length 11: for (w = 1; w ≤ W − R + 1; w = w + S) do m∈[R+1,R ] less than R , one can set (F ) m to zero. 12: load inputs from A[h:min(h+T,H), m m,n,k n∈[R+1,Rm] Similarly, if the dimensionality of the kernel is less than w:min(w+R,W),:] m∈[1,H] 13: perform convolution Dm, then the modulators (Am,n,k) should also be n∈[1,W ] 14: store results in O[h/S,w/S,k] set to zero, with k ∈ [D + 1,Dm] in this case. 15: end for 16: end for 17: end for 18: end function

III.2. Performing a Full Convolution

In the previous section, we have discussed how DEAP can produce a single convolved pixel. In order to perform a convolution of arbitrary size, one would need to stride The DEAP architecture also allows for parallelization along the input image and readjust the modulation array. by treating the photonic architecture proposed in the Since the same kernel is applied across the set of inputs, previous section as a single output “convolutional unit”. the weight banks do not need to be modified until a new However, by creating n instances of these convolu- kernel is applied. Fig. 6(a) demonstrates this process on conv tional units, you could produce n pixels per cycle an input with S = 1. To handle S ≥ 1, the inputs being conv by passing in the next set of inputs per unit. This is passed in to DEAP should also be strode accordingly. In demonstrated in Fig. 6(b) for n = 2. The compu- this approach, the inputs should have been zero padded conv tation of output pixels can be distributed across each before being passed into DEAP. In pseudocode, perform- convolutional unit, resulting in a runtime complexity of ing a convolution with K filters can be implemented as  KHW  O 2 . shown in Algorithm 1. S nconv 7

(a)

(b)

Figure 6. (a) Cycling through a convolution using DEAP. (b) Performing a convolution with two convolutional units.

IV. PHOTONIC CONVOLUTIONAL NEURAL training. The pooling layer introduces an stage where a NETWORKS set of neighbor pixels are encompassed in a single opera- tion. Typically, such operation consists in the application In this section, we show how DEAP can be used to run of a function that determines the maximum value among a CNN. CNNs are a type of neural network that were neighboring values. An average operation can be im- developed for image recognition tasks. A CNN consists plemented likewise. Both approaches describe max and of some combination of convolutional, nonlinear, pool- average pools, respectively. This statistical operation al- ing and fully connected layers [27], see Fig. 7(a). As lows for a direct down-sampling of the image, since the introduced previously, convolutions perform a highly ef- dimensions of the object are reduced by a factor of two. ficient and parallel matrix multiplication using kernels From this step, we aim to make our network invariant [3]. Furthermore, since kernels are typically smaller than and robust to small translations of the detected features. the input images, the feature extraction operation allows The triplet, convolution-activation-pooling, is usually efficient edge detection, therefore reducing the amount of repeated several times for different kernels, keeping in- memory required to store those features. variant the pooling and activation functions. Once all CNNs are networks suitable to be implemented in pho- possible features are detected, the addition of a fully con- tonic hardware since they demand fewer resources to do nected layer is required for the classification stage. This matrix multiplication and memory usage. The linear op- layer prepares and shows the solutions of the task. eration performed by convolutions allows single feature CNNs are trained by changing the values of the ker- extraction per kernel. Hence, many kernels are required nels, analogous to how feed-forward neural networks are to extract as many features as possible. For this reason, trained by changing the weighted connections [28]. The kernels are usually applied in blocks, allowing the net- estimated kernel and weight values are required in the work to extract many different features all at once and testing stage. In this work, this stage is performed by in parallel. our on-chip DEAP CNN. Figure 7(b) shows a high-level In feed-forward networks, it is typical to use a rectified overview of the proposed testing on-chip architecture. linear unit (ReLU) activation function. Since ReLUs are Here, the testing input values stored in the PC modulate linear piecewise functions that model an overall nonlin- the intensities of a group of lasers with identical powers earity, they allow CNNs to be easily optimized during but unique wavelengths. These modulated inputs would 8

(a) (b) PC

Image Kernel Convolved Activation

SDRAM input weights feature Convolution

SDRAM Activation DAC DAC Pooling

Laser Modulator Weight bank Voltage sources array array adder Fully connected ADC

0 1 2 3 4 5 6 7 8 9

Figure 7. Block diagrams that describe: (a) a typical CNN, which contains convolutions, activation functions, pooling and fully connected layers. In this case we exemplify such diagram using MNIST-based recognition task that predicts the number 5; and (b) the DEAP architecture. In the computer (PC) the input image, kernel weights, and convolved features are stored. Also, the commands to implement the activation function off-chip are stored in the PC. The input image, kernel weights and convolved features are transferred to the chip via DACs from the SDRAM. Then, the convolution is performed on-chip. Finally the output is digitalized via an ADC and stored in a SDRAM connected to the computer. be sent into an array of photonic weight banks, which defined by eight 5 × 5 different filters. In the following would then perform the convolution for each channel. we use our DEAP CNN simulator to recognize new input The kernels obtained in the training step are used to images, obtained from a set of 500 images, which are in- modulate these weight banks. Finally, the outputs of the tended to be used for the test step. Our simulator only weight banks would be summed using a voltage adder, works at the transfer level and does not simulate noise which produces the convolved feature. This simulator or distortion from analog components. The process of works using the transfer function of the MRRs, through feature extraction performed by the DEAP CNN is illus- port and drop port summing equations at the balanced trated in Fig. 8(a). As it can be seen in the illustration, PDs, and the TIA gain term to simulate a convolution. a 28 × 28 input image from the test dataset is filtered by The simulator assumes that the MRRs can only be con- a first 5 × 5 × 8 kernel, using stride one. The output of trolled with 7−bits of precision as that has been empir- this process is a 24 × 24 × 8 convolved feature, with a ically observed in a lab setting. The MRR self-coupling ReLU activation function already applied. Following the coefficient is equal to the loss, r = a = 0.99[29] in Eqs. (6) same process, the second group of filters is applied to the (10) and (11). convolved feature to generate the second output, i.e. a The interfacing of optical components with electronics 20 × 20 × 8 convolved feature. would be facilitated by the use of digital-to-analog con- After the second ReLU is applied to the output, aver- verters (DACs) and analog-to-digital converters (ADCs), age pooling is utilized for invariance and down-sampling while the storage of output and retrieving of inputs would of the convolved features. The average pooling is imple- be achieved by using memories GDDR SDRAM. The mented by a 2×2 kernel whose elements are all 1/4. How- SDRAM is connected to a computer, where the infor- ever, the stride one was kept; therefore the pooled feature mation is already in a digital representation. Then, the has dimensionality 19×19×8. The down-sampling is im- implementation of the ReLU nonlinearity and the reuse plemented offline: from the 19 × 19 × 8 output, a simple of the convolved feature to perform the next convolution algorithm extracts the elements that have even indexes. can be performed. The idea is to use the same archi- The result of this process is a 10 × 10 × 8 pooled output. tecture to implement the triplet convolution-activation- Finally, the first fully connected layer is fed through by pooling on hardware. the flattened version of the pooled object. The resultant In this work, we trained the CNN to perform image vector feeds the last fully connected layer, where the re- recognition on the MNIST dataset. The training stage sult of the MNIST classification appears. uses the ADAM optimizer and back-propagation algo- The results of the MNIST task solved by our simu- rithm to compute the gradient function. The optimized lated DEAP CNN is shown by Fig. 8(b). For a test set parameters to solve MNIST can be categorized in two of 500 images, we obtained an overall accuracy of 98%. groups: (i) two 5×5×8 different kernels and (ii) two fully This performance was compared to the results obtained connected layers of dimensions 128 × 800 and 128 × 10; using a standard two-layers CNN including a max pool- and their respective bias terms. These kernels are then ing layer. We found that This standard network achieves 9

(a)

9 8 7 5 65 4 3 2 1 MNIST input Conv + ReLU Conv + ReLU Conv-Avg 0 28x28 24x24x8 20x20x8 10x10x8 FC

100 (b)

80

60

40 prediction (%) prediction

20

0 2 4 6 8

Figure 8. (a) An illustrative block diagram of the two-layers DEAP CNN solving MNIST. (b) Results of the MNIST task using a simulated DEAP CNN. an overall accuracy of 98.6%. Therefore, we can con- V.2. DEAP Performance clude that our simulator is sufficiently robust despite the 7−bits of precision considered in the DEAP CNN simu- The time it takes for light to propagate from the WDM lation. to before the balanced PDs is estimated by the following equation:

V. ENERGY AND SPEED ANALYSES k2πrMRR tprop = (23) V.1. Energy Estimation c

where c is the speed of light 2πrMRR is the circumfer- The energy used by a single DEAP convolutional ence of the MRR and k is the number of MRRs. Assum- unit depends on the R and D parameters. The 100- ing 100 MRRs with a radius of 10 m [11, 31], the PWB wavelength limitation for MRRs constrains the maximum gets a propagation time of around 21 ps and a through- 2 R to be 10, as each multiplexed waveguide will store R put of 1/tprop = 50 GS/s. The bottlenecks come from signals. The number of MRRs used in the modulator ar- the fact that the balanced PDs has a throughput of 25 ray is equal to R2D, meaning that only certain D and R2 GS/s[30] and the TIA has a throughput of 10 GS/s[32]. values are allowed for a finite number of MRRs. Assum- An individual MRR can be modulated at speeds of 128 ing that a maximum of 1024 MRRs can be manufactured GS/s[31], meaning that the modulation frequency of the in the modulator array, a convolutional unit can support MRRs does not bottleneck the throughput of the PWB. a large kernel size with a limited number of channels, R The throughput of a PWB is around 5 GS/s. The = 10, D = 12, or a small kernel size with a large number DACs[33] and ADCs[34] both operate at 5 GS/s and of channels, R = 3, D = 113. We will consider both edge support to 7-bits. The GDDR6 SDRAM operates at cases to get a range of energy consumption values. For 16 G with a 256-bit bus size[35]. Consequently, the the smaller convolution size, we will have R2 lasers, R2 speed of the system is limited by the throughput of the MRRs and DACs in the modulator array, R2D MRRs DACs/ADCs, resulting in DEAP producing a single con- and D TIAs in the weight bank array and one ADC to volved pixel at 5 GS/s or t = 200 ps. convert back into digital signal. With 100 mW per laser, DeepBench [17] is an empirical dataset that contains 19.5 mW per MRR, 26 mW per DAC, 17 mW per TIA how long various types of GPUs took to perform a convo- [30] and 76 mW per ADC, we get an energy usage of 112 lution for a given set of convolutional parameters. Table W for the large kernel size and 95W for the smaller ker- (II) contains the parameters used for each of these bench- nel size. Therefore, we estimate a single convolution unit marks, and Table (III) contains the power consumption. to use around 100 W when 1024 modulators are used to The speeds of various GPUs were directly taken from represent inputs. Ref. [17], while the speed of the convolution was esti- 10

GPU runtime while using 0.37 times the energy consump- Table II. Benchmarking parameters for DEAP tion. Using two convolutional units doubles the speed of DEAP, meaning that DEAP can be between 2.8 and 14 W H D N K Rw Rh S times faster than a conventional GPU while using almost 700 161 1 4 32 5 20 2 0.75 times the energy consumption. DEAP with a sin- 112 112 64 8 128 3 3 1 7 7 832 16 256 1 1 1 gle unit performing at a speed somewhat similar to the GPUs is expected.

Table III. Benchmarked GPUs with power consumption VI. CONCLUSION GPU Power Usage (W) AMD Vega FE[36] 375 We have proposed a photonic network, DEAP, suited AMD MI25[37] 300 for convolutional neural networks. DEAP was estimated Tesla P100[38] 250 to perform convolutions between 2.8 and 14 times faster NVIDIA GTX 1080 Ti[39] 250 than a GPU while roughly using 0.75 times the energy consumption. A linear increase in processing speeds cor- responds to a linear increase in energy consumption, al- lowing for DEAP to be as scalable as electronics. mated using the following equation: High level software simulations have shown that DEAP is theoretically capable of performing a convolution. We demonstrate that our DEAP CNN is capable of solving NK H − R  W − R  truntime = 200 ps× + 1 + 1 . MNIST handwritten recognition task with an overall ac- nconv S S curacy of 98%. The largest bottlenecks is the I/O in- (24) terfacing with digital systems via DACs and ADCs. If In some of the benchmarks, the kernels edge lengths were photonic DACs[40] and ADCs[41] are to be built with not equal, hence the parameters Rw and Rh which corre- higher bit-precisions, the speedup over GPUs could be spond to the width and height of the kernels. For each of even higher. If higher bit precision photonic DACs and 2 the selected benchmarks, the parameters R D ≤ 1024, ADCs are able to be built, replacing the electronic com- meaning that the convolutional network is compatible ponents with optical ones can significantly decrease the with DEAP implementations. runtime. In order to realize a physical implementation, there are a number of issues that still need to be solved. Packag- ing a silicon photonic with an electronic chip with high I/O count is a challenging RF engineering task, but it is a central thrust in the roadmap for silicon photonic foundries [21]. There also needs to be control circuitry that routes the outputs of the SDRAM into the rele- vant DACs and from the ADCs into the SDRAM. Since we assume that the control circuitry can operate signif- icantly faster than a memory access, we believe it will have a negligible impact on the overall throughput. An- other issue is that DEAP processes data in the analog do- main, whereas GPUs perform floating point arithmetic. Though floating-point arithmetic does have some degree of error due to rounding in the mantissa, their errors are deterministic and predictable. On the other hand, the errors from photonics are due to stochastic shot, spec- Figure 9. Estimated DEAP convolutional runtime compared tral, Johnson-Nyquist and flicker noises, as well as quan- to actual GPU runtimes from DeepBench benchmarks tization noise in the ADC, and distortion from the RF signals applied to the modulators. However, artificially adding random noise to CNNs have been shown to reduce The estimated DEAP runtimes using one and two con- over-fitting [42], meaning that some degree of stochastic volutional units were plotted against actual DeepBench behaviour is tolerable in the domain of machine learning runtimes in Fig. 9. From this, we can see that using problems. two convolutional units performs slightly better than all Finally, MRRs have only been shown to have up to the GPU benchmarks. While mean GPUs power con- 7-bits of precision, which is significantly smaller than the sumption is 295 W, DEAP with a single convolutional range precision supported by even half-precision (16-bit) unit sues about 110 W. Therefore, DEAP can perform floating point representations. In conclusion, photonics convolutions between 1.4 and 7.0 faster than the mean has the potential to perform convolutions at speeds faster 11 than top-of- the-line GPUs while having a lower energy to scale up in the future. consumption. Moving forward, the greatest challenges to overcome have to do with increasing the precision of photonic components so that they are comparable to clas- ACKNOWLEDGMENT sical floating-point representations. Overall, silicon pho- tonics has the potential to outperform conventional elec- Funding for B.J.S., B.A.M., H.B.M., and V.B. was pro- tronic hardware for convolutions while having the ability vided by the Natural Sciences and Engineering Research Council of Canada (NSERC) and the Queen’s Research Initiation Grant (RIG).

[1] X. Li, G. Zhang, H. H. Huang, Z. Wang, and W. Zheng, [20] S. Chetlur, C. Woolley, P. Vandermersch, J. Co- in 2016 45th International Conference on Parallel Pro- hen, J. Tran, B. Catanzaro, and E. Shelhamer, CoRR cessing (ICPP) (2016) pp. 67–76. abs/1410.0759 (2014), arXiv:1410.0759. [2] M. Jaderberg, A. Vedaldi, and A. Zisserman, CoRR [21] A. Rahim, T. Spuesens, R. Baets, and W. Bogaerts, Pro- abs/1405.3866 (2014), arXiv:1405.3866. ceedings of the IEEE 106, 2313 (2018). [3] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learn- [22] M. A. Nahmias, B. J. Shastri, A. N. Tait, T. F. de Lima, ing (The MIT Press, 2016). and P. R. Prucnal, Opt. Photon. News 29, 34 (2018). [4] Y. LeCun and C. Cortes, (2010). [23] W. Bogaerts, P. De Heyn, T. Van Vaerenbergh, K. De [5] F. Duport, A. Smerieri, A. Akrout, M. Haelterman, and Vos, S. Kumar Selvaraja, T. Claes, P. Dumon, P. Bien- S. Massar, Scientific Reports 6, 22381 EP (2016). stman, D. Van Thourhout, and R. Baets, Laser & Pho- [6] D. Brunner, M. C. Soriano, C. R. Mirasso, and I. Fischer, tonics Reviews 6, 47 (2012), arXiv:arXiv:1208.0765v1. Nature Communications 4, 1364 (2013). [24] H. Zheng, R. Ma, and Z. Zhu, Analog Integrated Circuits [7] K. Vandoorne, P. Mechet, T. Van Vaerenbergh, M. Fiers, and Signal Processing 90, 217 (2017). G. Morthier, D. Verstraeten, B. Schrauwen, J. Dambre, [25] M. Lipson, Journal of Lightwave Technology 23, 4222 and P. Bienstman, Nature Communications 5, 3541 EP (2005). (2014). [26] A. N. Tait, H. Jayatilleka, T. F. D. Lima, P. Y. Ma, M. A. [8] L. Larger, M. C. Soriano, D. Brunner, L. Appeltant, J. M. Nahmias, B. J. Shastri, S. Shekhar, L. Chrostowski, and Gutierrez, L. Pesquera, C. R. Mirasso, and I. Fischer, P. R. Prucnal, Opt. Express 26, 26422 (2018). Optics Express 20, 3241 (2012). [27] K. O’Shea and R. Nash, CoRR abs/1511.08458 (2015), [9] P. R. Prucnal and B. J. Shastri, Neuromorphic Photonics arXiv:1511.08458. (CRC Press, Taylor & Francis Group, Boca Raton, FL, [28] K. Mehrotra, C. K. Mohan, and S. Ranka, Elements of USA, 2017). Artificial Neural Networks (MIT Press, Cambridge, MA, [10] A. N. Tait, M. A. Nahmias, B. J. Shastri, and P. R. Pruc- USA, 1997). nal, Journal of Lightwave Technology 32, 4029 (2014). [29] Y. Tan and D. Dai, Journal of Optics 20, 054004 (2018). [11] A. N. Tait, A. X. Wu, T. F. de Lima, E. Zhou, B. J. Shas- [30] Z. Huang, C. Li, D. Liang, K. Yu, C. Santori, tri, M. A. Nahmias, and P. R. Prucnal, IEEE Journal of M. Fiorentino, W. Sorin, S. Palermo, and R. G. Beau- Selected Topics in Quantum Electronics 22, 312 (2016). soleil, Optica 3, 793 (2016). [12] A. N. Tait, T. Ferreira de Lima, M. A. Nahmias, [31] J. Sun, R. Kumar, M. Sakib, J. B. Driscoll, H. Jayatilleka, H. B. Miller, H.-T. Peng, B. J. Shastri, and P. R. and H. Rong, Journal of Lightwave Technology 37, 110 Prucnal, arXiv e-prints , arXiv:1812.11898 (2018), (2019). arXiv:1812.11898 [physics.app-ph]. [32] M. Atef and H. Zimmermann, Analog Integr. Circuits [13] T. W. Hughes, M. Minkov, Y. Shi, and S. Fan, Optica 5, Signal Process. 76, 367 (2013). 864 (2018). [33] B. Sedighi, M. Khafaji, and J. C. Scheytt, International [14] Y. Shen, N. C. Harris, S. Skirlo, M. Prabhu, T. Baehr- Journal of Microwave and Wireless Technologies 4, 275 Jones, M. Hochberg, X. Sun, S. Zhao, H. Larochelle, (2012). D. Englund, and M. Soljaˇci´c,Nat. Photonics 11, 441 [34] J. Fang, S. Thirunakkarasu, X. Yu, F. Silva-Rivas, (2017). C. Zhang, F. Singor, and J. Abraham, IEEE Transac- [15] T. F. de Lima, H. Peng, A. N. Tait, M. A. Nahmias, tions on Circuits and Systems I: Regular Papers 64, 1673 H. B. Miller, B. J. Shastri, and P. R. Prucnal, Journal of (2017). Lightwave Technology 37, 1515 (2019). [35] I. , Gddr6 sgram mt61k256m32 8gb: [16] A. Mehrabian, Y. Al-Kabani, V. J. Sorger, and 2 channels x16/x8 gddr6 sgram. T. A. El-Ghazawi, CoRR abs/1807.08792 (2018), [36] I. , vega frontier edition arXiv:1807.08792. (liquid-cooled) (). [17] B. Research, Deepbench. [37] I. Advanced Micro Devices, mi25 accel- [18] V. Bangari, B. Marquez, H. Miller, and B. J. Shastri, erator (). DEAP, https://github.com/Shastri-Lab/DEAP (2019). [38] N. Corporation, Vidia tesla p100 gpu accelerator (). [19] G. Tan, L. Li, S. Triechle, E. Phillips, Y. Bao, and N. Sun, [39] N. Corporation, Geforce gtx 1080 ti (). in Proceedings of 2011 International Conference for High [40] F. Zhang, B. Gao, X. Ge, and S. Pan, Optical Engineer- Performance Computing, Networking, Storage and Anal- ing 55, 031115 (2015). ysis, SC ’11 (ACM, New York, NY, USA, 2011) pp. 35:1– [41] M. A. Piqueras, P. Villalba, J. Puche, and J. Mart´ı, 35:11. in 2011 IEEE International Conference on Microwaves, 12

Communications, Antennas and Electronic Systems (COMCAS 2011) (2011) pp. 1–6. [42] Z. You, J. Ye, K. Li, and P. Wang, CoRR abs/1805.08000 (2018), arXiv:1805.08000.