<<

Articles https://doi.org/10.1038/s41928-019-0270-x

A fully integrated reprogrammable – CMOS system for efficient multiply–accumulate operations

Fuxi Cai 1,3, Justin M. Correll 1,3, Seung Hwan Lee 1,3, Yong Lim 1,2, Vishishtha Bothra1, Zhengya Zhang1, Michael P. Flynn1 and Wei D. Lu 1*

Memristors and memristor crossbar arrays have been widely studied for neuromorphic and other in-memory computing appli- cations. To achieve optimal system performance, however, it is essential to integrate memristor crossbars with peripheral and control circuitry. Here, we report a fully functional, hybrid memristor chip in which a passive crossbar array is directly inte- grated with custom-designed circuits, including a full set of mixed-signal interface blocks and a digital for reprogram- mable computing. The memristor crossbar array enables online learning and forward and backward vector-matrix operations, while the integrated interface and control circuitry allow mapping of different algorithms on chip. The system supports charge- domain operation to overcome the nonlinear I–V characteristics of memristor devices through pulse width modulation and custom analogue-to-digital converters. The integrated chip offers all the functions required for operational neuromorphic com- puting hardware. Accordingly, we demonstrate a network, sparse coding algorithm and principal component analy- sis with an integrated classification using the system.

emristors are two-terminal resistive devices that have In this Article, we report a fully integrated, functional, reprogram- a conductance state that depends on one or more inter- mable memristor chip, including a passive memristor crossbar array nal state variables and can be modulated by the history directly integrated with all the necessary interface circuitry, digital M 1–5 of external stimulation . Due to their compact device structure buses and an OpenRISC processor to form a complete hardware and ability to both store and process information at the same physi- system on chip. Thanks to the re-programmability of the memris- cal location, and memristor crossbar arrays have been tor crossbar and the integrated complementary metal–oxide–semi- explored for neuromorphic computing, and edge conductor (CMOS) circuitry, the system is highly flexible and can computing applications5–8. These include single-layer perceptron9–11 be programmed to implement different computing models and and multi-layer perceptron networks12,13, image transformation14,15, network structures. Three widely used models—a perceptron net- sparse coding16, reservoir computing17 and principal component work, a sparse coding algorithm and a bilayer PCA system with an analysis (PCA)18. In addition, novel approaches using multiple mem- unsupervised feature extraction layer and a supervised classification ristor devices or mixed-precision architectures have been developed layer—are demonstrated experimentally on the same chip. to effectively mitigate device non-idealities and allow the devices to be used in high-precision computing and training tasks19–24. Fully integrated reprogrammable neuromorphic chip Although the key matrix operations can be performed effi- A memristor crossbar is very efficient at vector-matrix multiplica- ciently with memristor crossbar arrays14,15,25, previous implemen- tion (VMM), because the values in the matrix can be stored as the tations have largely relied on external printed-circuit boards to analogue device conductances of the crossbar array. When an input provide the required interface and control circuitry10,14,16, or discrete vector is applied as voltage pulses with different pulse amplitudes parameter analysers to generate and collect signals9,12,15. In cases or different pulse widths to the rows of the crossbar, the currents where memristor arrays are integrated with periphery circuitry, or charges collected at the columns of the crossbar correspond to the circuit’s function has been limited to providing access devices the resulting VMM outputs, following ’s law and Kirchhoff’s (for example, in the form of 1T1R arrays10,13,14,25) or address decod- current law14,16,25,28. This approach allows direct computing of this ing purposes26,27. Demonstrating the potential of memristor-based data-intensive task both in memory and in parallel5,29 in a single computing hardware requires the development of fully functional step, that is O(1), and has attracted broad interest for implementing systems, where the memristor crossbars are integrated with - neuromorphic computing and machine learning models28,30,31. essary analogue interface circuitry (including analogue-to-digital The results from the crossbar are then used to update the converters (ADCs) and digital-to-analogue converters (DACs)), neuron outputs, which typically require efficient ADC circuits digital buses and ideally a programmable processor to control the that convert the analogue current or charge signals collected at digital and analogue components. Integrating all the necessary the columns to digital signals that can be processed by the artifi- functions on chip will be key to enabling the practical implementa- cial neurons. Similarly, neuron outputs need to be supplied to the tion of memristor-based computing systems and allowing the pro- devices in the crossbar, typically through DAC circuits feeding the totypes to be scaled to larger systems. rows. Controllers are needed to convert the input signals to pulse

1Department of Electrical Engineering and Computer Science, , Ann Arbor, MI, USA. 2Present address: Electronics, Yongin, Korea. 3These authors contributed equally: Fuxi Cai, Justin M. Correll, Seung Hwan Lee. *e-mail: [email protected]

290 Electronics | VOL 2 | JULY 2019 | 290–299 | www.nature.com/natureelectronics NATure ElecTronics Articles amplitude or width, and to implement weight update rules dur- of devices within the array allows the system to implement a num- ing training. To reduce latency and power consumption, all these ber of computing tasks on chip. We note that device variations can components need to be integrated together with the crossbar array still be observed. This effect is exacerbated by the line resistance on a single chip, instead of using discrete components on a board. effects (Supplementary Figs. 12 and 13). Further device optimiza- Integrating a processor on chip also allows the neuron functions tions and the use of a monolithic integration technique that reduces and network structures to be reprogrammed though simple soft- the line resistance can lead to better array and network performance ware changes, enabling different models to be mapped on the same (Supplementary Fig. 14 and Supplementary Note 4). hardware platform. In the following, we discuss different models and neural network In this study we have designed and fabricated a complete, structures implemented in the same chip, by reprogramming the integrated memristor/CMOS system, with a memristor crossbar on-chip processor and the functions of the memristor array and the array integrated on top of CMOS circuits consisting of a full set interface circuits. of ADCs, DACs, digital bus, memory and a processor, allowing the integrated system to take advantage of the efficiency of the Training and classification using a single-layer perceptron matrix operations provided by the memristor crossbar and the A feed-forward single-layer perceptron (SLP) network was first flexibility offered by the CMOS system to successfully implement implemented to verify the operation of the integrated chip; 5 × 5 different algorithms on chip. The system architecture is shown in binary patterns were used in the SLP training and testing. The SLP Supplementary Figs. 1 and 2. has 26 inputs (corresponding to the 25 pixels in the image and a bias We operate the crossbar array in the charge domain to mini- term) and 5 outputs, with the input and output neurons fully con- mize multiplication error due to memristor device I–V nonlinearity nected with 26 × 5 = 130 synaptic weights, where the neuron with the (Supplementary Note 1). Our approach also simplifies the interface highest output is identified as the winner and used to classify the cor- circuit design as the input pulses to the crossbar have fixed ampli- responding class, as schematically shown in Fig. 2a (see Methods). tude and variable widths. The charge-domain technique is enabled In our implementation, the original binary input patterns are by a current-integrating hybrid ADC design and a pulse-mode DAC converted into input voltage pulses through the integrated proces- scheme (Supplementary Figs. 3 and 4). sor and DAC circuitry and are fed to the rows of the memristor Our design features both DACs and ADCs at each row and each array. Specifically, when a white pixel is present, a pulse is applied to column for flexibility and to allow bidirectional and full transpose the corresponding row; while black pixels correspond to no pulse. operation of the crossbar. For example, inputs x can be applied to The bias term is fixed at a constant value of 1 (treated as a white the rows and the charges collected at the columns correspond to the pixel) and is applied as an extra input. All the input pulses have forward operation xTW. If the output neurons’ activities are then the same duration and amplitude in this test, as illustrated in Fig. applied to the columns, the charges collected at the rows then repre- 2b. Each synaptic weight wij is implemented with two memristors sent the reconstructed signal, corresponding to multiplication of the representing a positive weight and a negative weight, Gijþ and GijÀ, output neurons’ activities a and the transpose of the weight matrix respectively, using the positive memristor conductance Ivalues (seeI aWT. With DACs and ADCs at each row and column, both forward equation (10) in Methods). and backward operations can be obtained with the same crossbar The integrated chip allows us to perform online learning. array16. Furthermore, our design allows all DACs and ADCs at Specifically, the synaptic weights are updated during the training each row/column to operate in parallel, thereby supporting high- phase using the batch rule: throughput parallel VMM and transpose VMM operations. More N details of the system architecture are provided in Supplementary n n n Δwij η tjð Þ yjð Þ xð Þ 1 Fig. 1 and Supplementary Note 2. ¼ n 1 � ð Þ ¼ The mixed-signal interface is controlled by an on-chip P OpenRISC processor that allows the implementation of differ- where x(n) is the nth training sample of the input dataset, y(n) is the ent computing operations, as shown in Supplementary Fig. 2 and network output and t(n) is the corresponding label. η is the learn- Supplementary Note 2. The processor provides fine-grained pro- ing rate. The update value Δwij for the ith element in the jth class grammability, allowing the operation of any set of rows and col- is then implemented in the memristors by applying programming umns, in either read or write mode and forward or backward pulses through the write DACs with a pulse width proportional to direction (Supplementary Figs. 5–8 and Supplementary Note 3). the desired weight change (quantized within the range of 0–63 time The design also supports online dictionary learning with higher- steps corresponding to 6 precision). voltage write DACs to program or update the memristor conduc- We trained and tested the SLP with noisy 5 × 5 Greek letter pat- tance (Supplementary Figs. 7 and 8). terns, for five distinct classes: ‘Ω’, ‘ M ’, ‘ Π’, ‘ Σ’, ‘ Φ’. For each Greek A 54 × 108 WOx-based memristor crossbar is fabricated on top of letter, we flip one of the 25 pixels of the original image and generate the CMOS circuits. Figure 1a shows a photograph of the integrated 25 noisy images. Together with the original image they form a set chip after packaging, along with the testing set-up. Figure 1b,c are of 26 images for each letter. We randomly select 16 images from the top-view images of the chip, showing the memristor array integrated set of each class for training and use the other 10 images for test- on top of the chip surface, at different magnifications. Each row and ing (Fig. 2c). Examples of the training set and testing set are shown column of the crossbar array is connected to a specific landing pad in Supplementary Figs. 15 and 16. This approach guarantees that left open during the CMOS fabrication process, and then connected the training set and the testing set do not overlap, and therefore to the interface circuitry through internal CMOS wiring (Fig. 1d; improves the robustness of our testing results. see also Supplementary Figs. 9 and 10 and Methods for fabrication Training and testing results from the experimentally imple- details). Each row or column of the array is connected to two write mented SLP are shown in Fig. 2d,e. After five online training epochs DACs, one read DAC and a 13 bit ADC (Fig. 1e). The memristors can the SLP can already achieve 100% classification accuracy for both be successfully programmed by controlling the number of applied the training and testing sets. The average activation of the correct write/erase pulses, then read out by the integrated interface cir- neuron during training is also clearly separated from the others, and cuitry and controller, even without any current-limiting the difference in neuron outputs between the winning neuron and or external current compliance (Supplementary Fig. 11). Figure 1f the other neurons improves during training, as shown in Fig. 2d, shows the on/off ratio distribution of devices in the integrated array verifying that online learning has been reliably implemented in the after a weight update operation. The relative uniform distribution experimental set-up.

Nature Electronics | VOL 2 | JULY 2019 | 290–299 | www.nature.com/natureelectronics 291 Articles NATure ElecTronics a bc

def 1 5 Crossbar array 486 DACs and 162 ADCs 162 configurable 3 memory mapped via channels PAD connection line Extension pad 4 mixed-signal interface 5 To crossbar pa d

Mixed-signal interface VREG + – 7 3 On/off ratio Passivation Al Al 9 ADC 11 2 M4 M4 M4 Au WO Read DAC x 13 M3 M3 DAC M2 Au M2 1 Pd Write_High DAC 15 M1 M1 Ni/Cr M0 SiO2 DAC 17 OpenRISC Write_Low DAC 0 AltOR32 Mixed-signal interface DAC 19 Gate 135791113151719

Fig. 1 | Fully integrated memristor/CMOS chip. a, Integrated chip -bonded on a pin-grid array package. Inset, testing set-up used to power and test the integrated memristor/CMOS chip. b, Optical microscopic image of a memristor crossbar array integrated on the chip, with portions of the CMOS circuitry visible underneath the chip surface. c, Magnified image of the integrated 54 × 108 memristor array region. Inset, scanning electron (SEM) image of the WOx memristor crossbar. Scale bar, 50 μm. d, Cross-section schematic of the integrated chip, showing connections of the memristor array with the CMOS circuitry through extension lines and internal CMOS wiring. Inset, cross-section of the WOx device. e, Schematic of the mixed- signal interface to the 54 × 108 crossbar array, with two write DACs, one read DAC and one ADC for each row and column. f, Distribution map for a 20 × 20 subarray of the memristor crossbar. Colour represents the on/off ratio measured after 50 consecutive programming pulses (1.8 V, 82 µs; see Supplementary Note 1 for more details).

Sparse coding algorithm implementation both forward and backward directions. Given that the chip offers The same hardware system was then used to implement a sparse full flexibility to implement different algorithms by re-program- coding algorithm (see Methods). Sparse coding aims at represent- ming the integrated processor, the LCA algorithm was imple- ing the original data with the activity of a small set of neurons, and mented in the same chip used in the SLP study, through simple can be traced to models of data representation in the visual cor- software changes. tex32,33. Sparse coding is an efficient method for feature extraction We used 4 × 4 inputs to test the experimental implementation of and information compression, and allows and the LCA algorithm. By using linear combinations of horizontal and classification to be performed in the compressed domain34. vertical bar patterns, the input dimension is reduced to 7. To satisfy Following our previous work implemented at the board level16, the over-completeness requirement of the LCA algorithm, a diction- we mapped the locally competitive algorithm (LCA)35 on our inte- ary containing 14 features of horizontal and vertical bar patterns is grated memristor/CMOS chip. In this approach, the membrane used, as shown in Fig. 3b. This set-up produces a 2 × over-complete potential of an output neuron is determined by the input, a leak- dictionary35 that enables the network to find a sparse, optimal solu- age term and an inhibition term whose strength is proportional tion out of several possible solutions. to the similarity of the neurons’ features; that is, an active neuron The LCA algorithm was mapped to a 16 × 14 subarray in the will try to inhibit neurons having similar features. It can be shown memristor/CMOS chip, using the corresponding interface circuitry mathematically that lateral neuron inhibition can be achieved in the and the processor that provide the neuron functions. An example of memristor crossbar by removing the reconstructed signal from the the LCA network operation is shown in Fig. 3c,d. The experimen- input (see equation (13) in Methods). With this approach, the LCA tally implemented network correctly reconstructs the input image algorithm can be implemented in an iterative process through two while minimizing the number of activated neurons. For example, it VMM operations: in the forward direction to obtain neuron acti- identifies the optimized solution with two neurons 6 and 13, instead vations and in the backward direction to obtain the reconstructed of using three neurons 2, 4 and 6 in this case. The dynamics of the input. The residue term is then obtained by removing the recon- LCA network operation can also be correctly captured, as shown structed input from the original input, and is then fed to the net- in Fig. 3e, where the effects of lateral neuron inhibition that lead work. The process is repeated until the network stabilizes. Figure 3a to a more sparse solution than the initial solution can be clearly illustrates the iterative forward and backward processes employed observed (see Methods). in the LCA implementation. To verify the system’s performance for other input patterns, an The bidirectional operation of the memristor array in the inte- exhaustive test of all 24 possible patterns consisting of two horizon- grated memristor/CMOS chip allowed us to experimentally imple- tal bars and one vertical bar was performed using the on-chip mem- ment the sparse coding algorithm on chip. Similar to the SLP case, ristor network, resulting in 100% success rate (Fig. 3f), measured by we use the crossbar array to perform VMM operations, here in the network’s ability to correctly identify the sparse solutions.

292 Nature Electronics | VOL 2 | JULY 2019 | 290–299 | www.nature.com/natureelectronics NATure ElecTronics Articles a b c 01215

V1 V1

V2 V2 V Softmax 3 V3 Q1 a1 16 17 18 31 V4 V 4 V5

V5 Q2 a2 … … 32 33 34 47

… V 21 … Q3 a3 V22 V21 V Bias 23 V 48 49 50 63 V22 24 Q4 a4 V25 V23

V Vb 24 Q a 5 5 64 65 66 79 V25 + – + – + – + – + – Q 1 Q 1 Q 2 Q 2 Q 3 Q 3 Q 4 Q 4 Q 5 Q 5

Vb d e 60

0.5 40 Training error 20 0.4 Ω M 0

y 0.3 Π Σ Ψ 40 0.2 Testing error 0.1 20 No. of misclassifications 0510 15 20 25 30 0510 15 20 25 30 0510 15 20 25 30 0510 15 20 25 30 0510 15 20 25 30 0 Epochs Epochs Epochs Epochs Epochs 0510 15 20 25 30 Epochs

Fig. 2 | Experimental demonstration of the single-layer perceptron on the integrated memristor chip. a, Schematic of the single-layer perceptron for classification of 5 × 5 images. b, Implementation of the SLP using a 26 × 10 memristor subarray through the integrated chip. Input data (for example, the Greek letter ‘Ω’) are converted to voltage pulses of Vread or 0 through the on-chip circuitry, depending on the pixel value. c, Samples from the training data set. d, Evolution of the output neuron signals during training. Each data point represents the average output value for a specific neuron during a training epoch for a given input class. e, Classification error of the training and testing data set as a function of training epochs. The network can achieve 100% classification for both the training set and the testing set after five training epochs.

Implementation of a multi-layer neural network j δgij ηyj xi gijyj 2 Finally, we demonstrate a bilayer neural network with the same ¼ � ð Þ k 1 integrated chip, using two subarrays in the memristor crossbar P¼ implementing unsupervised and supervised online learning to achieve feature extraction and classification functions, respectively. In the experiment, the weights of the first and second PCs, gij, are The bilayer network is used to analyse and classify data obtained mapped onto the memristor conductances through a linear trans- from breast cancer screening based on PCA. Specifically, the first formation18 (Supplementary Note 5). The network is trained online, layer of the system is a 9 × 2 network that performs PCA of the orig- using a subset of the original database consisting of 100 data points. inal data, which reduces the nine-dimensional (9D) raw input data During the training process, the 9D breast cancer data are converted to a 2D space based on the learned principal components (PCs). into input voltage pulses with pulse widths proportional to the data The second layer is a 3 × 1 SLP layer (with differential weights and values, within the range of 0–63 time units. The output charge col- a bias term), which performs classification using the reduced data lected at column j then corresponds to the dot product of the input in the 2D space for the two classes (benign or malignant). The vector and the conductance vector stored in column j, projecting schematic and crossbar implementation of the bilayer network are data from the original 9D space to a 2D output space (when only shown in Fig. 4a,b. two PCs are used). During training, the weights are then updated PCA reduces data dimensionality by projecting data onto lower following equation (2), using programming voltage pulses gener- dimensions along the PCs, with the goal of finding the best sum- ated through the write DACs with pulse widths proportional to δgij. mary of the data using a limited number of PCs36. The conventional Initially, the weights of the first and second PCs are random- approach to PCA is to solve the eigenvectors of the covariance ized in the memristor array (Fig. 4c). Projection of the input along matrix of the input data, which can be computationally expensive these vectors leads to severe overlapping of the benign and malig- in hardware. A more hardware-friendly approach is to find the PCs nant cases in the 2D space, as shown in Fig. 4d,e. After 30 train- through unsupervised, online learning. ing epochs (an epoch is defined as a training cycle through the 100 Specifically, following our previous study18, Sanger’s rule, also training data), the PCs are correctly learned (Fig. 4f) and the 2D known as the generalized Hebbian algorithm, is implemented in projected data can be clearly separated into two clusters (Fig. ,h). the integrated chip to obtain the PCs (Supplementary Note 5). The Note that the ground truth (benign or malignant) is not used in desired weight change for the jth PC is determined by the PCA training or clustering process. It is included in the plots

Nature Electronics | VOL 2 | JULY 2019 | 290–299 | www.nature.com/natureelectronics 293 Articles NATure ElecTronics

a bc1 23

r x1 1 456 x2 r2

r x3 3 789 x4 r4

x r 5 5 d r x6 6 10 11 12

13 14

u1 u2 u3 u4 u5 u6 a1 a2 a3 a4 a5 a6 ef O R O R O R O R O R O R 25

l 20

15

10 1 2 3

Membrane potentia 5 4 5 6 7 8 9 10 11 12 0 13 14

0510 15 20 25 30 Iteration no.

Fig. 3 | Experimental demonstration of sparse coding using the integrated memristor chip. a, Schematic of the experimental implementation of the LCA algorithm using on-chip memristor arrays. In each iteration a forward pass is performed to update the membrane potential u of the output neurons based on the inputs x, followed by a backward pass to update the residual r based on the neuron activities a. The residual r becomes the input for the next iteration. b, Dictionary elements based on horizontal and vertical bars. c, Original test image. d, Experimentally reconstructed image based on the neuron activities from the memristor chip. e, Experimentally obtained neuron membrane potentials as a function of iteration number during LCA analysis. The horizontal red dashed line marks the threshold parameter λ. f, Additional examples of original input images (O) and reconstructed images (R). The same threshold λ = 18 is used in all experiments in d–f.

(represented as blue and red colours in Fig. 4g,h) only to highlight than those obtained from software implementation (95% during the effectiveness of the clustering before and after learning the PCs. training and 96.8% during testing, Fig. 5d,e), due to the non- The PCA layer separates the original data into clusters, but does ideality in the memristor weight update that results in a decision not classify them. To achieve classification, we implement a second boundary that differs from that obtained from software (which layer, an SLP, in the same hardware system. The SLP processes out- assumes ideal linear weight updates) after the online training pro- puts from the PCA layer and generates a label (benign or malig- cess (Supplementary Figs. 17 and 18). Other statistical parameters, nant). Given that there are only two classes to distinguish, the SLP is including the sensitivity, specificity and accuracy for the experi- trained online using logistic regression. A 3 × 2 subarray is used in mentally implemented PCA + classifier network, were calculated the second layer to account for the two inputs, the bias term and the to be 93.1%, 99.0% and 94.6%, respectively, along with excellent differential weights (see Methods). receiver operating characteristic and F1 scores (Supplementary Fig. After learning the PCs in the PCA layer, the original 9D data 19 and Supplementary Note 7). are fed through the PCA layer, and the clustered 2D data are used as inputs for the SLP layer. The same 100 training data used for the Performance analysis of the integrated memristor chip PCA layer training are used for the SLP layer training (Fig. 5a) in An important design goal in this study is to have the complete algo- a supervised fashion, using the ground truth (the label associated rithms run on-chip, without having to access off-chip storage for with the original data, Supplementary Note 6). Training is com- weights and instructions. To achieve this goal, we first program pleted after 30 epochs. all necessary instructions in the algorithm in C code, and compile Afterwards, the 500 test data not included in the training set are the C code into binary machine code. This process only needs to applied to the network, passing first through the PCA layer then be performed once for each application. Afterwards, the binary as 2D data into the SLP layer. After online training of the PCA code is loaded into the static random-access memory on chip by and the SLP layers, the experimentally implemented two-layer a bootloader, and all operations are performed on chip. During network can achieve 94% and 94.6% classification accuracy dur- training and inference, the binary program instructions are exe- ing training and testing (Fig. 5b,c). The values are slightly lower cuted through the OpenRISC processor without the need to access

294 Nature Electronics | VOL 2 | JULY 2019 | 290–299 | www.nature.com/natureelectronics NATure ElecTronics Articles

a cde

x1

x2 0.9 9 x y1 9 3 0.6 6 6 3 x 0.3 4 3 0 Sigmoid 0 0 –3 x y Q a 5 2 –3 –0.3 –6 Weight (g) Second PC Second PC –6 x –9 6 –0.6 V 1 –9 Benign –12 Benign V2 Malignant Malignant x –0.9 –12 7 b –15 123456789 –12 –8 –4 048 –12–8–40 48 x 8 Inputs First PC First PC

x9

bfgh Subarray 1 Subarray 2 0.4 3 6 Input 1 (x ) 1 3 Input 2 (x ) 0.2 2 0 0 Input 3 (x3) –3 Input 4 (x4) 0 –3

Input 5 (x5) –6 Bias V1

Input 6 (x ) Weight (g) –0.2 –6 Second PC 6 + – V Second PC –9 Input 7 (x ) (b) Q Q 2 7 Benign Benign –9 –12 Input 8 (x8) –0.4 Malignant –15 Malignant Input 9 (x9) 123456789 4812 16 20 369121518 y1 y2 Inputs First PC First PC

Fig. 4 | Experimental demonstration of PCA using the integrated chip. a, Schematic of the bilayer neural network for PCA analysis and classification. b, The bilayer network is mapped onto the integrated memristor chip, using a 9 × 2 subarray for the PCA layer and a 3 × 2 subarray for the classification layer. c, Initial weights for the two PCs in the network. d,e, Before training, linear separation is not possible in the projected 2D space, for both training (d) and testing (e) data. f, Weights for the two PCs after unsupervised online training obtained from the memristor network, using Sanger’s learning rule and 30 training cycles. g,h, Clear separation can be observed in the 2D space for both training (g) and testing (h) data after projection along the trained PCs. Note that the colour labels (representing the ground truth) are used only to highlight the effect of clustering in the plots, and are not used in the PC training or PCA analysis.

external controllers. The inputs to the memristor array are supplied tion of 7 mW experimentally measured from the 54 × 108 crossbar, through the DACs following the instructions, and the VMM results this results in a total system power of 306.7 mW and a power effi- are read as charge values from the ADCs and processed by the ciency of 187.62 GOPS per W for the experimentally demonstrated OpenRISC processor for batch gradient decent calculations or for memristor/CMOS chip based on 180 nm CMOS technology. Simply running other algorithms. The results (for example, batch update scaling the design to a more advanced process node such as 40 nm values) are then sent to the on-chip registers, and the DACs are CMOS technology would reduce the total system power consump- reconfigured for the next operations. After the process is complete, tion to 42.1 mW, corresponding to a power efficiency of 1.37 TOPS the final output can be transferred to a computer and accessed by per W (Supplementary Fig. 21 and Supplementary Note 9). Further the users. Details of the data path are provided in Supplementary optimizations of the system design, for example by replacing the Fig. 20 and Supplementary Note 8. generic processor with a custom-designed controller or field-pro- The integrated chip suggests that different computing tasks can grammable , and by replacing the fast and high-precision be efficiently mapped on the memristor-based computing plat- 13 bit ADC with simpler interface circuits (for example, 8 bit neu- form by taking advantage of the bidirectional VMM operations in ron activations may be sufficient for common neural networks) and the memristor crossbars and the flexibility of the CMOS interface more optimized ADC designs, along with memristor device optimi- and control circuitry. In our prototype, the supporting analogue zations that reduce power consumption in the memristor crossbar, interfaces, as well as digital control and the OpenRISC processor, can further improve the system’s performance and power efficiency are implemented in 180 nm CMOS technology. The entire mixed- (see Supplementary Figs. 22–27 and Supplementary Note 10 for signal interface with independent ADCs and DACs supporting details about circuit optimizations). the 54 × 108 crossbar and operating at the maximum frequency of During batch training, the memory needed to store the batch 148 MHz consumes 64.4 mW, obtained from experimental mea- information may become a bottleneck as the task becomes complex surements. This corresponds to an energy consumption of 6.53 nJ and the pattern size increases. Batch gradient descent has also been per inner product or 1.12 pJ per operation for the mixed-signal associated with due to the low stochasticity of the process. interface, where an operation is defined as the multiplication and For complicated tasks and large datasets, mini-batch training37,38 may accumulate (MAC) process of a 4 bit input with a stored analogue be an attractive option by splitting the training dataset into small weight in the memristor array. At the maximum operating fre- batches, with typical batch sizes between 2 and 32. Additionally, quency of 148 MHz, the OpenRISC core consumes 235.3 mW and recently proposed hybrid techniques that utilize CMOS the system supports 9.87 M VMMs per second, corresponding to a to store the lower-significance weight updates could also provide a throughput of 57.5 GOPS. Along with an average power consump- realistic solution to the batch information storage challenge22.

Nature Electronics | VOL 2 | JULY 2019 | 290–299 | www.nature.com/natureelectronics 295 Articles NATure ElecTronics

ab4 c 6 Experimental training Experimental testing 2 4 2 55 0 0 –2 50 –2 –4 –4 –6 45 –8 –6 –10 40 Second PC projection Second PC projection –12 –8 Correct predicted benign Correct predicted benign Correct predicted malignant –14 Correct predicted malignant 35 Incorrect predicted benign –16 Incorrect predicted benign –10 Incorrect predicted malignant Incorrect predicted malignant –18 30 46810 12 14 16 18 20 22 4681012141618 First PC projection First PC projection 25

de4 20 6 No. of misclassifications Simulation training Simulation testing 2 4 15 2 0 0 10 –2 –2 –4 5 –4 –6 –8 0 –6 –10

0102030 Second PC projection Second PC projection –12 –8 Correct predicted benign Correct predicted benign Training epochs Correct predicted malignant –14 Correct predicted malignant Incorrect predicted benign –16 Incorrect predicted benign –10 Incorrect predicted malignant Incorrect predicted malignant –18 46810 12 14 16 18 20 22 4681012141618 First PC projection First PC projection

Fig. 5 | Classification results in the bilayer network. a, Evolution of the classification error during the online training process, from the experimentally implemented bilayer network on chip. b,c, Classification results experimentally obtained from the memristor chip, for the training (b) and testing (c) data. Blue and red colours represent the predicted benign and malignant data, respectively. The incorrectly classified results are marked as open circles. Classification rates of 94% and 94.6% are obtained for the training and testing data, respectively. d,e, Classification results of the bilayer network implemented in software. Blue and red colours represent the predicted benign and malignant data, respectively. Incorrectly classified results are marked as open circles. Classification rates of 95% and 96.8% are obtained for the training (d) and testing (e) data in software, respectively.

To achieve high accuracy with more complex and larger datasets, memristor crossbar array is integrated with a complete set of ana- the memristor device properties should be further improved. The logue and digital components and an on-chip processor. The inte- 7 WOx memristor device can be reliably programmed over 10 times. grated chip allows mapping of different neuromorphic and machine Although this level of endurance can support certain online train- learning algorithms on chip through simple software changes. ing algorithms, longer endurance may be desirable. Additionally, Three different and commonly used models, perceptron, sparse the cycle to cycle variation during the 107 programming cycles is coding and PCA with an integrated classification layer, have been ~3.4–4.2% (Supplementary Fig. 28a) and the device to device vari- demonstrated. A classification accuracy of 100% was achieved for ability is ~4.5% (Supplementary Fig. 28b). The small models we 5 × 5 noisy Greek letters in the SLP implementation, reliable sparse implemented in the current study can tolerate these cycle-to-cycle coding analysis was obtained from an exhaustive test set using and device-to-device variations; however, device nonlinearity and 4 × 4 bar patterns, and 94.6% classification rate was experimentally variability need to be reduced to implement larger networks39. To obtained from the breast cancer screening dataset using the same this end, future device optimizations that can improve device uni- integrated chip. formity and weight update linearity13,40, along with architecture The integrated memristor–CMOS systems potentially offer innovations such as hybrid non- (NVM)–CMOS efficient hardware solutions for different network sizes and appli- neural-network implementations22, mixed-precision19, multi- cations6–10,13,14,16,19,20,22,25,43. An initial application of such systems memristive device architectures21 and other precision extension may be edge computing, for example as used in the Internet of techniques24 can be employed. To scale up the system for larger net- Things (IoT) to process data near its source, allowing real-time works, rather than simply increasing the crossbar size, a promising data processing with high speed and low energy consumption44,45. approach may be to tile small crossbars together in a modular fash- Continued device, circuit and architecture innovations as discussed ion24,41,42, as schematically shown in Supplementary Figs. 29 and 30. above, along with algorithm advances such as quantized neural net- In this approach, each tile is a self-contained, integrated memristor– works46,47, can allow the system to be scaled up for more complex CMOS unit (macro); these units are then tiled together using digital and more demanding tasks. interfaces to construct larger systems5 (Supplementary Note 11). Methods Crossbar array fabrication and integration. Te memristor crossbar array used Conclusions in this work was directly fabricated on top of the CMOS circuits. First, bottom We have reported the design and fabrication of a fully functional, (BE) patterns with 500 nm width were defned by electron-beam programmable neuromorphic computing chip, in which a passive lithography; the 80-nm-thick Au BEs were then deposited (with Ni/Cr adhesion

296 Nature Electronics | VOL 2 | JULY 2019 | 290–299 | www.nature.com/natureelectronics NATure ElecTronics Articles layer underneath) by electron-beam evaporation and lifof processes. Next, 300 nm To minimize the energy function, the network is trained using batch gradient 48 of SiO2 was deposited by plasma-enhanced chemical vapour deposition, followed descent defined as by a reactive-ion etch back process to form a spacer structure along the sidewalls N n n n of the BEs. Te spacer structure allows better step coverage for the WOx switching w w η yðÞ tðÞ xðÞ 6 ¼ � n 1 � ð Þ layer and the top (TEs), and also restricts the active device regions to the ¼ fat exposed top surface of the BEs, as shown in Fig. 1d. To prevent through P� the switching layer among adjacent devices, the switching layer was only deposited at the crosspoint regions through electron-beam lithography defned patterns. Te Softmax regression. Softmax regression is a generalized logistic regression for switching layer was formed by frst depositing 20-nm-thick W by d.c. sputtering multi-class classification, usually used when more than two classes need to be and lifof processes in the electron-beam patterned regions, then through rapid classified. The is defined as thermal annealing of the patterned W islands with gas at 400 °C for 60 s n exp zð Þ n n j to form the WOx switching material. Aferwards, TEs (Pd (40 nm)/Au (90 nm)) yð Þ zðÞ n 7 j ð Þ ¼ exp zk ð Þ with 500 nm width were patterned and deposited by electron-beam lithography, k �ðÞ � electron-beam evaporation and lifof processes. Finally, metallization processes n n zð Þ wTx n P (n) where zkð Þ is given by k k ðÞ, representing the likelihood of training data x were performed by to connect the crossbar electrodes with n ¼ belongingI to class C tIð Þ 0; ¼ ; 1 ; ¼ ; 0 the CMOS landing pads that are lef open during the CMOS circuit fabrication k ð ¼ð ÞÞ process. An in situ etch process was performed to remove the native kth I oxide on the CMOS landing pads, followed by deposition of 800 nm thick Al with The network ends up with a similar|{z} form of gradient as in the logistic d.c. sputtering and lifof processes to ensure step coverage of the deeply recessed regression case: landing pads. N n n ð Þ ð Þ n ∇wj Ew1; ¼ ; wK yj tj xðÞ 8 Experimental set-up for chip measurement. The integrated chip was wire- ð Þ¼n 1 � ð Þ ¼ bonded onto a pin-grid-array package and mounted on a P (PCB). The PCB provides the needed power signals and the global system clock To minimize the energy function, the network is trained using batch gradient for the integrated chip. No active circuitry (DACs, ADCs, matrix and descent defined as48 so on) are implemented on the PCB, as these functions are all provided directly N n n on chip. A UART-to-serial (UART, universal asynchronous receiver–transmitter) ð Þ ð Þ n wj wj η∇wj E wj η yj tj xðÞ 9 board was used to convert the input/ouput data from the chip into serial data and ¼ � ¼ � n 1 � ð Þ ¼ communicate with a desktop computer through a USB cable. P In general, the weighted sums of the inputs can be anything ranging from −∞ On-chip analogue-to-digital and digital-to-analogue interfaces. A charge- to +∞. To bound the neuron output values and perform proper classifications, based mixed-signal interface was used to support the VMM and programming activation functions play a key role. Nonlinear activation functions such as the operations of the integrated crossbar array. We operated the memristor array in Sigmoid and the Softmax functions are among the most widely used for binary and the charge domain instead of the voltage/current domain to avoid the nonlinear multi-class classification, respectively. In this case, we chose the voltage-to-current multiplication issue (for example, 2 × voltage does not produce for the multi-class classification task, which allows us to compute the probabilities 2 × current because of the nonlinear current–voltage characteristics in the device). for all classes. The Softmax function produces neuron output values in the range Specifically, we applied a discrete-time pulse-train input and measured the 0–1, where the neuron with the highest output is identified as the winner and used accumulated charge from each column (row). During VMM operations the row to classify the corresponding class. (column) DACs applied 6 bit programmable pulses at fixed amplitude (600 mV In the memristor crossbar array, the charge collected at an output neuron j is with respect to the virtual ground) to the crossbar. The integrating ADCs provided a virtual ground (at 1.2 V) and measured the collected charge over the input period. Qj wijxi V Gijti V Gijþ Gij� ti Qjþ Qj� 10 ¼ i ¼ i ¼ i � ¼ � ð Þ A hybrid incremental charge-integrating ADC was used to tackle the challenge P P P of digitizing a broad range of column (row) outputs while minimizing the area where x is the input at row i and is represented by a voltage pulse with amplitude V and energy consumption. The required large number of ADCs (162), charge i and width ti. The charges are measured at the output columns and digitized by the domain inputs and large dynamic ranges all present unique design challenges. ADCs, then converted to the neuron output y through the Softmax function: Supplementary Fig. 3 shows the 13 bit ADC design used for this study, which j exp βQ balances performance and area consumption. Accurate and linear time-domain y Q j 11 j j expðÞβQ DACs were implemented with a pulsed, return-to-zero DAC architecture. A simple ¼ k ðÞk ð Þ time-domain DAC is inherently nonlinear because of the finite rise and fall time � P of the output waveform (Supplementary Fig. 4). Instead, we applied duty-cycled where β is a scaling factor of the ADC output and k represents the output neuron (75%) pulsed inputs where the number of pulses represented the input amplitude. index. The return-to-zero ‘read’ DACs toggled between 1.8 V and 1.2 V (virtual ground), Compared with earlier SLP implementations9 that used the Manhattan rule creating 600 mV pulses for VMM operations without changing the memristor and required on average 23 epochs to achieve perfect classification for a similar conductance. Two additional ‘write’ DACs were connected for each column and database, the batch gradient descent rule used here not only considers the direction row to create 1.8 V amplitude pulses for weight updates during online learning. of the weight update (which represents the Manhattan rule), but also the value of the weight update, so much faster training convergence can be obtained. Logistic regression. A algorithm, logistic regression, was used to train the SLP layer of the bilayer network to classify benign and malignant cells. LCA. The LCA is a sparse coding algorithm that uses a dictionary of feature Logistic regression is commonly used for the classification of two classes. vectors to encode an input signal with a small number of output coefficients, while T Suppose an N training sample dataset 1 ; ¼ ; N with label minimizing the reconstruction error. T x xðÞ xðÞ 1 N ¼ (n) t tðÞ; ¼ ; tðÞ , where the nth trainingI sample can be written as x and the The concept of sparse coding is as follows. Given an input signal x and a ¼ n ÀÁ nIth label as tðÞ 0; 1 . dictionary of features D, sparse coding aims to represent x as a linear combination ÀÁ2 fg A cross-entropyI energy function can be defined as of features from D using a set of sparse coefficients a, with a minimum number of features. Mathematically, the objective of sparse coding can be summarized as N n n n n minimizing an energy function containing both the reconstruction error term as Ew tðÞln yðÞ 1 tðÞ ln 1 yðÞ 3 ð Þ¼�n 1 þ � � ð Þ well as a sparsity penalty term, defined as X¼  n o T n n n T n min x Da λ a 12 where yðÞ σ zðÞ and zðÞ w xðÞ a 2 0 ¼ ¼ � þ jj ð Þ σ(z)I is the logistic sigmoidI function defined by  ÀÁ 2 0 where |·|2 and |·|0 are the L - and L -norm, respectively, and λ is a sparsity parameter σ 1 z 1 exp z 4 that determines the relative weights of the reconstruction error (first term) and the ð Þ¼ þ ðÞ� ð Þ sparsity penalty (the number of neurons used, second term). The likelihood of a training data x(n) belonging to class t(n) = 1 is determined The mathematical form of the LCA can be expressed as follows: x is an by the output y(n), with larger y(n) meaning x(n) is more likely to m-element (m × 1) input vector, D is an m × n matrix, where each column of D belong to the class. represents an m-element feature vector (that is, a dictionary element) and a is Taking the gradient of the error function leads to an n-element (1 × n) row vector representing the neuron activity coefficients,

N where the ith element of a corresponds to the activity of the ith neuron. After n n n ∇Ew yðÞ tðÞ xðÞ 5 feeding input x to the network and allowing the network to stabilize through ð Þ¼n 1 � ð Þ T ¼ lateral inhibition, a reconstruction of x can be obtained as Da , that is, the linear P�

Nature Electronics | VOL 2 | JULY 2019 | 290–299 | www.nature.com/natureelectronics 297 Articles NATure ElecTronics combination of the neuron activities and the corresponding neurons’ feature 7. Xia, Q. & Yang, J. J. Memristive crossbar arrays for brain-inspired computing. vectors. In a sparse representation, only a few elements in a are non-zero, while the Nat. Mater. 18, 309–323 (2019). other neurons’ activities are suppressed to be precisely zero. 8. Ielmini, D. & Wong, H.-S. P. In-memory computing with resistive switching The neuron dynamics during LCA analysis can be summarized by the devices. Nat. Electron. 1, 333–343 (2018). following equation35: 9. Prezioso, M. et al. Training and operation of an integrated neuromorphic network based on metal-oxide memristors. Nature 521, 61–64 (2015). du 1 u xTD aDTD I 13a dt ¼ τ � þ � ð � nÞð Þ ð Þ 10. Yao, P. et al. Face classifcation using electronic synapses. Nat. Commun. 8, 15199 (2017). 11. Alibart, F., Zamanidoost, E. & Strukov, D. B. Pattern classifcation by ui if ui>λ ai 13b memristive crossbar circuits using ex situ and in situ training. Nat. Commun. ¼ 0 otherwise ð Þ  4, 2072 (2013). 12. Bayat, F. M. et al. Implementation of network with highly where ui is the membrane potential of neuron i, τ is a time constant and In is the uniform passive memristive crossbar circuits. Nat. Commun. 9, 2331 (2018). n × n identity matrix. T 13. Li, C. et al. Efcient and self-adaptive in-situ learning in multilayer Implementing the inhibition effect D D can be very computation-intensive. To memristor neural networks. Nat. Commun. 9, 2385 (2018). implement the algorithm in memristor hardware, the original equation (13a) can 14. Li, C. et al. Analogue signal and image processing with large memristor be rewritten as crossbars. Nat. Electron. 1, 52–59 (2017). T 15. Gao, L., Chen, P.-Y. & Yu, S. Demonstration of kernel operation du 1 u x x^ D a 14 dt ¼ τ � þ ðÞ� þ ð Þ on resistive cross-point array. IEEE Electron Device Lett. 37, 870–873 (2016).  16. Sheridan, P. M. et al. Sparse coding with memristor networks. where x^ DaT is the reconstructed signal. Equation (14) shows that the inhibition ¼ Nat. Nanotechnol. 12, 784–789 (2017). term betweenI neurons can be interpreted as a neuron removing its feature from 17. Du, C. et al. Reservoir computing using dynamic memristors for temporal the input when it becomes active, thus suppressing the activity of other neurons information processing. Nat. Commun. 8, 2204 (2017). having similar features. The matrix–matrix operation DTD in equation (13a) is 18. Choi, S., Shin, J. H., Lee, J., Sheridan, P. & Lu, W. D. Experimental thus reduced to two sequential matrix-vector operations, one used to calculate demonstration of feature extraction and dimensionality reduction using x^ DaT and the other used to calculate the neuron activity from the updated input ¼ memristor networks. Nano Lett. 17, 3113–3118 (2017). rITD, where r x x^ is the residue term. This approach allows us to implement ¼ � 19. Le Gallo, M. et al. Mixed-precision in-memory computing. Nat. Electron. 1, the LCA in memristorI crossbars without physical inhibitory synaptic connections 246–253 (2018). between the neurons. 20. Burr, G. W. et al. Experimental demonstration and tolerancing of a large-scale The dynamics of the LCA network operation are correctly captured by the neural network (165,000 synapses) using phase-change memory as the memristor chip, as shown in Fig. 3e. In this example, all neurons are charging up in synaptic weight element. IEEE Trans. Electron Devices 62, 3498–3507 (2015). the first four iterations. At the fifth iteration, neuron 13 first crosses the threshold, 21. Boybat, I. et al. Neuromorphic computing with multi-memristive synapses. as it consists of two horizontal bars which result in a larger output value in the Nat. Commun. 9, 2514 (2018). membrane potential update. As a result, the lateral inhibition effect in the system 22. Ambrogio, S. et al. Equivalent-accuracy accelerated neural-network training suppresses the membrane potentials of other neurons (2 and 4) sharing part of using analogue memory. Nature 558, 60–67 (2018). the features of neuron 13, even though they also represent features of the input. 23. Gao, B. et al. Ultra-low-energy three-dimensional oxide-based electronic Meanwhile neuron 6, which represents the vertical bar feature, continues to charge synapses for implementation of robust high-accuracy neuromorphic up. At iteration 11, the membrane potential of neuron 6 crosses the threshold, computation systems. ACS Nano 8, 6998–7004 (2014). and all other neurons’ membrane potentials are suppressed below the threshold, 24. Zidan, M. A. et al. A general memristor-based partial diferential equation leading to the optimal solution. The neurons’ membrane potentials continue to solver. Nat. Electron. 1, 411–420 (2018). evolve, but those of neurons 6 and 13 remain above the threshold and those of 25. Hu, M. et al. Memristor-based analog computation and neural network other neurons continue to decrease due to the inhibition term and the leaky term classifcation with a dot product engine. Adv. Mater. 30, 1705914 (2018). in the membrane potential equation. The solution from the network was read out 26. Xia, Q. et al. Memristor–CMOS hybrid integrated circuits for reconfgurable after 30 iterations. logic. Nano Lett. 9, 3640–3645 (2009). 27. Kim, K.-H. et al. A functional hybrid memristor crossbar-array/CMOS Breast cancer dataset and classification. A standard breast cancer dataset from system for and neuromorphic applications. Nano Lett. 12, the University of Wisconsin Hospital49 was used as the input for the PCA network, 389–395 (2012). which is available through the University of Californian, Irvine Machine Learning 28. Yang, J. J., Strukov, D. B. & Stewart, D. R. Memristive devices for computing. Repository50. Nat. Nanotechnol. 8, 13–24 (2013). The dataset consists of breast cell mass properties measured in nine categories; 29. Chen, B. et al. Efcient in-memory computing architecture based on crossbar each property is scored from 1 to 10. Each input to the memristor network is thus arrays. In Proceedings of 2015 IEEE International Electron Devices Meeting a 9D vector consisting of scores from the nine measurements. The bilayer network (IEDM) 17.5.1–17.5.4 (IEEE, 2015). was trained using 100 training data (containing 50 benign and 50 malignant 30. Pershin, Y. V. & Di Ventra, M. Neuromorphic, digital and quantum cases) from the dataset, and tested with 500 data (containing 312 benign and 188 computation with memory circuit elements. Proc. IEEE 100, 2071–2080 (2012). malignant cases) not in the training set. 31. Jeong, D. S. & Hwang, C. S. Nonvolatile memory materials for neuromorphic intelligent machines. Adv. Mater. 30, 1704729 (2018). Data availability 32. Olshausen, B. A. & Field, D. J. Emergence of simple-cell receptive feld The data that support the plots within this paper and other findings of this study properties by learning a sparse code for natural images. Nature 381, are available from the corresponding author upon reasonable request. 607–609 (1996). 33. Olshausen, B. A. & Field, D. J. Sparse coding with an overcomplete basis set: Received: 18 October 2018; Accepted: 12 June 2019; a strategy employed by V1? Vis. Res. 37, 3311–3325 (1997). 34. Sheridan, P. M., Du, C. & Lu, W. D. Feature extraction using memristor Published online: 15 July 2019 network. IEEE Trans. Neural Netw. Learn. Syst. 27, 2327–2336 (2016). 35. Rozell, C. J., Johnson, D. H., Baraniuk, R. G. & Olshausen, B. A. Sparse References coding via thresholding and local competition in neural circuits. 1. Chua, L. Memristor—the missing circuit element. IEEE Trans. Circuit Teory Neural Comput. 20, 2526–2563 (2008). 18, 507–519 (1971). 36. Lever, J., Krzywinski, M. & Altman, N. Points of signifcance: principal 2. Strukov, D. B., Snider, G. S., Stewart, D. R. & Williams, R. S. Te missing component analysis. Nat. Methods 14, 641–642 (2017). memristor found. Nature 453, 80–83 (2008). 37. Masters, D. & Luschi, C. Revisiting small batch training for deep neural 3. Jo, S. H. et al. Nanoscale memristor device as synapse in neuromorphic networks. Preprint at https://arxiv.org/abs/1804.07612 (2018). systems. Nano Lett. 10, 1297–1301 (2010). 38. Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M. & Tang, P. T. P. On 4. Yu, S., Wu, Y., Jeyasingh, R., Kuzum, D. & Wong, H. S. P. An electronic large-batch training for : generalization gap and sharp minima. synapse device based on metal oxide resistive switching memory Preprint at https://arxiv.org/abs/1609.04836 (2016). for neuromorphic computation. IEEE Trans. Electron Devices 58, 39. Chen, P.-Y., Peng, X. & Yu, S. NeuroSim: a circuit-level macro model 2729–2737 (2011). for benchmarking neuro-inspired architectures in online learning. 5. Zidan, M. A., Strachan, J. P. & Lu, W. D. Te future of electronics based on IEEE Trans. Comput. Des. Integr. Circuits Syst. 37, 3067–3080 (2018). memristive systems. Nat. Electron. 1, 22–29 (2018). 40. Choi, S. et al. SiGe epitaxial memory for neuromorphic computing with 6. Krestinskaya, O., James, A. P. & Chua, L. O. Neuro-memristive circuits for reproducible high performance based on engineered dislocations. Nat. Mater. edge computing: a review. Preprint at https://arxiv.org/abs/1807.00962 (2018) 17, 335–340 (2018).

298 Nature Electronics | VOL 2 | JULY 2019 | 290–299 | www.nature.com/natureelectronics NATure ElecTronics Articles

41. Zidan, M. A. et al. Field-programmable crossbar array (FPCA) for Research Projects Agency (DARPA) through award HR0011-13-2-0015, the National reconfgurable computing. IEEE Trans. Multi-scale Comput. Syst. 4, Science Foundation (NSF) through awards CCF-1617315 and 1734871, and the 698–710 (2017). Applications Driving Architectures (ADA) Research Centre, a JUMP Centre co- 42. Mikhailenko, D., Liyanagedera, C., James, A. P. & Roy, K. M2CA: modular sponsored by SRC and DARPA. memristive crossbar arrays. In Proceedings of 2018 IEEE International Symposium on Circuits and Systems (ISCAS) 1–5 (IEEE, 2018). 43. Xu, X. et al. Scaling for edge inference of deep neural networks. Nat. Electron. Author contributions 1, 216–222 (2018). F.C., J.M.C., S.H.L., Z.Z., M.P.F. and W.D.L. conceived the project and constructed 44. Shafee, A. et al. ISAAC: a convolutional neural network accelerator with the research frame. S.H.L. prepared the memristor arrays and performed device in-situ analog arithmetic in crossbars. In Proceedings of 43rd International integration. J.M.C., Y.L., V.B., Z.Z. and M.P.F. designed the CMOS chip. F.C. and J.M.C. Symposium on Computer Architecture 14–26 (IEEE, 2016). prepared the test hardware and software platform. F.C. and S.H.L. performed the 45. Gokmen, T. & Vlasov, Y. Acceleration of deep neural network training network measurements and software simulations. W.D.L. directed the project. F.C., with resistive cross-point devices: design considerations. Front. Neurosci. 10, J.M.C., S.H.L., Z.Z., M.P.F. and W.D.L. analysed the experimental data and wrote the 33 (2016). manuscript. All authors discussed the results and implications and commented on the 46. Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R. & Bengio, Y. Quantized manuscript at all stages. neural networks: training neural networks with low precision weights and activations. Preprint at https://arxiv.org/abs/1609.07061 (2016). Competing interests 47. Jacob, B. et al. Quantization and training of neural networks for efcient The authors declare no competing interests. integer-arithmetic-only inference. Preprint at https://arxiv.org/abs/1712.05877 (2017) 48. Bishop, C. M. Pattern Recognition and Machine Learning Vol. 4 (Springer, 2006). Additional information 49. Mangasarian, O. L., Street, W. N. & Wolberg, W. H. Breast cancer diagnosis Supplementary information is available for this paper at https://doi.org/10.1038/ and prognosis via linear programming. Oper. Res. 43, 570–577 (1995). s41928-019-0270-x. 50. Dheeru, D. & Karra Taniskidou, E. Machine Learning Repository (Univ. Reprints and permissions information is available at www.nature.com/reprints. California–Irvine, 2017). Correspondence and requests for materials should be addressed to W.D.L. Acknowledgements Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in The authors acknowledge inspiring discussions with C. Liu, T. Chou, P. Brown, M.A. published maps and institutional affiliations. Zidan and P.M. Sheridan. This work was supported in part by the Defense Advanced © The Author(s), under exclusive licence to Springer Nature Limited 2019

Nature Electronics | VOL 2 | JULY 2019 | 290–299 | www.nature.com/natureelectronics 299