Memristor Based Low Power High Throughput Circuits And

Home , Gradient descent, Memristor, Neuromorphic engineering

MEMRISTOR BASED LOW POWER HIGH THROUGHPUT CIRCUITS AND

SYSTEMS DESIGN

Dissertation

Submitted to

The School of Engineering of the

UNIVERSITY OF DAYTON

In Partial Fulfillment of the Requirements for

The Degree of

Doctor of Philosophy in Engineering

Md Raqibul Hasan, M.S.

UNIVERSITY OF DAYTON

Dayton, Ohio

May, 2016

MEMRISTOR BASED LOW POWER HIGH THROUGHPUT CIRCUITS AND

SYSTEMS DESIGN

Name: Hasan, Md Raqibul

APPROVED BY:

Tarek M. Taha, Ph.D. Vijayan K. Asari, Ph.D. Advisory Committee Chairman Committee Member Associate Professor Professor Electrical and Computer Engineering Electrical and Computer Engineering

John S. Loomis, Ph.D. Zhongmei Yao, Ph.D. Committee Member Committee Member Professor Emeritus Associate Professor Electrical and Computer Engineering Computer Science

John G. Weber, Ph.D. Eddy M. Rojas, Ph.D., M.A., P.E. Associate Dean Dean School of Engineering School of Engineering

Md Raqibul Hasan

2016

iii

ABSTRACT

MEMRISTOR BASED LOW POWER HIGH THROUGHPUT CIRCUITS

AND SYSTEMS DESIGN

Name: Hasan, Md Raqibul University of Dayton

Advisor: Dr. Tarek M. Taha

Power density constraint and device reliability issues are driving energy efficient, fault tolerant architecture designs in recent years. With the emergence of big data applications low power, high throughput architectures are getting more interest. Neural networks have diverse use in the areas including big data analysis, sensor and signal processing applications. The memristor is a novel device having a large varying resistance range. Physical memristors can be laid out in a high density grid known as a crossbar. A memristor crossbar can evaluate many multiply-add operations in parallel in analog domain which is the dominant operation in neural network applications. The objective of this thesis is to examine memristor based extreme low power neuromorphic architectures for signal and big data processing applications.

This thesis examines in-situ training of memristor based multi-layer neural networks where the entire crossbar is updated in four steps for a training instance (data).

Existing training approaches update a crossbar serially column by column. Training of memristor based deep neural networks are examined using autoencoders for layer-wise iv pre-training of the networks. We propose a novel technique for ex-situ training of memristor based neural networks which takes sneak-path currents into consideration.

Multicore architectures based on memristor neural cores are developed and system level area power are compared with traditional computing systems. Results show that the memristor neural network based architectures could be about five orders of magnitude more energy efficient when compared to the traditional computing systems.

ACKNOWLEDGEMENTS

I enjoyed my study and research very much since I joined the research group led by Dr. Tarek M. Taha. So my greatest gratitude would be extended to my advisor, Dr.

Tarek M. Taha, for his constant support and valuable instruction. I would like to express my gratitude to other committee members, Prof. Vijayan K. Asari and Prof. John Loomis for their support.

And last, but not least, I am grateful for the support of my wife and my parents who are continuously encouraging me.

TABLE OF CONTENTS

ABSTRACT…………….………………………………………………………………..iv

ACKNOWLEDGEMENTS………………………………………………………………vi

LIST OF FIGURES…………………………………………………………………...... xiii

LIST OF TABLES……………………………………………………………………....xxi

CHAPTER I

INTRODUCTION ……………….……….……………………………………..………1

1.1 Power Wall ………………….…………………...……………………………1 1.2 Memory Wall ……………….….……………………………………………...2 1.3 Approximate Computing…....…………………………………………………2 1.4 Emerging Memristor Devices …………………………………….…………...3 1.5 Emerging Applications…………….…………………………………………..4 1.6 Neural Networks ………………………………………………………………5 1.7 Contributions …………………………………………………………………..6

CHAPTER II

RELATED WORK …….…………………..……………………………………..……10

2.1 Neural Networks for General Purpose Applications……………………...….10 2.2 Approximate Computing …………………………………………………….11 2.3 Digital Neuromorphic Systems………………………………..……………..12 2.4 Memristor Neural Network Circuits……….……………………….………...16 2.5 Memristor Based Multicore Systems…………………………………………17

vii

CHAPTER III

ON-CHIP TRAINING OF MULTI-LAYER NEURAL NETWORKS.….…………18

3.1 Introduction …………………………………………..……………………...18 3.2 Memristor Crossbar Based Neuron Circuit and Linear Separator Design…...19 3.2.1 A Neuron in a Neural Network……………………………………20 3.2.2 Neuron Circuits……………………………………………………20 3.2.3 Synaptic Weight Precision…………………………..…………….23 3.2.4 Linearly Separable Classifier Designs…...………………………..24 3.3 System 1: Memristor Crossbar Based Multi-layer Neural Network for BP Algorithm…………………………………………………………………...25 3.3.1 Multi-layer Circuit Design…….…..………………………………25 3.3.2 Training……………………………………………………………26 3.3.3 Circuit Implementation of the Training Algorithm……………….27 3.3.4 Writing to Memristor Crossbars ………………………………….32 3.4 System 2: Memristor Crossbar Based Multi-layer Neural Network for the Proposed Training Algorithm……………………..……………….………...35 3.4.1 Proposed Training Algorithm ……….……………………………35 3.4.2 Training Algorithm Comparison ………….……………………...37 3.4.3 Circuit Implementation of the Proposed Training Algorithm ……39 3.4.4 Error Back Propagation …………………………………………..40 3.4.5 Memristor Weight Update Approach …………………………….42 3.5 Experimental Setup ………………………………………………………….48 3.6 Results…………………………………………………..……………………50 3.6.1 SPICE Training Results for BP Algorithm ………………………50 3.6.2 SPICE Training Results for the Proposed Algorithm ……………50 3.7 Summary……………………………………………………………………..54

CHAPTER IV

ON-CHIP TRAINING OF DEEP NEURAL NETWORKS.………..…………….…55

4.1 Introduction…….……………………………....…………………………….55

viii

4.2 Deep Neural Networks……………………………………………………….56 4.3 Memristor Crossbar Based Multi-layer Neural Network Design……...……..59 4.3.1 Neuron Circuit …….………………………………………………59 4.3.2 Synaptic Weight Precision ………………………………………..61 4.3.3 Memristor Based Neural Network Implementation ………………61 4.4 Back-propagation Training Circuit …………….……………………………62 4.4.1 The Training Algorithm ……………………….………………….62 4.4.2 Circuit Implementation of the Back-propagation Training Algorithm …………………………………………………………………….63 4.5 Deep Network Training Circuit …………………………………………….66 4.5.1 Overall System Architecture……………..………………………..66 4.5.2 Unsupervised Layer-wise Pre-training ……………………………67 4.5.3 Supervised Full Network Training……………...…………………68 4.6 Large Memristor Crossbars…………………………………………………69 4.6.1 Large Crossbar Simulations ………………………………………69 4.6.2 Minimizing Sneak-path Impact …………………………………...72 4.7 Experimental Setup ……...…………………………………………………74 4.8 Results ……………………………………………………………………...75 4.8.1 Autoencoder Training …………………………………………….76 4.8.2 Supervised Training ……………………………………………....79 4.8.3 Deep Network Training …………………………………………..80 4.8.4 Discussion …………………….…………………………………..82 4.9 Summary …………………………………………………………………...83

CHAPTER V

EX-SITU TRAINING OF LARGE MEMRISTOR CROSSBARS………..………..84

5.1 Introduction…………………………………………………………………..84 5.2 Memristor Crossbar Based Neural Network…...…………………………….86 5.3 Large Crossbar Ex-situ Training Challenges…………………………..…….86 5.3.1 Traditional Ex-situ Training Approaches ………..………………..86 5.3.2 Large Crossbar Simulations ……………….…………………...….88

5.3.3 Impact of Sneak-paths in a Large Crossbar ……………………….89 5.3.4 Impact of Sneak-paths on Ex-situ Training ……………………….90 5.4 Proposed Ex-situ Training Process …..……………………………………..91 5.5 Experimental Evaluations …………………………………………………...93 5.5.1 Datasets ……………………………………………………………93 5.5.2 Memristor Model ………………...………………………………..94 5.5.3 Results ……………………………………………………………..94 5.5.4 Area Savings ……………………………………………………….96 5.5.5 Impact of Device Variation ………………………………………...97 5.6 Summary…………………………………….……………………………….98

CHAPTER VI

NEURAL NETWORK BASED MULTICORE PROCESSORS……..………..……99

6.1 Introduction ………………………………………………………………....99 6.2 Multicore Architecture …………………………………………………….101 6.2.1 SRAM Digital Neural Core ………………………………..……102 6.2.2 On-chip Routing ………………………………………………...103 6.2.3 Integration with a Processor …………………………………….104 6.2.4 I/O ………………………………………………………….……105 6.3 Memristor Cores …………………………………………………………...105 6.3.1 Memristor Based Neuron Circuit and Neural Network …………105 6.3.2 Memristor Neural Core ………………………………………….107 6.4 Programming the Memristor Cores …………………….…………………..108 6.5 Evaluation of Proposed Architectures ………………….………………….109 6.5.1 On -chip Training Result ……………………………….………..109 6.5.2 Ex-situ Training Result ……………………………….………...111 6.5.3 Application Description ……………………………….………..111 6.5.4 Mapping Neural Networks to Cores ………………….………...113 6.5.5 RISC Core Configuration ………………….……….…………..114 6.6 Results …………………………………………….………………….……115 6.6.1 Bit Width and Activation Function …………………………….115

6.6.2 Design Space Exploration of Neural Cores …………………....116 6.6.3 Results for Real Time Applications ……………………………120 6.6.4 Discussion ……………………………………………….……..123 6.7 Summary………………………………………………………..………….131

CHAPTER VII

LOW POWER HIGH THROUGHPUT STREAMING ARCHITECTURE FOR

BIG DATA PROCESSING……………………..…………………………………….132

7.1 Introduction ……………………………………………….……………..…132 7.2 System Overview …………………………………………………………..133 7.3 Autoencoder …………………………………………………………….….136 7.4 Heterogeneous Cores ……………………………………………………….137 7.4.1 Memristor Neural Core ………………………...………………..137 7.4.2 Digital Clustering Core ………………………………………….138 7.5 Experimental Setup …………………………………………………………140 7.5.1 Applications and Datasets ……………………………………….140 7.5.2 Mapping Neural Networks to Cores …………………………...... 140 7.5.3 Area Power Calculations ………………………………………...141 7.6 Results ……………………………………………………………………..142 7.6.1 Supervised Training Result …………………………………..…142 7.6.2 Unsupervised Training Result …………………………………..143 7.6.3 Anomaly Detection ……………………………………………...144 7.6.4 Impact of System Constraints on Application Accuracy ………...145 7.6.5 Single Core Area and Power ……………………………………..146 7.6.6 System Level Evaluations ………………………………………..147 7.7 Summary …………………………………………………………………...149

CHAPTER VIII

MEMRISTOR CROSSBAR BASED STRING MATCHING CIRCUIT.………...151

8.1 String Matching Circuit ……………………………………………………151 8.2 Scaling for Matching Long String …………………………………………153 xi

8.3 Summary …………………………………………………………………...154

CHAPTER IX

CONCLUSION……….………...………………………..…………………………....155

BIBLIOGRAPHY……………………………………………………………………....157

xii

LIST OF FIGURES

Figure 1.1. (a) Memristor crossbar schematic and (b) memristor crossbar layout…….….4

Figure 1.2. A multi-layer neural network………………………………………..……….5

Figure 3.1. Neuron block diagram………………………………………………………20

Figure 3.2. Memristor-based neuron circuit. A, B, C are the inputs and yj is the output...21

Figure 3.3. Plot of functions f(x) and h(x)……………………………………………….22

Figure 3.4. (a) The CMOS inverter and (b) the inverter transfer curve. V(op2) is the neuron output and V(op1) is the inverted neuron output………………………………...23

Figure 3.5. Memristor based neuron circuits showing crossbar wire resistance………..23

Figure 3.6. Training graph of 8 three input minterms utilizing neuron circuit shown in

Figure 3.2(a)……………………………………………………………………………...25

Figure 3.7. Training graph of 8 three input minterms utilizing neuron circuit shown in

Figure 3.2(b)……………………………………………………………………………..25

Figure 3.8. Two layer network for learning three input functions……………………….26

Figure 3.9. Schematic of the neural network shown in Figure 3.8 for forward pass utilizing neuron circuit in Figure 3.2(a)………………………………………………….26

Figure 3.10. Schematic of the neural network shown in Figure 3.8 for back propagating errors to layer 1…………………………………………………………………………..29

xiii

Figure 3.11. Implementing back propagation phase multiplexing error generation circuit…………………………………………………………………………………….29

Figure 3.12. Training pulse generation module. Inputs to the circuit are mentioned for the scenario shown in the first row of Table 3.3…………………………………..…………30

Figure 3.13. Plot showing V∆1, Va and training pulse duration VaT∆1…………..………..32

Figure 3.14. Demonstration of the weight update operation in the first column of a crossbar: (a) increasing phase and (b) decreasing phase. Upward arrow indicates increase of conductance and downward arrow indicates decrease of conductance…………….....34

Figure 3.15. (a) Training signals while x1=0.2 V and x4=0.4 V and (b) voltage across some memristors for the operation in Figure 3.14(a). Vmi,j is the voltage difference across the memristor mi,j……………………………...…………………………………34

Figure 3.16. Plot comparing g(x) and gth(x)………………………………………...……36

Figure 3.17. Mean squared errors vs. training epoch for both the proposed training algorithm and a traditional BP algorithm……………………………………………...…38

Figure 3.18. Testing error for BP based training and the proposed approach using (a)

Wine dataset (51 test patterns) (b) the Iris dataset (60 test patterns) and (c) the MNIST dataset (10000 test patterns)……………………………………………………………...39

Figure 3.19. Schematic of the neural network shown in Figure 3.8 utilizing the neuron circuit in Figure 3.2(b)………………………………………………………...…………40

Figure 3.20. Schematic of the neural network shown in Figure 3.19 for back propagating errors to layer 1………………………………………………………………...………...41

xiv

Figure 3.21. Memristor crossbar weight (conductance) update scenarios. Upward arrow indicates increase of conductance and downward arrow indicates decrease of conductance………………………………………………………………………………43

Figure 3.22. Training pulse generation module for the proposed training algorithm. Inputs to the circuit are mentioned for the scenario shown in the first row of Table 3.6……….44

Figure 3.23. Updating multiple memristors in single step. Upward arrow indicates increase of conductance and downward arrow indicates decrease of conductance……...45

Figure 3.24. Plot showing V∆2, Vcj and training pulse duration T∆2(1-aVcj/2)………...….46

Figure 3.25. Simulation results displaying the input voltage and current waveforms for the memristor model [53] that was based on the device in [9]. The following parameter values were used in the model to obtain this result: Vp=1.3V, Vn=1.3V, Ap=5800,

An=5800, xp=0.9995, xn=0.9995, αp=3, αn=3, a1=0.002, a2=0.002, b=0.05, x0=0.001...…49

Figure 3.26. SPICE training results for BP algorithm for both cases: without considering memristor device variation & stochasticity (no device var.) and considering device variation & stochasticity (device var.)………………………………...…………………51

Figure 3.27. SPICE training results for the proposed algorithm for both cases: without considering memristor device variation & stochasticity (no device var.) and considering device variation &stochasticity (device var.)………………………………...…………..52

Figure 4.1. Block diagram of a deep network……………………………………...…….56

Figure 4.2. Two layer network having four inputs, three hidden neurons and four output neurons…………………………………………………………………………………...58

Figure 4.3. Training process of a deep network………………………………...………..59

Figure 4.4. Memristor-based neuron circuit. A, B, C are the inputs and yj is the output...60

Figure 4.5. Schematic of the neural network shown in Figure 4.2…………………..…..62

Figure 4.6. Schematic of the neural network shown in Figure 4.2 for forward pass…….64

Figure 4.7. Schematic of the neural network shown in Figure 4.2 for back propagating errors to layer 1…………………………………………………………………...……...65

Figure 4.8. Implementing back propagation phase multiplexing error generation circuit.65

Figure 4.9. Schematic of a memristor crossbar deep network system………………..….66

Figure 4.10. Training process in a memristor based deep network system………..…….67

Figure 4.11. Schematic of a M×N crossbar implementing a layer of N/2 neurons……...71

Figure 4.12. Potential across the memristors in crossbars of different sizes obtained through SPICE simulations and the proposed MATLAB framework based simulations.72

Figure 4.13. Proposed design for implementing a layer of memristor based neurons..….73

Figure 4.14. Voltage drop across the memristors in a 784×400 crossbar when the design is based on Figure 4.11: (a) for all the memristors in the crossbar (b) for the 1st and the

784-th rows of the crossbar………………………………………………………………74

Figure 4.15. Voltage drop across the memristors in a 784×400 crossbar when the design is based on Figure 4.13: (a) for all the memristors in the crossbar (b) for the 1st, 392-th,

393-th, and 784-th rows of the crossbar………………………………………………….74

Figure 4.16. Feature extraction results for Wisconsin dataset………………………..….77

Figure 4.17. Feature extraction results for Iris dataset…………………………...………78

Figure 4.18. Feature extraction results for Wine dataset…………………………...……78

Figure 4.19. Software training results and the SPICE training results of the memristor based neural networks for both cases: without considering variation in the memristor devices (no device var.) and considering variation in the devices (device var.)…..…….79

xvi

Figure 4.20. Training results of the memristor based deep neural networks for both cases: without considering variation in the memristor devices (no device var.) and considering variation in the devices (device var.)………………………………...…………………..81

Figure 5.1. Demonstration of the write operation in a crossbar: (a) increasing memristor conductance (upward arrow) and (b) decreasing memristor conductance (downward arrow)…………………………………………………………………………………….87

Figure 5.2. Accessing a memristor in a 1T1M crossbar……………………………..…..87

Figure 5.3. Reading a particular memristor from a 0T1M crossbar. In an ideal case, the figure in right is functionally equivalent to the figure in left…………………...………..88

Figure 5.4. Schematic of a M×N crossbar implementing a layer of N/2 neurons…….…89

Figure 5.5. Voltage drop across the memristors in a 617×400 crossbar: (a) for all the memristors in the crossbar (b) for the 1st and the 617-th rows of the crossbar………..…90

Figure 5.6. Circuit used to program a single memristor to a target resistance. Here ∆ determines the programming precision…………………………………………………..93

Figure 5.7. Demonstration of memristor programming through ex-situ process……..…96

Figure 6.1. Proposed multicore system with several neural cores (NC) connected through a 2-D mesh routing network. (R: router)………………………………………..……...101

Figure 6.2. Sensor chip on top of the neural processing system is shown in this figure.

Input data are transferred to the neural chip utilizing through silicon via (TSV)………102

Figure 6.3. Proposed digital neural core architecture………………………..…………103

Figure 6.4. SRAM based static routing switch. Each blue circle in the left part of the figure represents the 8x8 SRAM based switch shown in the middle (assuming a 8-bit network bus)……………………………………………..……………………………...104

xvii

Figure 6.5. Circuit diagram for a single memristor-based neuron…………………..….106

Figure 6.6. Two layer network……………………...……………...... 106

Figure 6.7. Memristor crossbar based implementation of the neural network………....106

Figure 6.8. Memristor crossbar based neural core architecture……………………..….108

Figure 6.9. Neural core having DACs for processing the first neuron layer of a network

…………………………………………………………………………………………..108

Figure 6.10. Simulation results displaying the input voltage and current waveforms for the memristor model [53] that was based on the device in [68]. The following parameter values were used in the model to obtain this result: Vp=4V, Vn=4V, Ap=816000,

-4 -4 An=816000, xp=0.9897, xn=0.9897, αp=0.2, αn=0.2, a1=1.6×10 , a2=1.6×10 , b=0.05, x0=0.001 ………………………………………………………………………………..110

Figure 6.11. Learning curve for the classification based on Iris dataset……………..…111

Figure 6.12. Multiple layers of neurons on a core……………………………..……….114

Figure 6.13. Splitting a neuron into multiple smaller neurons…………………..……..114

Figure 6.14. Error for different precisions (with the same number of neurons). Here Sig denotes for sigmoid, Flt for floating point, and Th for threshold…………………..…..116

Figure 6.15. Computation-communication timing for (a) memristor cores, (b) digital cores…………………………………………………………………………………….118

Figure 6.16. 0T1M system area and power…………………………...………………...119

Figure 6.17. 1T1M system area and power……………………………………...……...119

Figure 6.18. Digital system area and power……………………………………..……..119

Figure 6.19. 128×64 memristor core powers for the resistance ranges shown in Table

6.15……………………………………………………………………………………...130

xviii

Figure 6.20. Power efficiencies of the memristor systems over digital systems utilizing different memristor devices…………………………………………...………………..130

Figure 7.1. Heterogeneous architecture including memristor based multicore system as one of the processing components. Proposed multicore system with several neural cores

(NC), one clustering core connected through a 2-D mesh routing network. (R: router).

…………………………………………………………………………………………..134

Figure 7.2. SRAM based static routing switch. Each blue circle in the left part of the figure represents the 8x8 SRAM based switch shown in the middle (assuming a 8-bit network bus)…………………………………………………..………………………...136

Figure 7.3. Single memristor neural core architecture…………………………...……..138

Figure 7.4. Digital clustering core design implementing k-means clustering algorithm.139

Figure 7.5. Splitting a neuron into multiple smaller neurons……………..……………141

Figure 7.6. Learning curve for the classification based on Iris dataset……..………….143

Figure 7.7. Distribution of the data of different classes in the feature space……..…….144

Figure 7.8. Distance between original data and reconstructed data for normal packets..145

Figure 7.9. Distance between original data and reconstructed data for attack packets...145

Figure 7.10. Anomaly detection rate for the test dataset for different decision parameters………………………………………………………………………………145

Figure 7.11. Impact of memristor system constraints (3 bits neuron output and 8 bits neuron error precision) on application accuracy……………………...………………...146

Figure 7.12. Application speedup over GPU for training…………………………...….148

Figure 7.13. Energy efficiency of the proposed system over GPU for training………..148

Figure 7.14. Application speedup over GPU for recognition…………………..………148

xix

Figure 7.15. Energy efficiency of the proposed system over GPU for evaluation of a new input…………………………………………………………………………………….148

Figure 8.1. Neuron implementing minterm AB̅C̅D……………………………..………152

Figure 8.2. Length of the string vs. voltage across load resistor when string s1 is applied to the crossbar…………………………………..………………………………….…...153

Figure 8.3. Length of the string vs. voltage across load resistor when string s2 is applied to the crossbar…………………………………………………………………………..154

LIST OF TABLES

Table 3.1: Synaptic weight precision………………………………………………...…..24

Table 3.2: 8 three-input minterms used for training……………………………..………25

Table 3.3: Input to the training module for different scenarios………………..…….…..31

Table 3.4: Neural network configurations……………………………………..…….…..38

Table 3.5: Weight update scenarios………………………………………..…….………43

Table 3.6: Inputs to the training module (circuit in Figure 3.22(b)) for different scenarios of the proposed training algorithm…………………………………………..…….……..44

Table 3.7: Simulation parameters…………………………………………..…….……...49

Table 3.8: Recognition error on test data for different datasets and SPICE training approaches………………………………………………………………………………..52

Table 3.9: Comparison between the two proposed training approaches………….…...... 53

Table 4.1: Neural network configurations……………………..…………………….…..76

Table 4.2: Recognition error on test data for different datasets…………………...……..80

Table 4.3: Recognition error on test data for MNIST and KDD datasets…………..……82

Table 5.1: Recognition error on test data………………………………..…………….…96

Table 5.2: Area savings in 0T1M crossbar systems over 1T1M crossbar systems for implementing neural networks of different configurations……………………..……….97

Table 6.1: Area and power of different cores………………………………..…………120

xxi

Table 6.2: Deep Network………………………………………………….…..………..121

Table 6.3: Edge Detection………………………………………………………………121

Table 6.4: Motion Estimation………………………………………….…………….....122

Table 6.5: Object Recognition………………………………………….…..…………..122

Table 6.6: Optical Character Recognition………………………………………...……122

Table 6.7: Input output data rates…………………………………………….…..…….123

Table 6.8: RISC core area power breakdown…………………………………..………124

Table 6.9: Processing power breakdown in µW for deep network application………...127

Table 6.10: Processing power breakdown in µW for edge detection application…..….127

Table 6.11: Processing power breakdown for motion detection………………………..128

Table 6.12: Processing power breakdown for object recognition……………………....128

Table 6.13: Processing power breakdown in µW for OCR application………………..128

Table 6.14: Idle fraction………………………………………………………………...129

Table 6.15: Resistance range of trained memristors…………………...……………….130

Table 7.1: Neural network configurations…………………………………..……….…140

Table 7.2: Memristor core timing and power for different execution steps………..…..147

Table 7.3: For training number of cores used, the time and the energy for a single input in

the proposed architecture…………………..……………….…………………………..149

Table 7.4: Recognition time and energy for one input in the proposed architecture..….149

xxii

CHAPTER I

INTRODUCTION

General purpose computing systems are used for a large variety of applications.

Extensive supports for flexibility in these systems limit their energy efficiencies. A large number of applications (big data, and real time embedded processing) require low power, high throughput computing systems. Several factors are driving the low power, high throughput computer architecture research. Among them major issues are discussed below:

1.1 Power Wall

Power consumption has become a major limiting factor in high performance processor designs. Esmaeilzadeh et al. [1] demonstrated that with continued transistor dimensions scaling, power limitations will curtail the usable chip fraction (over 50% at 8 nm process). They also argued that neither CPU-like nor GPU-like multicore designs are sufficient to achieve the expected performance speedup levels. Radical micro- architectural innovations are necessary to deliver speedups consistent with Moore’s Law.

1.2 Memory Wall

Off-chip memory is utilized to store large amounts of data which cannot be accommodated by available on-chip memory. Technology trends show that off-chip memory bandwidth is not increasing at the same rate as the number of devices per chip.

The term "memory wall" refers to the disparity of speed between the processor and off- chip memory access. In last few decades, processor speed improved at the rate of 55% per year while memory speed only improved at 10% [2]. As a result of these trends, it is expected that memory latency would become an overwhelming bottleneck in computer performance.

1.3 Approximate Computing

Several research groups are investigating the design of energy efficient computing systems from different aspects including approximate computing techniques [3,4].

Approximate computing is the use of lower power inexact hardware to evaluate an application, where the output can have slight variations, but is of acceptable quality to the end user. A large set of existing and emerging application domains are able to tolerate inexactness or approximations in their underlying computation kernels and still generate acceptable outputs. The factors contributing to this intrinsic resilience are: (i) the application inputs often come from real world noisy sources, (ii) the algorithms utilized within the applications tend to minimize errors from approximations, and (iii) the end users of the applications find small variations in the outputs to be acceptable. Sensor processing, image processing, and classification applications are good candidates for approximate computing [5].

The techniques typically utilized for approximate computing are: (i) algorithmic modifications to utilize less complex computations at the expense of output precision, (ii) new assembly instructions and functional units that allow the hardware to dynamically choose higher or lower precision functional units, and (iii) circuit level modifications that reduce power, such as using lower voltage, variation prone circuits for lower order bits within a bus or functional unit [6]. These systems typically result in about 1.2 to 4 times energy reduction [6,7] in exchange of slight loss in output quality.

1.4 Emerging Memristor Devices

The memristor device was first theorized in 1971 by Dr. Leon Chua [8]. Several research groups have demonstrated memristive behavior using several different materials.

One such device, composed of layers of HfOx and AlOx [9], has a high on state resistance

(RON≈10kΩ) and a very high resistance ratio (ROFF/RON≈1000). In general, a certain energy (or threshold voltage, Vth) is required to enable the state change in a memristive device [9,10]. When the electrical excitation through a memristor exceeds the threshold, i.e., V(t)>Vth, the resistance of the device changes. Otherwise, a memristor behaves like a resistor. The device characterized in [9] has a threshold voltage of about 1.3V. Physical memristors can be laid out in a high density grid known as a crossbar. The schematic and layout of a memristor crossbar can each be seen in Figure 1.1. A memristor in the crossbar structure occupies 4F2 area (where F is the device feature size). This is 36 times smaller than a SRAM memory cell. A memristor crossbar can perform many multiply- add operations in parallel and the conductance of multiple memristors in a crossbar can be updated in parallel [11]. Multiply-add operations are the dominant operations in neural networks and training of neural networks require update of synaptic weights iteratively.

As a consequence, memristors have a great potential as a synaptic element in a neural network based system design.

Memristor device

2 (a) (b) Area of 4 F (F: feature size) Figure 1.1. (a) Memristor crossbar schematic and (b) memristor crossbar layout.

1.5 Emerging Applications

Intel has described Recognition, Mining and Synthesis (RMS) applications as the key application drivers of future [12]. Chen et al. shows that RMS applications can be approximated using neural networks [13]. Neural networks also have use in the areas including pattern recognition, image processing, recognition of remote sensing data [14-

16].

Big data applications are one of the most significant emerging applications. Big data analytics is the process of examining huge volume of continuously growing data to uncover hidden patterns, unknown correlations and other useful information that can be used to make better decisions. High-performance processing systems are needed to process that huge data. Traditionally GPUs, cluster of computers, super computers are used to process big data which consume significant amount of power.

1.6 Neural Networks

Neural networks are commonly utilized for pattern recognition and classification tasks. Several neural models have been proposed [17-19], including multi-layer perceptrons, recurrent networks, cellular neural networks, and spiking neural networks.

Of these, the most commonly used is the multi-layer perceptron, an example of which is shown in Figure 1.2. Each neuron in a feed forward neural network performs two types of operations: a dot product of its inputs and weights (see Eq. (1.1)), and evaluating a nonlinear function (see Eq. (1.2)). If the inputs to a neuron are xi and the corresponding synaptic weights are wj,i, the dot product of the inputs and weights is given by:

푑푝푗 = ∑ 푥푖푤푗,푖푖 (1.1)

while the neuron output is given by:

yj=f(dpj) (1.2)

i j Wj,i

......

Figure 1.2. A multi-layer neural network.

In Eq. (1.2), f is the activation function of the neurons. In a multi-layer feed forward neural network, a non-linear differentiable activation function (e.g. tan-1(x)) is desired. Neural networks typically need to be trained through a learning algorithm.

Generally learning algorithms are of two types: supervised and unsupervised. Supervised training utilizes training dataset which has labels for each input pattern in the dataset.

There are some scenarios where training data are not labeled. In this type of problem, unsupervised training is used and the goal is to unravel the underlying similarities among the data.

1.7 Contributions

There is strong demand for low power, high throughput computing architectures.

This thesis examines memristor crossbar based low power, high throughput computing circuits and systems design. It examines ex-situ (off-chip) and in-situ (on-chip) training of the memristor crossbar based neural networks. We have also examined memristor crossbar based multicore architectures for neural network applications.

The key contributions of this thesis are mentioned below:

i) Online memristor training: Chapter III presents on-chip training circuits for

multi-layer neural networks implemented using a single crossbar per layer and

two memristors per synapse. Using two memristors per synapse provides double

the synaptic weight precision when compared to a design that uses only one

memristor per synapse. Two on-chip training systems are examined, where the

first utilizes the traditional back propagation (BP) algorithm, and the second

utilizes a novel variant of the BP to reduce both circuit area and training time. In

our first approach, weight updates within the crossbar are done serially column by

column, while in the second approach four steps are enough to update all

memristors in a crossbar in parallel. This work would enable the design of high

throughput, energy efficient, and compact neuromorphic processing systems.

6 ii) Deep network training: Chapter IV designs on-chip training circuits for

memristor based deep neural networks utilizing unsupervised and supervised

learning methods. On chip training circuits would allow the training algorithm to

account for device variability and faults in these circuits. We have utilized

autoencoders for layer-wise pre-training of the deep networks and utilized the

back-propagation algorithm for supervised fine tuning. Our design utilizes two

memristors per synapse for higher precision of weights. Techniques to reduce the

impact of sneak-paths in a large memristor crossbar and for high speed

simulations of large crossbars were proposed. This work would enable the design

of high throughput, energy efficient, and compact deep learning systems. iii) Ex-situ training: Neural networks need to be trained prior to use. Ex-situ training

is one of the approaches for training a neural network where the weights trained

by a software implementation are programmed into the system. Existing ex-situ

training approaches for memristor crossbars do not consider sneak-path currents

and may work only for small crossbar based neural networks. Ex-situ training in

large crossbars, without considering sneak-paths, reduce the application

recognition accuracy significantly. Chapter V proposes ex-situ training

approaches for both 0T1M (crossbar having no isolation at the cross-points), and

1T1M (crossbar having an isolation transistor at each cross-point) crossbars

considering crossbar sneak-paths. The proposed ex-situ training approach is able

to tolerate the stochasticity in memristor devices. The results show that the 0T1M

crossbar based systems trained were about 17% to 83% smaller in area than the

1T1M crossbar based systems. 7 iv) Multicore Design: Chapter VI examines the design of several novel specialized

multicore neural processors that would be used primarily to process data directly

coming from sensors. Systems based on memristor crossbars are examined

through detailed circuit level simulations. These show that two types of memristor

circuits could be used, where processors either have or do not have on-chip

training hardware. We have designed these circuits and have examined their

system level impact on multicore neural processors. Additionally, we have

examined the design of SRAM based neural processors. Full system evaluation of

the multicore processors based on these specialized cores are performed which

takes I/O and routing circuits into consideration and area power benefits are

compared with traditional multicore RISC processors. The presented memristor

based architectures can provide an energy efficiency between three and five

orders of magnitude greater than that of RISC processors for the benchmarks

examined. We have also examined the factors contributing to the energy

efficiencies of the specialized architectures. v) Architecture for high throughput applications: In chapter VII we propose a

multicore heterogeneous architecture for big data processing. This system has the

capability to process key machine learning algorithms such as deep neural

network, autoencoder, and k-means clustering. Memristor crossbars are utilized to

provide low power, high throughput execution of neural networks. The system

has both training and recognition (evaluation of new input) capabilities. The

proposed system could be used for classification, unsupervised clustering,

dimensionality reduction, feature extraction, and anomaly detection applications. 8

The system level area and power benefits of the specialized architecture is

compared with the NVIDIA Telsa K20 GPGPU. Our experimental evaluations

show that the proposed architecture can provide four to six orders of magnitude

more energy efficiency over GPGPUs for big data processing.

vi) Memristor based string matching circuit: Memristor based string matching

circuit is examined in chapter VIII. It has potential application in network

intrusion detection systems, text mining and bioinformatics.

The applications of the proposed works would be very broad. They will enable extremely energy efficient processing architectures that can be integrated into a whole host of processing architectures including mobile SOCs (such as cell phone processors), low power sensors, robotics, bio-medical devices, and high performance computing systems. In particular the on-line learning capabilities will be highly applicable to a range of areas such as big data, cybersecurity, and any type of adaptive processing systems (e.g. adaptive control system).

Recently White House has started promoting the development of a new type of computer that can proactively interpret and learn from data [80]. The system will be able to solve unfamiliar problems using what it has learned. One of the important design constraints is that the system should operate with the energy efficiency of the human brain. We can comment that the proposed architecture will be able to meet the requirements of the White House target for the computing system quite satisfactorily.

CHAPTER II

RELATED WORK

Related works in this thesis area can be categorized as follows: i) mapping general purpose applications to neural networks form to enable acceleration on specialized architectures, ii) approximate computing, iii) specialized digital architectures for neural network applications, iv) specialized mixed signal circuits for neural network applications, and v) memristor based multicore systems.

2.1 Neural Networks for General Purpose Applications

Several recent studies have examined the implementation of traditional computing applications and kernels using neural networks. Esmaeilzadeh et al. [1] presented a compilation approach where a user would annotate a code segment for conversion to a neural network. The application would be run and inputs to and outputs of the code segment would be recorded. These inputs and outputs would then be used to train a neural network to learn the behavior of the annotated code segment. Finally a transformed code would be generated where the annotated code segment would be replaced with a call to a neural processing unit to run the equivalent neural network. This study examined transformation of several kernels including FFT, JPEG, K-means, and the Sobel edge detector into equivalent neural networks. Only the feed-forward multi- layer perceptron class of neural networks were examined.

Chen et al. [13] developed neural network implementations of 5 of the 12 applications in the PARSEC benchmark suite. This suite falls under the Recognition,

Mining, and Synthesis (RMS) class of applications that are considered to be some of the most important emerging category of high performance applications [12]. They utilized several classes of neural networks including multi-layer perceptrons, convolutional neural networks, and cellular neural networks.

A recent publication from IBM [20], demonstrated the implementation of several complex applications onto a large collection of neural cores. They developed the neural network transformation of a library of functions. Applications were broken down into a set of subtasks and each subtask was further subdivided until they could be represented with a large collection of the library of functions they had examined. Thus an application could potentially be represented by tens of thousands of simple neural networks that were connected together. As an example, a collision avoidance system application they implemented utilized over 21 thousand hardware neural cores (where each core implemented a single neural network with up to 256 neurons, each having up to 256 inputs).

2.2 Approximate Computing

Workloads from several prevalent and emerging application domains possess the ability to produce outputs of acceptable quality in the presence of inexactness or approximations in a large fraction of their underlying computations. In approximate computing lower power inexact hardware is used to evaluate such applications, where the output can have slight variations, but is of acceptable quality to the end user. The techniques typically utilized for approximate computing are: i) algorithmic modifications

11 to utilize less complex computations at the expense of output precision [21], ii) new assembly instructions and functional units that allow the hardware to dynamically choose higher or lower precision functional units [5], and iii) circuit level modifications that reduce power, such as using lower voltage, variation prone circuits for lower order bits within a bus or functional unit [7]. These systems typically result in about 1.2 to 4 times energy reduction.

Venkataramani et al. proposed quality programmable processors for approximate computation [5]. The ISA of a quality programmable processor contains instructions associated with quality fields to specify the accuracy level that must be met during their execution. They designed a system with a 3-tiered hierarchy of processing elements –

Approximate Processing Elements (APE), Mixed Accuracy Processing Elements

(MAPE), and Completely Accurate Processing Element (CAPE) that provide distinctly different energy vs. quality trade-offs. They proposed hardware mechanisms based on precision scaling with error monitoring and compensation to facilitate quality- configurable execution on these processing elements and demonstrate significant energy benefits. Their results demonstrate that leveraging quality-programmability leads to

1.05X-1.7X savings in energy for virtually no loss (< 0.5%) in application output quality, and 1.18X-2.1X energy savings for modest impact (<2.5%) on output quality.

2.3 Digital Neuromorphic Systems

High performance computing platforms can simulate a large number of neurons, but they are very expensive from power consumption standpoint. There are several neural network emulation projects are underway around the world which include GPU, FPGA,

CMP, and ASIC based systems [22-27]. Han et al. [22] and Nageswaran et al. [23]

12 demonstrate GPU acceleration of spiking neural models. Acceleration on high performance clusters [24] and FPGAs [25] are also studied. Since GPU and FPGA systems are relatively high power and area consuming, they are not favorable for embedded applications.

The SpiNNaker [26] architecture, highlighted on cover of the August 2012 issue of IEEE Spectrum [28], is a specialized multi-core RISC architecture geared towards neural simulations. Developed by a consortium of British Universities, one SpiNNaker chip contains 18 ARM9 cores with local SRAM and a specialized router with a routing table. Each chip is packaged with a stacked 128MB SDRAM chip to hold connectivity information for up to 16 million synaptic connections. The designers of this system have developed a board with 48 SpiNNaker chips and plan to integrate enough such boards to have about a million processors for brain simulation. The SpiNNaker project did not customize the ARM9 core for their application domain.

IBM and Cornell [29,30] recently developed a fully digital synaptic core in 45 nm

SOI technology. The system utilizes asynchronous logic to reduce power consumption. A core models 256 integrate and fire neurons, each with 1024 pre-synaptic inputs. Each synapse is represented through a single bit in a 1024×256 SRAM crossbar memory.

Synaptic weights are added based on input directly from each bitline in the SRAM array.

Rows within the SRAM array are activated depending on which input pre-synaptic connections fired. They proposed that multiple cores would communicate asynchronously using an address-event representation (AER) for spikes. In their current study, training was carried out offline.

The European FACETS project consists of a large number of ASICs containing analog neuron and synapse circuits [31]. Synapses are implemented with groups of

DenMem (Dendrite Membrane) circuits. A hybrid analog/digital solution is used for the synapses, and a hybrid of address encoding and separate signal lines is used for communication. Each DenMem can receive as many as 224 pre-synaptic inputs based on a 6-bit address sent via a layer 1 channel. The synaptic weight is represented in a 4-bit

SRAM with a 4-bit DAC. The post-synaptic signal is encoded as a current pulse proportional to the synapse weight, and can be excitatory or inhibitory. Neuron circuits integrate the DenMem signals. A digital control circuit implements STDP based on temporal correlation between pre- and post-synaptic signals, updating the synaptic weight. For neuron processing, each synaptic weight needs to be converted from digital to analog form. The DAC circuits needed for this are area and power consuming.

IBM’s TrueNorth chip consists of 5.4 billion transistors [32]. It has 4,096 neurosynaptic cores interconnected via an intra-chip network that integrates one million programmable spiking neurons and 256 million configurable synapses. The basic building block is a core, a self-contained neural network with 256 input lines (axons), and

256 outputs (neurons) connected via 256×256 directed, programmable synaptic connections. With a 400×240-pixel video input at 30 frames-per-second, the chip consumes 63mW. Key differences between this system and our system are that

TrueNorth uses spiking neurons, has asynchronous communications, and allows the

SRAM arrays to enter ultra low leakage modes. We examine the impact of using leakage- less SRAM arrays in our study.

DaDianNao and PuDianNao [33,34] are accelerator for deep neural networks

(DNN) and convolutional neural networks (CNN). These systems are based on a fully digital design. In PuDianNao [34] neuron synaptic weights are stored in off-chip memory which requires data movement during training back and forth between the off-chip memory and the processing chip. In DaDianNao [33], neuron synaptic weights are stored in eDRAM and are later brought into a neural functional unit for execution. Input data are initially stored in a central eDRAM. This system utilized dynamic wormhole routing for data transfers. It was able to achieve 150× energy reduction over GPUs. ShiDianNao is an accelerator only for CNNs. As CNNs utilize shared weights, a small on-chip SRAM cache was able to store the whole network’s weights [35].

St. Amant et al. [36] presented a compilation approach where a user would annotate a code segment for conversion to a neural network. This study examined transformation of several kernels including FFT, JPEG, K-means, and the Sobel edge detector into equivalent neural networks. In this work they utilized analog neural cores to accelerate a RISC processor for general purpose codes. Our approach varies from this in that we are examining standalone neural cores for embedded applications, and using memristors for the processing.

Belhadj et al. [37] proposed multicore architectures for spiking neurons and evaluated them for embedded signal processing applications. They stored synaptic weights in digital form and used digital to analog converters to process the neurons.

There were significant differences in the neuron circuit designs and neural algorithm compared to our system.

2.4 Memristor Neural Network Circuits

Several research efforts examined memristor based synapse, neuron, and neural network circuit designs. Zamarreño-Ramos et al. [38] examined how a memristor grid can implement a highly dense spiking neural network used for visual image processing.

They examined STDP training to implement spiking neural networks. Chabi et al. [39] examined the implementation of linearly separable Boolean functions in Neural Logic

Blocks (NLBs). They focused on robustness studies of the NLBs using probabilistic predictive models of defective memristors. Alibart et al. [40], and Starzyk et al. [41] demonstrated pattern classification using a single layer perceptron network implemented utilizing a memristive crossbar circuit. This crossbar was trained using the perceptron learning rule for both ex-situ and in-situ methods. Nonlinearly separable problems were not studied in these works.

Memristor bridge circuits have been proposed [42,43] where small groups of memristors (either 4 or 5) are used to store a synaptic weight. One of the advantages of these bridge circuits is that either a positive or negative weight can be stored based on the sensed voltage. The use of these circuits in a crossbar structure for large scale pattern recognition has not yet been studied.

Soudry et al. [11] proposed gradient decent based learning on a memristor crossbar neural network. This system utilizes two transistors and one memristor per synapse. Synaptic weight precision of the proposed implementations are two time more than this design. The hardware implementation of their approach would require ADCs and set of multi bit buffers. An error back propagation step would require multipliers and a set of DACs. For a layer with m inputs and n neurons, a weight update requires O(m+n)

16 operations in [11] while our proposed training approach requires O(c) operations (c is a constant).

Work in [44] proposed using two crossbars for the same weight values. One would be used to propagate forward and another transposed version would be used to propagate errors backward. Since the switching characteristic of a memristors often contains some degree of noise, it would be difficult to store an exact copy of a memristor crossbar without a complex feedback write mechanism. Our proposed work is able to apply a variable pulse width during the weight update which is not possible in the system in [44].

Training a multi-layer neural network requires the output layer error to be back propagated to the hidden layer neurons. Works [45,46] examined training of multi-layer neural networks using a training algorithm named “Manhattan Rule”. They did not detail the error back propagation step and design of the training pulse generation circuitry.

2.5 Memristor Based Multicore Systems

The most recent results for memristor based neuromorphic systems can be seen in [47,48]. Work in [47] did not explain how negative synaptic weights will be implemented in their single memristor per synapse design. Such neurons, capable to represent only positive weights, have very limited capability. No detail programming technique was described. Liu [48] examined memristor based neural networks utilizing four memristors per synapse while the proposed systems utilized two memristors per synapse. They proposed the deployment of the system as a neural accelerator with a

RISC processor. The proposed system executes data directly coming from a 3D stacked sensor chip.

CHAPTER III

ON-CHIP TRAINING OF MULTI-LAYER NEURAL NETWORKS

3.1 Introduction

Memristor crossbar based neural networks provide low power, high throughput execution. It is necessary to have an efficient training system for memristor neural network based systems. On-chip training has the advantage that it can take into account variations between devices. This chapter presents on-chip training circuits for multi-layer neural networks implemented using a single crossbar per layer and two memristors per synapse. Using two memristors per synapse provides double the synaptic weight precision when compared to a design that uses only one memristor per synapse. Two on- chip training systems are examined, where the first utilizes the traditional back propagation (BP) algorithm, and the second utilizes a novel variant of BP to reduce both circuit area and training time.

In our first approach, weight updates within the crossbar are done serially column by column, while in the second approach four steps are enough to update all memristors in a crossbar in parallel. The proposed training algorithm can train nonlinearly separable functions with a slight loss in accuracy compared to training with the traditional BP algorithm. We evaluated the training of both systems with some nonlinearly separable

18 datasets through detailed SPICE simulations which take crossbar wire resistance and sneak-paths into consideration.

Existing work in on-chip training circuits for memristor crossbars include [11] and [44]. Soudry et al. [11] examined on-chip gradient descent based training of memristor crossbars with a single memristor per synapse. They did not consider the training of systems with two memristors per synapse. Boxun et al. [44] examined on-chip training of crossbar systems with two memristors per synapse. They utilized a pair of crossbars per layer, with one for the forward pass, and the second for the backward pass.

In their design, the second crossbar needs to be an exact transposed copy of the first. The key disadvantage of this design is that the variability and stochastic switching characteristics of memristors would make it difficult to create an exact copy of a memristor crossbar without complex feedback write mechanisms.

The rest of the chapter is organized as follows: section 3.2 describes memristor devices and our memristor crossbar based neuron circuit designs. Section 3.3 examines training of memristor based multi-layer neural networks through the BP algorithm.

Section 3.4 describes the proposed training algorithm and its hardware implementation.

Sections 3.5 and 3.6 demonstrate the experimental setup and results respectively. Finally section 3.7 concludes the chapter.

3.2 Memristor Crossbar Based Neuron Circuit and Linear Separator Design

This section describes the operations of a neuron in a neural network, and explains the proposed memristor based implementations.

3.2.1 A Neuron in a Neural Network

Figure 3.1 shows a block diagram of a neuron. The neuron performs two types of operations, (i) a dot product of the inputs x1,…,xn and the weights w1,…,wn, and (ii) the evaluation of an activation function. The dot product operation can be seen in Eq. (3.1).

The activation function of the neuron is shown in Eq. (3.2). In a multi-layer feed forward neural network, a nonlinear differentiable activation function is desired (e.g. tan-1(x)).

w1 x 2 w Σ f inputs . 2 output . . wn x n Figure 3.1. Neuron block diagram.

푛 퐷푃푗 = ∑푖=1 푥푖푤푖푗 (3.1)

푦푗 = 푓(퐷푃푗) (3.2)

3.2.2 Neuron Circuits

Two circuits for implementing memristor based neurons are shown in Figure 3.2.

Each of the two neuron circuits has three inputs and one bias input (β). In each case, a synapse in these circuits is represented by a pair of memristors.

In the circuit in Figure 3.2(a), each data input is connected to two virtually grounded op-amps (operational amplifiers) through a pair of memristors. For a given row,

+ if the conductance of a memristor connected to the first column (σA ) is higher than the

- conductance of the memristor connected to the second column (σA ), then the pair of

+ - memristors represents a positive synaptic weight. In the inverse situation, when σA < σA , the memristor pair represents a negative synaptic weight.

 σA+ A  A A A σ Synapse A A- B Synapse B C B β Memristor C C R + - β β

퐷푃푗 Memristor R + - Rf

yj y j (a) (b) Figure 3.2. Memristor-based neuron circuit. A, B, C are the inputs and yj is the output.

In Figure 3.2(a) currents through the first and second columns are 퐴휎퐴+ + ⋯ +

훽휎훽+ and 퐴휎퐴− + ⋯ + 훽휎훽− respectively. Assume that

퐷푃푗 = 4푅푓[{휎퐴+ + ⋯ + 훽휎훽+} − {휎퐴− + ⋯ + 훽휎훽−}]

= 4푅푓[퐴(휎퐴+ − 휎퐴−) + ⋯ + 훽(휎훽+ − 휎훽−)]

(here 4Rf is a constant)

The output of the opamp connected directly with the second column represents the neuron output. When VDD and VSS are set to 0.5V and -0.5V respectively, the neuron circuit implements the activation function h(x) as in Eq. (3.3) where 푥 = 4푅푓[퐴(휎퐴+ −

휎퐴−) + ⋯ + 훽(휎훽+ − 휎훽−)]. This implies, the neuron output can be expressed as h(DPj).

1 푓(푥) = − Figure 3.3 shows that h(x) closely approximates the activation function, 1+푒−푥

0.5. The values of VDD and VSS are chosen such that no memristor gets a voltage greater than Vth across it during evaluation.

0.5 𝑖푓 푥 > 2 푥 ℎ(푥) = { 𝑖푓 |푥| < 2 4 (3.3) −0.5 𝑖푓 푥 < −2

0.5

f(x) h(x) 0

-0.5 -5 0 5 x Figure 3.3. Plot of functions f(x) and h(x).

Figure 3.2(b) shows the second neuron circuit. For each input, both the data and its complemented form are applied to the neuron circuit. In this circuit, a complemented input (such as 퐴̅) has the same magnitude but opposite polarity as the original input (A).

+ - For the input pair A and 퐴̅ in Figure 3.2(b), if σA > σA , then the synapse corresponding

+ - to input A has a positive weight. Likewise if σA < σA , then this synapse has a negative weight. This applies to each of the inputs. The output of the inverter pair at the bottom of the circuit represents the neuron output. Assume that the conductance of the memristors of Figure 3.2(b) from top to bottom are σA+, σA-,…, σβ+, σβ-. In an ideal case the potential at the first inverter input (DPj) can be expressed by Eq. (3.4) which indicates that this circuit is essentially carrying out a set of multiply-add operations in parallel in the analog domain. In Eq. (3.4) the denominator is always greater than zero and works as a scaling factor.

퐴(휎퐴+−휎퐴−)+⋯+훽(휎훽+−휎훽−) 퐷푃푗 = (3.4) 휎퐴++휎퐴−+⋯+휎훽++휎훽−

For the neuron circuit in Figure 3.2(b), we are approximating the activation function using a pair of CMOS inverters as shown in Figure 3.4(a). The power rails to the inverters are VDD=0.5V and VSS= -0.5V, leading to the transfer curve shown in Figure 22

3.4(b). In this plot, V(op2) is the neuron output and V(op1) is the inverted neuron output.

This approach provides a very efficient way of implementing the activation function in

terms of power, speed, and circuit components. This circuit is essentially providing a

bipolar neuron output with a value of either 0.5 or -0.5. Our experimental evaluations

consider memristor crossbar wire resistance. The schematics of the memristor based

neuron circuits considering wire resistance are shown in Figure 3.5.

V(op1) VDD=0.5 V VDD=0.5 V 500mV 300mV 100mV -100mV -300mV -500mV V(op2) in op1 op2 500mV 300mV 100mV -100mV -300mV -500mV V =-0.5 V V =-0.5 V SS SS -500mV -200mV 100mV 400mV (a) (b) Figure 3.4. (a) The CMOS inverter and (b) the inverter transfer curve. V(op2) is the neuron output and V(op1) is the inverted neuron output.

A A

B A . C . . β β

β + - R 퐷푃푗 R

+ - Rf

yj y j (a) (b) Figure 3.5. Memristor based neuron circuits showing crossbar wire resistance.

3.2.3 Synaptic Weight Precision

The precision of memristor based synaptic weights depends on the number of

memristors used for each synapse and the resistance (or conductance) range of the

23 memristor device. Assume that the maximum conductance of the memristor device is

σmax and the minimum conductance is σmin. For a design using only a single memristor per synapse, 휎 defines the separation between positive and negative weights [11]. Table 3.1 shows that the range of synaptic weights when using two memristors per synapse is two times that of a single memristor per synapse design.

Table 3.1: Synaptic weight precision. Two memristors per One memristor per synapse synapse

Maximum weight σmax - σmin σmax - 휎

Minimum weight σmin - σmax σmin - 휎

Range 2(σmax - σmin) σmax - σmin

3.2.4 Linearly Separable Classifier Designs

This subsection demonstrates the functionality of the neuron circuits in Figure 3.2 by implementing a set of linearly separable three-input Boolean functions. There are 256 three input Boolean functions and among them 104 are linearly separable. It is cumbersome to examine the implementation of all 104 of these linearly separable functions. Therefore, we have implemented the 8 minterms that exist within the set of all

3 input logic functions. These functions are listed in Table 3.2. Each of the 8 minterms was implemented by a separate neuron and we utilized a single memristor crossbar to implement and train all the minterms simultaneously. For the neuron circuit in Figure

3.2(a) we utilized a crossbar of size 4×16 and for the neuron in Figure 3.2(b), the crossbar size was 8×8 (the additional input is for the bias).

Table 3.2: 8 three-input minterms used for training. m0 = A’B’C’ m1 = A’B’C m2 = A’BC’ m3 = A’BC

m4 = AB’C’ m5 = AB’C m6 = ABC’ m7 = ABC

We have utilized the single layer perceptron learning algorithm [49] for training the memristor crossbars. The process used to apply the training pulses to the memristor crossbar is explained in subsection 3.3.4. A detailed SPICE simulation of the crossbar that considered crossbar wire resistance showed that the circuits were able to correctly classify each of the linearly separable functions. Figures 3.6 and 3.7 show the training graphs for the neuron circuits in Figures 3.2(a) and 3.2(b) respectively. In both cases, the mean square error (MSE) become zero, thus providing evidence for the functionality of the proposed memristor crossbar neuron circuits as linear separators.

0.6 0.6 0.4 0.4 MSE MSE 0.2 0.2

0 0 0 5 10 0 5 10 15 Epoch Epoch Figure 3.6. Training graph of 8 three Figure 3.7. Training graph of 8 three input input minterms utilizing neuron circuit minterms utilizing neuron circuit shown in shown in Figure 3.2(a). Figure 3.2(b).

3.3 System 1: Memristor Crossbar Based Multi-layer Neural Network for BP

Algorithm

3.3.1 Multi-layer Circuit Design

The implementation of a nonlinear classifier requires a multi-layer neural network. Figure 3.8 shows a simple two layer feed forward neural network with three inputs, two outputs, and six hidden layer neurons. Figure 3.9 shows a memristor crossbar 25 based implementation of the neural network in Figure 3.8, utilizing the neuron circuit shown in Figure 3.2(a). There are two memristor crossbars in this circuit, each representing a layer of neurons.

A Layer 1 y1 y2 crossbar B + - R inputs C R β + - Rf

ABC Layer 2 β crossbar output_1 - - output_2 target_1 + ∑ ∑ + target_2 δ δ L2,1 L2,2 Figure 3.8. Two layer Figure 3.9. Schematic of the neural network network for learning shown in Figure 3.8 for forward pass utilizing three input functions. neuron circuit in Figure 3.2(a).

In Figure 3.9, the first layer of neurons is implemented using a 4×12 crossbar. The second layer of two neurons is implemented using a 7×4 memristor crossbar, where 6 of the inputs are coming from the 6 neuron outputs of the first crossbar. The additional input is used as a bias. When the inputs are applied to a crossbar, the entire crossbar is processed in parallel within one cycle.

3.3.2 Training

In order to provide proper functionality, a multi-layer neural network needs to be trained using a training algorithm. Back-propagation (BP) [50] and the variants of the BP algorithm are widely used for training such networks. The stochastic BP algorithm was used to train the memristor based multi-layer neural network and is described below: 26

1) Initialize the memristors with high random resistances.

2) For each input pattern x:

i) Apply the input pattern x to the crossbar circuit and evaluate the DPj values and

outputs (yj) of all neurons (hidden neurons and output neurons).

ii) For each output layer neuron j, calculate the error, δj, between the neuron output

(yj) and the target output (tj).

훿푗 = (푡푗 − 푦푗) (3.5)

iii) Back propagate the error for each hidden layer neuron j.

훿푗 = ∑ 훿푘푤푘,푗푘 (3.6)

where neuron k is connected to the previous layer neuron j.

iv) Determine the amount, Δw, that each neuron’s synapses should be changed (2η is

the learning rate and f is the activation function):

Δ푤푗 = 2휂 × 훿푗 × 푓′(퐷푃푗) × 푥 (3.7)

3) If the error in the output layer has not converged to a sufficiently small value,

goto step 2.

3.3.3 Circuit Implementation of the Training Algorithm The operations in the training process for the neural network in Figure 3.9 can be broken down into the following three major steps:

1. Apply inputs to layer 1 and determine the layer 2 neuron errors.

2. Back-propagate the layer 2 errors through the second layer weights and record the layer 1 errors.

3. Update the synaptic weights for both layers of neurons.

We will refer to the system based on the neuron circuit in Figure 3.2(a) and trained through the BP algorithm as system1. The circuit implementations of the above three steps for system1 are detailed below:

Step 1: A set of inputs is applied to the layer 1 neurons, and for each neuron j of the network, the neuron outputs (yj) are measured. This process is shown in Figure 3.9.

The output layer neuron errors (δL2,1 and δL2,2) are evaluated based on the difference between the observed outputs (yj) and the expected outputs (tj). The error values are discretized using ADC and are stored in buffers.

Step 2: The layer 2 errors (δL2,1 and δL2,2) are applied to the layer 2 weights after conversion from digital to analog form as shown in Figure 3.10 to generate the layer 1 errors (δL1,1 to δL1,6). The memristor crossbar in Figure 3.10 is the same as the layer 2 crossbar in Figure 3.9. Assume that the synaptic weight associated with input i, neuron j

+ - (second layer neuron) is wij=σij - σij for i=1,2,..,6 and j=1,2. In the backward phase we want to evaluate

δL1,i = Σjwij δL2,j for i=1,2,..,6 and j=1,2.

+ - = Σj(σij - σij )δL2,j

+ - = Σjσij δL2,j - Σjσij δL2,j (3.8)

The circuit in Figure 3.10 is essentially evaluating the same operations as Eq.

(3.8), applying both δL2,j and -δL2,j to the crossbar columns for j=1,2. Back propagated errors are discretized using ADCs and are stored in buffers. To reduce the training circuit overhead, we can multiplex the back propagated error generation circuit as shown in

Figure 3.11. When enabling the appropriate pass transistors, back propagated errors are sequentially generated and stored in buffers. Access to the pass transistors will be 28 controlled by a shift register. In this approach the complexity of the back propagation step will be O(m) where m is the number of inputs in the layer of neurons.

Error inputs from layer 2 Error inputs from layer 2 DAC DAC

− − ,1 ,1 - , , ,1 ,1 - , ,

- + - + σ2,1 σ2,1

. . . - . . . +

- +

Figure 3.10. Schematic of the neural Figure 3.11. Implementing back network shown in Figure 3.8 for propagation phase multiplexing error back propagating errors to layer 1. generation circuit.

Step 3: The weight update procedures for both layers are similar. They take neuron inputs, errors, and intermediate neuron outputs (DPj) to generate a set of training pulses. The training unit essentially implements Eq. (3.7). To update a synaptic weight by an amount Δw, the conductance of the memristor connected to the first column of the corresponding neuron will be updated by amount Δw/2 (where Δw/2=휂 × 훿푗 × 푓′(퐷푃푗) ×

푥푖) and the memristor connected to the second column of the corresponding neuron by amount -Δw/2. We will describe the weight update procedure for the first columns of the neurons (odd crossbar columns in Figure 3.9). The weight update procedure for the second columns of the neurons (even columns) is similar to the procedure for the first columns, except that we need to multiply the neuron errors (훿푗) by -1.

In Eq. (3.7) we need to evaluate the derivative of the activation function for the dot product of the neuron inputs and weights (DPj). The DPj value of neuron j is 29 essentially the difference of the currents through the two columns implementing the neuron (see section 3.2.2. Applying the inputs to the neuron again, the difference of column currents are stored in the buffer after converting into digital form. The derivative of the activation function (f’) is evaluated using a lookup table. A digital multiplier is utilized to multiply the neuron error (δj) and f’(DPj). The number δj×f’(DPj) is converted into analog form and is applied to the training pulse generation unit (see Figure 3.12(b)).

For training, pulses of variable amplitude and variable duration are produced. The amplitude of the training signal is modulated by the neuron input (xi) and is applied to the row wire connecting the desired memristor (see Figure 3.12(a)). The duration of the training pulse is modulated by η×δj×f’(DPj) and is applied to the column wire connecting the desired memristor (see Figure 3.12(b)). The combined effect of the two voltages applied across the memristor will update the conductance by an amount proportional to

휂 × 훿푗 × 푓′(퐷푃푗) × 푥푖.

R R ip1=-xi - Vwai

ip2=-(Vth-Vb) . . . R +

(a) Memristor . V =V . DD b . ip4=V∆1 ip3=δj + Vwd - DPj f’ × DAC Va VSS=-Vb (b) Figure 3.12. Training pulse generation module. Inputs to the circuit are mentioned for the scenario shown in the first row of Table 3.3.

Figure 3.12 shows the training circuit for the case when δj>0 and xi>0. Table 3.3 shows the inputs to the training module for each of the combinations of sign(xi) and

30 sign(δj). The inputs are taken such that during weight increase Vwai>0, Vwd=VSS for the training period and during weight decrease Vwai<0, Vwd=VDD for the training period. The inputs are normalized such that max{|xi|}=0.5V and min{|xi|}=0V. In the circuit in Figure

3.12(a) the amplitude of Vwai can be expressed as Vth-Vb+xi. The circuit in Figure 3.12(b) uses a triangular wave, V∆1 which is described in Eq. (3.9). This circuit has VDD=Vb and

VSS=-Vb. The value of Vb is taken 1.2 V and the reasoning behind this choice is explained in the next subsection. When Va>V∆1 the amplitude of Vwd will be VSS whose value is –Vb where Va=δj×f’(DPj). Va is greater than V∆1 for the duration T∆1×Va where T∆1 is the duration of the triangular wave (see Figure 3.13). The value of T∆1 determines the learning rate, η. The time T∆1× δj×f’(DPj) which determines the effective training pulse duration, is consistent with the training rule (Eq. (3.7)). Soudry et al. [11] used a similar approach for training, but utilized one memristor per synapse.

Table 3.3: Input to the training module for different scenarios. Sign of δj Sign of xi Weight ip1 ip2 ip3 ip4 update

+ + increase -xi -(vth-vb) δj 푉∆1

+ - decrease -xi -(-vth+vb) -δj −푉∆1

- + decrease xi -(-vth+vb) δj −푉∆1

- - increase xi -(vth-vb) -δj 푉∆1

푡 1 − 𝑖푓 0 ≤ 푡 ≤ 푇∆1/2 푇∆1 푡 푇 푉∆1(푡) = { ∆1 (3.9) − 1 𝑖푓 < 푡 ≤ 푇∆1 푇∆1 1 표푡ℎ푒푟푤𝑖푠푒

T∆1

1 V∆1

VaT∆1 Va Voltage (V) Voltage

Figure 3.13. Plot showing V∆1, Va and training pulse duration VaT∆1.

3.3.4 Writing to Memristor Crossbars Recall that the voltage across a memristor needs to surpass a threshold voltage

(Vth) in order to change the conductance of a given memristor [10] (for the device considered Vth=1.3 V [9]). Several studies [9,51,52] examined the change in memristor conductance for positive and negative voltage pulses for variable pulse height and pulse width (duration). They reported a strong correlation between change in conductance, the width of the applied pulse, and the applied voltage amplitude. It is reasonable to assume that a generated training pulse, determined according to the training rule, will be able to update memristor conductance values accordingly.

The physical layout of the memristor devices is assumed as in Figure 3.12. When

Vwai-Vwd>Vth the conductance of the memristor is increased. Alternatively when Vwai-

Vwd<-Vth conductance of the memristor is decreased. In the update process, two memristors in the same row or same column of a crossbar cannot have their conductance changes in different directions (one increase, another decrease) simultaneously. As a result, the conductance of the memristors in the crossbar will be updated column by column. For a crossbar column, the conductance of the memristors will be updated in two steps: first the memristors requiring a conductivity increase will be updated, then the memristors requiring a conductivity decrease will be updated. In this process, the circuit

32 shown in Figure 3.12(a) will be required for each row of the crossbar. For an entire crossbar one circuit shown in Figure 3.12(b) is required which will be used for each column one by one.

For the scenarios mentioned in the first and fourth rows of Table 3.3, the conductance of the memristors will be updated in the increasing phase. Figure 3.14(a) shows the weight update operation on the first crossbar column in the increasing phase.

In this phase, voltages Vwa1 and Vwa4, produced by the circuits similar to Figure 3.12(a), will be applied to the first and fourth rows respectively where we want to increase conductance. The training pulse Vwd, generated by the circuit in Figure 3.12(b), will be applied to the column of the memristor we want to update (in this case the first column).

The remaining rows and columns of the crossbar will be set to 0 V. The potential difference across the memristors to be updated will be Vwai-Vwd for i=1,4. For the training duration those potential differences will be Vth -Vb+|xi| -(-Vb) or Vth+|xi|. As Vth+|xi|>Vth the conductance of the selected memristors will be changed. The potential difference across other memristors will be 0 V, Vb, Vth-Vb+|xi|, or Vth-2Vb+|xi| for i=1,4. To ensure other memristors are not changing their conductance, we need to select Vb such that

Vb-Vth. We used the value Vb=1.2 V.

Figure 3.15 shows the plot of training signals and potential difference across some memristors for this weight update step. Only the memristors m1,1 and m4,1 get potential across them greater than Vth for the training duration (2 ns to 8 ns). Except that no memristor gets potential across it greater than Vth any time.

Vwd=-Vb 0 V 0 V 0 V Vwd=Vb 0 V 0 V 0 V

V 0 V m1,1 m1,2 wa1 0 V V m2,1 m2,2 wa2

0 V Vwa3

V 0 V m4,1 wa4

(a) (b) Figure 3.14. Demonstration of the weight update operation in the first column of a crossbar: (a) increasing phase and (b) decreasing phase. Upward arrow indicates increase of conductance and downward arrow indicates decrease of conductance.

2 2 Vwa1 1.7 Vm1,1 1.4 Vm4,1 Vwa4 1.1 1 Vwd 0.8 Vm2,1 0.5 Vm1,2 0.2 -0.1 Vm2,2 0 Voltage (V) -0.4 Voltage Voltage (V) -0.7 -1 -1 -1.3 0 5 10 15 0 2 4 6 8 10 time (ns) time (ns) (a) (b) Figure 3.15. (a) Training signals while x1=0.2 V and x4=0.4 V and (b) voltage across some memristors for the operation in Figure 3.14(a). Vmi,j is the voltage difference across the memristor mi,j.

For the scenarios mentioned in the second and third rows of Table 3.3, the conductance of the memristors will be updated in the decreasing phase in a procedure similar to the one used for the increasing phase (see Figure 3.14(b)). The training pulse generation technique was described in the previous subsection. The complexity of the weight update operations for a layer of neurons will be O(n) where n is the number of neurons in the layer. The training results on some nonlinearly separable datasets are shown in section 3.6.

3.4 System 2: Memristor Crossbar Based Multi-layer Neural Network for the Proposed Training Algorithm

3.4.1 Proposed Training Algorithm

The hardware implementation of the exact BP algorithm is expensive as it requires ADC, DACs, lookup table (to evaluate derivative of the activation function), multiplier. In this section a variant of the stochastic BP algorithm [50] is proposed as a low cost hardware implementation of the algorithm. For a layer of neurons, a weight update could be done in four steps as opposed to two steps per crossbar column (used for system1). When implementing back propagation training in software, a commonly used activation function is one that is based on the inverse tangent as seen in Eq. (3.10) and its derivative in Eq. (3.11).

1 푓 (푥) = 푡푎푛−1(푥) 휋 (3.10)

푑푓 (푥) 1 1 1 2 = × = 푔(푥) 푑푥 휋 1+푥2 휋 (3.11)

[where g(x)=1/(1+x2)]

The proposed hardware implementation approximates these functions using analog hardware. This allows for an implementation with higher speed, lower area, and lower energy. The neuron activation function is implemented using the double inverter circuit in Figure 3.4, and can be modelled using a threshold function as in Eq. (3.12).

0.5, 푥 ≥ 0 푓 (푥) = { 푡ℎ −0.5, 푥 < 0 (3.12)

To simplify the weight update equation for a low cost hardware implementation, g(x) in Eq. (3.11) was approximated as the piecewise linear function gth(x) (see Eq.

(3.13)). Figure 3.16 shows a comparison between g(x), and the approximated function gth(x).

1 − |푥|/2, |푥| < 1.90 푔 (푥) = { (3.13) 푡ℎ 0.05, 푒푙푠푒

1 g(x) 0.8 gth(x)

0.6

0.4

0.2

0 -5 0 5 x Figure 3.16. Plot comparing g(x) and gth(x).

The proposed training algorithm for the altered back propagation algorithm is stated below:

1) Initialize the memristors with high random resistances.

2) For each input pattern x:

i) Apply the input pattern x to the crossbar circuit and evaluate DPj values and

outputs (yj) of all neurons (hidden neurons and output neurons).

ii) For each output layer neuron j, calculate the error, δj, between the neuron output

(yj) and the target output (tj).

훿푗 = 푠푔푛(푡푗 − 푦푗) (3.14)

iii) If ∑푗휖(표푢푡푝푢푡 푛푒푢푟표푛) |훿푗| == 0 then goto step 2 for the next input pattern.

iv) Back propagate the error for each hidden layer neuron j.

훿푗 = 푠푔푛(∑ 훿푘푤푘,푗푘 ) (3.15)

where neuron k is connected to the previous layer neuron j.

v) Determine the amount, Δw, that each neuron’s synapses should be changed (2ηπ

is the learning rate):

Δ푤푗 = 2휂 × 훿푗 × 푔푡ℎ(퐷푃푗) × 푥 (3.16)

3) If the error in the output layer has not converged to a sufficiently small value,

goto step 2.

In this algorithm, equations (14)-(16) are different from the traditional back- propagation algorithm. Instead of evaluating the actual error, we evaluate a sign function of the error (this is a bipolar value). It is much simpler to store a bipolar value than an analog value, and this allows for a simpler hardware implementation. Additionally, we do not need expensive ADCs, DACs, lookup table, multiplier in this approach. Furthermore, for a layer of neurons, the weight update could be done in four steps which will enable faster training (detailed in subsection 3.4.3).

3.4.2 Training Algorithm Comparison To examine the functionality of the proposed training algorithm we have trained neural networks for different nonlinearly separable datasets: (a) 2 input XOR function,

(b) 3 input odd parity function, (c) 4 input odd parity function, (d) Wine classification, (e) classification based on the Iris dataset and, (f) classification based on the MNIST dataset.

We have trained the neural networks for these datasets in Matlab utilizing the configurations shown in Table 3.4. A neural network configuration is descried as x  y

 z, where the network has x inputs, y hidden neurons, and z output neurons. We also trained these datasets using the traditional BP algorithm for comparison. For each dataset both training approaches utilized the same network configurations, learning rates and the same maximum epoch counts. Figure 3.17 shows the mean squared error (MSE) obtained for different datasets using the proposed algorithm as well as the traditional back- propagation algorithm. We observe that the proposed algorithm converges to the minimum specified error.

Table 3.4: Neural network configurations. Dataset Neural network Number of Learning configurations training data rate 2 input XOR 2  5  1 4 0.1 3 input odd parity 3  7  1 8 0.07 4 input odd parity 4  12  1 16 0.07 Wine 13  20  3 118 0.05 Iris 4  15  3 99 0.05 MNIST 784  800  10 50,000 0.07

0.8 BP 1 BP 0.6 proposed proposed 0.4 MSE

MSE 0.5 0.2

0 0 0 5 10 15 20 25 0 5 10 15 Epoch Epoch (a) 2 input XOR (b) 3 input odd parity 1.5 BP BP 1 proposed 1 proposed MSE MSE 0.5 0.5

0 0 0 5 10 15 0 50 100 150 Epoch Epoch (c) 4 input odd parity (d) Wine 0.5 0.4 BP 0.4 BP proposed 0.3 proposed 0.3 0.2 MSE MSE 0.2 0.1 0.1 0 0 0 100 200 300 400 0 50 100 150 200 Epoch Epoch (e) Iris (f) MNIST Figure 3.17. Mean squared errors vs. training epoch for both the proposed training algorithm and a traditional BP algorithm.

For the 2 input XOR, 3 input odd parity, and 4 input odd parity functions, 100% recognition accuracy was achieved with both the proposed algorithm and the traditional

BP approach. Figure 3.18 shows the recognition error on the test datasets for the 38 proposed algorithm and the back-propagation training approaches. The proposed training algorithm can train nonlinearly separable functions with a slight loss in accuracy compared to when training with the BP algorithm. For MNIST dataset recognition error on test dataset for the proposed algorithm was 4.9% while for BP algorithm it was 3.5%.

The training curves for the proposed algorithm (in Figures 3.17 (d), (e) and (f)) are not as smooth as the curves obtained when using the BP algorithm because in the former one, we are utilizing a sign function of the errors.

5 8 BP BP 4 Proposed 6 proposed 3 4 2 2

1 error(%) Recognition Recognition error (%) error Recognition

0 0 class 1 class 2 class 3 setosa versicolor verginica (a) (b) 10 BP 8 proposed

2 Recognition error (%) error Recognition

0 0 1 2 3 4 5 6 7 8 9 Avg. (c) Digit Figure 3.18. Testing error for BP based training and the proposed approach using (a) Wine dataset (51 test patterns) (b) the Iris dataset (60 test patterns) and (c) the MNIST dataset (10000 test patterns).

3.4.3 Circuit Implementation of the Proposed Training Algorithm The implementation of the forward pass of the proposed algorithm is similar to the design for system1, except the neuron circuit in Figure 3.2(b) will be used. We will refer the system based on the neuron circuit in Figure 3.2(b) and trained through the proposed training algorithm as system2. Figure 3.19 shows a memristor crossbar based

39 implementation of the neural network in Figure 3.8 utilizing the neuron circuit shown in

Figure 3.2(b). In forward pass ctrl1 signal is high. The error back propagation step and weight update step for system2 are different from the designs for system1.

A A B B inputs C C β β 훿 ,1 훿 , Training Unit (L1) ctrl1 ctrl2

훿 1, . . .

훿 1,1 β β - + -

- + +

output_1 - - output_2 target_1 ∑ ∑ target_2 + + δL2,1 δL2,2

Training Unit (L2) Figure 3.19. Schematic of the neural network shown in Figure 3.8 utilizing the neuron circuit in Figure 3.2(b).

3.4.4 Error Back Propagation The output layer (layer 2) neuron errors are discretized with values 0, 1, -1 and stored in registers. The layer 2 errors (δL2,1 and δL2,2) are applied to the layer 2 weights as shown in Figure 3.20 to generate the layer 1 errors (δL1,1 to δL1,6). Assume that the

+ - synaptic weights associated with input i, neuron j (second layer neuron) is wij=σij - σij for i=1,2,..,6 and j=1,2. For the proposed algorithm, in backward phase we want to evaluate

δL1,i = sgn(Σjwij δL2,j) for i=1,2,..,6 and j=1,2. 40

+ - = sgn(Σj(σij - σij )δL2,j)

+ - = sgn(Σjσij δL2,j - Σjσij δL2,j) (3.17)

The circuit in Figure 3.20 is essentially evaluating the same operations as Eq.

(3.17). Alibart et al. utilized similar design for neuron circuits to implement linear separators [40]. In this approach complexity of back propagation step will be O(1).

- Error inputs from layer 2 + -

,1 , - + +

. . . Training Unit (L1)

1,1 Back propagated errors for layer 1 Figure 3.20. Schematic of the neural network shown in Figure 3.19 for back propagating errors to layer 1.

Figure 3.19 shows the overall neural network circuit, combining the circuits for the forward and backward passes. Pass transistors have been added to the second layer memristor crossbar that are controlled by signals ctrl1 and ctrl2. The ctrl1 signal enables forward propagation through the layer 2 memristor crossbar to generate neuron outputs, while ctrl2 enables error back-propagation to generate error values for the earlier layer neurons. In the second step, ctrl1 is set to 0 to isolate the second layer crossbar from the first layer crossbar to enable error back propagation.

3.4.5 Memristor Weight Update Approach The training units use the inputs, errors, and intermediate neuron outputs (DPj) to generate a set of training pulses. The training unit essentially implements Eq. (3.16). To update a synaptic weight by an amount Δw, the conductance of the memristor connected to the corresponding uncomplemented input will be updated by amount Δw/2 (where

Δw/2=휂 × 훿푗 × 푔푡ℎ(퐷푃푗) × 푥푖) and the memristor connected with complemented input by amount -Δw/2.

For design simplicity the error term δj, in the training rule, is discretized to a value of either 1, -1 or 0. The key impact of δj is to determine the direction (by conductance increase or decrease) of the weight update along with input xi. The magnitude of the weight update is determined by the values η×gth(DPj)×xi. Recall that for making the desirable changes to memristor conductance values, we need to apply a voltage of appropriate magnitude and polarity for a suitable duration across the memristor.

Similar to the previous section, the training circuit is able to generate pulses of variable amplitude and of variable duration so that memristor conductance can be updated according to the learning rule in Eq. (3.16). We are determining training pulse magnitude on the basis of neuron input and duration on the basis of ηgth(DPj) for a neuron j. For the assumed layout of the memristor crossbar, conductance increase requires potential across the memristor (row to column) greater than Vth and conductance decrease requires potential less than –Vth. As a result, in the same row or same column conductance of different memristors cannot be increased and decreased simultaneously.

Furthermore, any two weight update operations of the four scenarios in Figure 3.21 cannot be done simultaneously. For example, if we update M1 and M4 simultaneously, M2

42 and M3 will also be updated to the undesirable direction (increase as opposed to decrease). For each combination of sign(δj) and sign(xi) shown in Table 3.5, one exclusive write step is required. This implies, entire memristor crossbar implementing a layer of neurons could be updated in four steps.

Table 3.5: Weight update scenarios. sign(δj) sign(xi) Weight update case1 + + increase case2 + - decrease case3 - + decrease case4 - - increase

case1 case3 . . . xi>0 M1 M3 ......

case2 case4 x <0 . . . j M 2 M4

δ >0 δ <0 k l Figure 3.21. Memristor crossbar weight (conductance) update scenarios. Upward arrow indicates increase of conductance and downward arrow indicates decrease of conductance.

The circuit in Figure 3.22 will be used to pulse the memristor crossbar during training when sign of (xi×δj) and DPj are positive. The output of neuron j is essentially producing the sign of DPj. For each combination of sign of xiδj and DPj, inputs ip3 and ip4 to the training pulse generation circuit in Figure 3.22(b) are shown in Table 3.6. 43

Inputs ip1 and ip2 to the circuit shown in Figure 3.22(a) are same as in Table 3.3. The inputs are taken such that during weight increase Vwai>0, Vwdj=VSS for the training period and during weight decrease Vwai<0, Vwdj=VDD for the training period. Inputs (xi) are normalized in the same way as mentioned in the previous section. The amplitude of Vwai can be expressed as Vth-Vb+xi (see Figure 3.22(a)). Duration of Vwdj in Figure 3.22(b) determines the effective training pulse duration (detailed later in this subsection).

R R ip1=-xi - Vwai ip2=-(Vth-Vb) . . . R +

Memristor (a) . . VSS . ip4=V∆2 - Vwdj ip3=V cj +

VDD

(b) Figure 3.22. Training pulse generation module for the proposed training algorithm. Inputs to the circuit are mentioned for the scenario shown in the first row of Table 3.6.

Table 3.6: Inputs to the training module (circuit in Figure 3.22(b)) for different scenarios of the proposed training algorithm. Sign of Sign of Weight ip3 ip4 VDD VSS

δjxi DPj update

+ + increase Vc 푉∆ 0 V -Vb

+ - increase −푉∆ 푉푐 0 V -Vb

- + decrease 푉∆ 푉푐 Vb 0 V

- - decrease Vc −푉∆ Vb 0 V

Now we will describe the conductance update procedure for all the memristors corresponding to positive inputs (xi>0) and positive neuron errors (δj). This step will update all the memristors belonging to case1 (which will update memristors residing in different rows and columns simultaneously). In Figure 3.23 memristors M1 and M5 will be updated in this step. This weight update approach is different from the approach used for system1 where each crossbar column was updated in two steps. Vwai produced by the circuits similar to Figure 3.22(a) will be applied to the crossbar rows corresponding to input xi>0 and 0 V will be applied to the rest of the rows. Vwdj produced by the circuits similar to Figure 3.22(b) will be applied to the crossbar columns corresponding to error

δj>0 and 0 V will be applied to the rest of the columns. Technique to produce Vwdj is detailed in the following paragraphs. Weight update procedures for the other three cases in Table 3.5 are similar to the procedure used for case1.

case1 case3 case1 ...... xi>0 M1 M3 M5 ......

case2 case4 case2 x <0 ...... j M 2 M4 M6

δk>0 δl<0 δm>0

Figure 3.23. Updating multiple memristors in single step. Upward arrow indicates increase of conductance and downward arrow indicates decrease of conductance.

The training pulse Vwdj is determined by the DPj value of each neuron. To determine the DPj value of a neuron we need to access the crossbar column implementing the neuron (the input of the inverter pair in the neuron circuit, Figure 3.2(b)). We cannot access the DPj values of the neurons and update the weights of the neurons at the same 45 time. In this circuit we propose to store the DPj values in capacitors as Vcj (after applying the input to the network again) and use it to determine the training pulses of appropriate durations. Because of the input normalization method, Vcj will be in the range [-0.5V,

0.5V]. Assume that a=1.9/0.5.

Training pulse duration is modulated by the term η×gth(DPj) in the training rule

(Eq. (3.16)). This is implemented using the voltage signal V∆2 in Figure 3.22(b), as triangular wave (Eq. (3.18)). In this circuit, when δjxi>0 and DPj>0, VDD=0 V and VSS=-

Vb (see Table 3.6). Here Vb is playing the same role as in system1. When Vcj>V∆2 then amplitude of Vwdj will be VSS whose value is -Vb. Vcj is greater than V∆2 for the duration

T∆2(1-aVcj/2) where T∆2 is the duration of the triangular wave and a=3.8 (see Figure

3.24). The value of T∆2 determines the learning rate, η. The time duration T∆2(1-aVcj/2) is consistent with the term ηgth(DPj) in the training rule (Eq. (3.16)).

4푡 𝑖푓 0 ≤ 푡 ≤ 푇∆ /2 푎푇∆2 4 푇 푉∆ (푡) = ∆2 (3.18) − (푡 − ) 𝑖푓 푇∆ /2 < 푡 ≤ 푇∆ 푎 푎푇∆2 { 0 표푡ℎ푒푟푤𝑖푠푒

V∆2

2/a T∆2(1-aVcj/2)

V T cj Voltage (V) Voltage ∆2

Figure 3.24. Plot showing V∆2, Vcj and training pulse duration T∆2(1-aVcj/2).

The proposed four step weight update operation of a layer of neurons would require the circuit shown in Figure 3.22(a) for each input and the circuit shown in Figure

3.22(b) for each neuron. This approach could also be applicable for system1, but hardware cost would be extensive. A large number of DACs, ADCs, lookup tables (to evaluate f’(DPj)), and multipliers would be required. Recall that we utilized capacitors to store the neuron DPj values in system2. The required capacitance for a capacitor is 5 fF

2 which requires an area of 0.3178 um if TiO2 is used as the dielectric material. This area is slightly greater than one SRAM memory cell assuming a 45 nm process is used.

In summary, the steps of each training iteration are:

1) Apply the input pattern to the crossbar.

2) Determine the output layer neuron outputs.

3) Determine the error in output layer.

4) Back-propagate the errors to earlier layers and store the polarity of these errors for

each neuron.

5) For each layer from the output layer to the first hidden layer (update the output

layer neuron weights and then gradually update the earlier layer neuron weights):

a) Apply the input pattern to the crossbar again and store the DPj

value of each neuron j in a capacitor.

b) Pulses are applied to the crossbar in four steps to update the

weights.

For a layer of neurons the weight update requires O(c) operations where c is a constant. The training units are only active in the training phase. When evaluating new

47 inputs, only the forward pass will be executed and there will be no power consumed by the training units, comparators, or opamps used in error back propagation step.

3.5 Experimental Setup

The memristor crossbar circuits were simulated in SPICE so that the memristor grid could be evaluated very accurately considering the crossbar sneak-paths and wire resistances. A wire resistance of 5Ω between memristors in the crossbar is considered in these simulations. Each attribute of the input was mapped within [-Vread, Vread] voltage range. We have performed simulations of the memristor crossbar based neural networks both considering memristor device variation, and stochasticity and without considering device variation, and stochasticity. We assumed maximum 30% deviation of memristor device response due to device variation & stochasticity. That is, when we want to update a memristor conductance by amount x, corresponding training pulse would update the conductance by a value randomly taken from the interval [0.7x, 1.3x]. Table 3.7 shows the simulation parameters. The resistance of the memristors in the crossbars were randomly initialized between 0.909 MΩ and 10 MΩ.

Simulation of the memristor device used an accurate model of the device published in [53]. The memristor device simulated in this chapter was published in [9] and the switching characteristics for the model are displayed in Figure 3.25. This device was chosen for its high minimum resistance value and large resistance ratio. According to the data presented in [9] this device has a minimum resistance of 10 kΩ, a resistance ratio of 103, and the full resistance range of the device can be switched in 20 μs by applying

2.5 V across the device.

On State On State 2 200 Off State Off State

1 A) 100  0 0 Voltage Voltage (V) -1 ( Current -100

-2 -200

0 10 20 30 40 50 0 10 20 30 40 50 time (us) time (us) Figure 3.25. Simulation results displaying the input voltage and current waveforms for the memristor model [53] that was based on the device in [9]. The following parameter values were used in the model to obtain this result: Vp=1.3V, Vn=1.3V, Ap=5800, An=5800, xp=0.9995, xn=0.9995, αp=3, αn=3, a1=0.002, a2=0.002, b=0.05, x0=0.001.

Table 3.7: Simulation parameters. Memristor RON 10 kΩ

Memristor ROFF 10 MΩ

Maximum read voltage, Vread 0.5 V Threshold voltage 1.3 V Memristor switching time for write voltage 2.5 V 20 μs Crossbar each wire segment resistance 5 Ω Maximum deviation of memristor device response due 30% to device variation & stochasticity

Value of Rf in Figure 3.2(a) 14 MΩ Learning rate 5 ns -8 ns

MATLAB (R2014a) and SPICE (LTspice IV) were used to develop a simulation framework for applying the training algorithm to a multi-layered neuron circuit. SPICE was mainly used for detailed analog simulation of the memristor crossbar array and

MATLAB was used to simulate the rest of the system. The circuits were trained by applying input patterns one by one until the errors were below the desired levels.

3.6 Results

We have examined the training of the memristor crossbar arrays in Systems 1 and

2 using SPICE for five of the datasets in Table 3.4. We do not show the SPICE training result for the MNIST dataset as the simulation of a memristor crossbar large enough for this application would take a very long time.

3.6.1 SPICE Training Results for BP Algorithm

The size of the memristor crossbar used to implement a layer of n neurons each having m inputs was (m+1)×2n as neuron in Figure 3.2(a) was used (where (m+1)th row is for the bias input). Figure 3.26 shows the training graphs obtained from the SPICE simulations for different datasets utilizing the training process described in section 3.3.

The results show that the neural networks were able to learn each classification application in both cases: without considering device variation & stochasticity and considering device variation & stochasticity.

3.6.2 SPICE Training Results for the Proposed Algorithm The size of the memristor crossbar used to implement a layer of n neurons each having m inputs was 2(m+1)×n as neuron in Figure 3.2(a) was used. Figure 3.27 shows the training graphs obtained from the SPICE simulations for different datasets utilizing the training process described in section 3.4. The results show that the neural networks were able to learn the desired classification applications in both cases: without considering device variation & stochasticity and considering device variation & stochasticity.

1.5 no device var. 0.8 no device var. device var. device var. 0.6 1

MSE 0.4 MSE 0.5 0.2

0 0 0 10 20 30 40 0 20 40 Epoch Epoch 2 input XOR 3 input odd parity 0.5 0.4 no device var. no device var. 0.4 device var. 0.3 device var. 0.3

MSE 0.2 0.2

0.1 0.1

0 0 20 40 60 0 Epoch 0 100 200 4 input odd parity Iris 0.4 no device var. device var. 0.2 MSE

0 0 50 100 150 Wine Epoch Figure 3.26. SPICE training results for BP algorithm for both cases: without considering memristor device variation & stochasticity (no device var.) and considering device variation & stochasticity (device var.).

Table 3.8 shows the recognition errors for system1, and system2 on test data for

Iris, and Wine datasets. Test errors considering memristor device variation & stochasticity and without considering device variation & stochasticity are close to the errors obtained through software implementations (Figure 3.18).

0.8 0.8 no device var. no device var. 0.6 device var. 0.6 device var.

0.4 0.4 MSE MSE 0.2 0.2

0 0 0 5 10 15 20 0 10 20 30 40 Epoch Epoch

2 input XOR 3 input odd parity

0.8 0.6 no device var. no device var. 0.6 device var. device var. 0.4 0.4 MSE MSE 0.2 0.2

0 0 0 10 20 30 40 50 0 100 200 Epoch Epoch

4 input odd parity Iris

0.4 no device var. device var. 0.3 0.2 MSE 0.1

0 0 50 100 150 Wine Epoch Figure 3.27. SPICE training results for the proposed algorithm for both cases: without considering memristor device variation & stochasticity (no device var.) and considering device variation &stochasticity (device var.).

Table 3.8: Recognition error on test data for different datasets and SPICE training approaches. Recognition error (%) no device device var. var. Iris system1 3.92 2.61 system2 3.92 6.5 Wine system1 1.34 1.93 system2 5.22 5.55

Table 3.9: Comparison between the two proposed training approaches. System1 System2 Neuron activation function Utilized two opamps Utilized 2 inverters Complemented input Not needed Needed Training algorithm BP Variant of BP. Training rule utilizes sign of error. Evaluation time of the One cycle One cycle system for new input Neuron error Multi bits Single bit ADCs and DACs Required Not required Multiplier Required to multiply back Not required propagated error and derivative of activation function Lookup table Used to determine derivative Not required of the activation function Capacitor Not required One capacitor for each neuron Weight update on a layer O(m+n) operations O(c) operations of m inputs, n neurons for where c is a constant a training instance (data)

In this chapter we have examined two systems for training memristor crossbar based multi-layer neural networks. System1 examines a closer implementation of the BP training algorithm utilizing two memristors per synapse. System2 enables faster weight update operations at low cost hardware implementation of the proposed variant of the BP training algorithm. A comparison between the two proposed designs is mentioned in

Table 3.9.

3.7 Summary

This chapter described the design of on-chip training systems for memristor based multi-layer neural networks utilizing two memristors per synapse. We have examined training of the memristor based neural networks utilizing BP algorithm and a proposed variant of the BP algorithm. Implementation of the proposed training algorithm does not require a set of ADCs and DACs which are needed for the closer implementation of the

BP algorithm. For a training instance, weight update operation of an entire memristor crossbar in system2 is done in four steps which enables faster training. For both systems we have demonstrated successful training of some nonlinearly separable datasets through detailed SPICE simulations which take crossbar wire resistance and sneak-paths into consideration. The proposed training algorithm can train nonlinearly separable functions with a slight loss in accuracy compared to training with the traditional BP algorithm.

CHAPTER IV

ON-CHIP TRAINING OF DEEP NEURAL NETWORKS

4.1 Introduction

Recently deep neural networks (or deep networks) have gained significant attention because of their superior performance for classification and recognition applications. This chapter presents on-chip training circuits for memristor based deep neural networks utilizing unsupervised and supervised learning methods. Memristor crossbar circuits enable compact storage of synaptic weights, provide high throughput execution, and low energy consumption. On chip training circuits would allow the training algorithm to account for device variability and faults in these circuits.

Furthermore, on-chip training enables deployment of the system for online unsupervised learning.

To the best of our knowledge, no existing work examined a memristor based deep neural network training system. The training of a deep network is different from the training a multi-layer neural network. A deep network has many layers of neurons, and for effective training, neuron weights need to be pre-trained using an unsupervised learning method before supervised training [54]. We have utilized autoencoders for layer- wise pre-training of the deep networks and utilized the back-propagation algorithm for supervised fine tuning. Proposed design utilizes two memristors per synapse which has

55 double the synaptic weight precision compared to a design using single memristor per synapse.

In this chapter, we also have proposed a technique to reduce the impact of sneak- paths in a large memristor crossbar. We present a novel method to accurately simulate large crossbars at high speed. This work would enable the design of high throughput, energy efficient, and compact neuromorphic processing systems.

The rest of the chapter is organized as follows: section 4.2 presents a background on deep networks. Section 4.3 describes memristor based neural network design. Sections

4.4 and 4.5 describe the circuit implementation of the training algorithm and the deep network training approach respectively. Section 4.6 describes our large memristor crossbar simulation method and approach to minimize sneak-path impacts in large crossbars. Sections 4.7 and 4.8 demonstrate the experimental setup and results respectively. Finally section 4.9 concludes the chapter.

4.2 Deep Neural Networks

Deep neural networks have become highly popular recently for classification and recognition applications. Figure 4.1 shows a block diagram of a deep network. There are strong similarities between the architectures of multi-layer neural networks and deep networks. A deep neural network has multiple hidden layers and large number of neurons in a layer which can learn significantly more complex functions.

i j Wj,i

......

Figure 4.1. Block diagram of a deep network. 56

Neurons are the building blocks of a deep network. A neuron in a deep network performs two types of operations: (i) a dot product of the inputs x1,…,xn and the weights w1,…,wn, and (ii) the evaluation of an activation function. The dot product operation can be seen in Eq. (3.1). The activation function evaluation is shown in Eq. (3.2). In a deep network, a nonlinear differentiable activation function is desired (such as tan-1(x)).

Training of Deep Networks: The training of deep networks is different from the training of multi-layer neural networks. Unlike multi-layer neural networks, a deep network does not perform well if it is trained using only a supervised learning algorithm on the entire network. Since deep networks have many layers of neurons, these networks are typically trained in a two step process: an unsupervised layer-wise pre-training step followed by a supervised training step of the entire network [54,55].

There are several approaches for unsupervised layer-wise pre-training of a deep network. The two most popular approaches for pre-training are autoencoders and restricted Boltzmann machines. In this thesis we developed circuits to implement autoencoder based layer-wise unsupervised pre-training for memristor based deep networks. We utilized back-propagation based training circuit for both the layer-wise pre- training and the supervised fine tuning of the whole network.

The autoencoder [55] is a popular method for dimensionality reduction and feature extraction that automatically learns features from unlabeled data. Hence the autoencoder can be used for unsupervised training (data labels are not needed). The architecture of an autoencoder is similar to a multi-layer neural network, as shown in

Figure 4.2. An autoencoder tries to learn a function hW,b(x) ≈ x. That is, it tries to learn an approximation to the identity function, such that the network’s output x’ is similar to the

57 input x. By placing constraints on the network, such as by limiting the number of hidden units, we can discover useful patterns within the data. Gradient descent training is generally utilized for training an autoencoder.

Layer 1 Input layer Output layer

Figure 4.2. Two layer network having four inputs, three hidden neurons and four output neurons.

In the first training step of deep networks, autoencoders are used sequentially to pre-train the layers, one layer at a time, from the first hidden layer to the layer before the output layer. When pre-training layer i using an autoencoder, a temporary layer (ti) of neurons is added in front of the layer i. The number of neurons in the temporary layer (ti) is equal to the number of inputs to the layer i. The number of inputs to the layer ti is equal to the number of outputs of the layer i. The output of the layer i-1 is applied as input to the layer i (where neuron weights of layers 1 to i-1 are already pre-trained). The training of each autoencoder is similar to a two layer neural network training. An autoencoder is trained such that the temporary layer (ti) outputs are same as the applied inputs to the layer i. In the second step, supervised training of the entire network is performed using the back-propagation training approach. The entire training process for a deep network is illustrated in Figure 4.3.

. . . i . . . n Input Layer Output Layer 1 Layer Layer (a) A deep network having n layers of neurons. i

2 . . . t1 . . . n Input Layer Layer Layer 1 Layer Layer Layer Layer

Autoencoder1

(b) Pre-training layer 1 using autoencoder1. i 2

3 . . . n t2 . . . Input Layer Layer Layer Layer 1 Layer Layer Layer Layer

Autoencoder2

(d) Supervised training of the entire network using the pre-trained weights.

Figure 4.3. Training process of a deep network.

4.3 Memristor Crossbar Based Multi-layer Neural Network Design

4.3.1 Neuron Circuit In this chapter we have utilized memristors as neuron synaptic weights. The circuit in Figure 4.4(a) shows the memristor based neuron circuit design utilized in this chapter. Detail description of the circuit can be seen in chapter III.

Crossbar wire segment resistance    A A A Synapse B A C B β Memristor C

+ - R β

+ - R R R + - Rf

+ - Rf y j yj

(a) (b) Figure 4.4. Memristor-based neuron circuit. A, B, C are the inputs and yj is the output.

When the power rails of the op-amps, VDD and VSS are set to 0.5V and -0.5V respectively, the neuron circuit implements the activation function h(x) as in Eq. (4.1) where 푥 = 4푅푓[퐴(휎퐴+ − 휎퐴−) + ⋯ + 훽(휎훽+ − 휎훽−)].

0.5 𝑖푓 푥 > 2 푥 ℎ(푥) = { 𝑖푓 |푥| < 2 4 (4.1) −0.5 𝑖푓 푥 < −2

Figure 3.3 shows that h(x) closely approximates the sigmoid activation function,

1 푓(푥) = − 0.5 1+푒−푥 . The values of VDD and VSS are chosen such that no memristor gets a voltage greater than Vth across it during evaluation. Our experimental evaluations consider memristor crossbar wire resistance. The schematic of a memristor based neuron circuit considering wire resistance is shown in Figure 4.4(b).

4.3.2 Synaptic Weight Precision Several recent studies have utilized a single memristor per synaptic weight, thus simplifying the neuron circuit, but at the same time, reducing the precision of the synaptic weights that can be represented. The precision of memristor based synaptic weights depends on the number of memristors used for each synapse and the resistance (or conductance) range of the memristor device. Chapter III shows that the range of synaptic weights when using two memristors per synapse is two times that of a single memristor per synapse design.

4.3.3 Memristor Based Neural Network Implementation The structures of a multi-layer neural network and an autoencoder are similar.

Both can be viewed as feed forward neural networks. Figure 4.2 shows a simple two layer feed forward neural network with four inputs, four outputs, and three hidden layer neurons. Figure 4.5 shows a memristor crossbar based circuit that can be used to evaluate the neural network in Figure 4.2. There are two memristor crossbars in this circuit, each representing a layer of neurons. Each crossbar utilizes the neuron circuit shown in Figure

4.4(a).

In Figure 4.5, the first layer of the neurons is implemented using a 5×6 memristor crossbar. The second layer of four neurons is implemented using a 4×8 memristor crossbar, where three of the inputs are coming from the three output neurons of the first crossbar. One additional input is used as the bias. A key benefit of this circuit is that by applying all the inputs to a crossbar, an entire layer of neurons can be processed in parallel within one cycle.

A Layer 1 crossbar B + - R

Inputs C R D + - R β f

Layer 2 crossbar β

Outputs

Figure 4.5. Schematic of the neural network shown in Figure 4.2.

4.4 Back-propagation Training Circuit

4.4.1 The Training Algorithm Both the layer-wise unsupervised pre-training (using autoencoder) and the supervised training, utilized in a deep network, are built on the back-propagation (BP) algorithm [50]. In this chapter we utilized the stochastic BP algorithm, where the network weights are updated after each input is applied. For autoencoders, the inputs to the autoencoder are used as the targets for the final layer of the autoencoder. The training algorithm utilized in this chapter is described below:

1) Initialize the memristors with high random resistances.

2) For each input pattern x:

i) Apply the input pattern x to the crossbar circuit and evaluate the DPj values

and outputs (yj) of all neurons (hidden neurons and output neurons).

ii) For each output layer neuron j, calculate the error, δj, based on the neuron

output (yj) and the target output (tj). Here f is the neuron activation function.

훿푗 = (푡푗 − 푦푗)푓′(퐷푃푗) (4.2) 62

iii) Back propagate the errors for each hidden layer neuron j.

훿푗 = (∑ 훿푘푤푘,푗푘 )× 푓′(퐷푃푗) (4.3)

where neuron k is connected to the previous layer neuron j.

iv) Determine the amount, Δw, that each neuron’s synapses should be changed

(2η is the learning rate):

Δ푤푗 = 2휂 × 훿푗 × 푥 (4.4)

3) If the error in the output layer has not converged to a sufficiently small value,

goto step 2.

4.4.2 Circuit Implementation of the Back-propagation Training Algorithm Without loss of generality we will describe the circuit implementation of the back-propagation training algorithm for the neural network shown in Figure 4.2. The implementation of the training circuit can be broken down into the following major steps:

1. Apply inputs to layer 1 and record the layer 2 neuron output errors.

2. Back-propagate layer 2 errors through the second layer weights and record the

layer 1 errors.

3. Update the synaptic weights.

The circuit implementations of these steps are detailed below:

Step 1: A set of inputs is applied to the layer 1 neurons, and both layer 1, and layer 2 neurons are processed. In Eq. (4.2 and 4.3) we need to evaluate the derivative of the activation function for the dot product of the neuron inputs and weights (DPj). The

DPj value of neuron j is essentially the difference of the currents through the two columns implementing the neuron (this can be approximated based on yj in Figure 4.4(a)). The DPj value of each neuron j is discretized and f’(DPj) is evaluated using a lookup table. The

63 f’(DPj) value of each neuron is stored in a buffer. The layer 2 neuron errors are evaluated based on the neuron outputs (yj), the corresponding targets (tj) and f’(DPj). First (tj-yj) is evaluated and discretized using an ADC (analog to digital converter). Then (tj-yj), and f’(DPj) are multiplied using a digital multiplier and the evaluated δj value is stored in a register.

A Layer 1 crossbar B + - R

Inputs C R D + - R β f

Layer 2 crossbar β

output_1 - - output_4 target_1 + ∑ ∑ + . . . target_4 ADC ADC

DPL2,1 f’ × × f’ DPL2,4 δ δ L2,1 L2,4 Figure 4.6. Schematic of the neural network shown in Figure 4.2 for forward pass.

Step 2: The layer 2 errors (δL2,1,.., δL2,4) are applied to the layer 2 weights after conversion from digital to analog form as shown in Figure 4.7 to generate the layer 1 errors (δL1,1 to δL1,3). The memristor crossbar in Figure 4.7 is the same as the layer 2 crossbar in Figure 4.5. Assume that the synaptic weight associated with input i, neuron j

+ - (second layer neuron) is wij=σij - σij for i=1,2,3 and j=1,2,..,4. In the backward phase we want to evaluate the layer 1 error

δL1,i = (Σjwij δL2,j)f’(DPL1,i) for i=1,2,3 and j=1,2,..,4.

+ - = (Σj(σij - σij )δL2,j)f’(DPL1,i)

+ - = (Σjσij δL2,j - Σjσij δL2,j)f’(DPL1,i) (4.5)

The circuit in Figure 4.7 is essentially evaluating the same operations as Eq. (4.5), applying both δL2,j and -δL2,j to the crossbar columns for j=1,2,..,4. The back propagated

(layer 1) errors are stored in buffers for updating crossbar weights in step 3. To reduce the training circuit overhead, we can multiplex the back propagated error generation circuit as shown in Figure 4.8. In this circuit, by enabling the appropriate pass transistors, back propagated errors are sequentially generated and stored in buffers. Access to the pass transistors will be controlled by a shift register. Same multiplexing approach could also be used for the layer 2 error generation. In this approach the time complexity of the back propagation step will be O(m) where m is the number of inputs in a layer of neurons.

Error inputs from layer 2 DPL1,1 DAC

- f’

- + ADC × δL1,1 . . .

- ADC + × δL1,3

f’

DP L1,3 Figure 4.7. Schematic of the neural network shown in Figure 4.2 for back propagating errors to layer 1. Error inputs from layer 2 DAC

- DPL1,i

f’

. . . - + ADC × δL1,i

Figure 4.8. Implementing back propagation phase multiplexing error generation circuit. 65

Step 3: Weight update procedure in the memristor crossbar is similar to the approach described in chapter III. The amplitude of the training signal is modulated by the neuron input (xi) and the duration of the training pulse is modulated by η×δj. The combined effect of the two voltages applied across the memristor will update the conductance by an amount proportional to 휂 × 훿푗 × 푥푖.

4.5 Deep Network Training Circuit

4.5.1 Overall System Architecture Figure 4.9 shows the overall architecture of the proposed deep network training system. Each layer in the network is implemented using a two crossbar module as shown in the figure. The inputs to each module (layer i) are from the previous layer module

(layer i-1), while the outputs go to the following layer module (layer i+1). This system is able to implement both unsupervised layer-wise pre-training and supervised training of the entire network. Both of these training steps utilize the back propagation training circuits outlined in the previous section.

Crossbar ti Crossbar i Layer i module A A’ B input i C

B’ output

D ti Layer Layer Input β C’ Layer Layer

Layer 1 D’

. . Layer i+1 module β

Layer i Crossbar i+1 Crossbar ti+1 output

Layer i+1 i . . output Layer Layer

Layer n ti+1 Layer i input β

Output Layer

Layer i+1 output Figure 4.9. Schematic of a memristor crossbar deep network system. 66

Layer 1 Layer 1 module module Input Input

Layer 2 Layer 2 module module ......

Crossbar in use

Inactive crossbar Output layer Output layer module module

(a) Pre-training layer 1 (b) Pre training layer 2

Layer 1 module Input

Layer 2 module ......

. . . Output layer module

Output

4.5.2 Unsupervised Layer-wise Pre-training Recall that in this chapter, autoencoders are used sequentially to pre-train the layers of a deep network, one layer at a time, from the first hidden layer to the layer before output layer. To pre-train layer i of a deep network using an autoencoder, a temporary layer (ti) of neurons needs to be added in front of the layer i. Both of these layers of neurons are built into the layer i module using a pair of memristor crossbars as shown in Figure 4.9. A back-propagation training circuit is used for training the two layer network within the module. As we are training an autoencoder, the errors calculated for the layer ti are the difference of the applied inputs to the layer i and the neuron outputs of layer ti.

Given that both the inputs to layer i and the outputs of layer ti (see Figure 4.9) are used to calculate the autoencoder errors, having the outputs of layer ti physically close to the inputs to layer i would reduce wiring overhead. Using the concept that multiple memristor crossbars could be stacked on top of one another [56], if crossbar ti is flipped at the dashed line in the middle of the module, it will fall on top of the crossbar i. As a result, the layer i inputs and the layer ti neuron outputs will be physically closer and the error calculation circuit would require less wiring. For the supervised training of a deep network, we do not necessarily need to have the layer i inputs and the layer i+1 output neurons closer, as there is no calculation that requires both of these terms at the same time.

After layer i has been pre-trained using the autoencoder, the system will move on to layer i+1 for autoencoder based pre-training. At this point layer i and earlier layer modules would deactivate their temporary neural networks (layer tj for j=1,2,..,i), and simply pass the output of the layer j crossbar to the inputs of the layer j+1 for j=1,2,..,i.

This process is illustrated in Figure 4.10 (a) and (b).

4.5.3 Supervised Full Network Training After the pre-training of the layers are done, supervised fine tuning of the pre- trained weights is required. This step does not need the temporary layers (ti layers) of neurons used during layer-wise pre-training. In this step, each module is using only the first crossbar, and passes the outputs of layer i to layer i+1 as input (see Figure 4.10(c)).

In the supervised training step, output layer neuron errors are evaluated based on the circuit generated outputs and the corresponding label of the applied input. The back- propagation algorithm based training procedure described in the previous section is also utilized in this training step. To facilitate both pre-training and supervised fine tuning, a 68 layer i neuron outputs are connected as input to both layer ti and layer i+1 (see Figure

4.9).

4.6 Large Memristor Crossbars

In deep networks, the layers can be very wide, having a large number of inputs and outputs. This means that large memristor crossbars would be needed in the layer modules in Figure 4.9. In large memristor crossbars sneak-path currents make training challenging. Thus a larger layer of neurons may need to be split into multiple smaller layers to be mapped to crossbars of manageable sizes. This is beyond the scope of this study and needs to be investigated as future work.

A large memristor crossbar has more sneak-path currents than a smaller one.

SPICE level accurate simulations of the crossbars are needed to correctly model these sneak-paths. But the SPICE simulations of large crossbars are very time consuming – requiring about a day per iteration. This section tackles two issues. We first look at a faster approach to accurately simulate large crossbars so that sneak-path currents are modeled. We also look at how to minimize the impact of sneak-paths in the large crossbars (using the new simulation approach for this study).

4.6.1 Large Crossbar Simulations The SPICE simulation of large memristor crossbars are very time consuming

(about a day per iteration). We have designed a MATLAB framework for large memristor crossbar simulation which is very fast compared to SPICE simulation and is as accurate as SPICE simulations (less than a minute per iteration). Our approach takes crossbar wire resistances and driver internal resistances into account.

Consider the M×N crossbar in Figure 4.11. There are MN memristors in this crossbar circuit, 2MN wire segments, and M input drivers. For any given set of crossbar input voltages, we need to determine the 2MN terminal (node) voltages across the memristors. The Jacobi method of solving systems of linear equations was used to determine these unknown voltages. All the nodes on the crossbar rows were initialized to the applied input voltages, while all the nodes on the crossbar columns were initialized to zero volts. We then repeatedly calculated currents through the crossbar memristors and updated the node voltages until convergence. The node voltages are updated based on the voltage drop across the crossbar wire resistances considering the currents through the memristors.

We have compared the crossbar simulation results obtained from SPICE and the

MATLAB framework for crossbars of sizes 100×100, 200×200, and 300×300. In these three crossbars 1V is applied as input to the leftmost point of each row, while the bottom of the crossbar columns are grounded. The conductance of the each memristor in the crossbars was set to 5.5×10-7 Ω-1. Figure 4.12 shows the voltage drop across the memristors in the crossbar from both simulation approaches: SPICE, and the MATLAB framework. In these graphs, the voltage across the memristors are plotted from the 1st row to the last row, and for each row from left to right. It is observed that corresponding results from both simulations match each other. These results also show that sneak-paths were properly evaluated in the MATLAB simulations.

. . . . .

Inputs .

+ - R R

+ - Rf

yj y1 yN/2

Figure 4.11. Schematic of a M×N crossbar implementing a layer of N/2 neurons.

0.995 0.995

0.99 0.99

0.985 0.985 Potential across memristor (V)

Potential across memristor (V) 0 5000 10000 0 5000 10000 Memristor Memristor SPICE 100x100 MATLAB 100x100 1 1

0.99 0.99

0.98 0.98

0.97 0.97

0.96 0.96

Potential across memristor (V) 0.95

Potential across memristor (V) 0.95 0 1 2 3 4 0 1 2 3 4 Memristor 4 Memristor 4 x 10 x 10 SPICE 200x200 MATLAB 200x200

1 1

0.98 0.98

0.96 0.96

0.94 0.94

0.92 0.92 Potential across memristor (V) Potential across memristor (V) 0.9 0.9 0 2 4 6 8 0 2 4 6 8 Memristor 4 Memristor 4 x 10 x 10 SPICE 300x300 MATLAB 300x300 Figure 4.12. Potential across the memristors in crossbars of different sizes obtained through SPICE simulations and the proposed MATLAB framework based simulations.

4.6.2 Minimizing Sneak-path Impact Large deep networks require large memristor crossbars. The training in a large crossbar is challenging due to driver internal resistances, crossbar wire resistances, and sneak-paths. This is illustrated using a large crossbar example in Figure 4.11. Assume that in Figure 4.11, M=784, N=400, and the inputs are Vi=1V for i=1,2,..,784. The bottom of the columns are virtually grounded as they are connected to the op-amp circuits in Figure 4.4(a). Further assume that the conductance of each memristor in the crossbar is

5.5×10-7 Ω-1. Figure 4.14 shows the voltage drop across the memristors in the crossbar for the applied input voltages. In this graph, the voltage across the memristors are plotted from the 1st row to the 784-th row, and for each row from left to right. In an ideal circuit, the voltage across each memristors would be 1V. Due to the sneak-paths in the crossbar, the memristors are getting an actual potential drop across them of less than 1V. In fact, the minimum potential across a memristor in the crossbar is 0.73 V. Thus sneak-paths in this memristor crossbar make training challenging.

We are proposing a design technique to minimize the impact of sneak-paths in a large memristor crossbar. Figure 4.13 shows the schematic of the proposed design technique. Instead of applying the inputs at the edge of the crossbar rows, inputs are

72 applied in the middle of the crossbar rows. Additionally, instead of connecting the neuron circuits at the edge of the crossbar columns, neuron circuits are connected in the middle.

Figure 4.15 shows the voltage drop across the memristors in the crossbar, assuming the same crossbar size, same input voltages, and same memristor conductances as in the previous paragraph. In this figure, the voltage across the memristors are plotted from the

1st row to the 784-th row, and for each row from left to right. The minimum potential across any memristor in this case is 0.9 V. This implies, the design approach in Figure

4.13 reduces the impact of sneak-paths significantly. We utilized this approach in the rest of this study.

......

+ - R R y1 yN/2 + - Rf

. . .

Figure 4.13. Proposed design for implementing a layer of memristor based neurons.

1 1

0.9 0.9 1st row 784-th row 0.8 0.8

0.7

0.7 Potential across memristor (V) 0 100 200 300 400

Potential across memristor (V) 0 1 2 3 Memristor Memristor 5 x 10 (a) (b) Figure 4.14. Voltage drop across the memristors in a 784×400 crossbar when the design is based on Figure 4.11: (a) for all the memristors in the crossbar (b) for the 1st and the 784-th rows of the crossbar.

0.98 0.98

0.96 0.96

0.94 0.94 1st/784-th row 392-th/393-th row 0.92 0.92 0.9 0.9 0.88

0.88 Potential across memristor (V) 0 100 200 300 400

Potential across memristor (V) 0 1 2 3 Memristor Memristor 5 x 10 (a) (b) Figure 4.15. Voltage drop across the memristors in a 784×400 crossbar when the design is based on Figure 4.13: (a) for all the memristors in the crossbar (b) for the 1st, 392-th, 393-th, and 784-th rows of the crossbar.

4.7 Experimental Setup

Small memristor crossbar circuits (size up to 50×50) were simulated in SPICE while larger crossbars were simulated in MATLAB. Both simulations considering the crossbar sneak-paths and wire resistances within the crossbar. A wire resistance of 1.5 Ω between memristors in the crossbar is considered in these simulations [57]. Each attribute of the input was mapped within a voltage range of [-Vread, Vread]. Two sets of crossbar simulations were carried out: one considering memristor device variation and 74 stochasticity, and the other not considering these. We assumed a maximum deviation of

30% in memristor device responses due to device variations and stochasticity. That is, when we want to update a memristor conductance by amount x, the corresponding training pulse would update the conductance by an arbitrary value randomly taken from the interval [0.7x, 1.3x]. Table 3.7 shows the simulation parameters. The resistance of the memristors in the crossbars were randomly initialized between 0.909 MΩ and 10 MΩ.

An accurate memristor model [53] of a HfOx and AlOx memristor device [9] was used for simulations. The switching characteristics for the model are displayed in Figure

3.25. This device was chosen for its high minimum resistance value and large resistance ratio. According to the data presented in [9] this device has a minimum resistance of 10 kΩ and a resistance ratio of 103. The full resistance range of the device can be switched in 20 μs by applying a 2.5 V pulse across the device.

For small crossbars, MATLAB (R2014a) and SPICE (LTspice IV) were used to develop a simulation framework for applying the training algorithm to a memristor based neural network circuit. SPICE was mainly used for detailed analog simulation of the smaller memristor crossbar arrays and MATLAB was used to simulate the rest of the system. Larger crossbars were simulated in MATLAB. The circuits were trained by applying input patterns one by one until the errors were below the desired levels.

4.8 Results

Two training methods were utilized in a deep network training system: an unsupervised autoencoder training and a supervised training. We evaluated each of these methods individually and then in combination to evaluate the overall deep network training. The individual training methods (autoencoder and supervised training) were 75 tested with smaller networks simulated in SPICE. The individual systems for autoencoders and supervised trainings were trained with the following datasets:

Wisconsin [58], Wine [59] and Iris [60]. For deep networks, the MNIST dataset and

KDD intrusion detection dataset were used. The MNIST and KDD datasets are too big to allow easy plotting of the autoencoder results. Table 4.1 lists the network configurations examined. Due to the long SPICE simulations needed for the deep networks, we used the accurate MATLAB simulation approach for them.

Table 4.1: Neural network configurations. Dataset Configuration Autoencoder Wisconsin 9→10→2→9 Training Wine 13→10→2→13 Iris 4→15→2→4 Supervised Wisconsin 9→10→1 Training Wine 13→10→3 Iris 4→15→3 Deep Network MNIST 784→200→100→10 Training KDD 41→100→50→20→1

4.8.1 Autoencoder Training Autoencoders were trained for dimensionality reduction and feature extraction for the Wisconsin, Iris, and Wine datasets considering the network configurations listed in

Table 4.1. In these networks, the number of output layer neurons were the same as the number of inputs to the networks. The networks were trained such that for a training data, the same values appear at the output layer neurons. After training, the hidden layer neurons learn an abstract model of the data. Figures 4.16- 4.18 show the distribution of the data in the reduced dimension space (output of the second layer) for both software,

76 and memristor based implementations of the training. We can observe that data belonging to the same class appear closely in the feature space and could potentially be linearly separated. Results obtained from memristor based designs are as effective as the software implementation results (MATLAB). When memristor device variation and stochasticity are considered (indicated as “device var.” in the figures), the memristor based autoencoders still work effectively. All the memristor crossbar simulations in this subsection were carried out in SPICE.

0.75 1

0.7 0.8 Benign 0.65 0.6 Malignant 0.6 0.4 0.55 Benign 0.2 Malignant 0.5 0 0.45 0 0.2 0.4 0.6 0.8 1 0.2 0.25 0.3 0.35 0.4 Before training MATLAB training

1 1 Benign Malignant 0.5 0.5

0 0

-0.5 -0.5 Benign Malignant -1 -1 -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 SPICE training (no device var.) SPICE training (device var.)

Figure 4.16. Feature extraction results for Wisconsin dataset.

0.7 1

0.8 0.65

0.6 0.6 Setosa 0.4 Versicolor 0.55 Setosa Virginica Versicolor 0.2 Virginica 0.5 0 0.8 0.81 0.82 0.83 0 0.2 0.4 0.6 0.8 1 Before training MATLAB training

-0.3 1 Setosa Setosa -0.4 Versicolor Versicolor Virginica 0.5 -0.5 Virginica

-0.6 0 -0.7

-0.8 -0.5 -0.9

-1 -1 -1 -0.8 -0.6 -0.4 -0.2 0 -1 -0.5 0 0.5 1 SPICE training (no device var.) SPICE training (device var.)

Figure 4.17. Feature extraction results for Iris dataset.

0.7 1 Class1 Class1 Class2 0.8 Class2 0.65 Class3 Class3 0.6 0.6 0.4

0.55 0.2

0.5 0 0.2 0.25 0.3 0.35 0 0.2 0.4 0.6 0.8 Before training MATLAB training

1 1

0.8 0.8

0.6 0.6

0.4 Class1 Class1 Class2 0.4 Class3 Class2 0.2 Class3 -1 -0.5 0 0.5 0.2 -1 -0.5 0 0.5 SPICE training (no device var.) SPICE training (device var.)

Figure 4.18. Feature extraction results for Wine dataset.

4.8.2 Supervised Training We have examined supervised training of memristor based two layer neural networks for the Wisconsin, Iris, and Wine datasets. Figure 4.19 shows the training graphs obtained from the SPICE simulations utilizing the training process described in section 4.4, as well as results from the software implementations for different datasets.

The results show that the neural networks were able to learn the desired classification applications. Table 4.2 shows the recognition errors on test data for different datasets.

Test errors considering variation in the memristor devices and without considering device variation are close to the errors obtained through software implementations.

0.2 0.4 no device var. 0.15 0.3 device var.

0.1

MSE 0.2 0.05 0.1 0 0 50 100 150 0 Epoch 0 50 100 150 Wisconsin s/w (#of training data 40) Wisconsin (SPICE)

0.2 0.4 no device var. 0.15 0.3 device var.

0.1 0.2 MSE 0.05 0.1 0 0 100 200 0 Epoch 0 100 200 Iris s/w (# training data 99) Iris (SPICE)

0.4 0.4 no device var. 0.3 device var.

0.2 0.2 MSE MSE 0.1

0 0 0 100 200 0 50 100 150 Epoch Epoch Wine s/w (# of training data 118) Wine (SPICE) Figure 4.19. Software training results and the SPICE training results of the memristor based neural networks for both cases: without considering variation in the memristor devices (no device var.) and considering variation in the devices (device var.).

Table 4.2: Recognition error on test data for different datasets. Recognition error (%) no device device s/w var. var. impl. Iris (51 test data) 3.92 2.61 1.66 Wisconsin (60 test data) 6.66 6.66 3.33 Wine (60 test data) 1.34 1.93 1.11

4.8.3 Deep Network Training Memristor based deep networks were examined for the MNIST dataset [61] and

KDD intrusion detection dataset [62]. The networks were trained in a two step process: autoencoders were used for layer-wise pre-training, followed by supervised training of whole network. The impact of device variation and stochasticity in the memristor crossbars were examined. Figure 4.20 shows the supervised training graphs obtained from the MATLAB framework based circuit simulations as well as from the software implementations for both of the datasets. The results show that the deep networks were able to learn the desired classification applications.

0.08 0.05 no device var. device var. 0.06 0.04

0.04 0.03 MSE MSE

0.02 0.02

0 0.01 0 5 10 15 5 10 15 Epoch Epoch KDD (software) [20,000 training data] KDD (circuit)

0.06 0.03 no device var. 0.04 0.02 device var.

MSE 0.02 0.01 0 0 20 40 0 Epoch 0 20 40 MNIST (software) [10,000 training data] MNIST (circuit) Figure 4.20. Training results of the memristor based deep neural networks for both cases: without considering variation in the memristor devices (no device var.) and considering variation in the devices (device var.).

Table 4.3 shows the recognition errors on test data for both datasets. Test errors considering variation in the memristor devices and without considering device variation are close to the errors obtained through software implementations. Our test errors for the

MNIST dataset are worse than the published best recognition error, because we did not train the network for the entire training dataset. The MNIST dataset has 60,000 training data samples and simulation of the training with this entire dataset takes a very long time.

Our experiments took 10,000 training data and 5000 test data for this application.

Table 4.3: Recognition error on test data for MNIST and KDD datasets.

Recognition error (%) no device device s/w var. var. impl. KDD (5000 test data) 3.36 3.2 3.32 MNIST (5000 test data) 8.06 8.04 6.84

4.8.4 Discussion On-chip training of memristor crossbar based deep neural network systems can provide high throughput training at low energy consumption over an ex-situ training system. Large deep networks would require large memristor crossbars. In a large crossbar, driver internal resistances, crossbar wire resistances, and sneak-paths make training challenging. Our future work will examine memristor crossbar scaling limits and approaches to train large deep networks utilizing multiple smaller crossbars.

In this chapter the maximum crossbar size utilized was 784×400. When reading/writing the memristors in this crossbar, the maximum degradation of the read/write potential for a memristor was 10% due to the above mentioned factors. That is, when a read/write pulse is applied for this memristor, only 90% of the applied pulse amplitude will appear across it. Our experimental results show that the proposed on-chip training approach was able to train the deep network for the MNIST and KDD datasets considering the degradation of the memristor read/write potentials.

Memristor based circuits are prone to device variations and faults. On-chip training systems are more tolerant to device variations and faults compared to an ex-situ training based system. Our experimental results show that the systems were able to learn the desired functionalities even in the presence of up to 30% variations in the device 82 responses.

4.9 Summary

In this chapter we have designed on-chip training system for memristor based deep neural networks utilizing two memristors per synapse. We have utilized autoencoders for layer-wise pre-training of the networks and utilized the back- propagation algorithm for supervised fine tuning. Techniques to reduce the impact of sneak-paths in a large memristor crossbar and for high speed simulations of large crossbars were proposed. We performed detailed evaluation of the training circuits with some nonlinearly separable datasets which take crossbar wire resistance and sneak-paths into consideration. We also demonstrated successful training of memristor based deep networks for MNIST digit classification and the KDD intrusion detection datasets. This work would enable the design of high throughput, energy efficient, and compact deep learning systems.

CHAPTER V

EX-SITU TRAINING OF LARGE MEMRISTOR CROSSBARS

5.1 Introduction

Memristor crossbar arrays carry out multiply-add operations in parallel in the analog domain, and so can enable neuromorphic systems with high throughput at low energy and area consumption. Neural networks need to be trained prior to use. Ex-situ training is one of the approaches for training a neural network where the weights trained by a software implementation are programmed into the system.

Programming a memristor crossbar is challenging because they have a significant amount of variation present between devices [9]. This means that identical voltage pulses may not induce identical amounts of resistance changes in different memristors within a crossbar. As a result, an iterative feed-back process is needed, where at each iteration step, the new device state is measured and a new adjustment pulse is applied.

Two possible types of crossbars that can be used, for ex-situ training, are 0T1M (0 transistor, 1 memristor) and 1T1M (0 transistor, 1 memristor). In a 1T1M crossbar, there is an isolation transistor at each cross-point, thus allowing any individual memristor to be accessed and read. In a 0T1M crossbar, the lack of isolation devices means that a read from the crossbar would be based on multiple memristors due to sneak-paths. Sneak current paths in such a crossbar make the reading of an individual memristor from the

84 crossbar challenging. Thus makes ex-situ training of 0T1M crossbar based neural networks challenging.

Existing ex-situ training approaches do not consider sneak-path currents and may work for only small crossbar based neural networks. Ex-situ training in large crossbars, without considering sneak-paths reduce the application recognition accuracy significantly due to large number of sneak-paths. Large memristor crossbars have more impact of sneak-path currents than smaller crossbars. It is important to have a fast simulation tool to study such large memristor crossbars thoroughly. SPICE simulation of large crossbars take very long time (about a day). We have developed a framework for accurate and fast simulation of large memristor crossbars.

This thesis proposes ex-situ training approaches for both 0T1M, and 1T1M crossbars considering crossbar sneak-paths. These approaches would be very effective for deep networks which require large memristor crossbars. The proposed ex-situ training approach is able to tolerate the stochasticity in memristor devices. The results show that the 0T1M systems trained were about 17% to 83% smaller in area than the 1T1M systems, including the area for training circuits. We also have examined the impact of device variation on the proposed ex-situ training approach.

The rest of the chapter is organized as follows: section 5.2 describes memristor based neuron circuit and neural network designs. Section 5.3 demonstrates the challenges in programming large memristor crossbar based neural networks. Proposed ex-situ training process is elaborated in section 5.4. Section 5.5 demonstrates experimental results and finally in section 5.6 we summarized our work.

5.2 Memristor Crossbar Based Neural Network

Figure 4.2 shows a block diagram of a two layer feed-forward neural network.

Each neuron in a feed forward neural network performs two types of operations: a dot product of its inputs and weights (see Eq. 3.1), and evaluation of a nonlinear function (see

Eq. 3.2). Deep neural networks [55] are variants of multi-layer perceptrons having a large number of layers and a large number of neurons per layer. Networks with a large number of layers and a high neuron count per layer enable construction of complex nonlinear separators that allow the classification of complex datasets. Figure 4.2 shows a memristor crossbar based circuit that can be used to evaluate the neural network in Figure 4.5.

5.3 Large Crossbar Ex-situ Training Challenges

5.3.1 Traditional Ex-situ Training Approaches The objective of the ex-situ training process is to set each memristor in the crossbar to a specific resistance. Because the memristor devices have stochastic behavior, multiple pulses may be required to set the memristors to a target state while reading the previous state between these write pulses. This is essentially a feedback write process which requires being able to read the resistance of each individual memristor.

Writing to a single memristor in a crossbar is relatively straight forward compared to reading. To write, we need to make sure that only the desired memristor gets the write pulse, Vw (greater than Vwt) [10]. Figure 5.1 shows the write operation to change a single memristor while keeping the conductance of the remaining memristors in the crossbar unchanged.

0 V Vw/2 0 V 0 V 0 V -Vw/2 0 V 0 V

0 V 0 V

V /2 -Vw/2 w

0 V 0 V

(a) (b) Figure 5.1. Demonstration of the write operation in a crossbar: (a) increasing memristor conductance (upward arrow) and (b) decreasing memristor conductance (downward arrow).

In the ex-situ training process, reading individual memristor resistance levels in a crossbar is challenging due to sneak-paths. Placing a transistor at each cross-point (Figure

5.2) will ensure that only the resistance of the target memristor is impacting the column current during the read process. This is essentially a 1T1M crossbar [63].

-VR

Sr1

Sr2

Sri

SrN

+ - Rf

Figure 5.2. Accessing a memristor in a 1T1M crossbar.

-V R R 1 R 0 V f

R1 0 V -V - ...... R + Vo 0 V

+ - Rf

Vo Figure 5.3. Reading a particular memristor from a 0T1M crossbar. In an ideal case, the figure in right is functionally equivalent to the figure in left.

Ex-situ training in 0T1M crossbars has significant area benefit over 1T1M crossbars due to the elimination of the cross-point transistors. Figure 5.3 shows the circuit to read a particular memristor from a 0T1M crossbar. An op-amp is utilized at the bottom of the corresponding crossbar column and a read voltage, –VR is applied to the corresponding row. Remaining rows and columns are set to 0 V. In an ideal circuit, the op-amp output can be expressed as Vo=VRRf /R1, which is a function of only R1 as Rf and

VR are constant. But in a real circuit the voltage Vo is essentially a function of all the memristors in the crossbar due to sneak-path currents.

5.3.2 Large Crossbar Simulations Large deep networks require large memristor crossbars. The SPICE simulation of large memristor crossbars are very time consuming (about a day per iteration). We have designed a MATLAB framework for large memristor crossbar simulation which is very fast compared to SPICE simulation and is as accurate as SPICE simulations (less than a minute per iteration). Our approach takes crossbar wire resistances and driver internal resistances into account.

Consider the M×N crossbar in Figure 5.4. There are MN memristors in this crossbar circuit, 2MN wire segments, and M input drivers. For any given set of crossbar input voltages, we need to determine the 2MN terminal (node) voltages across the memristors. The Jacobi method of solving systems of linear equations was used to determine these unknown voltages. All the nodes on the crossbar rows were initialized to the applied input voltages, while all the nodes on the crossbar columns were initialized to zero volts. We then repeatedly calculated currents through the crossbar memristors and updated the node voltages until convergence. The node voltages are updated based on the voltage drop across the crossbar wire resistances considering the currents through the memristors.

. . . . .

Inputs .

+ - R R

+ - Rf

yj y y 1 N/2 Figure 5.4. Schematic of a M×N crossbar implementing a layer of N/2 neurons.

5.3.3 Impact of Sneak-paths in a Large Crossbar A layer of neurons having large number of inputs and outputs, require a large memristor crossbar. The training in a large crossbar is challenging due to driver internal resistances, crossbar wire resistances, and sneak-paths. This is illustrated using a large crossbar example in Figure 5.4. Assume that in Figure 5.4, M=617, N=400, and the

89 inputs are Vi=1V for i=1,2,..,617. The bottom of the columns are virtually grounded as they are connected to the op-amp circuits. Further assume that the conductance of each memristor in the crossbar is 5.5×10-7 Ω-1. Figure 5.5 shows the voltage drop across the memristors in the crossbar for the applied input voltages. In this graph, the voltage across the memristors are plotted from the 1st row to the 617-th row, and for each row from left to right. In an ideal circuit, the voltage across each memristors would be 1V. Due to the sneak-paths in the crossbar, the memristors are getting an actual potential drop across them of less than 1V. In fact, the minimum potential across a memristor in the crossbar is

0.55 V.

1 1 1st row 0.9 0.9 617-th row

0.8 0.8

0.7 0.7

0.6 0.6 Potential across memristor (V) Potential across memristor (V) 0.5 0.5 0 0.5 1 1.5 2 2.5 0 100 200 300 400 Memristor 5 Memristor x 10 (a) (b) Figure 5.5. Voltage drop across the memristors in a 617×400 crossbar: (a) for all the memristors in the crossbar (b) for the 1st and the 617-th rows of the crossbar.

5.3.4 Impact of Sneak-paths on Ex-situ Training Existing ex-situ training approaches train a neural network in software without considering crossbar sneak-paths. Then they program the trained conductances to the crossbar. We have examined ex-situ training for ISOLET dataset [64] using the existing approach (which does not consider crossbar sneak-path currents).

The ISOLET dataset has 6238 training data and 1559 test data belonging to 26 classes. Each data sample has 617 attributes. We have considered a neural network

90 configuration of 617→200→26 (617 inputs, 200 hidden neurons and 26 output neurons).

We considered memristor crossbar wire segment resistance of 5 ohms. Recognition error on test dataset in software was 9.49%. After the trained weights were programed in to

1T1M crossbars, recognition error increases to13.08%. This is due to the presence of sneak-paths in the recognition phase.

5.4 Proposed Ex-situ Training Process

We have examined ex-situ training on both 0T1M, and 1T1M crossbars considering crossbar sneak-paths. The neural networks trained in software essentially mimic memristor crossbars with wire resistances. After training we have the trained memristor conductances which need to be programed in the crossbars. We quantized the conductances into 8 bit representations before programming the memristors.

Reading a particular memristor from a 1T1M crossbar is straight forward as it eliminates sneak-paths. As a result, once the software training is done considering sneak- paths, programming memristor conductances in a 1T1M crossbar is straight forward too.

During evaluation/recognition of a new input, all the transistors in a 1T1M crossbar will be turned on. Thus the recognition phase has sneak current paths. But it is not a problem as the memristor conductances were determined in software considering the crossbar sneak-paths.

Reading a single memristor from a 0T1M crossbar is complex due to sneak-paths as stated in the previous section. In Figure 5.3 , Vo is essentially a function of all the memristors in the crossbar due to sneak-paths. If other memristor conductances are known, we can determine expected value of Vo for a target value of R1 from the software simulation of the circuit.

For programming memristors in a 0T1M crossbar we propose to initialize all the memristors in the crossbar with the lowest conductance value of the device. The memristors in the crossbar will be programmed one by one. Thus when programming a particular memristor in the crossbar, we know the conductance of the rest of the memristors in the crossbar. During programming, for each memristor the expected read voltage will be evaluated in software, based on the initialization of the conductances

(lowest conductance) and the already programmed conductances. The evaluation of these expected read voltages consider the sneak-paths. The target conductance will be programmed in the physical memristor crossbar based on the corresponding expected read voltages. Thus the ex-situ training process takes sneak-paths in to consideration for

0T1M crossbar.

Recognition accuracy for both types of crossbar (0T1M and 1T1M) based designs would be about the same as both approaches take sneak-paths into consideration. But the

0T1M systems will be more area efficient than the 1T1M systems.

Training Circuits Needed: We assumed that the ex-situ programming process will be coordinated by an off-chip system and so added only the necessary hardware needed to allow the external system to program the crossbars. These include two D-to-A converters, along with a set of buffers and control circuits per crossbar. The use of two D- to-A converters per crossbar will serialize the programming process for each crossbar.

We assume this is not a problem as this system will be programmed once and then deployed for use. A proposed circuit for the training is shown in Figure 5.6. Since memristors will be programmed one at a time, only one of these circuits is required per crossbar.

-VR R1 0 V

0 V

. . . . Read current

0 V Rf D-to-A D-to-A x+∆/2 x-∆/2 -

+ + - + -

Read voltage Increase Resistance No Change Decrease Resistance Figure 5.6. Circuit used to program a single memristor to a target resistance. Here ∆ determines the programming precision.

The target conductance of each memristor will be determined from the trained software neural network weights according to the approach stated earlier. This leads to conductances being floating point numbers. Before programming, these target conductance values will be quantized into discrete resistance values. This significantly reduces the D-to-A size required to program the memristors. Our experimental results show that 8 bits precision weights provide acceptable recognition accuracy. An iterative feedback write process is used since memristors typically possess some degree of stochastic behavior. A pulse width known to be much smaller than that required to switch the memristor is used to program the memristors.

5.5 Experimental Evaluations

5.5.1 Datasets We have examined the proposed ex-situ training approach on three different datasets. Each of these datasets require nonlinear separator for classification hence requires a multi-layer neural network.

Iris Dataset: We have utilized the Iris dataset [60] consisting of 150 patterns of three classes (Iris Setosa, Iris Versicolour, and Iris Virginica). Each pattern consists of 4 attributes/features which were normalized such that the maximum attribute magnitude is

1. We have trained a two layer neural network having 15 neurons in the hidden layer and three in the output layer.

KDD Dataset: We have examined a four layer deep network for the KDD dataset

[62]. We utilized 20,000 data for training and 5,000 data for testing. The utilized neural network configuration was 41 →100→50→20→1.

ISOLET Dataset: Description is in section 5.3.

5.5.2 Memristor Model An accurate memristor model [53] of a HfOx and AlOx memristor device [9] was used for simulations. The switching characteristics for the model are displayed in Figure

5.5.3 Results To demonstrate the programming flow, consider the programming of the memristor at the 1st row and 398-th column for the ISOLET dataset. The conductance of the memristor obtained from software simulation was 9.2265e-07 ohms-1. For 1T1M read current would be Vr/[(618+400)Rw+1/ 9.22e-7] or 9.18e-7 A, where Vr=1V, is the applied read voltage and Rw=5 ohms, is the crossbar wire segment resistance.

For 0T1M crossbar we are assuming that the memristors will be programmed serially, one by one, from the 1st row to the last row and in a row from left to right. When 94 we are programming the memristor at the 1st row and 398-th column of the 0T1M crossbar, the memristors in the 1st to the 397-th columns of this row are already programmed. Conductances of the memristors in other rows are the lowest conductance of the device. Desired read current for this memristor is 6.7477e-07 A when applied read voltage is 1V (as evaluated by software implementation of the crossbar). Note that the two read currents for the two types of crossbars are different.

We have examined the impact of memristor device stochasticity by allowing random deviations of the programmed memristor conductance when a write pulse is applied. The maximum deviation could be 50% of the desired update. We assumed a programming pulse width of 2 ns, and thus needed several pulses to achieve the intended final resistance level. Figure 5.7 shows iterative change in conductance of the memristor during the programming process. We can observe that after about 82 iterations the memristor was programmed to the desired conductance within the programming precision boundaries. Table 5.1 shows the recognition error on test data for different datasets obtained from both: software implementations and the proposed ex-situ training approach. Recognition accuracy for both types of crossbar (0T1M and 1T1M) based designs are about the same as both approaches take sneak-paths into consideration.

-7 -6 x 10 x 10 8 1

0.8 6 0.6 4 0.4 Current (A) Current 2 0.2 Conductance (1/ohms) Conductance

0 0 0 20 40 60 80 0 20 40 60 80 Pulse count Pulse count (a) Change in read current. (b) Change in conductance. Figure 5.7. Demonstration of memristor programming through ex-situ process.

Table 5.1: Recognition error on test data. Test error (%) Training of the ckt. Ex-situ in s/w training Iris 3.92 3.92 KDD 2.4 2.42 ISOLET 9.49 10

5.5.4 Area Savings We have examined the area savings of 0T1M crossbar based neural network implementations over a 1T1M crossbar based designs. We assumed memristors were fabricated on top of the transistor layer for both the 0T1M and 1T1M designs. We have approximated the area of each system in terms of the number of transistors used. In the area calculations, we have accounted for both circuits contributing to the neural network evaluation and the memristor programming operations.

Table 5.2: Area savings in 0T1M crossbar systems over 1T1M crossbar systems for implementing neural networks of different configurations. Area (Transistors) Dataset 1T1M 0T1M Area reduction in 0T1M Iris 1212 1002 17% KDD 28530 8290 71% ISOLET 268950 45578 83%

Table 5.2 shows the area savings of 0T1M crossbars over 1T1M crossbars for implementing neural networks of different configurations. We can observe that 0T1M crossbar based systems provided between 17% to 83% area reductions over the 1T1M crossbar based systems. The bigger the crossbars, provide more area reductions.

5.5.5 Impact of Device Variation Due to device variations, conductance ranges of different memristors in a crossbar may not be same. It would not be possible to program a memristor beyond its conductance range. Prezioso et al. [51] examined variations in lowest conductances of different memristors in a crossbar. To cope with this issue, software training could be restricted such that the trained conductances are within the conductance range of any memristor in a physical crossbar. Thus the impact of device variation in ex-situ training of 1T1M crossbars could be minimized.

In ex-situ training of a 0T1M crossbar we assumed all the memristor conductances will be initialized to the minimum device conductance. In a physical crossbar not all the memristors will be at the same conductance value due to device variations. We have examined the impact of this issue in ex-situ training of 0T1M crossbars. From [51] it can be seen that lowest conductances of different memristors in a 97 crossbar are normally distributed and the standard deviation of the distribution is 0.7×10-7 ohms-1. For the device considered in this study, mean of the lowest conductances is 1×10-

7 ohms-1. Recognition error for ISOLET dataset considering device variation in ex-situ training of 0T1M crossbars was 10.5%. This is slightly greater than the without device variation case, but still lower that the without considering sneak-path currents case

(section 5.3).

5.6 Summary

Existing ex-situ training approaches for memristor crossbars do not consider sneak-paths and may work for only small crossbar based neural networks. Ex-situ training in large crossbars, without considering sneak-paths reduce the application recognition accuracy significantly due to large number of sneak-paths. This chapter proposes ex-situ training approaches for both 0T1M, and 1T1M crossbars considering crossbar sneak-paths. A framework for fast and accurate simulation of large memristor crossbars was developed. The ex-situ training approach is able to tolerate the stochasticity in memristor devices. The results show that the 0T1M systems trained are about 17% to 83% smaller in area than the 1T1M systems, including the area for training circuits.

CHAPTER VI

NEURAL NETWORK BASED MULTICORE PROCESSORS

6.1 Introduction

This study examines the design of several novel specialized multicore neural processors that would be used primarily to process data directly from sensors. Systems based on memristor crossbars are examined through detailed circuit level simulations.

These show that two types of memristor circuits could be used, where processors either have or do not have on-chip training hardware. We have designed these circuits and have examined their system level impact on multicore neural processors. Additionally, we have examined the design of SRAM based neural processors.

Full system evaluation of the multicore processors based on these specialized cores were performed which took I/O and routing circuits into consideration and area power benefits were compared with traditional multicore RISC processors. Our results indicate that the memristor based architectures can provide between three to five orders energy efficiency over RISC processors for the selected benchmarks.

Additionally, they can be up to 401 times more energy efficient than the SRAM neural cores.

The most recent results for memristor based neuromorphic systems can be seen in

[47,48]. Work in [47] did not explain how negative synaptic weights will be implemented

99 in their single memristor per synapse design. Such neurons, capable to represent only positive weights, have very limited capability. No detail programming technique was described. Liu et al. [48] examined memristor based neural networks utilizing four memristors per synapse where the proposed systems utilize two memristors per synapse.

They proposed deployment of the memristor based system as neural accelerator with a

RISC system while proposed systems are standalone embedded processing architectures

(which process data directly coming from 3-D stacked sensor chip).

The novel contributions of this work are:

1) This study examines the detailed design of memristor based multicore processors. The timing and power data for both cores are based on detailed SPICE simulations.

2) The I/O interface, bandwidth and energy are considered in system level evaluations. We compared the area-power benefits of memristor based systems with multicore digital SRAM and RISC systems.

3) The factors contributing to the energy efficiencies of the specialized systems are identified.

4) The on-chip and off-chip training circuits were verified with detailed SPICE simulations.

The rest of the chapter is organized as follows: section 6.2 describes the overall architecture and the digital neural core design. Section 6.3 describes the memristor neural cores and Section 6.4 describes how to program these cores. Sections 6.5 and 6.6 describe our experimental setup, results respectively. Finally in section 6.7 we conclude our work.

100

6.2 Multicore Architecture

In this study we have examined applications implemented using multi-layer neural networks [17]. Each neuron in such neural network performs two types of operations, (i) a dot product of the inputs x1,…,xn and the weights w1,…,wn, and (ii) the evaluation of an activation function. These operations are shown in Eq. (3.1) and (3.2). In a multi-layer feed forward neural network, a nonlinear activation function is desired (e.g. f(x) = tan-1(x)).

Multicore architectures are widely used to exploit task level parallelism. We assumed a multicore neural architecture, for neural network applications, as shown in

Figure 6.1, with an on-chip routing network to connect the cores. This system processes data directly coming from the sensor chip which is residing on top of the system utilizing

3-D staking technology [65] (see Figure 6.2).

On-chip NC NC NC

R R R . . .

NC NC NC

R R R . Input Processor/

Buffer Memory

I/O interface I/O (consumer of neural system output) NC NC NC

R R R Off-chip

Figure 6.1. Proposed multicore system with several neural cores (NC) connected through a 2-D mesh routing network. (R: router).

101

Sensor chip

NC NC NC

R R R

NC NC NC

R R R

NC NC NC

R R R

Figure 6.2. Sensor chip on top of the neural processing system is shown in this figure. Input data are transferred to the neural chip utilizing through silicon via (TSV).

6.2.1 SRAM Digital Neural Core Figure 6.3 shows a block diagram of the digital neural core. Each core processes a collection of N neurons, with each neuron having up to M input axons. The neuron synaptic weights (Wi,j) are stored in a memory array. These synaptic values are multiplied with the pre-synaptic input values (xi) and are summed into an accumulator. Once the final neuron outputs are generated, they are sent to other cores for processing, after going through a look-up table implementing an activation function. Input and output buffers store the pre-synaptic inputs and post-synaptic outputs respectively. The memory array shown in Figure 6.3 can be developed using several different memory technologies. In this study, we assumed it was implemented using SRAM. In the rest of the chapter the system in Figure 6.3 will be referred as digital system or SRAM system.

We utilize 8 bits to store a synaptic weight in the SRAM neural core. Every neuron input and output are also 8 bits wide. Inputs belonging to an input pattern (vector) are evaluated one by one. When one input (one component from the input vector) is applied to the core, all the neurons in the core access the corresponding synapses,

102 multiply them with the input, and sum the product in the accumulators simultaneously. In a single SRAM core we have one lookup table for implementing the nonlinear activation function to keep the area and power overhead lower. In this system execution and routing are overlapped (see Figure 6.15(b)). When the core is executing input pattern n, it is sending outputs for pattern n-1 through the routing network serially (8 bits at a time). As a result, a single lookup table per core is sufficient to evaluate neuron activation function.

Subsection 6.7.2 examines determination of near optimum SRAM neural core size.

Control Pre-synaptic Unit neuron

number SRAM (Wij) Pre-synaptic i

neuron inputs Decoder (from router)

(i, xi)

xi Pre-synaptic × × × × Input buffer neuron value + + + + Post-synaptic acc acc acc acc neuron outputs (to Activation router) LUT Output buffer Figure 6.3. Proposed digital neural core architecture.

6.2.2 On-chip Routing An on-chip routing network is needed to transfer neuron outputs among cores in a multicore system. In a feed-forward neural network, the outputs of a neuron layer are sent to the following layer after every iteration (as opposed to a spiking network, where outputs are sent only if a neuron fires). This means that the communication between neurons is deterministic and hence a static routing network can be used for the core-to- core communications. In this study, we assumed a static routing network as this would be lower power consuming than a dynamic routing network. The network is statically time

103 multiplexed between cores for exchanging multiple neuron outputs.

SRAM based static routing is utilized to facilitate re-programmability in the switches [66]. Figure 6.4 shows the routing switch design. Note that the switch allows outputs from a core to be routed back into the core to implement recurrent networks or multi-layer networks all within the same core.

Routing bus 8 bit wide

8 bit 8 bit

Programming 8 bit wide . . . transistor

. Vdd A SRAM Vp

B ABGND A B 8 bit wide

Output port Input port

Figure 6.4. SRAM based static routing switch. Each blue circle in the left part of the figure represents the 8x8 SRAM based switch shown in the middle (assuming a 8-bit network bus).

6.2.3 Integration with a Processor The proposed architecture will process data directly coming from sensors in a 3D stack. In the context of embedded systems (e.g., smartphones), the need to go from sensors to memory, and back to the processor, is a major source of energy inefficiency

[37]. The neural accelerator has the potential to remove that memory energy bottleneck by introducing sophisticated processing at the level of the sensors, i.e., accelerators between sensors and processors. It can also significantly reduce the amount of data sent to the processor and/or the memory by acting as a preprocessor. For example some applications might require the location of a particular object in a large frame to do

104 additional processing. The neural processing system will identify the target object in the image and will send only the position of the object to the processor for more elaborated processing. The outputs generated by the neural system will be sent to the processor memory for further processing or the processor will read those outputs directly from on- chip buffer adjacent to the neural system (see Figure 6.1). In this study we are particularly focusing on the design and impact of the neural processing system.

6.2.4 I/O We propose integration of the neural chip with the sensor chip through 3-D stacking. This will provide the required high input bandwidth. Input data would need to travel through fewer on-chip paths to reach the appropriate neural core. As a result, the on-chip routing resource requirements and signaling energy will be minimized. High throughput applications require more off-chip bandwidth (about 20 Gbps). 3D stacking will enable off-chip data transfer of required high bandwidth at relatively lower power.

6.3 Memristor Cores

6.3.1 Memristor Based Neuron Circuit and Neural Network The schematic in Figure 6.5 shows the memristor based neuron circuit used in this study. Description of the neuron circuit can be seen in chapter III. Figure 6.7 shows the the memristor memristor based implementation of the neural network shown in Figure

6.6.

105

Synapse )

A )   1 Positive1 Weight Negative Weight ) )  A 1  1

B 0.5 0.5 B 0.5 0.5

0 Memristor Conductivity ( C Memristor Conductivity ( 0 11 2 2

Memristor Conductivity ( 0 1  w w2  Memristor Conductivity ( 1 2 w1 w21 2  w w  w w C A1 2 A 1 2 β

β 퐷푃푗 Memristor

푦 푗 Figure 6.5. Circuit diagram for a single memristor-based neuron.

A A B B Layer 1 inputs C crossbar C β β

F G

Layer 2 crossbar

β β

ABC

Outputs: F G Figure 6.6. Two layer Figure 6.7. Memristor crossbar based network. implementation of the neural network.

The advantage of memristor crossbar based neuron circuit in Figure 6.7 is that all computations related to a single neural network layer can be evaluated in one step by the crossbar. Additionally computing in the analog domain eliminates the need for 106 multipliers, adders, and accumulators. This leads to area and power efficiency along with high computation throughput. The non-volatile nature of memristors allows the circuit to be turned off when not in use, thus reducing the static power consumption.

A nonlinear activation function is typically used to generate a neuron’s output

(given by Eq 2). To enable parallel operations in the circuit in Figure 6.7, the activation function circuit has to be reproduced for each neuron. To keep the area of the neuron circuit low, a threshold activation function is used in this study, consisting of a pair of inverters, as shown in Figure 6.5. More complex activation functions (such as sigmoid) would require a costly analog-to-digital converter to be placed at each neuron output, adding significant area (about 3000 transistors per neuron for 8 bits neuron output) and power overheads.

6.3.2 Memristor Neural Core Figure 6.8 shows the memristor based single neural core architecture. It is consisting of a memristor crossbar of certain capacity, input and output buffers, and a control unit. The control unit will manage the input and output buffers and will interact with the corresponding routing switch associated with the core. The control unit will be implemented as a finite state machine and thus will be of low overhead. This is very similar to the control unit in the digital SRAM core. Near optimum core size is examined in subsection 6.7.2.

107

DAC

DAC Input Buffer Input DAC Input Buffer Input DAC

DAC Control Output buffer Unit Memristor Control Output buffer Routing crossbar Unit switch

Figure 6.8. Memristor crossbar Figure 6.9. Neural core having DACs for based neural core architecture. processing the first neuron layer of a network.

As the inputs are coming from a different sensor chip to the neural processing chip, inputs should be in digital form for ease of transmission. In this study we are assuming each input is represented by 8 bits. Before applying inputs to the memristor neuron circuit, these need to be converted to analog form. Figure 6.9 shows the neural core implementing first hidden layer neurons of the neural network which utilizes digital to analog converters (DAC). Neural cores of both types (having and not having DAC) are distributed uniformly over the chip.

6.4 Programming the Memristor Cores

To implement a desired functionality, a memristor crossbar neural network needs to be programmed. The memristor crossbars in Figure 6.7 are programmed using voltage pulses applied to the rows and columns to update the resistances [11]. In this study we are proposing two different memristor systems: a 1T1M (1-Transistor 1-Memristor) system programmed based on off-chip trained weights, and a high density 0T1M (0-Transistor 1-

Memristor) system with no cross-point isolation for on-chip training. The first approach requires an off-chip training system to determine the resistance values, which will then be directly programmed in the memristor crossbar. The second method requires on-chip 108 training circuitry that will update the memristor resistances in the crossbar, based on the training dataset iteratively. Detail about on-chip and off-chip training can be seen in chapters III and V respectively.

After each of these crossbar systems (1T1M and 0T1M) is programmed, they will operate in an evaluation mode where input data sets no longer induce weight changes.

This is achieved by using read voltage pulses below the memristor write threshold (that must be surpassed for resistance change to occur). A layer of neurons are evaluated by applying an input data to the crossbar inputs. Since this is a parallel analog computation, all crosspoint transistors in the 1T1M system will be turned on. Therefore, both the 0T1M and 1T1M systems are performing the same processing in the evaluation phase. An earlier study shows that 7 bits of precision is achievable from a single memristor [67].

Memristor based neuron circuits are utilizing two memristors per synapse which will provide combined synaptic weight precision of about 8 bits. In our on-chip training approach we are applying training pulse of variable width and of both positive and negative polarities, allowing the effective precision of weight represented by a memristor to be higher than 8 bits.

6.5 Evaluation of Proposed Architectures

6.5.1 On-chip Training Result We have performed detail simulation of the memristor crossbar based neural network systems. MATLAB (R2014a) and SPICE (LTspice IV) were used to develop a simulation framework for applying the training algorithm to a multi-layer neural network circuit. SPICE was mainly used for detailed analog simulation of the memristor crossbar array and MATLAB was used to simulate the rest of the system. A circuit was trained by

109 applying input patterns one by one until the errors were below the desired levels.

Simulation of the memristor device used an accurate model of the device published in [53]. The memristor device simulated in this study was published in [68] and the switching characteristics for the model are displayed in Figure 6.10. This device was chosen for its high minimum resistance value and large resistance ratio. According to the data presented in [68] this device has a minimum resistance of 125 kΩ, a resistance ratio of 1000, and the full resistance range of the device can be switched in 80 ns applying

4.25 V across the device.

40 4 On State Off State 2 20 A)  0 0 Voltage Voltage (V) -2 ( Current -20

-4 -40 0 100 200 300 0 100 200 300 time (ns) time (ns) Figure 6.10. Simulation results displaying the input voltage and current waveforms for the memristor model [53] that was based on the device in [68]. The following parameter values were used in the model to obtain this result: Vp=4V, Vn=4V, Ap=816000, -4 -4 An=816000, xp=0.9897, xn=0.9897, αp=0.2, αn=0.2, a1=1.6×10 , a2=1.6×10 , b=0.05, x0=0.001.

Figure 6.11 shows the Matlab-SPICE training simulation on the Iris dataset. A two layer network with four inputs, 15 hidden neurons and three output neurons was utilized. The graph shows that the neural network was able to learn the desired classifiers.

Recognition error on the test dataset was 3.92%.

110

1.5

1 MSE 0.5

0 0 100 200 300 Epoch Figure 6.11. Learning curve for the classification based on Iris dataset.

6.5.2 Ex-situ Training Result We have trained the same two layer neural network off-line for the Iris dataset classification. Trained software weights were imported into equivalent memristor crossbar based neural network. Recognition error of the memristor based neural network was same (3.92%) as the software neural network.

6.5.3 Application Description We have selected the following five applications for our system level evaluations: edge detection, motion estimation, deep networks, object recognition, and optical character recognition (OCR). These are described in detail below.

Edge (edge detection): Changes or discontinuities in image attributes such as luminance are fundamentally important primitive characteristics of an image because they often provide an indication of the physical extent of objects within the image. We implemented Sobel edge detection algorithm that takes 3x3 pixels as input to generate one output pixel. The application was implemented using convolution operation (not using neural network) on the RISC processing cores. This makes sure that the best algorithm was used for the RISC system. In the SRAM neural cores, the algorithm was approximated using a neural network with configuration 9→20→1 (9 inputs, 20 neurons in hidden layer and 1 output neuron). For the memristor systems we utilized four neural networks of configurations 9→20→15, 24→20→15, 15→10→4, and 15→10→4. These 111 extra networks were needed to generate the multi-bit outputs for the application. The network was trained using input, output pairs collected from the RISC implementation of the algorithm.

Deep: Deep neural networks have become popular for face, object and pattern recognition tasks. We developed a small scale deep network to process images from the

MNIST dataset [69]. This dataset contains 60,000 images of handwritten digits, with each image consisting of 28×28 grayscale pixels. We utilized a neural network with configuration 784 →200→100→10. The network was trained with 50000 images from the dataset. The same network was used in both the RISC processor and the neural processors.

Motion: Two images are compared to determine the degree of motion within the images. The algorithm estimates the degree of motion in increments of 5% from 0% to

50%. It evaluates any motion through the temporal derivative of the luminance of each pixel to determine pixel-level variations. The absolute value of each derivative is computed and then the absolute value of the derivatives of all pixels are averaged.

Finally, the average is compared against a threshold to determine if motion occurred within the image. For an m×n frame size, to detect motion we determined pixel deviations in the 8×8 grids and accumulated those deviations. For the memristor system, we utilized neural networks of configuration 64(2→1), 64→10, and 20→10. For the

SRAM system, utilized neural network configurations are 64(2→1), 64→1, and 2→1.

The SRAM system network is different as it has multi-bit outputs. The application was implemented for the RISC system on the basis of calculation of pixel deviation (not in neural network form).

112

Object Recognition: We have examined an object recognition task on the

CIFAR-10 dataset [70]. This dataset consists of 60,000 color images of size 32×32 belonging to 10 classes, including airplanes, automobiles, birds, cats, deer, dogs, frogs, horses, ships, and trucks. For the desired classification we utilized a two layer neural network with configuration 3072→100→10.

OCR: The Optical Character Recognition application deals with the recognition and classification of printed characters. There are numerous algorithms that perform this task, each with its own strengths and weaknesses. In this study, we implemented a neural network based optical character recognition system. The neural network was trained using the Chars74K dataset [71], consisting of 128×128 pixel images. We subsampled the character images and used 50×50 images in our experiment. We utilized a network of configuration 2500→60→26 on all the systems.

6.5.4 Mapping Neural Networks to Cores The neural hardware are not able to time multiplex neurons as their synaptic weights are stored directly within the neural circuits. Hence a neural network’s structure may need to be modified to fit into a neural core. In cases where the networks were significantly smaller than the neural core memory, multiple neural layers were mapped to a core as shown in Figure 6.12. In this case, the layers executed in a pipelined manner, where the outputs of layer 1 were fed back into layer 2 on the same core through the adjacent routing switch.

113

Layer 1 Memristor crossbar

Layer 2 Input Buffer

Layer 3

Control Output buffer Unit Routing Switch

Figure 6.12. Multiple layers of neurons on a core.

When a network layer was too large to fit into a core (either because it needed too many inputs or it had too many neurons), the network layer was split amongst multiple cores. Splitting a layer across multiple cores due to large number of output neurons is trivial. When there were too many inputs per neuron for a core, each neuron was split into multiple smaller neurons and then combined through a higher level neuron as shown in

Figure 6.13. When splitting a neuron, the network needs to be trained to account for the new network topology. As the network topology is determined prior to training (based on the neural hardware architecture), the split neuron weights are trained correctly.

i1 i1 w ’ w 11 i2 11 i2 w21 i3 i3 i4 i4 w i5 i5 22 w16 i6 i6 w16’

Figure 6.13. Splitting a neuron into multiple smaller neurons.

6.5.5 RISC Core Configuration We have compared power, performance of memristor based systems with a traditional RISC system. The examined ARM processor is a single issue, inorder 6 stage pipelined system and operates at 1 GHz clock. L1 instruction cache size is 16 kB and L1 114 data cache is 16 kB. We assumed main memory access latency 1 cycle to mimic prefetching from 3D stacked main memory. The reason behind choosing the single issue inorder processor over a superscalar out-of-order process is to achieve overall power efficiency for the real time applications. A complex out-of-order processor provides more throughput over a simple single issue inorder processor, but consumes significantly more power. We have obtained area and power of the RISC core utilizing McPat [72] assuming

45 nm process. Area of the core is 0.524 mm2 and it consumes 0.087 W power.

Performance of the RISC core was evaluated utilizing SimpleScalar [73] simulator.

6.6 Results

6.6.1 Bit Width and Activation Function

The bit width of the synaptic weights and the neuron activation function, f in Eq.

(3.2), will impact the output accuracy of the computations. In our study we utilized different activation functions for each of the computational circuits, based on the cost of implementation. Figure 6.14 shows the error due to different bit widths and two activation functions: sigmoid and thresholding.

A key reason for the low power of the memristor cores is that each crossbar evaluates all its neurons and synapses in parallel (Figure 6.15(a)) and latches the outputs into a buffer. As the neuron outputs are analog, the cheapest way to latch them is to generate a binary output using a threshold activation function. To enable the parallel operation, the activation function circuit has to be reproduced for each neuron. Hence implementing the sigmoid activation function would require a more complex analog-to- digital converter to be placed at each neuron output, adding significant area (about 3000 transistors per neuron for 8 bits neuron output) and power costs.

115

In the SRAM and RISC cores, we used the sigmoid activation function. The cost of implementing this is low on both architectures. In the RISC architecture it was observed that the runtime reduced by less than 2% when converting from sigmoid to threshold activation functions. In the SRAM architecture, the activation function is implemented using a lookup table. Since the neuron outputs are transmitted to the on-chip router serially in the SRAM architecture, only one lookup table is needed per core. The area and power overhead of using the lookup table was 1% and 0.3% respectively for the

256×128 (inputs×outputs) digital core.

As shown in Figure 6.14, a synaptic bit width of 8 bits results in an average loss in accuracy of less than 1% and 3% for the sigmoid and threshold activation functions respectively. Therefore both the 1T1M and the SRAM architectures utilized the 8 bit precision for weights. As the 0T1M architecture is trained through on-chip training circuits, it can achieve higher precision in the trained weights.

Figure 6.14. Error for different precisions (with the same number of neurons). Here Sig denotes for sigmoid, Flt for floating point, and Th for threshold.

6.6.2 Design Space Exploration of Neural Cores We have evaluated the area, power, and throughput of multicore systems having different neural core configurations (neurons/core and synapses/neuron) to obtain a near

116 optimal neural core configuration. We assumed all systems used a 45nm process for this study.

For the digital system, a 16-core version of the system was implemented on an

Altera Stratix IV FPGA to understand and verify the components. Static routing was utilized for the communications between the cores. Each core utilized 8 bit synaptic weights and implemented the sigmoid activation function. The five applications described above were mapped to the system and produced the correct outputs. The FPGA system was programmed in Verilog and ran at 87.5 MHz.

The area, power, and timing of the SRAM array were calculated using CACTI

[74] with the low operating power transistor option utilized. Components of a typical cache that would not be needed in the neural core (such as the tag array and tag comparator) were not included in the calculations. The components of the digital neural core were determined using the FPGA design – such as the number of multipliers, adders, and registers. The power of these basic components were determined through SPICE simulations.

The routing link power was calculated using Orion [75] (assuming 8 bits per link).

A frequency of 200 MHz was assumed for the digital system to keep power consumption low. Each row of the synaptic memory in the digital core is evaluated in one cycle, thus requiring several cycles for generating all the neural outputs. We assumed that the routing of one set of outputs could be overlapped with the calculation of the next set of outputs in a core. Figure 6.15(b) shows timing breakdown of a 256×128 (inputs×outputs) digital neural core.

117

For the memristor cores, detailed SPICE simulations were used for power and timing calculations of the analog circuits (drivers, crossbar, and activation function circuits). These simulations considered the wire resistance and capacitance within the crossbar as well. The results show that the crossbar required 10 ns for processing. As the memristor crossbars evaluate all neurons in one step, the majority of time in these systems is spent in transferring neuron outputs between cores through the routing network. We assumed that routing would run at 200 MHz clock resulting in two cycles needed for crossbar processing. Figure 6.15(a) shows timing breakdown of a 128×64

(inputs×outputs) memristor neural core.

10 ns 80 ns

Comp. Comm.

(a) 1280 ns

Core: Computation

640 ns 640 ns Routing: Comm. Idle

(b) Figure 6.15. Computation-communication timing for (a) memristor cores, (b) digital cores.

For each of the three neural core types, we varied the memory/crossbar array size to examine the impact on area and power as shown in Figures 6.16 to 6.18. Each of the applications was mapped to the multicore system based on the method in subsection

6.5.4. Off-chip I/O energy was also considered as described in Section 6.2. Data transfer energy via TSV was assumed to be 0.05 pJ/bit [76]. Image sizes of 2500×2500 were used in this design space exploration.

118

1 1 0.9 0.8 0.8 0.7 edge 0.6 0.6 motion 0.5 0.4 0.4 ocr 0.3 deep Normalized area Normalized 0.2 power Normalized 0.2 object recog. 0.1 0 0 avg. 256x128 256x64 128x64 128x32 256x128 256x64 128x64 128x32 Core config. Core config.

Figure 6.16. 0T1M system area and power.

1 1 0.9 0.8 0.8 0.7 edge 0.6 0.6 motion 0.5 0.4 0.4 ocr 0.2 0.3 deep Normalized area Normalized Normalized power Normalized 0.2 object recog. 0 0.1 avg. 256x128 256x64 128x64 128x32 0 256x128 256x64 128x64 128x32 Core config. Core config.

Figure 6.17. 1T1M system area and power.

1 1 0.9 0.8 0.8 0.7 edge 0.6 0.6 motion 0.5 0.4 0.4 ocr 0.2 0.3 Normalized area Normalized 0.2 deep Normalized power Normalized 0 0.1 object recog. 0 avg.

Core config. Core config. Figure 6.18. Digital system area and power.

For both the 0T1M and 1T1M configurations, we picked the 128×64 (inputs × outputs) memristor crossbar core configuration as this had the lowest average normalized area and power for the different applications. For the digital core, the optimum size was

256×128 synapses in the on-core memory array; this corresponds to a memory size of

256×128 bytes (8 bits per synapse). Due to the 8-bit outputs, the lookup table needed 256 bytes of memory to implement the activation function. Table 6.1 show area and power of different cores. 119

Table 6.1: Area and power of different cores. Area Total power Leakage Core processing time (mm2) (mW) power (mW) (sec.) RISC 0.524 87 54 3.97×10-5 (1 neuron, 784 synapse) Digital 0.208 24.2 6.94 1.28×10-6 (128 neuron, 256 synapse/neuron) 0T1M 0.00745 0.0865 0.0112 9×10-8 (64 neuron, 128 synapse/neuron) 1T1M 0.0082 0.0888 0.0118 9×10-8 (64 neuron, 128 synapse/neuron)

6.6.3 Results for Real Time Applications To compare the three neural architectures against RISC processing cores, we examined the processing of real time application loads:

 Deep network/character recognition: process 100,000 characters per second. For

the deep network, inputs are 28×28 pixel handwritten digits, while for character

recognition, these are 50×50 printed characters.

 Edge detection and motion estimation: process an 1280×1080 image stream at

60 frames per second.

Tables 6.2 to 6.6 show the number of cores, area, and power for the different architectures to process these applications with the specified real time processing requirements. The results show that the digital neural processor is about 14 to 952 times more efficient than the RISC cores, while the memristor architectures are about 5,641 to

186,892 times more efficient. We assumed that during the idle time, the memristor neural cores would not consume significant static power.

120

In the RISC system, edge detection and motion detection applications are executed in traditional algorithmic procedure (not in neural network form) to make sure that the best algorithms are executed in this system. In the specialized architectures, these two applications are executed in neural network form. Neural network representations of these two applications increase the number of operations to be performed compared to the original RISC operations. As a result power efficiencies for these two applications in the specialized architectures are not as high as the actual neural network applications

(deep, object recog., OCR).

Table 6.2: Deep Network Power Number Area Power efficiency of cores (mm2) (mW) over RISC RISC 902 472.65 78474.00 1 Digital 9 1.88 82.40 952 0T1M 31 0.23 0.40 196892 1T1M 31 0.25 0.42 187064

Table 6.3: Edge Detection Power Number Area Power efficiency of cores (mm2) (mW) over RISC RISC 240 125.76 20880.00 1 Digital 18 3.75 433.16 48 0T1M 16 0.12 1.37 15187 1T1M 16 0.13 1.41 14813

121

Table 6.4: Motion Estimation Power Number Area Power efficiency of cores (mm2) (mW) over RISC RISC 7 3.67 609.00 1 Digital 2 0.42 42.57 14 0T1M 2 0.01 0.11 5731 1T1M 2 0.02 0.11 5641

Table 6.5: Object Recognition Power Number Area Power efficiency of cores (mm2) (mW) over RISC RISC 1358 711.59 118146.00 1 Digital 17 3.54 148.55 795 0T1M 68 0.51 0.90 131833 1T1M 68 0.56 0.94 125430

Table 6.6: Optical Character Recognition Power Number Area Power efficiency of cores (mm2) (mW) over RISC RISC 825 432.30 71775.00 1 Digital 13 2.71 119.08 603 0T1M 31 0.23 0.47 153601 1T1M 31 0.25 0.49 147012

The area, power, and throughput of the multicore RISC system was determined by scaling up the performance of a single core. This essentially means that the latency and energy for core-to-core communications was ignored. The energy and time for core-to-

122 core communication was accounted for in the neural systems however. In the RISC systems, the main memory access time was set to 1 cycle (to mimic prefetching from 3D stacked DRAM) and energy/area of this memory were ignored further optimize the RISC system performance numbers.

In this study we assumed 3D stacking of the sensors with the processors and that the sensor data is transmitted over through silicon vias (TSV) to the processors. Each

TSV can transmit data at a rate of 200 Mbps. The 200 Mbps is due to the selected clock frequency (200 MHz) and is much lower than the peak bandwidth of 8 Gbps shown in recent studies [76]. Table 6.7 shows total input and output bandwidths (over all the

TSVs) for the benchmark applications are generally quite low.

Table 6.7: Input output data rates. Input bps Output bps Deep 6.27×108 1.00×106 Edge 6.64×108 6.64×108 Motion 1.33×109 60.00 Object recog. 1.99×109 8.10×105 OCR 2.00×109 2.60×106

6.6.4 Discussion RISC vs. Digital Systems: In the RISC system, the neuron weights have to be fetched from cache, where as in the specialized neural systems, these data already reside in the cores. This give a significant energy savings to the neural systems. Additionally, the parallelism and hardware specialization of the neural cores adds to more efficiency.

Table 6.8 shows RISC core area power breakdown. The RISC core has instruction fetch unit, data forwarding unit, register file whose combined power consumption is about 67%

123 of the total core. The specialized neural cores do not have such analogous components.

Table 6.8: RISC core area power breakdown Area (mm2) Power (W) Inst. fetch unit 0.149 0.044 Forwarding unit 0.053 0.010 Data cache 0.156 0.012 Register file 0.011 0.005 Execution unit 0.155 0.016 Total 0.524 0.087

The digital neural core has a SRAM array to store synaptic weights and 8 bit adders, multipliers to perform the neuron operations. The RISC system has 32 bit ALU and multiplier in the execution unit. A 32 bit ALU and multiplier consume 6.95×10-12 J and 1.39×10-11 J energies respectively. An 8 bit adder is about 116 times less energy consuming than a 32 bit ALU while an 8 bit multiplier is about 33 times less energy consuming than a 32 bit multiplier. Power consumption of the processing components

(128 adders, 128 multipliers) in the digital core is 1.3 times less than the execution unit power of the RISC core because of different clock frequencies, data bit widths and activity factors of the execution units.

The specialized digital neural core brings 128 synapses of 128 neurons in a single

SRAM array access and processes them simultaneously. In the RISC core each synapse processing requires about 10 data cache accesses (considering both cache read and write) on average. This implies, memory access in the digital neural core could be about 1280 times energy efficient than in the RISC core. SRAM array size in the digital core is 2 times bigger than the data cache in the RISC core. The digital and RISC systems operate

124 at different clock frequencies and single synapse processing in the RISC system takes multiple cycles (about 50 cycles). Data array dynamic power of the digital neural core is about 1.47 times more than the RISC core data cache power and static power is about

1.4 times more in the digital core. When peripheral circuit components and routing circuits are considered for the digital neural core, total power of the core becomes 3.6 times less than the RISC core power.

We acknowledge that the comparison between RISC system and the specialized systems are not very fair because of 32 bits vs. 8 bits data widths. But only the precision issue would make the RISC system about 4 times energy inefficient not orders of magnitude less energy inefficient. Average power in these systems can be expressed as

P_total = IO_power +number_of_core_used×(active_fraction×total_core_power

+ idle_fraction×leakage_power) (6.1)

I/O power in the RISC system is insignificant compared to the overall system power. Table 6.1 shows RISC and SRAM core powers. The RISC core is essentially a

Von Neumann architecture where the SRAM neural core is an ASIC for neural network acceleration, which makes execution in SRAM core faster than in the RISC core (see

Table 6.1). Maximum achievable throughput from a digital core is about 1296 times more than a RISC core throughput. As a result, for certain throughput requirement digital system is active only for a small fraction of time while RISC systems require more cores which are busy almost all the time to meet the real time throughput requirements.

To process 100,000 digits using deep network, the RISC system requires 902 cores and the SRAM system requires 9 cores. To achieve the target throughput SRAM system is active only 12.8% of the time. Even the digital cores are active for less fraction

125 of the time, we cannot utilize less than 9 cores. This is due to the requirement of a dedicated hardware neuron for a software neuron. For other applications Table 6.3-6.6 show the number of core used in different systems and Table 6.14 shows the idle fraction in the digital systems. Specialized non-Von Neumann architecture of the SRAM system is enabling parallel operation at low power consumption which is making it orders of magnitude more energy efficient over the RISC system.

Digital System vs. Memristor System: From Table 6.1 it can be seen that total power consumption of the memristor core is 280 times less than the SRAM core mainly because of the analog computation in the memristor core. In the SRAM core synaptic weight memory is the main component contributing for the core leakage power. Leakage power in the memristor core is 1050 times less than the SRAM core mainly because of the non-volatile nature of the memristor crossbar based synaptic array. Throughput of the memristor system is higher because of the parallel analog operation in the memristor crossbar.

Tables 6.9-6.13 compares the power breakdown of the digital systems and the

0T1M memristor systems for the applications runs in Table 6.2 to 6.6. It is seen that synaptic memory accesses in the SRAM system consume significant amount of the overall powers. Computation power and leakage power in the memristor based systems are significantly less than in the digital systems. In the digital systems I/O powers are below 0.2% of the overall power consumptions. I/O buffer power in the digital system is greater than that in memristor system because in the former one neuron outputs are 8 bits wide and neuron inputs are processed serially. Serial operation of the inputs require shift operation in the input buffer of the core. Recall that in the digital core,

126 synapse i of all the neurons in the core are processed in parallel. In the memristor core all the synapses and neurons are processed in parallel. This also contributes for greater control unit power in the digital system. Routing power in the digital system is more than the memristor system because in the former one neuron outputs are of 8 bits while in the latter one they are of single bit.

Table 6.9: Processing power breakdown Table 6.10: Processing power breakdown in µW for deep network application. in µW for edge detection application. Digital Memristor Digital Memristor (µW) (µW) (µW) (µW) Memory 13585.1 0 Memory 210616 0 Compute 14155.8 15 Compute 219465 803.5 I/O buffer 69.1 2.3 I/O buffer 1070.6 123.4 Control unit 17.9 2.2 Control unit 277.9 116.2 Clk dis. 32.4 1.3 Clk dis. 502.1 67.7 Routing 12 3.6 Routing 185.3 190.6 I/O 31.4 31.4 I/O 69.7 69.7 Sub total 27903.6 55.7 Sub total 432187 1371.1 Leakage 54496.1 343 Leakage 971.5 11.9 Total 82399.6 398.8 Total 433158.4 1383

127

Table 6.11: Processing power Table 6.12: Processing power breakdown breakdown for motion detection. for object recognition. Digital Memristor Digital Memristor (µW) (µW) (µW) (µW) Memory 19562.5 0 Memory 20785.1 0 Compute 20384.3 12.6 Compute 21658.3 26.7 I/O buffer 99.4 1.9 I/O buffer 105.7 4.1 Control unit 25.8 1.8 Control unit 27.4 3.9 Clk dis. 46.6 1.1 Clk dis. 49.5 2.2 Routing 17.2 3 Routing 18.3 6.3 I/O 66.4 66.4 I/O 99.6 99.6 Sub total 40202.2 86.7 Sub total 42744 142.8 Leakage 2368.7 19.7 Leakage 105808 753.7 Total 42571 106.4 Total 148552 896.5

Table 6.13: Processing power breakdown in µW for OCR application. Digital(µW) Memristor (µW) Memory 19622.9 0 Compute 20447.2 15 I/O buffer 99.7 2.3 Control unit 25.9 2.2 Clk dis. 46.8 1.3 Routing 17.3 3.6 I/O 100.1 100.1 Sub total 40359.9 124.4 Leakage 78716.5 343 Total 119076 467.5

128

Total power of the specialized systems can be calculated according to Eq. (6.1).

Table 6.14 shows the idle fraction of neural cores for the application throughputs considered in the previous subsection. From Tables 6.2-6.6 and 6.14 we can remark, when cores are idle for less time (Edge detection application), ratio of total core powers essentially determine the relative energy efficiency of the specialized systems. We can also remark, when cores are idle most of the time (deep, obj. recog., OCR), ratios of both core dynamic powers and leakage powers determine the relative energy efficiency of the specialized systems.

Table 6.14: Idle fraction Digital Memristor Idle fraction Idle fraction Deep 0.872 0.991 Edge 0.0078 0.0669 Motion 0.170 0.883 Object recog. 0.896 0.992 OCR 0.872 0.991

Power consumption of a memristor core depends on the resistance range of the trained memristors. The memristor device utilized in this study has maximum resistance

125 MΩ. Using this device, resistance of the trained memristors for the neural networks were between 10 MΩ and 125 MΩ. Memristor core power for this resistance range is shown in Table 6.1. Different memristor devices have different resistance ranges which will produce trained resistance of memristors in different ranges. Figure 6.19 shows power consumption of a 128×64 memristor core (0T1M) for different trained resistance ranges of memristors. High trained resistances enable low power consuming core. Figure

6.20 shows energy efficiencies of memristor systems over SRAM systems for different

129 resistances of the memristor cores. Systems utilizing low resistances for the memristors provide very small power benefit over SRAM based systems. To achieve high energy efficient execution, memristor device having high ROFF needs to be utilized. In our proposed systems we utilized the resistance of the memristors similar to core3.

Table 6.15: Resistance range of trained memristors.

Resistance range core1 100 kΩ to 1 MΩ core2 1 MΩ to 10 MΩ core3 10 MΩ to 100 MΩ core4 100 MΩ to 1000 MΩ

10000

1000

100

Power (uW) 10

1 core1 core2 core3 core4

Figure 6.19. 128×64 memristor core powers for the resistance ranges shown in Table 6.15.

450 400 350 300 core1 250 core2 200 150 core3 100 core4 50

Energy efficiency Energy efficiency Digital over 0 Deep Edge Motion obj OCR net. recog Figure 6.20. Power efficiencies of the memristor systems over digital systems utilizing different memristor devices.

130

Comparison with GPU: GPUs are popular for data parallel applications to achieve high throughput. We executed deep neural network application on Nvidia Tesla

K20 GPU and were able to get 7.8×103 times more throughput over the RISC processor examined in this chapter. Power consumption of the GPU is 2.73×103 times more than the RISC processor power. These imply, GPU will be about 2.85 times more energy efficient over the RISC system executing neural network applications in real time. We can remark that the specialized digital systems will be about 5 to 334 times more energy efficient over GPU and the memristor systems will be about three to five orders of magnitude more energy efficient for the selected benchmark applications.

6.7 Summary

In this study we have performed full system evaluation of the multicore systems based on memristor neural cores. On-chip routing requirements, I/O interface were examined for these systems. We have performed design space exploration of specialized neural cores and determined optimum neural core configurations. Synaptic memory accesses in the digital system consume significant portion of the overall system powers.

Non-volatile synaptic array of memristor crossbar based systems consume very small leakage power. Furthermore, in the memristor based systems data processing takes place at the physical location of the data. Parallel analog operation of the memristor crossbar does not require adders and multipliers to perform the neuron operations. Specialized architectures for neural networks provide higher throughput over RISC architecture.

Furthermore, parallel analog operation of the memristor based systems provide dramatic throughput and power efficiencies over digital systems.

131

CHAPTER VII

LOW POWER HIGH THROUGHPUT STREAMING ARCHITECTURE FOR BIG

DATA PROCESSING

7.1 Introduction

General purpose computing systems are used for a large variety of applications.

Extensive supports for flexibility in these systems limit their energy efficiencies. Given that big data applications are among the main emerging workloads for computing systems, specialized architectures for big data processing are needed to enable low power and high throughput execution. Several big data applications are particularly focused on classification and clustering tasks. In this study we propose a multicore heterogeneous architecture for big data processing. This system has the capability to process key machine learning algorithms such as deep neural network, autoencoder, and k-means clustering.

Memristor crossbars are utilized to provide low power, high throughput execution of neural networks. The system has both training and recognition (evaluation of new input) capabilities. The proposed system could be used for classification, unsupervised clustering, dimensionality reduction, feature extraction, and anomaly detection applications. The system level area and power benefits of the specialized architecture is compared with the NVIDIA Telsa K20 GPGPU. Our experimental evaluations show that 132 the proposed architecture can provide four to six orders of magnitude more energy efficiency over GPGPUs for big data processing.

The rest of the chapter is organized as follows: section 7.2 demonstrates overall heterogeneous multicore architecture. Section 7.3 describes memristor device, neuron circuit design, autoencoder and memristor based autoencoder design. Section 7.4 describes designs of the heterogeneous cores. Sections 7.5 and 7.6 describe experimental setup and results respectively. Finally in section 7.7 we summarize our work.

7.2 System Overview

In this study we propose an energy efficient multicore heterogeneous architecture to accelerate both supervised and unsupervised machine learning applications. The architecture has implementation of the backpropagation training algorithm, autoencoder, deep neural network, and k-means clustering algorithm. These are some of the fundamental algorithms in deep learning and are used for big data processing.

A common approach for unsupervised learning is clustering which uses dimensionality reduction and feature extraction as preprocessing. Dimensionality reduction allows the data of large dimension to be represented using a smaller set of features, and thus allow a simpler clustering algorithm to process an unlabeled dataset. In this study, we implement the commonly used autoencoder algorithm for dimensionality reduction and the k-means algorithm for clustering. The autoencoder is based on multilayered neural networks and thus is implemented using memristor based neural processing cores. As the input data can be of high dimensionality (such as a high resolution image), multiple neural cores are needed to implement the autoencoder.

133

We utilize the back-propagation algorithm for training the autoencoder. Due to the use of the autoencoder, a reduced dimension k-means implementation is sufficient, and hence we utilize a single digital core to implement k-means clustering. Deep neural network applications for supervised training could also be implemented in the proposed system. For deep network, autoencoder is used for the unsupervised layer wise pre training and supervised fine tuning is performed on the pre trained weights utilizing the back-propagation algorithm. The system has both training and inference capabilities. It can be used for classification, unsupervised clustering, dimensionality reduction, feature extraction, and anomaly detection applications.

Main memory TSV RISC DMA

NC NC NC

R R R . . .

NC NC NC

R R R . Buffer Buffer Clustering core

NC NC

R R R

Figure 7.1. Heterogeneous architecture including memristor based multicore system as one of the processing components. Proposed multicore system with several neural cores (NC), one clustering core connected through a 2-D mesh routing network. (R: router).

Figure 7.1 shows the overall system architecture which includes memristor crossbar based neural cores (NC) and a digital unit for clustering. Memristor crossbar neural cores are utilized to provide low power, high throughput execution of neural

134 networks. The cores in the system are connected through an on-chip routing network (R).

The neural cores receive their inputs from main memory holding the training data through the on-chip routers and a buffer between the routing and main memory as shown in Figure 7.1.

Since the proposed architecture is geared towards machine learning applications

(including big data processing), the training data will be used multiple times during the training phase. As a result, access to a high bandwidth memory system is needed, and thus we propose the use of 3D stacked DRAM. Using through Silicon vias reduces memory access time and energy, thereby reducing overall system power. To allow efficient access to the main memory utilizing a DMA controller that is initialized by a

RISC processing core. A single issue pipelined RISC core is used to reduce power consumption.

An on-chip routing network is needed to transfer neuron outputs among cores in a multicore system. In feed-forward neural networks, the outputs of a neuron layer are sent to the following layer after every iteration (as opposed to a spiking network, where outputs are sent only if a neuron fires). This means that the communication between neurons is deterministic and hence a static routing network can be used for the core-to- core communications. In this study, we assumed a static routing network as this would be lower power consuming than a dynamic routing network. The network is statically time multiplexed between cores for exchanging multiple neuron outputs.

SRAM based static routing (same as chapter VI) is utilized to facilitate re- programmability in the switches. Figure 7.2 shows the routing switch design. Note carefully that the switch allows outputs from a core to be routed back into the core to

135 implement recurrent networks or multi-layer networks where all the neuron layers are within the same core.

Routing bus 8 bit wide

8 bit 8 bit

Programming 8 bit wide . . . transistor

. Vdd A SRAM Vp

B ABGND A B 8 bit wide

Output port Input port

Figure 7.2. SRAM based static routing switch. Each blue circle in the left part of the figure represents the 8x8 SRAM based switch shown in the middle (assuming a 8-bit network bus).

7.3 Autoencoder

Several big data applications are particularly focused on classification and clustering tasks. The robustness of such systems depends on how well they can extract important features from the raw data. For big data processing we are interested for a generic feature extraction mechanism which will work for a variety of applications. The autoencoder [77] is a popular method for dimensionality reduction and feature extraction approach which automatically learns features from unlabeled data. This means, it is an unsupervised approach and no supervision is needed. The architecture of the autoencoder is similar to a multi-layer neural network as shown in Figure 4.2. Detail description on autoencoder is presented in section 4.2.

136

In order to provide proper functionality, a multi-layer neural network needs to be trained using a training algorithm. Back-propagation (BP) [50] and the variants of the BP algorithm are widely used for training such networks. The stochastic BP algorithm is used to train the memristor based multi-layer neural network and is described in chapter

IV.

7.4 Heterogeneous Cores

The proposed system has a RISC core, a digital core for clustering and memristor neural cores. These cores are connected through an on-chip routing network. The memristor neural cores could be used to implement autoencoders and feed forward neural networks of different configurations.

7.4.1 Memristor Neural Core Figure 7.3 shows the memristor based single neural core architecture. It consists of a memristor crossbar of size 400×200, input and output buffers, a training unit, and a control unit. Our objective was to take as big crossbar as possible, because this enables more computations to be done in parallel. Doing experiment on different crossbar sizes, we observed that 400×200 crossbar has very little impact of sneak paths for the memristor device considered (high resistance values). The control unit will manage the input and output buffers and will interact with the specific routing switch associated with the core. The control unit will be implemented as a finite state machine and thus will be of low overhead. Processing in the core is analog and entire core process in one cycle for an input. The neural cores communicate between each other in digital form as it is expensive to exchange and latch analog signals. Neuron outputs are discretized using a three bit ADC converter for each neuron and are stored in the output buffer for routing.

137

Inputs come to the neural core through the routing network in digital form and are stored in the input buffer. Inputs are applied to the memristor crossbar, converting them into analog form.

DAC

DAC Input Buffer Input DAC Training unit DAC

ADC ADC ADC ADC Routing switch Output buffer Control Unit

Figure 7.3. Single memristor neural core architecture.

7.4.2 Digital Clustering Core We have designed a digital core for performing k-means clustering on the feature representation output of the autoencoder. The core could be configured to generate up to

32 clusters and maximum input dimension could be 32 after dimensionality reduction using the autoencoder. This system performs clustering based on Manhattan distance calculations.

Assume that our data dimension is n and the number of clusters is m. For an element of the data sample, the corresponding Manhattan distances for the current m cluster centers are evaluated in parallel and are accumulated in the dist. registers (Fg. ). If a subtractor (in the first row of subtractors) output is negative, it is subtracted from the corresponding dist. register otherwise it is added. After n iterations (one for each element of the input sample), the Manhattan distance between the input sample and the current

138 cluster centers will be in dist. registers. In the next m cycles, the minimum distance value and the corresponding cluster center index will be evaluated using the circuit shown in the figure at right.

The system has another set of registers for each cluster center (Figure 7.4) to store the accumulated values of the data samples belonging to that cluster. One counter is taken for each cluster to keep track of how many data samples belong to that cluster. The data sample is accumulated in the center accumulator register corresponding to the minimum distance cluster. To minimize hardware costs and processing time, this operation is overlapped with the distance calculation step for the next data pattern in Figure 7.4.

When all the training data samples are assigned to the clusters based on the minimum distances, new cluster center will be evaluated dividing the center accumulator registers by the corresponding sample counter values. After this, new cluster centers will be updated and operations for the next epoch will be performed based on the new centers.

m cluster centers Input mux . . . mux

SRAM array n Center

accumulator: . . .

Reg.: - data . . . + + Div. Div. + + Reg. Reg. dist.: Reg. Reg . . . . mux Routing mux + switch - + mux index Reg. hold/load

Figure 7.4. Digital clustering core design implementing k-means clustering algorithm.

139

7.5 Experimental Setup

7.5.1 Applications and Datasets We have examined the MNIST [61] and ISOLET [64] datasets for classification and clustering tasks. The classification applications utilized deep networks where the autoencoder was utilized for unsupervised layer-wise pre-training of the networks. For clustering applications, dimensionality reduction and feature extraction tasks are performed using the autoencoder. After that the k-means clustering algorithm is applied on the data generated by the autoencoder. An autoencoder based anomaly detection was examined on the KDD dataset [62]. Table 7.1 shows the neural network configurations for different datasets and applications.

Table 7.1: Neural network configurations. Dataset Configuration Anomaly KDD 41→15→41 detection Classification MNIST 784→300→200→100→10

ISOLET 617→2000→1000→500→250→26 Dimensionality MNIST 784→300→200→100→20 reduction ISOLET 617→2000→1000→500→250→20

7.5.2 Mapping Neural Networks to Cores The neural hardware are not able to time multiplex neurons as their synaptic weights are stored directly within the neural circuits. Hence a neural network’s structure may need to be modified to fit into a neural core. In cases where the networks were significantly smaller than the neural core memory, multiple neural layers were mapped to

140 a core. In this case, the layers executed in a pipelined manner, where the outputs of layer

1 were fed back into layer 2 on the same core through the core’s routing switch.

When a software network layer was too large to fit into a core (either because it needed too many inputs or it had too many neurons), the network layer was split amongst multiple cores. Splitting a layer across multiple cores due to a large number of output neurons is trivial. When there were too many inputs per neuron for a core, each neuron was split into multiple smaller neurons as shown in Figure 7.5. When splitting a neuron, the network needs to be trained based on the new network topology. As the network topology is determined prior to training (based on the neural hardware architecture), the split neuron weights are trained correctly.

i1 i1

i2 i2 i3 i3 i4 i4 i5 i5 i6 i6 i7 i7

i8 i8

Figure 7.5. Splitting a neuron into multiple smaller neurons.

7.5.3 Area Power Calculations The area, power, and timing of the SRAM array, used in the clustering core, were calculated using CACTI [74] with the low operating power transistor option utilized. We assumed a 45 nm process in our area power calculations. Components of a typical cache that would not be needed in the neural core (such as the tag array and tag comparator) were not included in the calculations. The clustering core also requires adders, registers,

141 multiplexers, and a control unit. The power of these basic components were determined through SPICE simulations. A frequency of 200 MHz was assumed for the digital system to keep power consumption low.

For the memristor cores, detailed SPICE simulations were used for power and timing calculations of the analog circuits (drivers, crossbar, and activation function circuits). These simulations considered the wire resistance and capacitance within the crossbar as well. The results show that the crossbar required 20 ns to be evaluated. As the memristor crossbars evaluate all neurons in one step, the majority of time in these systems is spent in transferring neuron outputs between cores through the routing network. We assumed that routing would run at 200 MHz clock resulting in 4 cycles needed for crossbar processing. The routing link power was calculated using Orion [75]

(assuming 8 bits per link). Off-chip I/O energy was also considered as described in section 7.2. Data transfer energy via TSV was assumed to be 0.05 pJ/bit [76].

7.6 Results

7.6.1 Supervised Training Result We have performed detailed simulations of the memristor crossbar based neural network systems. MATLAB (R2014a) and SPICE (LTspice IV) were used to develop a simulation framework for applying the training algorithm to a multi-layer neural network circuit. SPICE was mainly used for detailed analog simulations of the memristor crossbar array and MATLAB was used to simulate the rest of the system. The circuits were trained by applying input patterns one by one until the errors were below the desired levels.

Simulation of the memristor device used an accurate model of the device published in [53]. The memristor device simulated for this circuit was published in [9]

142 and the switching characteristics for the model are displayed in Figure 3.25. This device was chosen for its high minimum resistance value and large resistance ratio. According to the data presented in [9] this device has a minimum resistance of 10 kΩ, a resistance ratio of 103, and the full resistance range of the device can be switched in 20 μs by applying

2.5 V across the device.

SPICE simulations of large memristor crossbars are very time consuming.

Therefore, we examined the on-chip supervised training approach with the Iris dataset

[60] which utilizes crossbars of manageable sizes. A two layer network with four inputs, ten hidden neurons and one output neuron was utilized. Figure 7.6 shows the Matlab-

SPICE training simulation on the Iris dataset. It shows that the neural network was able to learn the desired classifiers.

0.8

0.6

0.4 MSE

0.2

0 0 20 40 Epoch Figure 7.6. Learning curve for the classification based on Iris dataset.

7.6.2 Unsupervised Training Result Unsupervised training of the autoencoder was performed on the Iris dataset. The network configuration utilized was 4→12→2→4. After training, the two hidden neuron outputs give the feature representation of the corresponding input data in a reduced dimension of two. Figure 7.7 shows the distribution of the data of three different classes

(setosa, versicolor, virginica) in the feature space. We can observe that data belonging to

143 the same class appears closely in the feature space and could potentially be linearly separated.

0 setosa versicolor -0.2 virginica

-0.4

-0.6

-0.8

-1 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4

Figure 7.7. Distribution of the data of different classes in the feature space.

7.6.3 Anomaly Detection We have examined autoencoder based anomaly detection on the KDD dataset.

The network configuration for this application was 39->15->39. As the SPICE simulation of this network size takes a very long time, we did this experiment in MATLAB. During training the autoencoder was trained using only normal data packets. It is expected that during evaluation, for normal data packets the differences between input and reconstruction will be smaller compared to the differences for the attack packets. The network was trained only with 5292 normal packet data (no anomalous packets were used during training).

Figures 7.8 and 7.9 show the distribution of the distances between inputs and the corresponding reconstructions for the normal packets and the attack packets respectively.

From Figure 7.10 it is seen that the system can detect about 96.6% of the anomalous packets with a 4% false detection rate. Similar approaches could be used for other big

144 data applications where the objective is to detect anomalous patterns from large volumes of continuously streaming data in real time at low power.

1.2

0.8

0.6 input & reconstruction n 0.4

0.2

Distance bet 0 0 500 1000 1500 2000 2500 3000 Normal packet Figure 7.8. Distance between original data and reconstructed data for normal packets.

1.2

0.8

0.6 input & reconstruction n 0.4

0.2 Distance bet 0 0 500 1000 1500 2000 2500 Anomalous packet Figure 7.9. Distance between original data and reconstructed data for attack packets.

100

Detection rate Detection (%) rate 20

0 0 20 40 60 80 100 False detection (%) Figure 7.10. Anomaly detection rate for the test dataset for different decision parameters.

7.6.4 Impact of System Constraints on Application Accuracy The hardware neural network training circuit differs from the software implementations in the following aspects: 145

- Limited precision of the discretized neuron errors and DPj values.

- Limited precision of the discretized neuron output values.

- Each neuron can have a maximum of 400 synapses.

Figure 7.11 compares the accuracies obtained from Matlab implementations of the applications considering the hardware system constraints and implementations without considering those constraints. It is seen that enforcing the system constraints the applications still give competitive performances.

100 90 80 70 60 50 40 s/w

Accuracy (%) 30 20 constrained 10 0

Figure 7.11. Impact of memristor system constraints (3 bits neuron output and 8 bits neuron error precision) on application accuracy.

7.6.5 Single Core Area and Power The area of the digital clustering core is 0.039 mm2 and its power consumption is

1.36 mW. The training of 1000 samples for one epoch in this core takes 0.32 μs time.

The memristor neural core configuration is 400×100, i.e. it can take a maximum of 400 inputs and can process a maximum of 100 neurons. The area of a single memristor neural core is 0.0163 mm2. Table 7.2 shows a single memristor core power and timing at different steps of execution. The RISC core is considered to be used only for configuring the cores, routers, and DMA engine. As a result, we assume that the RISC core is turned off afterwards during the actual training or evaluation phases.

146

Table 7.2: Memristor core timing and power for different execution steps. Time Power (us) (mW) Forward pass 0.27 0.794 (recognition) Backward pass 0.80 0.706 Weight update 1.00 6.513 Control unit 0.0004

7.6.6 System Level Evaluations Total system area: The whole multicore system includes 144 memristor neural cores, one digital clustering core, one RISC core, one DMA controller, 4 kB of input buffer and 1 kB of output buffer. The RISC core area was evaluated using McPat [72] and came to 0.52 mm2. The total system area was 2.94 mm2.

Energy efficiency: We have compared the throughput energy benefits of the proposed architecture over an Nvidia Tesla K20 GPU. The system consumes 225 W power. The area of the GPU is 561 mm2 using a 28 nm process. Figures 7.12 and 7.13 show the throughput and energy efficiencies respectively of the proposed system over

GPU for different applications during training. For training the proposed architecture provides up to 30x speedup and four to six orders of magnitude more energy efficiency over GPU.

147

30 1.00E+06 25 1.00E+05 20 1.00E+04 15 1.00E+03 10 1.00E+02 5 1.00E+01 Speedup over GPU over Speedup 0 1.00E+00 Energy efficiency over GPU over efficiency Energy

Figure 7.12. Application speedup over Figure 7.13. Energy efficiency of the GPU for training. proposed system over GPU for training.

Figures 7.14 and 7.15 show the throughput and energy efficiencies of the proposed system over GPU for different applications respectively during evaluation of new inputs. For recognition, the proposed architecture provides up to 50× speedup and five to six orders of magnitude more energy efficiency over the GPU.

50 1.00E+06 1.00E+05 40 1.00E+04 30 1.00E+03 20 1.00E+02 10 1.00E+01 Speedup over GPU over Speedup 0 1.00E+00 Energy efficiency over GPU over efficiency Energy

Figure 7.14. Application speedup Figure 7.15. Energy efficiency of the proposed over GPU for recognition. system over GPU for evaluation of a new input.

Table 7.3 shows the number of cores used, the time and the energy for a single training data item in one iteration. Table 7.4 shows the evaluation time and energy for a single test data. In both training and recognition phases the computation energy dominated the total system energy consumption.

148

Table 7.3: For training number of cores used, the time and the energy for a single input in the proposed architecture. Memristor # of Time Compute IO energy Total core (us) energy (J) (J) energy (J) Mnist_class 57 7.29 4.18E-07 8.48E-09 4.26E-07 Mnist_AE 57 17.99 8.37E-07 8.57E-09 8.45E-07 Mnist_kmeans 1 0.42 9.67E-10 4.47E-12 9.71E-10 Isolate_AE 132 24.41 1.97E-06 2.68E-08 1.99E-06 Isolate_kmeans 1 0.42 9.67E-10 4.47E-12 9.71E-10 Isolet_class 132 8.86 9.67E-07 2.67E-08 9.94E-07 KDD_anomaly 1 4.15 7.33E-09 4.51E-09 1.18E-08

Table 7.4: Recognition time and energy for one input in the proposed architecture. Memristor Time Compute IO Total (us) energy (J) energy energy (J) (J) Mnist_class 0.77 1.42E-08 8.43E-09 2.26E-08 Mnist_AE 0.77 1.42E-08 8.43E-09 2.26E-08 Mnist_kmeans 0.32 8.89E-10 3.69E-12 8.93E-10 Isolate_AE 0.77 3.28E-08 2.67E-08 5.94E-08 Isolate_kmeans 0.32 8.89E-10 3.69E-12 8.93E-10 Isolet_class 0.77 3.28E-08 2.67E-08 5.94E-08 KDD_anomaly 0.77 2.48E-10 4.48E-09 4.73E-09

7.7 Summary

In this chapter we have proposed a multicore heterogeneous architecture for big data processing. This system has the capability to process key machine learning algorithms such as deep neural network, autoencoder, and k-means clustering. The system has both training and recognition (evaluation of new input) capabilities. We 149 utilized memristor crossbars to implement energy efficient neural cores. Our experimental evaluations show that the proposed architecture can provide four to six orders of magnitude more energy efficiency over GPUs for big data processing.

150

CHAPTER VIII

MEMRISTOR CROSSBAR BASED STRING MATCHING CIRCUIT

String searching or string matching algorithms, are an important class of algorithms that try to find a place where one or several strings (or patterns) are found within a larger string or text. String matching has applications in several areas including network intrusion detection, bioinformatics, detecting plagiarism, information security, general pattern recognition, document matching, and text mining [78].

8.1 String Matching Circuit

Circuits for binary string matching and minterms of Boolean variables are similar.

First we will demonstrate implementation of minterms using discrete memristor states and later we will utilize this circuit to implement string matching circuit. Here utilized neuron circuit is similar to the circuit presented in Figure 3.2(b). Recall that in our design logic low is represented by -1 V and logic high is represented by 1 V. To implement any four input minterm, weights associated with normal variables (not complemented) in the minterm will be maximum positive value (w=1/Ron-1/Roff) and weights associated with complemented variables in the minterm will be maximum negative value (-w). For instance, to implement the minterm ab’c’d associated weights for inputs a and d will be w and weights for inputs b and c will be –w. Then for the applied input (a=1, b=-1, c=-1, d=1) summation of inputs multiplied by corresponding weights will be 4w. If any input

151 other than the input satisfying the minterm is applied, summation of inputs multiplied by weights will be less than or equal to 2w. We will set the bias inputs and weights such that when summation of inputs multiplied by weights greater than 3w neuron gives logic high output and otherwise it gives logic low output. Figure 8.1 shows the neuron implementing minterm ab’c’d. Similarly we can implement other four input minterms.

A Ron

A R 1 V off Roff -1 V R B on Roff B Ron 1 V Roff -1 V Ron C Roff

C Ron 1 V Roff -1 V Ron D Ron D Roff

Figure 8.1. Neuron implementing minterm AB̅C̅D.

In general for matching an m bits string we will have 4m-2 inputs to the crossbar.

Among them 2m inputs are corresponding to the m bits we want to match (m normal inputs and corresponding m complemented inputs). If we want to match logic high for a particular input, weight associated with that input will be Ron and weight associated with corresponding complemented input will be Roff and vice versa. We will have m-1 logic

152 high inputs and m-1 logic low inputs. Weight associated with high inputs will be Roff and weight associated low inputs will be Ron.

8.2 Scaling for Matching Long String

It is important to know the maximum number of bits we can match in single memristor crossbar column. Consider two strings s1 and s2 of the same length. Assume all the bits of s1 are high and all the bits except the first one of s2 are high. Figures 8.2 and 8.3 show that as we increase the length of the string to be matched, the magnitude of voltage across load resistor at the bottom of the circuit decreases. Our studies show that in a single column crossbar we can match up to 112 bits (or 14 characters) successfully.

In a single memristor crossbar, each column can examine a separate string, thus allowing us to match multiple strings concurrently. To match longer strings, we can match parts of the string in multiple crossbar and do an AND operation on those results.

The AND function can be implemented using a memristor crossbar in the same way we implement min terms of Boolean variables.

-3 x 10 14

Voltage Voltage across (V) load 4

2 0 50 100 150 Length of the string Figure 8.2. Length of the string vs. voltage across load resistor when string s1 is applied to the crossbar.

153

-3 x 10 -2

-4

-6

-8

-10

Voltage Voltage across (V) load -12

-14 0 50 100 150 Length of the string Figure 8.3. Length of the string vs. voltage across load resistor when string s2 is applied to the crossbar.

8.3 Summary

Systems based on memristor based string matching circuits could provide low power, high throughput execution for document matching, text mining and bioinformatics applications. Proposed design is based on using two discrete memristor resistance levels.

It does not require any on-chip training circuit and hence circuit overhead is reduced significantly.

154

CHAPTER IX

CONCLUSION

This thesis examined in-situ training of memristor based multi-layer neural networks which updates entire crossbar in four steps for a training instance (data).

Existing training approaches update a crossbar serially column by column. Training of memristor based deep neural networks were examined which use autoencoders for layer- wise pre-training. We proposed a novel technique for ex-situ training of memristor based neural networks which takes sneak-paths currents into consideration. Multi-core architectures based on memristor neural cores were developed and system level area power were compared with traditional computing systems. Results showed that the memristor neural network based architectures could be about five orders of magnitude more energy efficient when compared to the traditional computing systems.

The applications of the proposed works would be very broad. They will enable extremely energy efficient processing architectures that can be integrated into a whole host of processing architectures including mobile SOCs (such as cell phone processors), low power sensors, robotics, bio-medical devices, and high performance computing systems. In particular the on-line (in-situ) learning capabilities will be highly applicable to a range of areas such as big data, cybersecurity, and any type of adaptive processing

155 systems (e.g. adaptive control system). Our future research will examine the fabrication and testing of memristor neural network based processing systems. We would also examine training of memristor based deep networks for larger and more challenging datasets such as CIFAR-10, ImageNet [79].

156

BIBLIOGRAPHY

[1] H. Esmaeilzadeh, E. Blem, R. Amant, K. Sankaralingam, and D. Burger, “Dark

silicon and the end of multicore scaling,” in Proceeding of the 38th Annual

International Symposium on Computer Architecture, pp. 365–376, 2011.

[2] Rogers, Brian M., Anil Krishna, Gordon B. Bell, Ken Vu, Xiaowei Jiang, and Yan

Solihin. "Scaling the bandwidth wall: challenges in and avenues for CMP scaling."

In ACM SIGARCH Computer Architecture News, vol. 37, no. 3, pp. 371-382.

ACM, 2009.

[3] J. Han, M. Orshansky, "Approximate computing: An emerging paradigm for

energy-efficient design," Test Symposium (ETS), 2013 18th IEEE European , vol.,

no., pp.1,6, 27-30 May 2013.

[4] H. Esmaeilzadeh, A. Sampson, L. Ceze, and D. Burger, “Architecture support for

disciplined approximate programming,” In Proceedings of the seventeenth

international conference on Architectural Support for Programming Languages and

Operating Systems (ASPLOS XVII). ACM, New York, NY, USA, 2012.

[5] V. K. Chippa, S. T. Chakradhar, K. Roy, A. Raghunathan, "Analysis and

characterization of inherent application resilience for approximate computing,"

Design Automation Conference (DAC), 2013 50th ACM / EDAC / IEEE , vol., no.,

pp.1,9, May 29 2013-June 7, 2013.

157

[6] V. K. Chippa, D. Mohapatra, A. Raghunathan, K. Roy, and S. T. Chakradhar,

“Scalable effort hardware design: exploiting algorithmic resilience for energy

efficiency,” In Proceedings of the 47th Design Automation Conference. ACM, New

York, NY, USA, 2010.

[7] S. Venkataramani, V. K. Chippa, S. T. Chakradhar, K. Roy, and A. Raghunathan,

“Quality programmable vector processors for approximate computing,” In

Proceedings of the 46th Annual IEEE/ACM International Symposium on

Microarchitecture, USA, 2013.

[8] L. O. Chua, "Memristor—The Missing Circuit Element," IEEE Transactions on

Circuit Theory, vol. 18, no. 5, pp. 507–519 (1971).

[9] S. Yu, Y. Wu, and H.-S. P. Wong, "Investigating the switching dynamics and

multilevel capability of bipolar metal oxide resistive switching memory," Applied

Physics Letters 98, 103514 (2011).

[10] X. Dong, C. Xu, S. Member, Y. Xie, and N. P. Jouppi, “NVSim: A Circuit-Level

Performance, Energy, and Area Model for Emerging Nonvolatile Memory,” IEEE

Trans. on Computer Aided Design of Integrated Circuits and Systems, vol. 31, no.

7, pp. 994-1007, July, 2012.

[11] D. Soudry, D. D. Castro, A. Gal, A. Kolodny, and S. Kvatinsky, "Memristor-Based

Multilayer Neural Networks With Online Gradient Descent Training,” IEEE Trans.

on Neural Networks and Learning Systems, issue 99, 2015.

[12] P. Dubey, “Recognition, mining and synthesis moves computers to the era of tera,”

Technology@Intel Magazine, Feb. 2005.

158

[13] T. Chen, Y. Chen, M. Duranton, Q. Guo, A. Hashmi, M. Lipasti, A. Nere, S. Qiu,

M. Sebag, O. Temam, “BenchNN: On the Broad Potential Application Scope of

Hardware Neural Network Accelerators,” IEEE International Symposium on

Workload Characterization (IISWC), November, 2012.

[14] S. A. K. Al-Omari, P. Sumari, S.A. Al-Taweel and A.J.A. Husain, “Digital

Recognition using Neural Network,” Journal of Comput. Sci., 5: 427-434, 2009.

[15] M. Egmont-Petersen, D. de Ridder, H. Handels, Image processing with neural

networks: a review, Pattern Recognition, Volume 35, Issue 10, October 2002, Pages

2279-2301

[16] Dony, R.D.; Haykin, S.; , "Neural network approaches to image compression,"

Proceedings of the IEEE , vol.83, no.2, pp.288-303, Feb 1995

[17] D. W. Ruck, S. K. Rogers, M. Kabrisky, M. E. Oxley, W. B. Suter, “The multilayer

perceptron as an approximation to a Bayes optimal discriminant function,” IEEE

Transactions on Neural Networks, vol.1, no.4, pp.296-298, Dec 1990.

[18] E. M. Izhikevich, “Simple Model of Spiking Neurons,” IEEE Transactions on

Neural Networks, vol. 14, 1569-1572, 2003.

[19] B. Zineddin, Z. Wang, and X. Liu, “Cellular neural networks, the navier-stokes

equation, and microarray image recon struction,” IEEE Transactions on Image

Processing, vol. 20, no. 11, pp. 3296-3301, 2011.

[20] Steve Esser, Alexander Andreopoulos, Rathinakumar Appuswamy, Pallab Datta,

Davis Barch, Arnon Amir, John Arthur, Andrew Cassidy, Myron Flickner, Paul

Merolla, Shyamal Chandra, Nicola Basilico, Stefano Carpin, Tom Zimmerman,

Frank Zee, Rodrigo Alvarez-Icaza, Jeffrey Kusnitz, Theodore Wong, William Risk,

159

Emmett McQuinn, Tapan Nayak, Raghavendra Singh and Dharmendra Modha,

"Cognitive Computing Systems: Algorithms and Applications for Networks of

Neurosynaptic Cores," International Joint Conference on Neural Networks, Dallas,

Texas, 2013.

[21] H. Esmaeilzadeh, A. Sampson, L. Ceze, and D. Burger, “Neural Acceleration for

General-Purpose Approximate Programs,” International Symposium on

Microarchitecture (MICRO), 2012.

[22] B. Han, T. M. Taha, “Acceleration of spiking neural network based pattern

recognition on NVIDIA graphics processors,” Journal of Applied Optics, 2010.

[23] J. M. Nageswaran, N. Dutt, J. L. Krichmar, A. Nicolau, and A. Veidenbaum,

“Efficient simulation of large-scale spiking neural networks using CUDA graphics

processors,” In Proceedings of the International Joint Conference on Neural

Networks (IJCNN), 3201-3208, NJ, USA, 2009.

[24] T. M. Taha, P. Yalamanchili, M. Bhuiyan, R. Jalasutram, C. Chen, R. Linderman,

“Neuromorphic algorithms on clusters of PlayStation 3s,”, International Joint

Conference on Neural Networks (IJCNN), vol., no., pp.1-10, 18-23 July 2010.

[25] H. Hellmich, H. Klar, “An FPGA based simulation acceleration platform for

spiking neural networks,” The 47th Midwest Symposium on Circuits and Systems

(MWSCAS), vol.2, no., pp. 389-392, July 2004.

[26] S. B. Furber, S. Temple and A. D. Brown, “High-Performance Computing for

Systems of Spiking Neurons,” Proceedings of AISB'06 workshop on GC5:

Architecture of Brain and Mind, vol.2, pp 29-36, Bristol, April, 2006.

160

[27] J. Schemmel, J. Fieres, K. Meier, “Wafer-Scale Integration of Analog Neural

Networks”, IEEE International Joint Conference on Neural Networks (IJCNN),

2008.

[28] http://spectrum.ieee.org/magazine/2012/August

[29] P. Merolla, J. Arthur, F. Akopyan, N. Imam, R. Manohar, D. S. Modha, “A digital

neurosynaptic core using embedded crossbar memory with 45pJ per spike in

45nm,” IEEE Custom Integrated Circuits Conference (CICC), vol., no., pp.1-4, 19-

21 Sept. 2011.

[30] J. V. Arthur, P. A. Merolla, F. Akopyan, R. Alvarez, A. Cassidy, S. Chandra, S. K.

Esser, N. Imam, W. Risk, D. B. D. Rubin, R. Manohar, D. S. Modha, “Building

block of a programmable neuromorphic substrate: A digital neurosynaptic core,”

The International Joint Conference on Neural Networks (IJCNN), pp.1-8, June

2012.

[31] J. Schemmel, J. Fieres, K. Meier, “Wafer-Scale Integration of Analog Neural

Networks”, IEEE International Joint Conference on Neural Networks (IJCNN),

2008.

[32] P. A Merolla, J. V. Arthur, R. Alvarez-Icaza, A. S. Cassidy, Jun Sawada, F.

Akopyan, B. L. Jackson, N. Imam, C. Guo, Y. Nakamura, B. Brezzo, Ivan Vo, S. K.

Esser, R. Appuswamy, B. Taba, A. Amir, M. D. Flickne, W. P. Risk, R. Manohar,

D. S. Modha, "A million spiking-neuron integrated circuit with a scalable

communication network and interface." Science 345, no. 6197, pp 668-673, 2014.

[33] Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li,

Tianshi Chen, Zhiwei Xu, Ninghui Sun, and Olivier Temam. 2014. DaDianNao: A

161

Machine-Learning Supercomputer. In Proceedings of the 47th Annual IEEE/ACM

International Symposium on Microarchitecture (MICRO-47). IEEE Computer

Society, Washington, DC, USA, 609-622.

[34] Daofu Liu, Tianshi Chen, Shaoli Liu, Jinhong Zhou, Shengyuan Zhou, Olivier

Teman, Xiaobing Feng, Xuehai Zhou, and Yunji Chen. 2015. PuDianNao: A

Polyvalent Machine Learning Accelerator. In Proceedings of the Twentieth

International Conference on Architectural Support for Programming Languages and

Operating Systems (ASPLOS '15). ACM, New York, NY, USA, 369-381.

[35] Du, Zidong, Robert Fasthuber, Tianshi Chen, Paolo Ienne, Ling Li, Tao Luo,

Xiaobing Feng, Yunji Chen, and Olivier Temam. "ShiDianNao: shifting vision

processing closer to the sensor." In Proceedings of the 42nd Annual International

Symposium on Computer Architecture, pp. 92-104. ACM, 2015.

[36] R. S. Amant, A. Yazdanbakhsh, J. Park, B. Thwaites, H. Esmaeilzadeh, A. Hassibi,

L. Ceze, and D. Burger, “General-purpose code acceleration with limited-precision

analog computation,”. In Proceeding of the 41st annual international symposium on

Computer architecuture (ISCA '14). IEEE Press, Piscataway, NJ, USA, pp. 505-

516, 2014.

[37] B. Belhadj, A. J. L. Zheng, R. Héliot, and O. Temam. “Continuous real-world

inputs can open up alternative accelerator designs,” SIGARCH Comput. Archit.

News 41, 3 (June 2013).

[38] C. Zamarreño-Ramos, L. A. Camuñas-Mesa, J. A. Pérez-Carrasco, T. Masquelier,

T. Serrano-Gotarredona, and B. Linares-Barranco, “On spike-timing-dependent-

plasticity, memristive devices, and building a self-learning visual cortex,” Frontiers

162

in Neuroscience, Neuromorphic Engineering, vol. 5, Article 26, pp. 1-22, Mar.

2011.

[39] D. Chabi, W. Zhao, D. Querlioz, and J.-O. Klein, “Robust neural logic block (NLB)

based on memristor crossbar array,” in Proc. NANOARCH, pp.137-143, 2011.

[40] F. Alibart, E. Zamanidoost, and D.B. Strukov, "Pattern classification by memristive

crossbar circuits with ex-situ and in-situ training", Nature Communications, 2013.

[41] J. A. Starzyk, Basawaraj, "Memristor Crossbar Architecture for Synchronous

Neural Networks," Circuits and Systems I: Regular Papers, IEEE Transactions on ,

vol.61, no.8, pp.2390,2401, Aug. 2014.

[42] M. Pd. Sah, C. Yang, H. Kim and L. O. Chua, "Memristor Circuit for Artificial

Synaptic Weighting of Pulse Inputs," IEEE ISCAS 2012.

[43] S. P. Adhikari, C. Yang, H. Kim, and L. O. Chua, "Memristor Bridge Synapse-

Based Neural Network and Its Learning," IEEE Transactions on Neural Networks

and Learning System 23(9), pp. 1426-1435, 2012.

[44] Boxun Li; Yuzhi Wang; Yu Wang; Chen, Y.; Huazhong Yang, "Training itself:

Mixed-signal training acceleration for memristor-based neural network," Design

Automation Conference (ASP-DAC), 2014 19th Asia and South Pacific , vol., no.,

20-23 Jan. 2014.

[45] E. Zamanidoost, F. M. Bayat, D. Strukov, & I. Kataeva, “Manhattan Rule Training

for Memristive Crossbar Circuit Pattern Classifiers,” IEEE International Joint

Conference on Neural Networks, 2015.

[46] E. Zamanidoost, M. Klachko, D. Strukov, I. Kataeva, "Low area overhead in-situ

training approach for memristor-based classifier," in Nanoscale Architectures

163

(NANOARCH), 2015 IEEE/ACM International Symposium on , vol., no., pp.139-

142, 8-10 July 2015.

[47] Liu, Xiaoxiao, Mengjie Mao, Hai Li, Yiran Chen, Hao Jiang, J. Joshua Yang, Qing

Wu, and Mark Barnell. "A heterogeneous computing system with memristor-based

neuromorphic accelerators." In High Performance Extreme Computing Conference

(HPEC), 2014 IEEE, pp. 1-6. IEEE, 2014.

[48] Xiaoxiao Liu; Mengjie Mao; Beiye Liu; Hai Li; Yiran Chen; Boxun Li; Yu Wang;

Hao Jiang; Barnell, M.; Qing Wu; Jianhua Yang, "RENO: A high-efficient

reconfigurable neuromorphic computing accelerator design," in Design Automation

Conference (DAC), 2015 52nd ACM/EDAC/IEEE , vol., no., pp.1-6, 8-12 June

2015

[49] S. Theodoridis and K. Koutroumbas. 2008. Pattern Recognition, Fourth Edition (4th

ed.). Academic Press.

[50] Russell, S. & Norvig, P. (2002). Artificial Intelligence: A Modern Approach (2nd

Edition). Prentice Hall, ISBN-13: 978-01379039555.

[51] M. Prezioso, F. Merrikh-Bayat, B. D. Hoskins, G. C. Adam, K. K. Likharev, D. B.

Strukov, “Training and operation of an integrated neuromorphic network based on

metal-oxide memristors,” Nature, 521(7550), 61-64, 2015.

[52] S. H. Jo, T. Chang, I. Ebong, B. B. Bhadviya, P. Mazumder, & W. Lu, “Nanoscale

memristor device as synapse in neuromorphic systems,” Nano letters, 10(4), pp.

1297-1301, 2010.

[53] C. Yakopcic, T. M. Taha, G. Subramanyam, and R. E. Pino, "Memristor SPICE

Model and Crossbar Simulation Based on Devices with Nanosecond Switching

164

Time," IEEE International Joint Conference on Neural Networks (IJCNN), August

2013.

[54] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle, "Greedy layer-wise training

of deep networks." Advances in neural information processing systems, vol. 19,

issue 153, 2007.

[55] H. Larochelle, Y. Bengio, J. Louradour, and P. Lamblin, "Exploring strategies for

training deep neural networks." The Journal of Machine Learning Research, vol. 10,

pages 1-40, 2009.

[56] D. B. Strukov, and R. S. Williams. "Four-dimensional address topology for circuits

with stacked multilayer crossbar arrays." Proceedings of the National Academy of

Sciences, vol.106, no. 48, pp 20155-20158, 2009.

[57] Y. Gao, D. C. Ranasinghe, S. F. Al-Sarawi, O. Kavehei, and D. Abbott,

"Memristive crypto primitive for building highly secure physical unclonable

functions." Scientific reports 5, 2015.

[58] https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)

[59] https://archive.ics.uci.edu/ml/datasets/Wine

[60] https://archive.ics.uci.edu/ml/datasets/Iris

[61] http://yann.lecun.com/exdb/mnist/

[62] http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html

[63] C. Yakopcic, R. Hasan, and T. M. Taha, “Memristor Based Neuromorphic Circuit

for Ex-Situ Training of Multi-Layer Neural Network Algorithms,” IEEE IJCNN,

2015.

[64] https://archive.ics.uci.edu/ml/datasets/ISOLET

165

[65] G. Loh, “3D-Stacked Memory Architectures for Multi-core Processors,” Computer

Architecture, 2008. ISCA '08. 35th International Symposium on, Beijing, pp. 453-

464, 2008.

[66] P. Chow, S. O. Seo, J. Rose, K. Chung, G. Páez-Monzón, and I. Rahardja, “The

design of an SRAM-based field-programmable gate array. I. Architecture,” IEEE

Transactions on Very Large Scale Integration (VLSI) Systems, 7(2), pp. 191-197,

1999.

[67] F. Alibart, L. Gao, B. D. Hoskins, D. B. Strukov, “High precision tuning of state for

memristive devices by adaptable variation-tolerant algorithm,” Nanotechnology,

23(7), 075201, 2012.

[68] W. Lu, K.-H. Kim, T. Chang, S. Gaba, “Two-terminal resistive switches

(memristors) for memory and logic applications,” in Proc. 16th Asia and South

Pacific Design Automation Conference, 2011, pp. 217-223.

[69] http://yann.lecun.com/exdb/mnist/

[70] https://www.kaggle.com/c/cifar-10

[71] http://www.ee.surrey.ac.uk/CVSSP/demos/chars74k/

[72] S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and N. P. Jouppi,

“McPAT: an integrated power, area, and timing modeling framework for multicore

and manycore architectures,” In Proceedings of the 42nd Annual IEEE/ACM

International Symposium on Microarchitecture (pp. 469-480), 2009.

[73] Doug Burger and Todd M. Austin. 1997. The SimpleScalar tool set, version 2.0.

SIGARCH Comput. Archit. News 25, 3, pp. 13-25, 1997.

166

[74] N. Muralimanohar, R. Balasubramonian, and N. Jouppi, “Optimizing NUCA

Organizations and Wiring Alternatives for Large Caches with CACTI 6.0” In

Proceedings of the 40th Annual IEEE/ACM International Symposium on

Microarchitecture (MICRO 40), Washington, DC, USA, 3-14, 2007.

[75] A. B. Kahng, B. Li, L. S. Peh, and K. Samadi, “ORION 2.0: A fast and accurate

NoC power and area model for early-stage design space exploration,” Design,

Automation & Test in Europe Conference & Exhibition, pp.423-428, 20-24 April

2009.

[76] Gorner, J.; Hoppner, S.; Walter, D.; Haas, M.; Plettemeier, D.; Schuffny, R., "An

energy efficient multi-bit TSV transmitter using capacitive coupling," 21st IEEE

International Conference on Electronics, Circuits and Systems, vol., no.,

pp.650,653, 7-10 Dec. 2014.

[77] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P. A. Manzagol, “Stacked

denoising autoencoders: Learning useful representations in a deep network with a

local denoising criterion,” The Journal of Machine Learning Research, vo. 11, pp.

3371-3408, 2010.

[78] SaiKrishna V, Rasool A, Khare N. String Matching and its Applications in

Diversified Fields. International Journal of Computer Science Issues, 2012; 9(1):

219-226.

[79] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classification with

Deep Convolutional Neural Networks,” Advances In Neural Information Processing

Systems, pp. 1–9, 2012.

167

[80] https://www.whitehouse.gov/blog/2015/10/15/nanotechnology-inspired-grand-

challenge-future-computing.

168