<<

MEMRISTIVE CROSSBAR ARRAYS FOR SYSTEMS

A thesis submitted to the University of Manchester for the degree of Master of Philosophy in the Faculty of Science and Engineering by MANU V NAIR

2015

School of Electrical and Electronics Engineering Contents

Abstract7

Declaration8

Copyright9

Acknowledgements 10

The Author 11

1 Introduction 12

2 Hardware neural networks 15 2.1 Introduction...... 15 2.2 Neural Networks...... 15 2.3 Digital versus Analogue Neural Networks...... 18 2.4 Hardware Neural Network Architectures...... 20 2.4.1 Pulse-stream arithmetic based networks...... 20 2.4.2 TrueNorth...... 24 2.4.3 SpiNNaker...... 26 2.4.3.1 Node architecture...... 28 2.4.3.2 Event driven operation...... 29 2.4.3.3 Network communication...... 29 2.4.3.4 Neuron and Synapse model...... 30 2.4.4 Neurogrid...... 31 2.4.4.1 Shared dendrite structure...... 31 2.4.4.2 Neuron and Synapse...... 32

1 CONTENTS 2

2.4.4.3 Communication...... 33 2.5 Programming neuromorphic hardware...... 35 2.6 Other noteworthy neuromorphic systems...... 35 2.7 Concluding Remarks...... 37

3 Memristive Learning 39 3.1 ...... 39 3.1.1 Boundary condition model...... 41 3.2 Crossbar Arrays...... 43 3.3 circuits and systems...... 45 3.3.1 Crossbars...... 45 3.3.2 Memristor programming schemes...... 46 3.3.2.1 Unregulated write...... 46 3.3.2.2 Regulated write...... 48 3.3.3 STDP-based in memristive crossbar arrays...... 51 3.3.4 Back-propagation ...... 53 3.3.5 Dynamical systems and other circuit applications...... 54

4 -descent in crossbar arrays 55 4.1 Introduction...... 55 4.2 Gradient descent algorithm...... 56 4.3 Gradient descent for linear classifiers...... 58 4.4 Gradient descent in crossbar arrays...... 59 4.5 Unregulated step descent...... 61 4.6 USD versus other methods...... 62 4.7 Implementation...... 63 4.8 Simulations...... 65 4.8.1 Simulation Setup...... 65 4.8.2 Initial condition analysis...... 67 4.8.3 Effect of device parameters on performance...... 69 4.8.4 Effect of variability on performance...... 72 4.8.5 Comparison against floating point implementation...... 73 4.8.6 Performance on the MNIST database...... 74 CONTENTS 3

4.9 USD for other algorithms...... 75 4.9.1 Matrix Inversion...... 75 4.9.2 Auto-encoders...... 76 4.9.3 Restricted Boltzmann Machines...... 77 4.10 Concluding remarks...... 78

5 Conclusion 80

Bibliography 83

A Neural Networks and related algorithms 93 A.1 Introduction...... 93 A.2 Rosenblatt’s ...... 93 A.3 Hopfield network...... 95 A.4 Boltzmann Machines...... 96 A.5 The self-organizing map (SOM)...... 98 A.6 Spiking Neural Networks (SNN)...... 100 List of Figures

2.1 Pulse stream neuron [13] c IEEE...... 22 2.2 A transconductance multiplier [8] c IEEE...... 22 2.3 Self-timed asynchronous communication scheme [8] c IEEE...... 23 2.4 Input pulse probability versus output pulse probability...... 24 2.5 TrueNorth architecture [15] c IEEE...... 25 2.6 A SpiNNaker node [10] c IEEE...... 28 2.7 The SpiNNaker machine [10] c IEEE...... 30 2.8 A network of 4 neurons and 16 synapses with 1-to-1 connectivity[11] c IEEE ...... 31 2.9 A shared dendrite network of 4 neurons and 4 synapses [11] c IEEE... 32 2.10 A Neurogrid neuron [11] c IEEE...... 33 2.11 Block diagram of a Neurogrid tree [11] c IEEE...... 34

3.1 The new circuit element: Memristor [3] c IEEE...... 40 3.2 Tree and Crossbar architecture used in Teramac. [43] c IEEE...... 43 3.3 Crossbar array of memristors: 3-D and 2-D representation...... 44 3.4 A simple circuit schematic for (a) Unregulated writing into memristor and (b) Reading the state of the memristor...... 47 3.5 Schematic of a continuous feedback write scheme...... 48 3.6 Sneak paths in a crossbar array...... 49 3.7 Memristor-based analogue memory/computing unit [49] c IEEE...... 50 3.8 A 1T1M crossbar array access to the top left element [33] c IEEE..... 50 3.9 Segmented crossbar architecture [50] c IEEE...... 51 3.10 Voltage across the synapse ξ(∆T) for various action potential shapes [51] c Frontiers of Neuroscience...... 52

4 LIST OF FIGURES 5

4.1 Gradient descent for a 2-dimensional objective . F(w0,w1) in the figure is the same as F(w,x,y)...... 57 4.2 Block diagram of training module [68]...... 59 4.3 Convergence of USD algorithm when finding the minima of a paraboloid function...... 62 4.4 4-phase training scheme. The voltage levels in the pulsing scheme can only

take three levels: +Vdrive,0,and −Vdrive [68]...... 64 4.5 Convergence behaviour for different weight initializations...... 67 4.6 Weight updates versus iterations when initialized close to 0...... 68

Gratio 4.7 Weight updates versus iterations when initialized to 2 ...... 68 4.8 Weight updates versus iterations when initialized close to 1...... 69 4.9 Weight updates versus iterations when initialized to uniformly distributed random values between 0 and 1...... 70 4.10 Evolution of classification error with iterations [68]...... 71

4.11 Classification error versus Gratio. The bands straddle the maximum and minimum classification error because of settling error at convergence [68]. 71

4.12 Settling time (Niters) versus Gratio for different values of α (Alpha). [68]. 72 4.13 ) Classification Error (Error) versus σ (sigma)...... 73 4.14 Number of training iterations (Niters) versus σ (Sigma). Training time in- creases as variability increases. [68] c IEEE...... 73

4.15 Effect of in the spread of Gon (y-axis) and Go f f (x-axis) values shown using

1000 memristive device samples with mean Go f f = 0.003 S, Mean Gratio= 100. Note that both x and y-axis are in log-scale. [68] c IEEE...... 74 4.16 Performance of the stochastic USD implementation in comparison to floating-

point for various values of pe [68] c IEEE...... 75

A.1 The perceptron...... 94 A.2 A Hopfield network...... 95 A.3 Boltzmann machines and Restricted Boltzmann machines...... 97 A.4 A self-organizing map...... 99 A.5 A synapse. The blocks shown are state-dependent and characterized by var- ious dynamical effects leading to different types of spikes...... 100 A.6 IPSP, EPSP, and action potential...... 101 LIST OF FIGURES 6

A.7 Data-types and symbols used in the LIF neuron model [9] c IEEE..... 102 A.8 STDP waveform for the action potential shown in Figure A.6...... 103 Abstract

This thesis is a study of specialized circuits and systems targeted towards machine learning algorithms. These systems operate on a computing paradigm that is different from tradi- tional Von-Neumann architectures and can potentially reduce power consumption and im- prove performance over traditional computers when running specialized tasks. In order to study them, case studies covering implementations such as TrueNorth, SpiNNaker, Neuro- grid, Pulse-stream based neural networks, and memristor-based systems, were done. The use of memristive crossbar arrays for machine learning was found particularly interesting and chosen as the primary focus of this work. This thesis presents an Unregulated Step Descent (USD) algorithm that can be used for training memristive crossbar arrays to run algorithms based on gradient-descent learning. It describes how the USD algorithm can address hardware limitations such as variability, poor device models, complexity of training architectures, etc. The linear classifier algorithm was primarily used in the experiments designed to study these features. This algorithm was chosen because its crossbar architecture can easily be extended to larger networks. More importantly, using a simple algorithm makes it easier to draw inferences from experimen- tal results. Datasets used for these experiments included randomly generated data and the MNIST digits dataset. The results indicate that performance of crossbar arrays that have been trained using the USD algorithm is reasonably close to that of the corresponding float- ing point implementation. These experimental observations also provide a blueprint of how training and device parameters affect the performance of a crossbar array and how it might be improved. The thesis also covers how other machine learning algorithms such as logis- tic regressions, multi- , and restricted Boltzmann machines may be imple- mented on crossbar arrays using the USD algorithm.

7 Declaration

No portion of the work referred to in the thesis has been submitted in support of an appli- cation for another degree or qualification of this or any other university or other institute of learning.

8 Copyright

1. The author of this thesis (including any appendices and/or schedules to this thesis) owns any copyright in it (the Copyright) and he has given The University of Manch- ester the right to use such Copyright for any administrative, promotional, educational and/or teaching purposes.

2. Copies of this thesis, either in full or in extracts, may be made only in accordance with the regulations of the John Rylands University Library of Manchester. Details of these regulations may be obtained from the Librarian. This page must form part of any such copies made.

3. The ownership of any patents, designs, trade marks and any and all other intellectual property rights except for the Copyright (the Intellectual Property Rights) and any reproductions of copyright works, for example graphs and tables (Reproductions), which may be described in this thesis, may not be owned by the author and may be owned by third parties. Such Intellectual Property Rights and Reproductions cannot and must not be made available for use without the prior written permission of the owner(s) of the relevant Intellectual Property Rights and/or Reproductions.

4. Further information on the conditions under which disclosure, publication and ex- ploitation of this thesis, the Copyright and any Intellectual Property Rights and/or Reproductions described in it may take place is available from the Head of School of Electrical and Electronic Engineering (or the Vice-President) and the Dean of the Faculty of Life Sciences, for Faculty of Life Sciences candidates.

9 Acknowledgements

When I started this course, I came to Dr Piotr Dudek with a vague idea of what I would like to do. He gave me the freedom and resources to explore problems that interested me. His guidance and openness to my (often eccentric) ideas was and continues to be extremely motivating. I have never been as consistently driven as I have been in the past one and half years. I would like to thank him for having me as his student and for being such a great supervisor. My super-cool parents shielded me from worrying about the funding and myriad other small stuff that come up when pursuing a program away from mother-land. For this and countless other things, they will have my eternal gratitude. I also have to mention my girlfriend who was understanding and patient while I gave up a job and stayed away to do arcane research. Thank you and I promise to make it up to you.

10 The Author

I completed my Bachelor in Technology in Electronics and Communication Engineering from the National Institute of Technology Karnataka, India in 2010. Subsequently, I worked as a Design Engineer at Analogue Devices Inc (ADI) at their Bangalore design centre for 3.5 years. At ADI, I was involved in the development of mixed-signal custom SoCs for low-power and bio-medical applications.I was involved in design and development of both analogue and digital blocks such as a reconfigurable cryptography module, a low-power band-gap reference, and mixed-signal verification. I started the MPhil program at the University of Manchester in January 2014. My area of research is primarily in development of novel computational hardware. In particular, I am interested in designing systems for machine learning algorithms. My work explores use of mixed-signal circuitry and memristors for tasks that have generally been the domain of general-purpose digital computers.

11 Chapter 1

Introduction

The story of human civilization is also the story of computation. It took humanity hundreds of thousands of years to invent zero (4th century AD by the Indian mathematician Brah- magupta [1]). However, in the 1500 odd years that followed, humanity progressed at a pace incomparable to any known age before that. This time also saw development of increasingly sophisticated computational techniques such as the abacus, slide-rules, mechanical calcula- tors, and difference engines. For a computer engineer, the evolution of these ”computers” is no less fascinating than that of human evolution. The machines built until the 1940s such as abacuses, arithmometers, and Curta calcu- lators were dedicated computational devices meant to perform large calculations, incapable of conditional statements or branching. Starting from the 1940s, Turing-complete [2] ma- chines started being built. These were general-purpose digital computers capable of running any algorithm given enough time and resources. In the next 70-odd years, they would take over and transform the world. However, researchers realized that being Turing-complete did not make these computers the most efficient or fastest computational architecture for many tasks. This realization spawned new architectures and designs such as Graphical proces- sors, FPGAs, DSP cores consisting of FFT, MACs, etc. These systems were significantly more efficient and faster than a general-purpose computer for their targeted applications. However, in terms of learning capability and efficiency, they are all inferior to the human brain. Even before the age of computing, the philosophical question of consciousness and intel- ligence has instigated many thinkers. The development of digital computers gave this field of study a set of sophisticated tools that it lacked earlier. Advances in learning algorithms

12 CHAPTER 1. INTRODUCTION 13 such as neural networks coincided with this development. More recently, proliferation in use of data-intensive statistical models and algorithms has given a new push to this area of research. While the study of neural networks has not had the glamorous impact that the dig- ital computers have had, they are becoming increasingly potent tools to develop intelligent machines. We have barely scratched the surface of the problem, and yet it has revealed a plethora of techniques that can be used to build better machines. The goal of the work presented in this thesis was to explore computational architectures that are better tuned for machine learning algorithms belonging to the neural networks fam- ily. These algorithms are used in many real-world inference, prediction, and classification applications. The current implementations use millions of artificial neurons and synapses that run on computer farms spanning several thousand cores. This approach is highly ineffi- cient and does not scale well. It also precludes low-power portable devices from harnessing the power of these amazing algorithms. It is therefore important that we explore computa- tional architectures and techniques that are at least an order of magnitude better in terms of computational power and efficiency. Several attempts have been made by various research groups to build such machines, such as SpiNNaker, TrueNorth, Neurogrid, etc. Understanding these systems requires a good understanding of circuit design as well as machine learning algorithms. The thesis does not attempt to revisit the algorithms in detail. However, a brief overview of a few important algorithms are provided in the appendix for reference. This document does cover a fairly detailed analysis of the hardware aspects of some important neuromorphic systems. These systems are essentially massive hardware neural networks whose primary goal is to model the brain. It must be noted that some of these systems are not capable of running non- spiking neural networks such as Multi-layer perceptrons, Restricted Boltzmann machines, Convolutional neural networks, etc. Nevertheless, they incorporate design ideas that are also relevant for any type of neural network implementation. This work also dealt with an exciting new family of devices called Memristors. This is a new high-density non-volatile memory technology that was first identified and demonstrated as a memristor by HP Labs [3]. In addition to studying the properties of these devices, we also did a survey of the work done by various research groups in utilizing these devices for different applications. Finally, the original contribution of this thesis is a method to use CHAPTER 1. INTRODUCTION 14 the gradient-descent optimization algorithm to train memristive crossbar arrays to run ma- chine learning algorithms efficiently. It proposes novel approximations and implementation techniques in order to achieve that goal. In addition to discussing the algorithmic modifica- tions and its effect on the learning performance of the array, the study also covers a thorough analysis of the effect of device non-linearities and variability on computational performance. This work was published in a paper titled - ”Gradient-descent-based learning in memristive crossbar arrays” and presented at the international joint conference in neural networks-2015 (IJCNN-15). Another journal paper is currently being written. This thesis is structured as follows:

1. Introduction - Contains a summary of the work presented in this thesis.

2. Hardware neural networks - Describes various ideas and problems associated with designing and building some of the most advanced hardware neural networks.

3. Memristive learning - Covers an introduction to memristors and crossbar arrays fol- lowed by a description of how these devices are being used for machine learning and neuromorphic applications.

4. Gradient-descent in crossbar arrays - This chapter contains a description of the Un- regulated Step Descent algorithm, which is an original contribution of this work. The algorithm allows the designers to program memristive crossbar arrays to run machine learning and neural network algorithms. We study how well the algorithm performs and how device variations affect computational performance.

5. Conclusion - All the findings and ideas are summarized. It also muses over the future direction of research in this field.

6. Appendix - The appendix contains a brief description of the fundamental ideas un- derlying algorithms such as Rosenblatts Perceptron, Restricted Boltzmann Machines, Self-organizing maps, Spiking neural networks, etc. Chapter 2

Hardware neural networks

2.1 Introduction

This chapter is a survey of ideas used in some of the most successful hardware neural net- works available today. It does not attempt to be an exhaustive collection of all existing systems and only dwells on ideas that are of relevance to the work presented here. It starts with a general discussion of neural networks. The rest of the chapter is an overview of various challenges and considerations in designing hardware neural networks using a few selected systems, namely Pulse-stream networks, TrueNorth, SpiNNaker, BrainScaleS, and Neurogrid, for reference.

2.2 Neural Networks

The human brain is unparalleled in computational efficiency and capability. The field of neural networks owes its origins to our desire to build computers that are as capable as the human brain. However, even after decades of study, there is only a rough understanding of how the brain actually does any computation. While we do have rough models and theories based on observations and experimental data, none of them explain the computational archi- tecture of a brain satisfactorily. From a computational point of view, some of the interesting observations are as follows:

1. The human brain is made of around 100 billion neurons, each of them connected to about 10000 other neurons via synapses.

15 CHAPTER 2. HARDWARE NEURAL NETWORKS 16

2. It is difficult to assign a specific role to a particular neuron or synapse. However, it is observed that neuronal activity increases in a particular region for a particular stimulus. This observation led to the famous Hebbian learning principle [4].

3. The brain is highly non-linear and dynamic.

4. There appears to be some redundancy in the brain architecture and the network can recover from erratic neurons.

5. The communication appears to be in the form of spikes. It is almost certain that the information is encoded in the temporal properties of the spike trains and that the actual voltage levels are unimportant.

6. The brain is significantly more power efficient than the best computer we have today for and inference tasks.

7. The brain is highly volatile with new connections being formed and old connections broken continually.

It is quite clear that the brain is an excellent pattern recognition machine. But, in-spite of all our work, a unifying theory that explains exactly what happens in the brain is still far out of reach. However, even with a crude understanding of the brain, designers have been able to build some very impressive networks and systems. The explosion in Internet usage has generated data in astronomical quantities. The phe- nomenon is called Big Data. The inhuman effort required to work with such data has resulted in a huge demand for intelligent systems whose the goal is to find and identify patterns from raw data. These systems run on algorithms that lie in the domain of statistical inference and machine learning. As it stands today, the most successful techniques being used on these problems are based on neural networks. These techniques have revolutionized virtually ev- ery business and research activity that deals with Big Data by endowing them with excellent tools for prediction and analysis. A simplistic method to define a neural network would be as a network of neurons and synapses. But, there is much more to it. The neurons and synapses themselves can be treated as simple pulses and connections or as complex function generators and channels. Within a single layer, neurons can be connected in several configurations and each configuration CHAPTER 2. HARDWARE NEURAL NETWORKS 17 has its own unique set of properties. The interactions between different layers also depend heavily on how the connections are made. One of the most pertinent questions for a scientist studying the brain is how nature has made these connections. One of the most influential ideas in the early days of neural networks was that of the Perceptron. It modelled the brain as a linear combination of inputs followed by a non-linear operation. Several such layers connected to form multi-layer perceptrons or Madaline [5]. Additionally, several other interesting ideas emerged. One of them was the auto-associative memory based on Hebbian learning such as Hopfield networks [6]. There were others too, such as Boltzmann Machines, Auto-encoders, Convolutional Neural Networks, etc. While these algorithms borrowed liberally from neuroscience and its understanding of the brain, they also incorporated a large set of ideas from mathematics or physics. Some of these neural network algorithms have deviated significantly from our understanding of the brain and are an interesting class of the networks in their own right. The computational power of neural networks did not escape the attention of circuit de- signers. As a general rule, most circuit designers grapple with the inevitability of the device scaling wall. The problem is that as devices scale down, they became increasingly erratic. The other issue arising from scaling is that of higher power density. Alternatives to sil- icon are extremely unreliable or hard to work with. The observation that the brain can compute with highly unreliable devices at very low power levels is, therefore, extremely intriguing. If one could incorporate these features in artificial neural networks, it might re- sult in computational devices that are more power efficient and fast. Neural networks have been well-known for a while now. However, hardware neural networks never became main- stream because they could never beat general-purpose processors running neural network algorithms. MOS technology has been more resilient than originally expected and circuit designers have been able to push scaling wall all the way to 10 nm. However, the limits are near as can be seen by the slowdown in development of faster and denser processors. Current microprocessor designs attempt to overcome these limits by increasing number of processor cores. Alternative computational architectures are also being explored. Interest in hardware neural networks is therefore picking up again. There is an abundance of literature on hardware neural network (HNN) written over the last few decades. [7] provides a comprehensive review of many successful implementations. In general, all HNNs grapple with similar issues: CHAPTER 2. HARDWARE NEURAL NETWORKS 18

1. Inter-connect density: A fully-connected neural network involving physical connec- tions between neurons can only be built in 3-D space. It becomes prohibitively expen- sive to create a topologically equivalent connectivity using the current 2-D fabrication methods. Instead, circuit designers incorporate multiplexing schemes.

2. Memory: Unlike the general-purpose Von-Neumann architecture, HNN work best when each synapse has access to its weights locally. Storing them in a separate dedi- cated memory is not ideal because these weights will have to be accessed frequently. This increases the power and memory bandwidth requirements dramatically as the network scales up.

3. Device variability: Device variability is an important issue particularly when working with analogue HNN. While the algorithms for training neural networks are partially tolerant to this, good design is essential to achieve better performance.

The reason memristive crossbar arrays have grabbed a lot of attention is that they might solve the first two problems listed above. Before discussing such systems, it will be worth- while to study some of the existing hardware neural networks. Designing HNNs is a com- plex task with several factors to consider. The first major decision to be made is the choice of learning algorithm or neural network itself. There are several choices such as MLPs, RBMs, CNNs, Auto-encoders, SoMs, etc. This choice sets constraints on the architecture and per- formance for various building blocks. A brief summary of these algorithms is covered in the appendix. The designer then has to choose between digital or analogue implementa- tion for the building blocks. The interconnection architecture is another factor to consider. There are broadly two choices here, synchronous or asynchronous event-driven model. In many applications, it is not feasible to put all the blocks in a single chip. In such cases, inter-chip connectivity schemes have to be designed. Timing considerations become more critical in such cases. The discussion in this chapter covers these issues by using some of the successful state-of-the-art implementations as examples.

2.3 Digital versus Analogue Neural Networks

The argument for a partial or fully-digital VLSI implementation tend to be quite similar to why digital signal processing is more advantageous than analogue signal processing. A list CHAPTER 2. HARDWARE NEURAL NETWORKS 19 of advantages of adopting the digital method is as follows:

1. Design methodology is standardized and abstracted. Once the network is designed, the actual synthesis and fabrication steps will be quite similar to any well-established digital VLSI design flow.

2. The noise margins are high because digital signals operate at only two levels.

3. It is easier to interface digital blocks to standard digital systems such as FPGAs or SoCs. It does not require any converters or voltage scaling operations which makes the system cheaper.

4. Programmable weights are easy to implement using digital memories. This was per- haps the strongest advantage that the digital option had over analogue. Perhaps, the best option for analogue designers before the advent of memristors was the floating- gate MOSFET, which were (and still are) expensive.

On the other hand, analogue computation also has some advantages over digital computa- tion.

1. Digital NN tend to suffer from a performance bottleneck arising from frequent mem- ory accesses. Every weight or parameter in the neural network has to be fetched from the memory, processed, and updated if necessary. This costs time and power. On the other hand, the memory elements in analogue NNs double up as computational units eliminating the need to move the data around, improving computational speed and reducing power consumption.

2. Digital circuits take up more transistors than their analogue counterparts. However, this is a questionable advantage because digital CMOS circuits scale much better.

The difficulty in designing analogue NNs has traditionally outweighed their benefits. The building blocks used in analogue NNs are generally more complicated to build than their digital counterparts. These designs do not lend themselves well to large-scale automated design and synthesis. One also needs to consider factors such as noise, parasitics, device variability, etc. In order to be competitive to digital NN implementations, the devices used in analogue NNs should be smaller, which is difficult. In CMOS or bipolar technologies, CHAPTER 2. HARDWARE NEURAL NETWORKS 20 these issues make it intractable to build large analogue hardware neural networks consisting of millions or even thousands of neurons and synapses. The most compelling reason memristive crossbar arrays are of interest is the potential ease of design and fabrication they bring for analogue neural networks. However, these devices come with a baggage of design problems that have not yet been fully addressed. One of the outcomes of the work presented in this thesis is a method to overcome some of those issues.

2.4 Hardware Neural Network Architectures

In this section, we discuss the following:

1. VLSI neural networks using pulse-stream arithmetic [8]

2. TrueNorth [9]

3. SpinNaker [10]

4. Neurogrid [11]

This is by no means an exhaustive list of all existing hardware neural networks, but they are a good representative set to illustrate various problems associated with their design and have also been highly influential. A detailed discussion of low-level circuit descriptions of the various blocks in these networks is beyond the scope of this document. Only those aspects of the design that highlight issues of relevance to this thesis are discussed here. However, references are provided to the original works. A good reference to the fundamental building blocks of these circuits is [12].

2.4.1 Pulse-stream arithmetic based networks

Pulse stream arithmetic based networks try to combine the best properties of analogue and digital VLSI systems using clever circuit and statistical techniques. Specifically, these tech- niques try to overcome analogue circuit shortcomings using digital circuits and vice versa.

1. Analogue circuits are susceptible to noise and variability while digital signals have large noise margins and are much less affected by noise. CHAPTER 2. HARDWARE NEURAL NETWORKS 21

2. Analogue signals are sensitive to errors during transmission of data unlike digital on- off signals.

3. Digital multiplication and addition is area-intensive and power hungry unlike their analogue counterparts. On the other hand, digital signal processing circuits provide much better accuracy than their analogue counterparts.

4. Unlike digital memories, analogue memory technologies such as charge-coupled de- vices (CCD), metal-nitride oxide silicon (MNOS), flash memories, floating-gate mem- ories, etc., are difficult to program and handle. However, this is one of the shortcom- ings that might go away with the development of memristive crossbar arrays.

The method that pulse-stream networks adopt is to combine digital memory storage with analogue computation. Instead of using converters to translate between analogue and digital signals, the information is encoded as pulses. Different types of encoding schemes have been proposed and implemented such as:

1. Pulse amplitude modulation

2. Pulse width modulation

3. Pulse delay modulation

4. Pulse frequency modulation

A pulse-stream neuron is essentially a voltage-controlled oscillator that collects pulse- stream inputs and uses that to modulate the frequency of output pulses. As shown in Figure 2.1, inputs arriving from other connected neurons are integrated and used to model the neuronal activity of the node. Note that part or all of the information is encoded in the timing of the pulses. This means that to obtain good results the system has to monitor a sequence of pulses. The other point of interest is that although there are some structural similarities between spiking neural networks (Appendix) and pulse-stream techniques, the pulse-stream networks are based on different operating principles and pre-date them. Implementing a pulse-stream synapse is trickier because the signals have to be weighed according to the strength of the connection. One of the most interesting circuits here is that of the transconductance multiplier. In the circuit shown in Figure 2.2, the two transistors, CHAPTER 2. HARDWARE NEURAL NETWORKS 22

Figure 2.1: Pulse stream neuron [13] c IEEE

M1 and M2, are biased in the linear or triode region. The idea is to size the devices such that the 2nd order term in the current output I3 is cancelled off. It can be shown that the net current output is given by the equation:

CoxW1 I3 = µ (VGS1 −VGS2)WDS1 (2.1) L1

The output of the circuit is a stream of current pulses whose magnitude is proportional to Ti j and frequency is proportional to S j. The circuit shown in Figure 2.2 is highly susceptible to variability and mismatch, and schemes have been suggested to overcome them [8].

Figure 2.2: A transconductance multiplier [8] c IEEE

The next structure required for completing the connections between neurons and synapses is the connection fabric. Given the requirement for a large number of neurons, it is not al- ways feasible to connect distant neurons directly. To solve that problem, the designers [8] proposed communication schemes that generally fall in two categories:

1. Synchronous time-division multiplexed (TDM) transmission CHAPTER 2. HARDWARE NEURAL NETWORKS 23

2. Asynchronous self-timed transmission

The synchronous TDM scheme is fairly self-explanatory. It assigns a predefined slot for each neuron within a given window. If a pulse is generated by a neuron, then 1 is transmitted in the neuron’s time slot. If no pulse is generated, then 0 is transmitted. Implementing such a scheme requires needs a global clock. However, designing the clock tree can be tricky when the number of neurons is large. One of the solutions proposed to overcome this prob- lem was the self-timed scheme illustrated in Figure 2.3. In this scheme, the receiver and the transmitter use the ready to transmit (RTT) and ready to receive (RTR) lines to estab- lish an asynchronous handshaking mechanism. Once the communication link is established, the transmitter cycles through the incoming streams of data and encodes pulse information such as time between pulses using a pulse-width modulation scheme. The receiver knows the address of the neuron source because of the preliminary handshaking mechanism. The similarities between this scheme and the Address-event representation (AER) scheme, de- scribed in the next section, may be noted.

Figure 2.3: Self-timed asynchronous communication scheme [8] c IEEE

Pulse signals incoming from multiple sources are summed up using a summer circuit. There are two distinct implementation choices for this block:

1. Voltage pulse addition

2. Current pulse addition

One of the methods proposed to sum multiple pulse streams was the OR logic operation. th If pi is the probability of a pulse arriving at the i input of an OR gate, then, the probability of an output pulse pout can be computed as:

pout = 1 − (1 − p0)(1 − p1)...(1 − pN) (2.2) CHAPTER 2. HARDWARE NEURAL NETWORKS 24

However, if we were to use an normal adder instead, the output would be Σpi. As the number of inputs or the magnitude of probabilities increase, pout increasingly deviates from this value. This means that the addition accuracy worsens as the frequency of pulses increase. While this can be thought of as a disadvantage, it is not necessarily one. In most neural network implementations, the summer is followed by a sigmoid non-linearity. As shown in Figure 2.4, the probability of an output pulse from the OR adder looks a lot like a sigmoid. This eliminates the need for a dedicated generator. However, there are two problems with this approach. One is that the shape of the sigmoid depends on the number of inputs. The other issue is that as the number of inputs increases, the output saturates faster.

Figure 2.4: Input pulse probability versus output pulse probability

While the pulse-based neural network solutions are easier to design and implement than purely analogue implementations of the neural networks, they still are expensive and do not scale well for large neural networks consisting of thousands and millions of neurons and synapses. They eventually run into issues such as area, connectivity density, mismatch, de- sign complexity, etc. Timing and connectivity is another aspect that needs to be considered. Using the encoding schemes described in this section does help alleviate the issue but as the number of neurons increase, communication speed becomes a bottleneck. Lowering the pulse widths makes the design more complex but that is probably not a critical issue with current technology.

2.4.2 TrueNorth

The TrueNorth is a purely digital neuromorphic processor architecture build by IBM [14]. As shown in Figure 2.5, the design is based on a crossbar array of SRAM memory cells. As CHAPTER 2. HARDWARE NEURAL NETWORKS 25 in most neuromorphic systems, the basic building blocks are neurons, axons, and synapses. The chip consists of 4096 cores, each core containing 256 integrate-and-fire neurons, and 1024 × 256 SRAM crossbar memory for synapses. The axons and neurons are connected by crossbar synapses of which there are about a million. Each of the axons is assigned a type

G j and stored in memory as shown in Figure 2.5. G j is the synaptic weight and can take different values, corresponding to inhibitory or excitatory type with different efficacies. The crossbar synapses merely indicate the presence or absence of a connection between the axon and neuron. The operation of a TrueNorth neuron is based on the leaky integrate and fire

Figure 2.5: TrueNorth architecture [15] c IEEE model described in the Appendix. Use of fixed-point computation to model the neuron is where this design stands out from most other approaches. An important thing to note is that while computation at each node involves multi-bit calculations, the spike itself is modelled by a single bit. Every spike event is encoded by an output encoder into an address event representation. An address event representation (AER) encodes binary activity by sending the address of the spiking neuron via a multiplexed channel [16]. The advantage of using the AER method is that it enables asynchronous transmission of spiking activity which results in power savings. Processing within the core occurs in two time steps: an integration phase and a time-step synchronization phase. In the integration phase, the decoder calculates the source address of the incoming spikes using the incoming AER encoded pulses. All the axonal inputs are gathered by the input decoders and sent to their respective neurons. In the synchronization CHAPTER 2. HARDWARE NEURAL NETWORKS 26 phase, a Sync event is sent to all the neurons in the system. The neurons use the information that arrived in the integration phase to update its membrane potentials. In this phase, the neurons may also fire if their membrane potentials are higher than the programmed threshold values. If a spike is produced, the membrane potentials are reset to 0. While the transmission of spikes is done asynchronously, the design ensures that there is some crude global synchronization. This is important to ensure that the hardware and algorithm are running in tandem. Because of the statistical nature of the spiking neurons and the variability in delays, it is possible that a given spike will miss its time window. This problem is addressed at the algorithmic level using techniques that are tolerant of a few misfiring neurons. Finally, it has to be noted that the system is not capable of any training. The parameters are computed off-line and programmed onto the chip. The creators have been able to suc- cessfully port some simple algorithms on this system. Some of these tasks are autonomous virtual-robot driver, pong player, MNIST digit recognition, auto-association [15]. It must be noted that the current state-of-the-art spiking neurons use a much larger number of neurons and synapses than is supported by a single TrueNorth chip. For example, [17] uses 7000+ neurons and 5 million+ synapses to implement a two layer network. Hierarchical multilayer networks that use a much larger number of neurons and synapses have also been reported [18]. A single TrueNorth chip already has 5.4 billion transistors. Therefore, it might be im- practical to build larger networks on a die. One of the potential avenues of research on this front is to use the same chip for running different chunks of network sequentially. The other is to use several chips and integrate them together at the board level. Some approaches such as the HICANN project do this by a wafer-level integration procedure. (HICANN is briefly discussed later.) However, these approaches re-introduce the problem of memory bandwidth and power consumption whose reduction was the objective of the TrueNorth chip. The cur- rent implementation of the design is in the form of a chip that consumes only 45 pJ per spike [14].

2.4.3 SpiNNaker

Like IBM’s TrueNorth, the SpiNNaker is a digital neuromorphic system. The purpose of the SpiNNaker project is to build a supercomputer capable of simulating large (1 billion neurons) spiking neural networks operating in the biological time-scale (packet latency can CHAPTER 2. HARDWARE NEURAL NETWORKS 27 be up to 1 ms) [10]. SpiNNaker is a massively parallel multi-core computing platform consisting of about 57000 compute nodes. Each node contains 18 ARM9 cores bringing the total number of ARM processors in the complete system to more than a million! Each core is provided with 32 kB instruction memory and 64 kB data memory bringing the total RAM in the system to 7 TBytes. For simple neuron models, each ARM9 core is capable of modelling about 1000 neurons. This means that the total number of neurons that can be simultaneously simulated by this system is approximately a billion! While SpiNNaker and TrueNorth share some similarities, their design objectives and capabilities are vastly different. However, a contrastive study might help understand the design trade-offs better. IBM’s TrueNorth uses dedicated digital fixed function blocks to operate as neurons and synapses. Although there is some level of programmability in the neuronal functionality, the programmer is stuck with the leaky integrate and fire model that is hard-coded in the system. While this was essential in order to lower power consumption and chip area, it also makes it harder to radically alter the functionality of the inbuilt neu- rons. Unlike IBM’s chip, SpiNNaker uses ARM9 cores to model neurons. This provides the programmer with complete flexibility over the models used for his algorithms. He is only limited by the 32 kB instruction memory provided to each core. The data memory associated with each processor also implies that, unlike TrueNorth, SpiNNaker synaptic weights can be trained on-line. Another difference between SpiNNaker and TrueNorth is in the routing mechanism. The crossbar architecture provides each neuron in TrueNorth with the ability to directly connect to a target node. On the other hand, SpiNNaker transfers information packets using a routing fabric to communicate between the different nodes on the system. This makes connectivity in SpiNNaker completely reconfigurable. It is of interest to note the similarities between a computer network and SpiNNaker. However, the cost of this re- configurability is higher power consumption. This can be appreciated better by the fact that a single TrueNorth chip consumes about 70 mW on a 4.3cm2 die while a single SpiNNaker node consumes 1 W on a 101.64mm2 die. However, when we consider the fact that a single TrueNorth chip models 256 neurons and 1024 axons versus a Spinnaker node that models 16 × 1000 neurons, the difference is not so stark. CHAPTER 2. HARDWARE NEURAL NETWORKS 28

Figure 2.6: A SpiNNaker node [10] c IEEE

2.4.3.1 Node architecture

A block diagram of a SpiNNaker node is provided in Figure 2.6. Each node in SpiNNaker is a network of 18 ARM9 cores and part of a larger network of similar nodes. On start-up, one of the 18 processors is selected as the Monitor Processor and used for housekeeping tasks on each chip. 16 processors are used to model the neurons and are independently config- urable. One processor is reserved for fault-tolerance and manufacture yield-enhancement. SpiNNaker uses two Network-on-Chips (NoCs); a Communications NoC to establish inter- processor communication and a System NoC to handle processor-peripheral communica- tion. The interesting thing to note here is that even though two processor might share the same node, the Communications NoC handles any communication between them using the same protocol it would use to handle transfer between two processors on different nodes. On the other hand the System NoC allows the 18 processor cores in a node to access peripherals on the node such as SRAM, ROM, Timers, DMA, etc. CHAPTER 2. HARDWARE NEURAL NETWORKS 29

2.4.3.2 Event driven operation

For a system as large as this, it is impractical to design a synchronous communication scheme. An elegant feature of spiking neural networks is its tolerance to delays. This allowed the architects of the system to do away with synchronization. The general principle here is that a packet is launched in the earliest available window. However, this means that the packets may arrive out of order or delayed because of crowding. While the algorithms are tolerant to reasonable delays, they cannot be indefinitely large. In order to handle this, the system cycles through global time phase values that must be consistent across the sys- tem. If a node receives a packet that is older than 2 time steps, it is taken out of action and sent to the Monitor process. Note that the latency in packet arrival can be in the order of 1 ms in biological systems. It can be shown that a router operating at 100 MHz can handle the traffic in SpiNNaker in normal operating conditions [10]. The unpredictability in packet arrival time makes interrupt-based mechanisms the logical way to communicate. In normal operation, arrival of packet triggers an interrupt to the designated core. The core accepts the packet, does the requisite processing, transmits new packets if necessary, and goes back to its old state. As a power saving feature, the cores go to sleep state when not in operation.

2.4.3.3 Network communication

Much like the internet, communication in the SpiNNaker systems is packet-based. However, unlike the internet, SpiNNaker packets consist of 40 to 72 bits of data. Every packet contains a control byte that specifies routing information such as the packet type, parity, payload indicator (length of the packet), etc. The system supports 4 packet types corresponding to 4 different communication methods:

1. Nearest neighbour - Communication between adjacent nodes. Package always deliv- ered to the monitor core that lies in the neighbourhood.

2. Point-to-point - Packet launched by any core and delivered to the monitor on the node with the target core.

3. Multicast - Only type that permits core-to-core communication. Duplication of pack- ets will be done to ensure that all cores receive the package CHAPTER 2. HARDWARE NEURAL NETWORKS 30

Figure 2.7: The SpiNNaker machine [10] c IEEE

4. Fixed route - Contains information such as error detection and time stamp, emergency routing, etc.

One of the biggest advantages of a packet scheme is the ability to configure the network in any topological order. For example, the designers have proposed a connectivity scheme shown in Figure 2.7. In order to build such network topologies, it is essential that the routers on each node contain information about the addresses and routing paths to the other cores in the system. This is done by data tables in each node. Computing the optimal routing table is a non-trivial problem and needs to be programmed before the system can start operation [10].

2.4.3.4 Neuron and Synapse model

SpiNNaker is best used for algorithms that use the point-neuron model[19]. The point neu- ron ignores the effect of dendrite structure on the input pulses. The incoming spikes are treated as if they are directly applied to the body of the neuron. While there is a chance that this might be too simplistic a model for the brain, most of the other neuromorphic systems make similar assumptions. However, the programmer does have some freedom in control- ling the data content of each packet. Therefore it might be theoretically possible to model properties such as delays to some extent. At this point it is easy to appreciate that while SpiNNaker is a neuromorphic super- computer, it liberally borrows ideas from VLSI and computer networks. The event-driven model used here is somewhat similar to the brain, but the communication and computation infrastructure is completely different. One of the biggest challenges facing scientists in this CHAPTER 2. HARDWARE NEURAL NETWORKS 31 study is the inability to simulate large neural networks and architectures on real-time. And that is exactly the problem SpiNNaker is in the best position to solve.

2.4.4 Neurogrid

The Neurogrid is a mixed-signal multi-chip neuromorphic system designed by Brains in Silicon lab at Stanford University. Like SpiNNaker described in the previous section, Neu- rogrid aims to provide a neuromorphic simulation platform that is capable of handling large scale brain simulations in real time. While the problems faced by the designers of Neuro- grid were similar to that of TrueNorth, they made several interesting and divergent design choices in the implementation of Neurogrid. The following are of particular interest [11]:

1. Use of analogue CMOS circuits to model the neurons and synapses.

2. Adopting a shared dendrite structure to reduce the implementation cost.

3. A binary tree-based network to establish communication between the various nodes.

2.4.4.1 Shared dendrite structure

One of the biggest challenges in building neuronal circuitry is the exponentially increasing number of connections. For example, a fully-connected network consisting of N neurons with one-to-one connectivity requires N2 synapse elements, as shown in Figure 2.8. Clearly,

Figure 2.8: A network of 4 neurons and 16 synapses with 1-to-1 connectivity[11] c IEEE this architecture does not scale well. One of the methods proposed to reduce the number of synapses to N is to use a time-multiplexing scheme where dedicated time slots are assigned to each neuron to transmit its pulse. The synaptic weights are stored in a memory. When a CHAPTER 2. HARDWARE NEURAL NETWORKS 32 neuron generates a spike, the respective synaptic weight is fetched and applied to the spike before passing it on to the target neuron. A block diagram of the scheme is presented in Figure 2.9. The blocks marked transmitter and receiver encode the source and destination

Figure 2.9: A shared dendrite network of 4 neurons and 4 synapses [11] c IEEE information using an asynchronous AER scheme to transmit the spike to the correct destina- tion. The figure also illustrates the shared-dendrite structure that is represented by resistors connecting adjacent neurons. This mechanism mimics the visual cortex structure where the neurons in a ”minicolumn” within the same ”hypercolumn” of the visual cortex share the connections and are connected to each other by lateral inhibitory connections [20]. One of the limitations of this structure is that it does not allow the programmer to implement any synaptic plasticity scheme. If such a feature is to be supported, it would be necessary to reprogram the synaptic weights to model the appropriate behaviour.

2.4.4.2 Neuron and Synapse

A single chip of the Neurogrid, called Neurocore, consists of 256 × 256 neurons, a transmit- ter, a receiver, and two RAMs. The neuronal circuits are made of different functional blocks, which perform the role of soma, dendrites, synapse-population, and ion-channel population circuits. The blocks are implemented using dimensionless models which simplifies design without any loss of accuracy. For example, consider a circuit equation:

CV˙ = −G1(V −Vconstant) + Iconstant CHAPTER 2. HARDWARE NEURAL NETWORKS 33

Constants such as G1, Vconstant, Iconstant can be normalized to τ and ν resulting in equations of the form:

τν˙ = −ν + u

A dimensionless equation of this form is relatively easier to implement if care is taken to

Figure 2.10: A Neurogrid neuron [11] c IEEE ensure that the values are scaled suitably. A block diagram of the various components in a Neurogrid neuron is shown in Figure 2.10. Each neuron consists of a soma block, a den- drite block, four synapse-population blocks, and four gating-variable blocks. The synapse- population block refers to the the resistive grid connection that was illustrated in Figure 2.9. The gating-variables block is used to model the ion-channel behaviour observed in the neurons. A simple explanation for the role of gating in the behaviour of the neuron is pro- vided by the Hodgkin-Huxley model. Circuit-level details of the different building blocks are shown in [11].

2.4.4.3 Communication

As mentioned earlier, whenever a neuron generates a spike, the information is AER encoded and transmitted to the appropriate neuron. The target neuron does not necessarily lie in the same Neurocore. Therefore, it was essential to design a scheme that was fast enough for CHAPTER 2. HARDWARE NEURAL NETWORKS 34 real time processing, not area-intensive, and highly-scalable. The choice of transporting information digitally was made for these reasons. The transmitter consists of two 256-input arbiters (functionality similar to a multiplexer), one associated with the row index and the other with the column index. When a neuron fires, each of the arbiters generate an 8-bit address. The receiver is a 2048 × 256 arbiter where 8 lines are allotted to each row in order to select one (out of four) shared-synapses, to sample one (out of three) analogue signals, and one to disable a neuron. While this scheme is fairly ordinary, the interesting idea in Neurogrid communication is the arrangement of the Neurocores in a binary tree fashion that makes communication highly bandwidth-efficient. A block diagram of the scheme is shown in Figure 2.11.

Figure 2.11: Block diagram of a Neurogrid tree [11] c IEEE

Each of the triangular blocks shown in the figure correspond to a single Neurocore. Packets are transmitted in two phases. A point-to-point phase shown by black arrows, and a branching phase shown by the purple arrows. In the point-to-point phase the tree uses the path information encoded in the packet to decide the direction of the packet’s movement (U: Up, D: Down, R: Right, L: Left). When the stop code (S) is reached, it indicates that the point-to-point phase is complete. In the next phase, namely the branching phase, the packets are sent to every child of the parent node all the way to the leaf node. In Figure 2.11, transmission from node 4 to 3 is point-to-point, and that from node 3 is branching. It is of interest to note that the SpiNNaker systems can easily be configured to adopt the same communication structure. Additionally, it should not be too difficult to model a similar CHAPTER 2. HARDWARE NEURAL NETWORKS 35 shared-dendrite structure on SpiNNaker. Taking only simulation capability into considera- tion, SpiNNaker appears to be more reconfigurable, flexible, and precise. However, the real advantage of Neurogrid could be its power efficiency and lower cost of implementation.

2.5 Programming neuromorphic hardware

Programming large neural networks such as SpiNNaker, Neurogrid, or Brainscales is a non- trivial problem. The complexity is higher when the specifications are to be translated to bias values for the analogue circuitry. All the systems endeavour to present the user with a relatively simply user interface that can translate the neural network specifications to low level circuit parameters. One of the popular high-level languages prescribed for this task is PyNN[21]. This thesis does not dwell much on the different programming methodologies adopted by various systems since the focus is on the hardware architecture. However, the impor- tance of a simple and adaptable interface cannot be underestimated when we try to bring neuromorphic computation to the mainstream.

2.6 Other noteworthy neuromorphic systems

Two challenges that need to be addressed when building any large scale neuromorphic sys- tem are as follows:

1. Designing high density circuits for neurons and synapses that are power efficient and re-programmable.

2. Communication infrastructure to transmit billions of spikes in real time between the neurons while maintaining some level of global synchronization.

The four systems reviewed in this chapter were chosen because they are a good repre- sentative set of the different approaches that have be used to build large hardware neural networks. However, this list is, by no means, exhaustive. There are several other neuromor- phic systems; such as the European BrainScaleS project [22], BioRC [23], MIT’s Silicon Synapse [24], Intel’s neuromorphic hardware [25], spiking neuromorphic processor at INI Zurich [26], etc. CHAPTER 2. HARDWARE NEURAL NETWORKS 36

The most interesting observation one might make after studying these systems is that all of them have adopted distinctly different approaches to solve similar problems. For ex- ample, the BrainScaleS project led to creation of the HICANN (High input count analogue neural network) system where interconnections are implemented using wafer-level connec- tions. Instead of cutting the neural network chip dies out of the wafer, the designers created mechanisms to create interconnects directly on the wafer. This approach led to a new set of issues. One interesting issue was that using an AER scheme increased the power consump- tion of the system to unacceptable levels. So the designers developed a novel asynchronous low-voltage signalling scheme for the communication task [22]. The BioRC project at the University of South California aims to build analogue circuits using carbon nano-tube tran- sistors that emulate synapses, neurons, and neural networks. The MIT Silicon Synapse is a fully analogue system that models ion channels in neurons to a high level of precision. Intel is attempting to use Spin devices where the lateral spin valves act as neurons and memristors as synapses. Neural networks have been around for a while, but they have never been as attractive as they are today. The current interest in building large hardware neural networks is probably for two main reasons:

1. We have much greater processing power at our disposal today than a few decades ago. This has led to development of increasingly large neural networks (primarily,in software) that have pushed the boundaries of artificial intelligence. However, these networks are implementations running on hardware that is not optimized for these kind of memory accesses or computations. Building larger and larger networks us- ing the traditional Von Neumann style computers is increasingly non-viable because of power and performance considerations. Therefore, it makes economical sense to explore alternative computational styles.

2. The tremendous interest and benefit in understanding how the human brain works: The brain has remained one of the most enduring mysteries in human history. Power- ful simulators to test new theories are absolutely essential to make any progress in this front. As described in the previous item, traditional computers are ill-suited to handle this load. CHAPTER 2. HARDWARE NEURAL NETWORKS 37 2.7 Concluding Remarks

A point to be highlighted at this stage is that an overwhelming majority of VLSI implemen- tations available today tend to focus on based systems. In general, one of the primary objectives of these systems is to model the brain. The other is to use spik- ing neural networks to solve multi-modal pattern recognition problems in real-time such as multiple face-recognition in streaming videos. Spiking neural networks appear to have im- mense potential to solve these problems. They have theoretically been shown to be capable of encoding more information than the other neural networks for some tasks [27] . These have been called the third generation of neural networks and rightly so. One of the reasons that hardware based on spiking neurons are so attractive is proba- bly because they are relatively easier to implement than RBMs, CNNs, Auto-Encoders, etc. in VLSI. This is because of the difference in the nature of information flow in these sys- tems. Unlike other neural networks, spiking neural networks only transmit pulses between neurons. This is a big simplification when designing massive networks. However, it needs to be understood that in many instances, the designs do not necessarily try to exploit the theoretical advantages held by spiking neural system over other artificial neural networks. Building an analogue VLSI system for neural networks such as RBMs will require trans- mission of multi-bit signals accurately. This is extremely difficult. A digital VLSI imple- mentation could potentially be limited by the memory bandwidth requirements. The com- putations involved in each node are also more complicated. Transmitting more data between the neuronal computational block and the memory increases the power savings significantly and can erase any power advantage to be had by moving away from traditional computers. As will be elaborated in the subsequent chapter, memristive crossbar arrays have opened new avenues that might address some of these issues. While the research community has shown tremendous interest in using the memristive crossbar arrays for various applications including spiking neural networks, there has been surprisingly little work done on networks such as MLPs, RBM, etc. There might be several reasons for this. These devices exhibit ex- treme variability in device parameters and properties. Their internal mechanics are also not completely understood. There are many device candidates of memristors and it is difficult to build a precise model that fits them all. This makes circuit design all the more difficult, particularly in applications where precision is key. CHAPTER 2. HARDWARE NEURAL NETWORKS 38

The biggest contribution made in this work is the discovery that, by a smart choice of algorithm, it is possible to obtain higher precision with memristive devices even when they are updated very imprecisely. The algorithm used in this study, which is the gradient- descent algorithm, happens to be at the heart of training procedure used in almost all neural networks. By using this algorithm, we demonstrate that is possible to design analogue neural networks that can be programmed to run a much wider class of neural networks efficiently. This is a highly relevant result considering the explosion of neural network applications today. A subsequent chapter covers this topic in more detail. Chapter 3

Memristive Learning

3.1 Memristors

Memristors, first proposed conceptually as the missing circuit element by Chua [28], have seen a dramatic upsurge in interest from the research and industrial community since the memristive effect was observed in a device fabricated at HP Labs in 2008 [3]. The theoretical memristor connects the flux and the charge via the relation:

dφ = M(q)dq

It was further shown in [28] that:

1. A flux-charge memristor is passive if and only if its incremental resistance M(q) is non-negative;

2. A one-port containing only flux-charge memristors is equivalent to a flux-charge memristor.

3. Any network containing only flux-charge memristors with positive incremental mem- ristances has one, and only one, solution.

The memristor equation and the above theorems establish the inter-relationships between the quartet of circuit elements as shown in Figure 1.

39 CHAPTER 3. MEMRISTIVE LEARNING 40

In 1976, [29] Chua and Kang proposed that any nth order system that followed the be- haviour prescribed by the following equations was memristive:

v = R(w,i,t)i

w˙ = f (w,i,t)

Where, w is an nth order state variable, and w˙ is its time-. They also noted that

Figure 3.1: The new circuit element: Memristor [3] c IEEE such systems were capable of modelling thermistors, Hodgkin-Huxley nerve axon mem- branes, etc. While the dynamics of a memristive system is well-defined by mathematical equations, it does not explain the physical phenomenon that gives rise to memristance. This is evidenced by the wide variety of materials that display the memristive property. While they all exhibit the memristive behaviour, the physical constants and device properties are different. After HP labs demonstrated the practical memristor, several systems that satisfy the re- quirements of a memristive system were identified by research groups across the world, such as Phase change memories [30], neuronal axon membranes [31], etc. For many of these de- vices, good models that can be used for precise circuit design do not exist. The complex dependence on internal state parameters make it extremely difficult to precisely program them to a desired state without resorting to feedback or read-write schemes [32][33]. Mem- ristive devices are also seen to exhibit a stochastic behaviour [34][35]. Currently, the most common commercial application for memristive devices is as re- placement to flash memory, where they are referred to as Resistive random-access memory (RRAM). RRAMs have a significantly smaller footprint when compared to flash memories and are potentially the way forward for non-volatile memories. These devices have other CHAPTER 3. MEMRISTIVE LEARNING 41 potential applications in designing systems for spiking neural networks [17], dynamical cir- cuits [36], neuromorphic modelling [37], etc. One of the most important structures used in most of these applications is that of the crossbar array, which will be discussed in a later section.

3.1.1 Boundary condition model

Since the discovery of TiO2-based memristive devices, several memristive devices have been proposed such as molecular and ionic thin film memristive systems, spin-based and magnetic memristive systems, phase change memristive systems, etc. All these devices share the memristive behaviour although the underlying physics is different. Therefore, in order to study the behaviour of memristors in various circuits, it is useful to have a device- agnostic model that captures the electrical behaviour of memristors. Such models are gener- ally ill-suited for high-precision designs. However, by using techniques that compensate for model deficiencies during circuit design, such generic-models could be used for design pur- poses. One of the outcomes of this work is one such training algorithm and will be discussed in the next chapter. Several device models have been reported in the literature [38][39][40][41][42]. In this work, we chose the Boundary condition (BCM) model [41] for our simulations because of the following reasons:

1. Although the BCM model is relatively simplistic, it provides closed form equations for all the device parameters making simulations faster. When simulating networks containing thousands of devices, this is a very useful property.

2. It was computationally easy to calculate various device parameters using this model. This allowed us to simulate larger networks and to interpret the simulation results clearly without being bogged down by the physics or inter-relationships between var- ious device parameters.

3. With the current manufacturing capability, memristive devices display a large vari- ability. This makes it difficult for any model to precisely capture the dynamics of all the devices used in the circuit. Therefore, using more complex models does not necessarily result in more accurate simulations. CHAPTER 3. MEMRISTIVE LEARNING 42

In the BCM model, the memristor is modelled as a thin oxide film that consists of a highly-conducting layer of dimensionless length x (obtained by dividing the length of the highly-conductive layer by the total length of the device). The remaining 1−x length of the memristor has a low conductivity. The hysteric behaviour of the memristor is assumed to arise from the movement of the conducting layer under the influence of an applied electric field. For a given value of x, Ohm’s law when applied to a memristor gives:

i = M(x) · v (3.1) where, M(x) is the memristor’s conductance when the conducting film is of length x. The memristor equations that control x are given by the following two equations: dx(t) η = M(x(t))v(t)F(x(t),ηv(t), p) (3.2) dt i0 i(t) = M(x(t))v(t) (3.3) where, the state variable x(t) models the fact that x changes with time in the presence of an electric stimulus. ηv(t) is the normalized input voltage. η is the polarity term and nominally has a magnitude 1. F(...) is a window function that models the boundary conditions of device behaviour and p controls the extent of non-linearity in the window function. Since M(x(t)) is modelled as a series combination of an insulating and conducting ma- terial, it can be expressed as: G G M(x(t)) = on o f f (3.4) Gon − ∆Gx(t) where, ∆G = Gon − Go f f , and Gon and Go f f denote the conductance of the device when x = 1 (conducting layer covers entire length of the memristor) and x = 0 (insulating layer covers entire length of the memristor) respectively. The equations described so far are generally applicable to most of the memristor models referenced in this section. The BCM model gets it name because of the method by which this model defines the window function F(...). Three conditions are defined:   x(t) ∈ (0,1) or  C1 := (x(t) = 0 and ηv(t) > vth,0) or (3.5)    (x(t) = 1 and ηv(t) < −vth,1)

C2 := x(t) = 0 and ηv(t) ≤ vvth,0 (3.6)

C3 := x(t) = 1 and ηv(t) ≥ −vvth,1 (3.7) CHAPTER 3. MEMRISTIVE LEARNING 43 where, vth,0 and vth,1 are the high and low threshold voltage magnitudes, respectively. The window function F(...) is then computed as:   1 i f C1 F(x,ηv, p) = F`(x,ηv) = (3.8)  0 i f C2 or C3 In this window function, the parameter p is unused. It is now possible to analytically inte- grate the memristor equations to obtain the following closed form expression for x(t): s 2 Gon Gon 2 Gon φ(t) − φ(ti) x(t) = − + x(ti) − 2 x(ti) − 2GonGo f f i f C1 (3.9) ∆G ∆G ∆G ∆Gi0

x(t) = 0 ∀ t i f C2 (3.10)

x(t) = 1 ∀ t i f C3 (3.11) where (t) − (t ) = R t v( )d and t is the time when programming began. φ φ i ti θ θ i

3.2 Crossbar Arrays

The architectural benefits of the crossbar structure were first demonstrated by HP labs [43] using a computer called the Teramac. The architectural design of this computer was driven by the observation that as devices shrink to nano-scale regime, it becomes increasingly difficult to attain high fabrication yields. Additionally, as devices shrink and circuits become denser, cost and complexity would shift to wiring and interconnections. It was therefore essential to rethink the network connectivity and computational schemes used in traditional computing systems.

Figure 3.2: Tree and Crossbar architecture used in Teramac. [43] c IEEE

The solution they proposed was to use the so-called fat tree architecture wherein a large amount of redundancy was added to the interconnections between different nodes. Each of CHAPTER 3. MEMRISTIVE LEARNING 44 these interconnects can be programmed depending on the algorithmic requirements and the defect-status of the device at each junction. The physical manifestation of a single layer is shown in Figure 3.2. The block marked as memory is envisaged to be a three-terminal nano-electronic device that controls the interconnections between the green and red dots depending on the input combinations. In the Teramac system, this was implemented as a 6-input, 1-output look-up table. The architectural idea was to stack several such layers of crossbar arrays. It was shown that such networks are highly tolerant to defects. For example, in the Teramac system, it was observed that the system worked perfectly well even when 10% of the LUTs were non-functional, and 10% of the interconnect signals were unreliable.

Figure 3.3: Crossbar array of memristors: 3-D and 2-D representation

Clearly, systems based on these ideas become economically-viable only under the cir- cumstances assumed by the developers, i.e., cheap, dense, and highly-unreliable devices. Additionally, the architecture assumed the existence of three terminal nano-devices which is still fodder for research. However, the memristive devices have partially created a scenario wherein it might be attractive to consider building crossbar arrays inspired by an original work by Heath [43]. Although memristors are only two terminal devices, they are mostly very compact and extremely prone to variability [3]. Researchers have proposed building simpler nano-wire cross bars as shown in Figure 3.3. Using some clever layering techniques discussed by researchers such as Likharev and Strukov [44], it is now possible to create fairly dense memristive system based memories using the existing fabricating technologies. There is a significant interest in such arrays in the industry and some companies are also close to commercialization [45]. While building general purpose computers modelled like the Teramac is beyond the CHAPTER 3. MEMRISTIVE LEARNING 45 scope of such arrays, they are attractive for use in several applications such as neural net- works, dynamical circuits, etc. The rest of the chapter is a discussion of some of the systems that use memristors.

3.3 Memristor circuits and systems

This section discusses how the state-of-the-art designs take advantage of memristors’ unique properties and try to address the issues listed above. The topics discussed in this section include Memristor programming schemes, STDP-based training, Back-propagation algo- rithm, Dynamical systems, and other circuit ideas.

3.3.1 Crossbars

Memristive crossbar arrays are made of layers of orthogonal wires sandwiched by nano- scale memristor devices lying at the cross-over points between the layers [3][46]. Crossbar arrays have several attractive features. They can be made extremely small and dense, which is useful for development of high-capacity memories and computational hardware. Addi- tionally, they can be laid out on top of traditional CMOS ICs, where the CMOS circuits handle the complex and precise signal conditioning processing tasks, and the crossbar ar- ray performs simpler, but parallel, operations [47]. This configuration has a lot of potential for use in hardware acceleration of computation. It can lead to development of ultra-fast, low-power neural networks and other machine learning algorithms containing thousands or millions of parameters. Memristive crossbar arrays have the following advantages that make them attractive for use in various systems:

1. Dense programmable memory - Memristive devices have one of the lowest footprints amongst all the memory devices available in the market today. They are non-volatile and programming them is significantly easier than Flash memories.

2. They can be fabricated as arrays and layered on top of CMOS circuits [44]. This makes adoption of the technology easier because they do not have to be incorporated into the CMOS technology flow. CHAPTER 3. MEMRISTIVE LEARNING 46

3. Several types of nano-devices exhibit the memristive behaviour. This is exciting be- cause it hints at the possibility of designing memristor-based circuits for non-CMOS applications too.

However, designing with memristors can also be challenging for the following reasons:

1. Variability - The current fabrication technologies result in devices that exhibit varia- tion in parameters spread over orders of magnitude. Designing with such devices is challenging.

2. Device Models - Memristive behaviour is exhibited by several types of devices. In many instances, the underlying principles of their operation are not fully understood or modelled.

3. CMOS integration - Although some memristors are CMOS compatible, integrating CMOS circuits with memristors is still quite difficult on a commercial scale.

3.3.2 Memristor programming schemes

Broadly speaking, there are two categories of memristor programming schemes:

1. Unregulated write - In this method the programming pulse is driven into the device without monitoring the change in conductance of the device. This technique is gen- erally suitable when using memristors as two-state digital memories. High variability and poor modelling makes this a challenging approach when dealing with multi-bit or analogue memristors.

2. Regulated write - The state of the device is updated using a sequence of programming pulses. Each programming pulse is followed by a read to check if the conductance has attained the desire value. This approach generally requires a reference and dis- tinct circuit configuration for each memristor. Implementing this on crossbar arrays is tricky and not economically viable in the current state of technology.

3.3.2.1 Unregulated write

The circuit in Figure 3.4 shows that when a vbias is applied to turn on a transistor, vwrite causes a current to flow into the memristor. In order to program the memristor to the desired CHAPTER 3. MEMRISTIVE LEARNING 47

Figure 3.4: A simple circuit schematic for (a) Unregulated writing into memristor and (b) Reading the state of the memristor on or off state the polarity of vwrite is modulated. The simple circuit shown in Figure 3.4 is not capable of high precision writes because of several reasons:

1. Strength of the programming pulse is affected by MOSFET variability.

2. High variability shown by memristors imply that similar programming pulses can result in very different results for different devices from the same batch.

3. For most memristor devices, high-quality device models, which are essential to decide the programming pulse, are lacking.

4. Even if precise models are available, there are practical issues to contend with. The change in conductance for a given pulse is affected by the starting state of the device. In order to program precisely, the correct sequence would be a read, followed by a

complex calculation to determine the level and width of the programming pulse vwrite, and finally, application of the computed programming pulse. Even this might not be possible for some devices, where the conductance of the device does not uniquely define its internal state.

The issues listed above are not critical when the goal is to drive the memristance to its Gon or Go f f boundary states. This is because applying a high voltage pulse with the appropriate polarity would take the device to the desired state CHAPTER 3. MEMRISTIVE LEARNING 48

3.3.2.2 Regulated write

In regulated write schemes, the memristor state is actively monitored while it is being pro- grammed to ensure that the desired state is attained. Several schemes have been proposed in order to do this [36][33][32][48]. In general, the schemes fall into two categories:

1. Continuous feedback scheme: A circuit diagram that represents this method is il-

Figure 3.5: Schematic of a continuous feedback write scheme

lustrated in Figure 3.5. The circuit implements the idea that a programming pulse

Vprog be applied to a memristor till it attains the desired state. The comparator moni- tors whether the target state is attained by using a reference resistance. Note that the block diagram is a simple figure illustrating the idea behind continuous write. Other methods are viable and might even be more suitable[32]. For example, using a cur- rent comparator and a DAC to generate the reference voltage levels. The continuous monitoring scheme is clearly an expensive circuit requiring tight integration of the memristors and the programming circuitry. This is not feasible with the current state of technology. The two-phase monitoring scheme simplifies the circuitry significantly.

2. Two-phase monitoring scheme: The idea here is to separate the programming and the monitoring phase. Such an approach simplifies the circuitry necessary to do this task significantly. Several designs have been proposed to do this such as [33][36][48]. In general, the approach is as follows:

(a) Check if the current state of the memristor is the desired state. If yes, stop, else continue.

(b) Drive a programming pulse into the device based on the error in the device state. CHAPTER 3. MEMRISTIVE LEARNING 49

(c) Stop the pulse and read new state.

It can be seen that all the devices in the crossbar array are a part of the network connecting any given input and output in the array. These devices create undesirable sneak paths in the array. They make it very difficult to precisely program a device without affecting the devices lying on the sneak paths. A diagram representing the sneak paths is shown in Figure 3.6 which illustrates the effect using the blue coloured traces. The larger the array, the stronger the effect of sneak paths on the output values.

Figure 3.6: Sneak paths in a crossbar array

A partial solution to this problem was presented in [49], where a technique called cyclical programming is employed. In this method, a high impedance is applied across the untargeted devices (Figure 3.7). The other switches shown in the figure are to control the op-amp configuration during the programming and read phases.

The untargeted devices are disabled by applying a high-impedance Z across the cor-

responding nodes (V1,...,V6). Such an arrangement also permits simple arithmetic operations such as addition or subtraction as detailed in the paper. However, it must be noted that this scheme is not suitable for memristor arrays with multiple columns because of the parallel sneak paths that are created in the presence of high impedance contacts in a crossbar arrangement.

In order to solve this problem, [33] proposes a design which involves use of a transis- tor in crossbar arrays as shown in Figure 3.8. Each memory element is now made of CHAPTER 3. MEMRISTIVE LEARNING 50

Figure 3.7: Memristor-based analogue memory/computing unit [49] c IEEE

Figure 3.8: A 1T1M crossbar array access to the top left element [33] c IEEE

a transistor and a memristor. The voltage level Vdrive is chosen as a small valu (less than the threshold) during the read-phase and a large value during the programming phase. While this simplifies the architecture, it still needs a transistor embedded in the crossbar array. Another problem with this approach is that the devices have to be trained sequentially. This will slow down the processing when dealing with large memory arrays.

Another interesting approach to solve this problem has been described in [50] where the authors propose a segmented crossbar architecture as shown in Figure 3.9. The AC blocks shown in the figure are used to block currents emerging from the unselected crossbar segments and only enable the desired segments. While the segmented archi- tecture does not eliminate the sneak paths completely, it limits the number of paths to a single segment. This can be a useful technique when large crossbar memory arrays CHAPTER 3. MEMRISTIVE LEARNING 51

are required.

Figure 3.9: Segmented crossbar architecture [50] c IEEE

3.3.3 STDP-based algorithms in memristive crossbar arrays

STDP is a learning algorithm for neural networks based on the principle of Hebbian learning. The appendix contains a brief overview of the fundamental principle behind STDP-based training algorithms. The fundamental idea is that the synaptic weights change as a strong function of the temporal relationship between the pre-synaptic and post-synaptic spikes or voltage levels. One can express the change in synaptic weights as:

∆w = F(ξ(∆T)) (3.12) where, ∆T is the time difference between the pre- and post-synaptic spikes. The exact magnitude of ∆w depends on two factors:

1. ξ(∆T), which is the potential across the synapse as a function of time difference be- tween the pre- and post-synaptic pulses. This is affected by the shape of the action potential generated by the neuron CHAPTER 3. MEMRISTIVE LEARNING 52

2. F(ξ(∆T))Response of the memristor to a given voltage pulse. This is affected by device fabrication technology and variability.

The effect on spike shapes on the potential generated across a synapse (ξ(∆T)) is shown in Figure 3.10.

Figure 3.10: Voltage across the synapse ξ(∆T) for various action potential shapes [51] c Frontiers of Neuroscience

The parallel between STDP and memristor is that when a large positive(negative) voltage is applied across a memristor its conductance increases(decreases). This observations has led researchers to believe that the memristive behaviour of the brain is key to understanding its operation. It also led to the development of a plethora of memristive crossbar arrays trained using STDP-based training rules. One of the most interesting application of this observation is in using memristive cross- bar arrays to mimic the orientation selective property of the V1 layer of the visual cortex [52]. In this work, each column of a crossbar array terminates with a neuron and the rows correspond to the input features. The training data, which consists of a special image sen- sor that generates an AER-encoded output whenever the change in incipient photo-intensity levels exceeds a threshold. At the end of the training process, orientation maps similar to the visual cortex were observed in crossbar array. This result, although useful, is not surprising since the ability of STDP learning rules to capture orientation maps was already well-known. The interesting take-away is that the memristor crossbar arrays are exceptionally well-suited CHAPTER 3. MEMRISTIVE LEARNING 53 to model synapses. Several other researchers have built systems that use STDP-based training rules such as [51][53][54][55]. All these designs use a crossbar array with memristors as synaptic ele- ments. The difference lies in the complexity of pulse generation, adaptation of the thresholds used in the neurons, encoding schemes for the input data, etc. These systems are still ac- tively being explored by a large number of researchers world-wide. It must also be noted that STDP-based learning methods have not yet attained the performance metrics exhibited by the networks trained using back-propagation or the gradient-descent algorithm. The pa- per [17] has a table illustrating the performance of various STDP-based networks. Neural networks based on gradient-descent based learning have been able to attain a much supe- rior performance (For example, LeNet [56]). Today’s large-scale neural networks based on gradient-descent such as GoogLeNet, AlexNet[57], etc., are being used for larger and more complex tasks. In fact, these networks have attained near-human performance on datasets provided in contests such as ImageNet. This difference in performance is the primary mo- tivation behind the efforts to use back-propagation based learning on memristive crossbar arrays.

3.3.4 Back-propagation algorithm

A detailed description of the gradient-descent or back-propagation implementation issues is provided in the next chapter. This section serves to merely reflect upon the existing ideas in order to complete the topic of memristive crossbar applications. While STDP-based learning schemes are an extremely popular area of research, back-propagation algorithm based networks have drawn much less attention [58][59][60]. This is incongruous because most of the commercially used neural networks in use today are trained using this algorithm. (A brief overview of the back-propagation algorithm is provided in the Appendix. ) In [60], Soundry et al discuss an implementation that involves a grid of synapses. How- ever, each synapse is made of two transistors and a memristor. The work presented in [58] also implements the gradient-descent algorithm on a transistor-free crossbar array. Both these papers demonstrate the performance of their networks on the Wisconsin Breast Can- cer Dataset. In both these works, the problem of imprecise programming is unanswered. The designs generally attempt to drive training pulses that induce a change in the device conductances that are proportional to the update magnitude. For reasons described earlier, CHAPTER 3. MEMRISTIVE LEARNING 54 this is not practically feasible. In [59], Fabien et al train a simple linear classifier. While they do not explicitly refer to the training technique as back-propagation algorithm, the approaches are equivalent. This work is interesting because they worked with real devices. Secondly, the approach does not use any transistors within the crossbar array. Finally, they use an efficient 4-phase training scheme which allows for parallel programming of the devices. In addition to these, [61][62][63] discuss gradient-descent algorithmic ideas for imple- mentation on crossbar arrays that are similar to the ideas presented in the next chapter of this thesis. The authors of [64] argue that implementing the back-propagation pulses might be infeasible owing to the micro-voltage pulse accuracy required for the training process. However, the paper is not clear about the effect of modulating the width of the training pulses which is the approach presented in this thesis. We use pulse-width modulation based training algorithm to train purely memristive crossbar arrays. These ideas are covered in the next chapter.

3.3.5 Dynamical systems and other circuit applications

While memristors have been a subject of significant interest by the neuromorphic research community, they have also been used in a variety of other applications. In fact, even before the connection to STDP was found, these devices were used in dynamical circuits to design noise generators, oscillators, etc. [65][66]. Unlike resistors, inductors, or capacitors, mem- ristors exhibit a hysteretic loop which results in non-linear electrical behaviour. Another interesting idea is to use memristors for programmable analogue circuits [67]. These de- signs take advantage of the re-programmability of memristors in order to correct or modify the behaviour of analogue circuits. The authors recognize that there are issues such as poor models, variability, etc. and also propose ideas to mitigate them. Unfortunately, these designs are currently not feasible for production on an industrial scale because of manufacturing difficulties. They are however of interest because they demonstrate how circuits and systems stand to benefit by incorporating memristive be- haviour. This thesis does not delve much into these designs because the focus is on hardware neural networks. Chapter 4

Gradient-descent in crossbar arrays

This chapter covers techniques to implement gradient-descent based learning on memristive crossbar arrays. The key idea behind to the implementation techniques discussed in this chapter is an approximation to the standard gradient-descent algorithm, called the Unregu- lated Step Descent (USD) algorithm. This algorithm is an original and key outcome of this research work. This chapter describes the reasoning behind the USD approximation and how it ad- dresses various hardware implementation issues such as the effect of device parameters and variability. The performance of the algorithm is analysed by testing it on artificially gener- ated and real-world datasets. It also discusses how the USD algorithm results in a simpler training architecture for crossbar arrays by enabling a simple 4-phase training procedure. This procedure makes it possible to update all the crossbar memory elements in parallel, reducing the hardware cost, time, and complexity of the training architecture significantly. Some of the material presented in this chapter has been published by the author of this thesis in [68].

4.1 Introduction

The discussion in the previous chapter highlighted some of the limitations of memristors, one of which was high device variability [69]. Given the incessant focus on higher de- vice densities and physical limitations of small devices, this problem is unlikely to vanish. The chapter also described how the lack of precise device models was affecting memristive circuit design.

55 CHAPTER 4. GRADIENT-DESCENT IN CROSSBAR ARRAYS 56

In spite of these difficulties, the potential benefits of memristors have generated a lot of interest in developing memristor-based computational circuits [46][70]. Neuromorphic systems consisting of dynamic circuits for neurons and using memristors to mimic the plas- ticity rules seen in biological synapses are especially popular [52][53][54][71]. The training rules for these systems are generally based on Hebbian-based STDP (Spike Timing De- pendent Plasticity) learning as described in the previous chapter. These systems have been demonstrated to work reasonably and are also tolerant of high device variability. However, the learning performance of these systems is rather limited when pitted against the state- of-the-art systems that use techniques. The deep learning systems use neural networks such as Restricted Boltzmann Machines (RBM), Convolutional Neural Networks (CNNs), Auto-encoders, etc. and are trained using the gradient-descent algorithm. A con- cise summary of these networks is provided in the Appendix. Simple memristive hardware based on some of these neural network architectures that perform simple binary pattern classification tasks have been demonstrated [59]. As it stands today, these neural networks perform better and use fewer training parameters than spiking systems [17]. Therefore, an investigation of how memristive crossbar arrays implement gradient-descent based learning is of practical relevance.

4.2 Gradient descent algorithm

Gradient descent is a first order optimization method for finding a local minimum of a func- tion. For convex functions, it converges to the global minimum. The method is used for a variety of optimization tasks such as matrix inversion, , and objective function minimization in several statistical inference problems. Gradient-descent is also the principle underlying back-propagation and similar training algorithms used to train a wide variety of artificial neural networks such as RBMs, CNNs, Auto-encoders, etc. In order to train a network using the gradient-descent algorithm for a given training dataset (X (input vectors),Y(outputs)), an objective or a cost function F(w,X,Y) with pa- rameters w is defined. The goal of the training procedure is generally to find the parameters that minimize or maximize the objective function. The gradient-descent algorithm uses the gradient of the objective cost function with respect to w in order to do this. The technique is guaranteed to quadratically converge to the global minimum if the objective function is CHAPTER 4. GRADIENT-DESCENT IN CROSSBAR ARRAYS 57 convex. In normal gradient descent, the gradient is approximated by averaging over all the test samples. The update rule for the parameters is as follows:

α w(t + 1) := w(t) − ΣM ∆ F(w(t),x ,y ) (4.1) M i=1 w i i

Figure 4.1: Gradient descent for a 2-dimensional objective function. F(w0,w1) in the figure is the same as F(w,x,y).

th Where, xi is the i training example, M is the number of training samples, and α is a parameter that controls the update magnitude. Some of the common objective functions used in machine learning are the least-square error, Kullback-Liebler divergence, log-likelihood estimate, etc. The back-propagation algorithm for multi-layer neural networks involves computation of the gradient with respect to every weight in the network. This calculation is greatly sim- plified by using the property that the gradient with respect to the weight in a particular layer can reuse the gradient values computed for the higher layers [72]. Almost every popular neural network implementation uses this technique. There are several variants to the method of gradient descent. One variant, called the stochastic gradient descent [73], is particularly interesting. In this variant, the weights are updated for every training sample. The computation of the update rule gets simplified as:

w(t + 1) := w(t) − α · ∆wF(w(t),xi,yi) (4.2)

Note that in the all the following equations the step indices are not mentioned for brevity. CHAPTER 4. GRADIENT-DESCENT IN CROSSBAR ARRAYS 58

The idea is that when the training samples are chosen at random, the net effect of all the weight updates average out to give the same result as the non-stochastic gradient de- scent algorithm. The total time taken for convergence using stochastic gradient descent is higher than the non-stochastic version. On the other hand, the stochastic version takes a much smaller time to reach a predefined error measure [73], which is of interest when the training set is large. Updating the weights for every training sample also makes the weight updates noisy. One way to mitigate this effect is to gradually decrease the learning rate α with iterations. Another technique that is used in some neural networks is called the batch stochastic descent, where the error is averaged out over a small batch of training samples. This approach tries to reduce the noise created by a single sample based stochastic descent and also retain the benefits accruing from stochastic descent. However, as we discuss in this chapter, device limitations makes it difficult to directly implement the steepest gradient descent in memristive crossbar arrays. Therefore, we pro- pose an approximate gradient descent rule that simplifies the hardware used to train mem- ristive crossbar arrays. We also investigate the effect of device parameters and variability on the performance of the algorithm.

4.3 Gradient descent for linear classifiers

One of the simplest network that can be trained by the gradient-descent algorithms is a linear classifier or perceptron. These networks are generally trained using the method of least- square minimization. In this method, the objective function quantifies the mean square error generated by a predictive model hw(x) when it is used to fit the training samples (yi,xi), i ∈ 1,...,M. The training sample consists of a two-tuple of observed output yi and input T feature vector xi = [xi,1,xi,2...xi,N] . The objective function can be expressed as: 1 F(w) = ΣM (y − h (x ))2 (4.3) M i=1 i w i T For linear predictive models of the form hw(x) = w x, the update rule for w = [w1,w2,...wN] in each iteration is given by the method of steepest descent as : α w := w − ΣM (y − h (x ))x (4.4) k k M i=1 i w i i,k The update rule for stochastic version of this equation is expressed as:

wk := wk − α · δ · xi,k (4.5) CHAPTER 4. GRADIENT-DESCENT IN CROSSBAR ARRAYS 59 where, prediction error

δ = yi − hw(xi) (4.6)

This algorithm introduces more noise in the convergence behaviour of the algorithm, but converges faster than the steepest descent technique for large datasets [73].

Figure 4.2: Block diagram of training module [68]

4.4 Gradient descent in crossbar arrays

From an implementation perspective, the gradient-descent algorithm consists of two passes per iteration, a forward predictive pass, and a reverse update pass. The forward pass is equivalent to a matrix multiplication operation. One method to implemented this on crossbar arrays is shown in Figure 4.2. Here, by appropriate scaling, the weights w are treated as conductances g, the inputs x as voltages v, and outputs hw(x) as currents i. During the scaling operation, care must be taken to ensure that rescaled values are within the dynamic range of the relevant circuit blocks. Additionally, it must be ensured that the output currents ik are sunk into virtual grounds to eliminate the effect of loading on current outputs. This is a common feature of the neurons designed in analogue VLSI. CHAPTER 4. GRADIENT-DESCENT IN CROSSBAR ARRAYS 60

The parameters of the predictive model are updated in the reverse pass. Design issues, such the memristive device characteristics, dynamic range of the input, and variability, on al- gorithm performance are some of the factors to be considered when designing such a system. The operating principle is fairly simple. The conductance of a memristive device changes when it is subjected to a non-zero flux. Therefore, the goal is to affect the desired change in memristor conductance by modulating the amplitude and/or duration of the training pulse β applied to the device. The weight update rule can now be expressed as:

wk := wk − λk(β,ωk) (4.7) where, λk(β,ωk) is the change in device conductance when a training pulse β is applied to a device with state parameters given by ωk. In theory, any weight update can be implemented in this manner. For example, Equation 4.7 is equivalent to Equation 4.5 if λk(β,ωk) is made equal to α · δ · x(i,k). However, such precise control over the memristor states is impractical for reasons discussed earlier. The designer can only control the direction of the updates.

The function λk is typically monotonically increasing with respect to β, and positive (nega- tive) pulses result in an increase (decrease) of memristor conductance. However, the exact magnitude of change in conductance is a complex function of the device state. For example, a larger change in conductance is observed at a higher conductance state than at a lower conductance state. Additionally, the internal state parameters of the device ωk might not be directly observable. In most instances, the device models are either too simplistic or compu- tationally expensive to compute λk to the desired precision. The memristor behaviour may also be stochastic. This makes it very difficult to stimulate a precisely controlled change in device conductance by simple schemes. Some of the methods to precisely program the memristors have already been discussed in an earlier chapter. These methods are based on look-up tables (LUTs) or feedback schemes. However, these methods have several implementation issues. Memristive devices are char- acterized by high variability (spread across orders of magnitude), making it impossible to use a shared LUT for the entire array. Using a dedicated LUT for each device is clearly impractical because it would increase the cost and would require a lot of memory. The feed- back schemes typically incorporate CMOS devices within the crossbar array. This makes fabrication more complex and expensive. Such schemes also require dedicated control cir- cuitry and do not permit the devices to be trained in parallel. This is because each device has to be precisely regulated in such schemes. Sequential training is unattractive because CHAPTER 4. GRADIENT-DESCENT IN CROSSBAR ARRAYS 61 high-dimensional inputs result in a large number of devices, making the training process infeasibly slow. Therefore, in order to enable use of memristive crossbar arrays in practical computa- tional hardware, better training mechanisms are essential. It must be noted that the algo- rithms used for pattern recognition tasks, do not to try precisely compute a particular set of parameter values for the prediction model. That task is in impractical for larger datasets. Instead, these algorithms try to minimize/maximize the objective function. Inspired by this feature of the training algorithms, we propose a an approximate gradient-descent rule that allows the devices to settle to a state that minimizes the least square error in the training data without actively trying to regulate each device state to a predetermined value at every step.

4.5 Unregulated step descent

In order to address the issues discussed in the previous section, we propose to approximate the training pulse as:

β ∝ α · sign(δxi,k) (4.8)

Using β from Equation 4.8 in Equation 4.7 is essentially the primary idea behind the Un- regulated Step Descent (USD) algorithm. The motivation is similar to that of the Manhattan rule [74]. The programmer chooses α ignoring the exact magnitude of δxi,k. The intuition behind Equation 4.8 is to not regulate the update step size, which is impractical in memris- tive crossbar arrays for reasons discussed earlier. Instead, the training rule shifts the weights in the general direction that lowers of the objective function. The learning algorithm cor- rects the error introduced by this approximation by taking small steps and recalibrating the direction of descent in each iteration. One of the limitations of making the the USD approx- imation is that it is not always easy to choose the learning rate to ensure convergence. As shown in Figure 4.3, a good choice can improve the rate of convergence for weights whose are large. It can also be seen that direction of descent will be different from that of the steepest-descent method as illustrated in Figure 4.3. In addition, when the value ap- proaches a local minima, the ability of the training rule to converge is limited by α. This effect is seen as noise or settling error that can be improved by adaptively shrinking α as the solution converges. Note that this figure was generated where the update step sizes are the CHAPTER 4. GRADIENT-DESCENT IN CROSSBAR ARRAYS 62 same for all the input weights. This condition is not typically true when training a memristor crossbar array.

Figure 4.3: Convergence of USD algorithm when finding the minima of a paraboloid func- tion

It should be noted other inexact gradient descent methods such as the stochastic gradient- descent method is based on a similar idea as USD, where an approximation of the actual gradient of the objective function is used for updating the weights.

4.6 USD versus other methods

The parallel weight perturbation (PWP) algorithm [75][76] has been proposed as an ap- proximation of the gradient-descent algorithm suitable for analogue hardware. However, we argue that the USD approximation is better suited for crossbar arrays. In the PWP algo- rithm, the weights are perturbed such that the magnitude of change is uniformly distributed. The change in the objective function is computed and is used to approximate the gradient, which is then used to update the weights. Note that uniformly distributed perturbation is important for the PWP algorithm to work well [76]. Multiple perturbations are sometimes performed to improve the convergence behaviour. Implementing PWP on crossbar arrays is difficult for the following reasons:

1. Every update involves two device programming steps, a perturbation step and a cor- rection step. If multiple perturbations are performed, it further increases the number CHAPTER 4. GRADIENT-DESCENT IN CROSSBAR ARRAYS 63

of programming steps. On the other hand, the USD implementation only involves a single programming step in each iteration.

2. It is impossible to perturb all the devices in parallel in a manner that ensures that the perturbations are uniformly distributed (because of the state-dependence in memris- tors). One alternative is to perturb the devices sequentially where the pulse widths can be controlled to effect the desired changes. However, as discussed earlier, it is practi- cally impossible to control the state of each device precisely. Even if it was possible to precisely control the updates, cycling through each device individually increases training time exponentially. The USD algorithm is not subject to such limitations.

3. Finally, the USD algorithm does not need a uniform perturbation signal generator. This simplifies the design of the circuit.

In parallel to this work, there have been a few other research groups working on sim- ilar techniques. An approach similar to the USD algorithm was used to train small but real memristors arrays in [62]. In [61], the authors propose two methods of crossbar train- ing, one of which is a fixed-amplitude training method similar to USD and the other is a variable-amplitude training method. In the variable amplitude training method, logarithmi- cally scaled voltage pulses are applied to the crossbar array. In order to do this, a voltage proportional to the update magnitude in Equation 4.5 is applied across a device by applying a voltage of log(δ) and log(xi,k) to each of its terminals. When applied to a memristor charac- terized an exponential relationship between change in conductance and applied voltage, this results in the desired net change in conductance. The authors indicate that the variable-step method can out-perform the fixed-amplitude training procedure. However, it must be noted that this mechanism assumes an exponential relationship in the memristor equations, which is not necessarily always true. An approach proposed in this thesis to improve accuracy is to shrink the training pulses as the classifier converges. It will be useful to experimentally compare the two approaches and evaluate their performances.

4.7 Implementation

Most of the commonly used memristive device models [35][41][77] define a threshold volt- age vth for memristors. This is the voltage across the device below which the conductance CHAPTER 4. GRADIENT-DESCENT IN CROSSBAR ARRAYS 64 does not change or is affected very little. This property is used in the forward pass of each iteration where it is ensured that the voltages used for reading the array is much lower than vth. This restriction is to minimize changes in the device state caused by read operations. The gradient of the objective function is approximately computed using the sign of the input and the resultant prediction error as suggested by the USD approximation. The computed values are then used to train the devices in the update pass using the 4-phase scheme shown in Figure 4.4.

Figure 4.4: 4-phase training scheme. The voltage levels in the pulsing scheme can only take three levels: +Vdrive,0,and −Vdrive [68]

A similar scheme was used in [59] to train a binary perceptron. As discussed in the memristive learning chapter, [59] experimentally demonstrates simple pattern recognition tasks using a single layer perceptron network in a memristive crossbar array. In this scheme, as illustrated in Figure 4.4, four-phased pulse patterns are driven into the inputs (rows) and outputs (columns) of the crossbar array. The pattern is chosen based on the sign of the input feature and output error as shown in Figure 4.4. In each phase, only a subset of the devices see a voltage difference that is large enough to effect a change in their conductances. For example, in the first phase, memristors that were driven by positive inputs and members of columns that generate positive error in outputs see a voltage of 2Vdrive, while the other devices only see either 0 V or Vdrive. Vdrive should be chosen such that Vdrive < vth < 2Vdrive.

This guarantees that the untargeted devices do not see a voltage difference greater than vth.

Additionally, keeping pulse widths the same in all the phases is necessary to eliminate 2Vdrive sneak paths that can corrupt the state of the untargeted devices. Note that the simplifications given by Equation 4.8 was essential for the 4-phase scheme to work. Otherwise, each device would require a unique training pulse corresponding to its weight update.

One potential optimization is to scale the pulse amplitudes in proportion to δ and xi,k. CHAPTER 4. GRADIENT-DESCENT IN CROSSBAR ARRAYS 65

Such a scheme can help reduce the settling error and settling time. This study does not pursue this idea because it expects significant variability in vth, which is not modelled by the BCM model used in our simulations. Experimental evidence and device models also suggest that threshold voltage is a soft boundary below which small changes in devices conductances do occur [3]. This can result in change/drift in the weights of the non-targeted devices. It should be noted that the 4-phase scheme partially corrects the effect of this drift by ensuring that all the devices effectively see a net flux only in the desired update direction. The training algorithm is also self-correcting, which makes the effect of drift during training considerably less serious.

4.8 Simulations

4.8.1 Simulation Setup

Simulations to test the algorithm were run using the Boundary Condition Model (BCM) for memristors that was described in a earlier chapter. The device parameters are listed in Table 4.1.

Table 4.1: Device parameters for the BCM Model used in the simulations Property Description Value

Go f f Mean off state conductance 6.25e-4 S Gon Mean on state conductance Go f f × Gratio 2 −1 −1 µv Mobility of oxygen ions 1e-14 m s V D Length of thin-oxide film 10 nm η Mean value of polarity coefficient 1 vmax Maximum voltage applied to the devices 1 V t0 Time normalization factor 10 ms |vth| Threshold voltage magnitude 1 V

The BCM model provides a fair approximation of the electrical properties of a mem- ristor [77] even though it does not model threshold voltage. Note that the vth described in the BCM model is different from the threshold voltage being discussed here. The lack of modelling is not an issue when the array is trained using the USD training rule because the amplitude of the training pulses are fixed and much larger than the voltage threshold, vth of the devices in the array. The effect of variability was tested by randomly generating the on-state conductance Gon, off-state conductance Go f f and polarity coefficient η values. CHAPTER 4. GRADIENT-DESCENT IN CROSSBAR ARRAYS 66

Variability was controlled using the control parameter σ, i.e., for any device property with mean value p, the values for the samples were generated such that the standard deviation of its distribution was σ ∗ p. Based on empirical observations [69][51], the log-normal dis- tribution was used for Gon and Go f f . η was sampled from a normal distribution. Idealized blocks were used to model the peripheral circuit elements such as sigmoid function, scal- ing, logic, pulse generation, etc. Note that the model implicitly assumes that the memristive conductance is continuous. Therefore, this study does not concern itself with the effect of finite resolution here. Note that this is a realistic assumption as can be seen in [78]. To study the effect of device parameters on performance of the array, tests were run on several datasets with dataset size of M=1000 samples and dimensionality N=20. The linear classifier algorithm was chosen to test the USD rule because the algorithm uses gradient- descent and is simple enough to develop intuition for the performance of the technique. Additionally, the crossbar architecture for the linear classifier is similar to more complicated machine learning algorithms such as polynomial regression or classification, MLPs, RBMs, etc. Negative weights are implemented by driving input features of same magnitude but opposite polarity into the array. A bias term is learnt by using a constant input as an added dimension to the input vector. For example a column with 22 memristors can be used to train a linear classifier using a ten-dimensional training set. Alternatively, two columns of 11 devices each with a summation circuit can be used for the same test-set. In this work, the simulations assume ideal peripheral circuitry. Under these conditions, both structures are equivalent.

The role of initial conditions and device parameters Gratio(= Gon/Go f f ), σ, and α on the performance of the crossbar array are presented in the following sections. The train- ing pulses used in all the simulations have an amplitude Vdrive=1V and duration α ·t0. The learning capability of the array is analysed in terms of settling time and settling error. Clas- sification error is computed using the performance of the array on the entire training dataset in each iteration. Settling error is defined as the asymptotic mean error rate observed on the training dataset. However, since the USD algorithm is noisy in nature, the mean classifica- tion error rate was computed by averaging the classification error over the preceding 10000 iterations. Settling time is defined as the number of iterations taken to see no improvement in the averaged settling error. The training sets used to obtain the plots in Figure 4.10-4.12 and Figure 4.13-4.14 were generated by sampling from a uniformly distributed N-dimensional CHAPTER 4. GRADIENT-DESCENT IN CROSSBAR ARRAYS 67 random variable and splitting the sample space of the training set by a randomly chosen hyperplane. A random subset of the training samples was misclassified (by assigning a label corresponding to the wrong side of the dividing hyperplane) with a probability pe, which unless stated otherwise is set at 5%.

4.8.2 Initial condition analysis

Purely from an algorithmic perspective, gradient-descent treats all the training parameters equally. However, initialization affects the convergence behaviour and performance of the algorithm on crossbar arrays as described in this section. This arises because of the pecu- liarities of the device dynamics. Figure 4.5 shows that initializing the weights of the array close to 0 results in a very slow convergence. In fact, improvement in classification error in the first few iterations is almost imperceptible. However, it is also noted that the noise in corresponding settling error trace is much smaller. Unfortunately, this might not be of much use practically. This is because when the weights are initialized to small values, they settle

Figure 4.5: Convergence behaviour for different weight initializations to values that are also very small and similar as shown in Figure 4.6. This is not a problem when we have ideal peripheral circuitry with no resolution or dynamic range limitations. However, with real circuits, this will result in an increase in settling time. Such a system is also more likely to be affected by noise. The algorithm will however converge because of the tendency of the gradient-descent algorithm to find a minimum. It can be seen in Figure 4.5 that the settling time gets better as the initialization weights are allowed to be higher. The random initialization test-case is a simulation of the situation where weights are initialized CHAPTER 4. GRADIENT-DESCENT IN CROSSBAR ARRAYS 68 to values uniformly distributed between 0 and 1. In this case, the settling time is between the low and high initialization case. The settling time behaviour for both mid- and high- initialization cases look similar. This is because even at mid conductance values, changes in memristive conductance is much higher than when initialized to the lowest-conductance state (for the BCM memristor model).

Figure 4.6: Weight updates versus iterations when initialized close to 0.

Figures 4.6 and 4.8 display a behaviour that can potentially be a cause of concern. It can be seen that the spread of the values attained by the weights of the circuits is not uniformly spread between 0 and 1 on convergence. This can result in a situation similar to that of the low initialization case described earlier.

Gratio Figure 4.7: Weight updates versus iterations when initialized to 2 .

From an circuit design perspective, it is better to have a higher dynamic range in the conductances of the devices. The likelihood of this increases if the weights are spread across CHAPTER 4. GRADIENT-DESCENT IN CROSSBAR ARRAYS 69 the entire range [0−1]. Not having a good spread might result in poor performance because of dynamic range limitations of the peripheral circuitry. This will worsen the settling time and may sometimes affect the settling error.

Figure 4.8: Weight updates versus iterations when initialized close to 1.

Interestingly, it was noted that the random initialization case resulted in a nice spread of weights over the entire range between 0 and 1 at convergence. Although interesting, such an initialization scheme might not be feasible in practical applications. Another observation that can be made from Figures 4.6-4.9 is that as iterations increase, smaller weights tend to remain unchanged. However, there is a significant amount of activity in the larger weights. This is because pulses applied to devices in low-conductance states only trigger small con- ductance changes and vice versa. This is seen as settling error in Figure 4.5. One good method to address this is to gradually lower the learning rate. As illustrated in Figure 4.3, such a procedure results in lower settling error. It can also be seen that when the weights are initialized close to 1, a large number of the devices tend to saturate. This is compensated by the negative or positive weight counterpart of the array. Saturation of the weights does not appear to be a significant problem in our experiments.

4.8.3 Effect of device parameters on performance

The effect of device parameters was studied by initializing the memristor devices to Gratio/2. Each data-point was generated by averaging over 40 runs with suitably constrained random datasets and device properties. The effects of Gratio and α on settling error and settling time CHAPTER 4. GRADIENT-DESCENT IN CROSSBAR ARRAYS 70

Figure 4.9: Weight updates versus iterations when initialized to uniformly distributed ran- dom values between 0 and 1. are shown in Figures 4.10, 4.11, and 4.12. Figure 4.10 shows the change in classification error as the algorithm converges. Figure 4.11 and 4.12 plot the effect of varying Gratio and α on settling behaviour.

Gratio affects settling time, settling error, and design specifications for the peripheral circuitry. As Gratio increases the algorithm converges faster, but displays higher settling error. This can be seen in Figure 4.11, where the coloured bands straddle the maximum and minimum settling error for corresponding values of Gratio and α. As Gratio and α increase, the bands get wider (Figure 4.11). This effect is because of the dynamics of memristive behaviour . A simple analysis shows that larger Gratio and α values effect a greater change in device conductance. On the other hand, from a design point of view, it is beneficial to have a greater Gratio because that would result in a greater dynamic range in the output current(because the difference between the max- and min-state currents would be larger). This makes design of signal processing blocks such as the threshold function simpler and less susceptible to variability and noise. This trade-off between larger Gratio or α values and learning performance is a constant feature in our simulations. It should also be noted that the mean error performance of the algorithm only modestly increases with respect to

Gratio and α. The noisy behaviour resulting from larger Gratio values can be compensated by averaging over multiple predictors or lowering α. There are several studies on how the learning rate α affects performance of a floating- point gradient-descent implementation. Most of those results are directly applicable here too. Our experiments show that a larger α results in faster convergence 4.12. However, CHAPTER 4. GRADIENT-DESCENT IN CROSSBAR ARRAYS 71

Figure 4.10: Evolution of classification error with iterations [68]

Figure 4.11: Classification error versus Gratio. The bands straddle the maximum and mini- mum classification error because of settling error at convergence [68] larger values of α also make settling noisier (Figure 4.11). A simple technique to improve settling time and error is to start with a large α and progressively shrink it. Recall that this effect was also highlighted earlier when discussing the effect of initialization. It should be noted that even though more iterations are required when α is reduced, the time taken in each iteration decreases too (because α determines the duration of the training pulse). Therefore, the total training time does not necessarily increase for smaller α in an actual hardware implementation. The optimal choice of α is difficult to determine analytically. It is easier to determine the optimal value empirically taking into account factors such as distribution of weights, training time, Gratio, and the dataset. CHAPTER 4. GRADIENT-DESCENT IN CROSSBAR ARRAYS 72

Figure 4.12: Settling time (Niters) versus Gratio for different values of α (Alpha). [68]

4.8.4 Effect of variability on performance

The effect of variability was also studied by initializing the memristor devices to Gratio/2. Figures 4.13 and 4.14 show the effect of device variability on settling error and settling time. The log-normal distribution results in a wide variation in the device properties (Figure 4.15). Higher device variability results in higher deviation of the input and output currents from the expected mean value. This can potentially result in saturation of the peripheral circuitry. However, the gradient-descent rule partially compensates for this by adjusting the unsaturated components of the system. The compensation mechanism does not appear to affect the mean convergence time significantly. However, the variance in settling time increases with increasing device parameter variability. The effect of device variability can be mitigated by increasing the dynamic range of the peripheral circuits. Arrays with larger

Gratio values find it easier to compensate for variability because there is more room to adjust. This can be seen in a smaller spread of settling time (Niters) in Figure 4.14. A effect of variability can also be reduced by training several columns with the same dataset and different initialization conditions. A weighted average of all the predictions can be then used to make the final decision. This technique is similar to the boosting schemes that have been successfully used in several machine learning algorithms. An important requirement for boosting is to have uncorrelated predictors [103]. This is satisfied by the implementation described in this section because the columns are trained independently of each other and are also initialized differently. CHAPTER 4. GRADIENT-DESCENT IN CROSSBAR ARRAYS 73

Figure 4.13: ) Classification Error (Error) versus σ (sigma).

Figure 4.14: Number of training iterations (Niters) versus σ (Sigma). Training time in- creases as variability increases. [68] c IEEE

4.8.5 Comparison against floating point implementation

Figure 4.16 shows how the performance of the crossbar array trained by the USD algorithm compares against a floating-point implementation of the logistic regression for various val- ues of pe. It can be seen that smaller values of α improve the mean error rate. This effect was seen in many of the experiments described earlier. This effect is seen because the devices are updated in every iteration even when they are very close to their ideal states (unless the error is precisely zero, which is practically impossible). Therefore, using smaller values of α prevents the final state from changing radically. CHAPTER 4. GRADIENT-DESCENT IN CROSSBAR ARRAYS 74

Figure 4.15: Effect of in the spread of Gon (y-axis) and Go f f (x-axis) values shown using 1000 memristive device samples with mean Go f f = 0.003 S, Mean Gratio= 100. Note that both x and y-axis are in log-scale. [68] c IEEE

4.8.6 Performance on the MNIST database

To test the USD rule on a more realistic and complex problem, it was used to train a (784 × 2) rows × 10 columns crossbar array on the complete MNIST handwritten digits dataset [79]. This experiment used the soft-max (multinomial) logistic regression [80] as the clas- sification algorithm. Implementing this algorithm on a crossbar array is straightforward and only requires change in the peripheral circuitry. The crossbar array would require as many column outputs as the number of classes. In this example, a crossbar array with ten col- umn outputs was chosen. The output nodes are connected to a winner-take-all circuit whose role is to select the output column corresponding to the highest output. Other aspects of the design are the same as the linear classifier case. While the mathematical form of the gradient-descent applicable here is different from a linear classifier, the final weight update equation is very similar. Therefore, the array can be trained by the USD algorithm using a 4-phase training procedure. The procedure is effectively the same as that discussed earlier.

We used a wide range of devices with Gratio ranging between 32 to 1000 and σ of 0.2. As seen in Figure 4.15, a σ of 0.2 results in very large device variations. Using a floating- point software implementation (soft-max regression using steepest-gradient descent) gives a test-set error of about 5%. An interesting observation made in these experiments was CHAPTER 4. GRADIENT-DESCENT IN CROSSBAR ARRAYS 75

Figure 4.16: Performance of the stochastic USD implementation in comparison to floating- point logistic regression for various values of pe [68] c IEEE that the validation-set error and test-set error hovered around 7.5% and 10%, respectively, for all the devices. One possible explanation for this is the fact that the number of input features was larger in the MNIST training set. Therefore, there was a lot more flexibility for the USD algorithm to correct for the bad devices. This observation is useful because it shows that as the training feature size becomes larger, the algorithm might find it easier to correct the inaccuracies arising from device variability, non-linearities, and algorithmic approximations.

4.9 USD for other algorithms

In this section we discuss how the USD approximation may be applied for a variety of tasks and networks such as matrix inversion, Principal Component Analysis (PCA), Restricted Boltzmann Machines (RBM), etc. We also discuss how the crossbar array can be set-up for these tasks.

4.9.1 Matrix Inversion

One can use the gradient-descent algorithm to compute the inverse of a matrix. Consider that the matrix to be inverted is A. While the procedure outlined here is applicable for a matrix of any shape, the method is highlighted using a square matrix of rank N. For rectangular CHAPTER 4. GRADIENT-DESCENT IN CROSSBAR ARRAYS 76 matrices, the procedure results in left- or right-inverses depending on how the training is set-up. T th Now the matrix A can be written as [a1,a2,...,aN] where ai is the i row of the matrix A. Clearly, if an inverse exists, then A−1 is also N-dimensional. The crossbar array is trained using an N × N crossbar array. The motivation behind this procedure is the understanding that the ith column of the final array basically defines a hyperplane that outputs 1 only when multiplied by the ith row and gives 0 otherwise. The training procedure is as follows:

1. Multiply the ith row of the matrix A to the crossbar array. This results in an output Y.

2. The goal is to minimize the output error using the least-square minimization proce- dure.

3. Compute the error in output Y, using the target output as H = [0,0,...1...,0] where the H[i] = 1

4. Compute the gradient of the squared error function

5. Apply the USD approximation.

6. Repeat the procedure for all the rows of A until convergence.

The final weights of the array is A−1. Interestingly, the resulting update equation is very similar to that of Equation 4.5. In this training procedure the matrix inversion problem is simply converted into an optimization problem. For well-behaved matrices, the local optima also corresponds to the perfect inverse. For matrices with rank less than the number of columns or rows, the method finds one of the best possible solutions. Using a large learning rate at the start of the training procedure and progressively shrinking it helps compute better solutions to the problem.

4.9.2 Auto-encoders

Auto-encoders are neural networks that learn an internal representation of the training sam- ples in such a manner that the input can be reconstructed from this internal representation. The number of hidden layers are not fixed. However, at the time of training, the output layers generally have the same number of nodes as the input. Auto-encoders share some CHAPTER 4. GRADIENT-DESCENT IN CROSSBAR ARRAYS 77 similarities with RBMs and are sometimes used instead of RBMs in deep learning networks because they are simpler to train[81]. It can be shown that when using a three-layered auto-encoder network with a hidden layer of k nodes, and trained using the least-square error minimization objective, the network learns to project the data into the first k principal components of the training data [82]. This property is contingent on the absence of non-linearities in the hidden layer. This is however good news if the goal is to compute the principal components of some data because absence of non-linearities significantly reduces the cost of the hardware. Using non-linear hidden nodes instils the ability to capture multi-modal aspects of the input distribution too. The training procedure is simply the back-propagation algorithm. The gradient back- propagation is further simplified when using linear hidden layers. For example, in order to perform PCA, two layers of crossbar arrays is required, the first with N columns and K rows, and the second having K rows and N columns. When K < N, and the hidden layers are linear, the crossbar array is set-up for Singular Value Decomposition (SVD). When K > N, the array learns to capture useful internal representations of the training data that can be used in building deep layer networks.

4.9.3 Restricted Boltzmann Machines

Boltzmann machines are the stochastic counterpart to Hopfield networks. Restricted Boltz- mann machines are Boltzmann machines with some restrictions. A brief overview of these networks is provided in the Appendix. One of the most popular methods to train an RBM is by the contrastive divergence procedure [83] . The contrastive divergence algorithm pro- ceeds in two phases. For a two layer network with N inputs nodes and K hidden layer nodes, contrastive divergence proceeds as follows:

1. Sampling phase - The training data X = [x1,x2,...xN] is driven into the array and out-

puts from the hidden layer are sampled H = [h1,h2,...hK]

2. Free-running phase - H is driven back into the hidden layer towards the input layer and 0 0 0 0 0 the free-running value X = [x1,x2,...,xN] is computed. Then X is used to generate 0 0 0 0 the free-running hidden layer output H = [h1,h2,...hK] CHAPTER 4. GRADIENT-DESCENT IN CROSSBAR ARRAYS 78

Updating the weights by contrastive divergence can be expressed by the following equa- tions:

0 0 wi j ← wi j + α(h jxi − h jxi) (4.9)

th th where, wi is the weight of the connection between the i input xi and j hidden term, th 0 0 h j is the sampled output from the j hidden term in the sampling phase. xi and h j are the corresponding values in the free-running phase. The best thing about this training rule is that all the terms except α are already 0 or 1. The training architecture simplification garnered by such an approximation is essentially the same hardware benefit that the USD approximation provides. The added cost in training a memristive crossbar array using contrastive divergence is the need for sampling the crossbar outputs. This requires building blocks such sampling and noise generator circuits. Several memristor-based and CMOS noise generator circuits are available in literature. The other aspect that needs to be studied is the effect of different magnitudes of α on different weights. If we use the 4-phase scheme described earlier in this chapter, then it would result in different weights of the array getting updated by different magnitudes, albeit in the right direction. The effect of this behaviour needs to be analysed thoroughly. A good metric to demonstrate functionality is to study the reconstruction ability of the network as done in [84].

4.10 Concluding remarks

In this chapter, we presented the USD approximation, which is an approximation to the standard gradient-descent algorithm. This approximation can be applied to several gradient- descent-based algorithms such as linear/polynomial classification, soft-max regression, back- propagation algorithm for MLPs, contrastive divergence for RBMs, matrix inversion, Prin- cipal component analysis (PCA), auto-encoders etc. Training a crossbar array using the standard gradient descent technique requires fine control over the internal states of all the memristive devices. This normally requires us- ing complex feedback schemes and sequential training of the individual devices, which is unattractive for larger arrays. It also increases the training time for large datasets with high- dimensional input features. The USD approximation solves these design problems, making CHAPTER 4. GRADIENT-DESCENT IN CROSSBAR ARRAYS 79 it feasible to use memristive crossbar arrays for applications where power and speed are important considerations. The USD gradient-descent rule compensates for errors arising from inaccurate weight changes, non-linearities, variability, etc. These benefits accrue from the self-correcting nature of the gradient-descent rule. Additionally, using the USD rule on a crossbar arrays permits the training architecture to use simple 4-phase schemes to train all the devices in crossbar arrays simultaneously. Such schemes are hardware-efficient, and reduce array-training signal-generation complexity. Techniques like 4-phase schemes sim- plify fabrication of crossbar-array-based machine learning systems because of the absence CMOS devices embedded within the crossbar array. In order to understand the behaviour of the USD rule, we used it to train a crossbar ar- ray as a classifier using the logistic regression and soft-max algorithms. These algorithms were chosen because they share several algorithmic and structural similarities with more complex neural networks. Our experiments used artificially constructed and the MNIST databases. The results are promising. It might even be argued that training memristive crossbar arrays using techniques based on gradient-descent USD approximation could be more attractive than STDP, LUT, feedback, or PWP-based mechanisms. Even with a sim- ple linear-classifier, the performance obtained using the proposed USD approximation is comparable to that of spiking neural networks [53][17]. Another interesting observation is that these algorithms use fewer devices than spiking neural network implementations [17]. The USD approximation also enables training architectures that are faster and simpler in comparison to LUT or feedback-based schemes. Finally, our experiments provide an understanding of how crossbar array device and training parameters such as weight initialization, α, Gratio, and σ affect the learning capabil- ity of the crossbar array. These insights can also be useful when using memristive crossbar arrays for non-learning applications such as memories. The experimental results presented in this chapter indicate that using the USD approxi- mation to implement machine learning algorithms in memristive crossbar arrays might result in practical, low-power, and fast designs. This is encouraging and motivates further analysis to test the performance of USD training rule in larger and deeper crossbar array networks based on auto-encoders, RBMs, CNNs, etc. Chapter 5

Conclusion

The purpose of this work was to explore techniques to improve computational efficiency for statistical inference and machine learning algorithms. This is of interest because these algo- rithms are increasingly being deployed in consumer, industrial, and research applications. With the increasing proliferation of smart devices, in the form of portable gadgets and sensor networks, the need to compute more for less power is going to increase further. Traditional Von Neumann architectures have a memory access bottleneck which puts an artificial limit on their computational capability. Eliminating this bottleneck can lead to great performance gains as has been demonstrated by some of the neuromorphic and other systems built by various research groups. We studied some of the most successful systems designed for neuromorphic simulations and neural networks such as Pulse-stream computation based networks, TrueNorth, SpiN- Naker, and Neurogrid. Several aspects of these systems were analysed including design challenges, architectural innovations, strengths, and weaknesses. It was noted that while all of them address similar problems, such synapse design (lack of suitable analogue memory), communication bottleneck (when multi-bit signals are transmitted during gradient-descent), noise margin in analogue circuitry, etc., they do so by radically different approaches. It was also noted that bulk of the designs are based on spiking neural networks. There are rela- tively fewer systems dedicated for neural networks such as RBMs, CNNs, etc. The reasons for this were identified as increased complexity of the training algorithm in comparison to STDP-based learning and the resultant complexity in the hardware infrastructure. We then noted that some of the challenges in building large neural networks may be surmounted by using memristive crossbar arrays as synapses. Memristive crossbar arrays

80 CHAPTER 5. CONCLUSION 81 are highly dense, provide analogue memory storage, and obey Ohm’s law for small poten- tial differences. Memristive behaviour also resembles synaptic behaviour. These devices can result in lower power consumption, better synaptic modelling, and fewer data transfer operations, if used intelligently. These properties have resulted in a proliferation of STDP- based learning systems that use memristors. However, there are relatively fewer designs that use memristive crossbar arrays for neural networks such as RBMs, CNNs, etc. Considering that such neural networks are used in a wide variety of practical applications and have been known to work well, we identified this topic as a useful area of research. The most popular training method for neural networks is the back-propagation algo- rithm. This algorithm uses the gradient-descent method to iteratively adjust the network parameters. This approach works well in software because of the high precision offered by floating point computation. However, this approach is intractable when building hardware neural networks. This is because of the lack of precision offered by analogue memories (other than memristors) as synapses. Digital synapses run into performance bottlenecks because of memory transfers. Using memristors as analogue weights and the crossbar to perform fast multiplication is potentially an excellent method to address these issues. The problem however is that memristors exhibit extreme variability and there is an absence of precise device models. One of the approaches to solve this problem is to design algorithms that can correct for device and model imprecisions. We propose the Unregulated Step De- scent (USD) algorithm in order to do this. As discussed in this thesis, the USD algorithm is a result of making approximations to the steepest-gradient descent algorithm that makes circuit design easier. We studied the effect of these approximations on learning capabilities of neural networks implemented on memristive crossbar arrays. Our analysis included the effect of device variability and other training parameters relevant to circuit design. The tests were conducted using several artificial and real-world test cases. An extensive study of the USD algorithm was done for simple linear classifiers and regression models. Some experiments were also conducted with neural networks such as RBMs and MLPs. This provided us with several useful insights on how memristive cross- bar arrays may be used for machine learning tasks. However, the analysis was done for relatively small networks (hundreds of neurons). It will be worthwhile to test if the learn- ings from these experiments scale for larger and deeper networks (thousands or millions of CHAPTER 5. CONCLUSION 82 neurons). These simulations are expensive and will require development of better simula- tion techniques. Currently, the simulations are run on custom-designed simulators that use Nvidia GPUs and Python scripts. Additionally, experiments on real memristors should be conducted in collaboration with fabrication labs and other groups to validate the simulation results. Finally, methods to implement other machine learning algorithms such as Bayesian inference, random forests, etc. need to be explored. Bibliography

[1] Ifrah, Georges. ”The Universal History of Numbers: From Prehistory to the Inven- tion of the Computer, translated by David Vellos, EF Harding, Sophie Wood and Ian Monk.” (2000).

[2] Turing, Alan M. ”Computing machinery and intelligence.” Mind (1950): 433-460.

[3] D B Strukov, et al. ”The missing memristor found.” Nature 453.7191 (2008)

[4] Hebb, Donald Olding. The organization of behavior: A neuropsychological theory. Psychology Press, 2005.

[5] Widrow, Bernard, and Michael A. Lehr. ”30 years of adaptive neural networks: perceptron, madaline, and .” Proceedings of the IEEE 78.9 (1990): 1415-1442.

[6] Hopfield, John J. ”Neurons with graded response have collective computational prop- erties like those of two-state neurons.” Proceedings of the national academy of sci- ences 81.10 (1984): 3088-3092.

[7] Misra, Janardan, and Indranil Saha. ”Artificial neural networks in hardware: A survey of two decades of progress.” Neurocomputing 74.1 (2010): 239-255.

[8] Hamilton, Alister, et al. ”Integrated pulse stream neural networks: results, issues, and pointers.” Neural Networks, IEEE Transactions on 3.3 (1992): 385-393.

[9] Cassidy, Andrew S, et al ”Cognitive Computing Building Block: A Versatile and Efficient Digital Neuron Model for Neurosynaptic Cores”, IJCNN 2013. IEEE

[10] Furber, Steve B., et al. ”Overview of the SpiNNaker system architecture.” Computers, IEEE Transactions on 62.12 (2013): 2454-2467

83 BIBLIOGRAPHY 84

[11] Benjamin, Ben Varkey, et al. ”Neurogrid: A mixed-analog-digital multichip system for large-scale neural simulations.” Proceedings of the IEEE 102.5 (2014): 699-716.

[12] Liu, Shih-Chii, ed. Analog VLSI: circuits and principles. MIT press, 2002.

[13] Murray, Alan F., and Anthony VW Smith. ”Asynchronous VLSI neural networks using pulse-stream arithmetic.” Solid-State Circuits, IEEE Journal of 23.3 (1988): 688-697.

[14] Merolla, Paul A., et al. ”A million spiking-neuron integrated circuit with a scalable communication network and interface.” Science 345.6197 (2014): 668-673.

[15] Merolla, Paul, et al. ”A digital neurosynaptic core using embedded crossbar memory with 45pJ per spike in 45nm.” Custom Integrated Circuits Conference (CICC), 2011 IEEE. IEEE, 2011.

[16] Imam, Nabil, and Rajit Manohar. ”Address-event communication using token-ring mutual exclusion.” Asynchronous Circuits and Systems (ASYNC), 2011 17th IEEE International Symposium on. IEEE, 2011.

[17] Peter U. Diehl, Matthew Cook, of Digit Recognition Us- ing Spike-Timing-Dependent Plasticity, Neural networks and learning systems, IEEE Transactions on 59.4 (2012)

[18] Beyeler, Michael, Nikil D. Dutt, and Jeffrey L. Krichmar. ”Categorization and decision-making in a neurobiologically plausible spiking network using a STDP-like learning rule.” Neural Networks 48 (2013): 109-124.

[19] Furber, Steve, and Steve Temple. ”Neural systems engineering.” Journal of the Royal Society interface 4.13 (2007): 193-206.

[20] Hashmi, Atif, et al. ”A case for neuromorphic ISAs.” ACM SIGPLAN Notices 47.4 (2012): 145-158.

[21] Davison, Andrew P., et al. ”PyNN: a common interface for neuronal network simula- tors.” Frontiers in neuroinformatics 2 (2008). BIBLIOGRAPHY 85

[22] Schemmel, Johannes, Johannes Fieres, and Karlheinz Meier. ”Wafer-scale integra- tion of analog neural networks.” Neural Networks, 2008. IJCNN 2008.(IEEE World Congress on Computational Intelligence). IEEE International Joint Conference on. IEEE, 2008.

[23] Zhang, Jialu, et al. ”A biomimetic fabricated carbon nanotube synapse for prosthetic applications.” 2011 IEEE/NIH Life Science Systems and Applications Workshop (LiSSA). 2011.

[24] Rachmuth, Guy, et al. ”A biophysically-based neuromorphic model of spike rate- and timing-dependent plasticity.” Proceedings of the National Academy of Sciences 108.49 (2011): E1266-E1274.

[25] Sharad, Mrigank, et al. ”Proposal for neuromorphic hardware using spin devices.” arXiv preprint arXiv:1206.3227 (2012).

[26] Qiao, Ning, et al. ”A reconfigurable on-line learning spiking neuromorphic processor comprising 256 neurons and 128K synapses.” Frontiers in neuroscience 9 (2015).

[27] Maass, W. (1995b). On the computational complexity of networks of spiking neurons. In Advances in neural information processing systems, Vol. 7 (183-190). Cambridge: MIT Press.

[28] Chua, Leon O. ”Memristor-the missing circuit element.” Circuit Theory, IEEE Trans- actions on 18.5 (1971): 507-519.

[29] Chua, Leon O., and Sung Mo Kang. ”Memristive devices and systems.” Proceedings of the IEEE 64.2 (1976): 209-223.

[30] Lacaita, A. L. ”Phase change memories: State-of-the-art, challenges and perspec- tives.” Solid-State Electronics 50.1 (2006): 24-31.

[31] Chua, Leon, Valery Sbitnev, and Hyongsuk Kim. ”HodgkinHuxley axon is made of memristors.” International Journal of Bifurcation and Chaos 22.03 (2012).

[32] Yi, Wei, et al. ”Feedback write scheme for memristive switching devices.” Applied Physics A 102.4 (2011): 973-982. BIBLIOGRAPHY 86

[33] Manem, Harika, and Garrett S. Rose. ”A read-monitored write circuit for 1T1M multi-level memristor memories.” Circuits and systems (ISCAS), 2011 IEEE inter- national symposium on. IEEE, 2011.

[34] R. Berdan, T. Prodromakis, F.P. Diaz, E. Vasilaki, A. Khiat, I. Salaoru and C. Touma- zou, Temporal Processing with Volatile Memristors, IEEE International Symposium on Circuits and Systems, May 2013

[35] Guan, Ximeng, Shimeng Yu, and H-SP Wong. ”On the switching parameter varia- tion of metal-oxide RRAMPart I: Physical modeling and simulation methodology.” Electron Devices, IEEE Transactions on 59.4 (2012): 1172-1182.

[36] Lehtonen, Eero. ”Memristive Computing.” University of Turku (2013): 1-157.

[37] Zamarreo-Ramos, Carlos, et al. ”On spike-timing-dependent-plasticity, memristive devices, and building a self-learning visual cortex.” Frontiers in neuroscience 5 (2011).

[38] Joglekar, Yogesh N., and Stephen J. Wolf. ”The elusive memristor: properties of basic electrical circuits.” European Journal of Physics 30.4 (2009): 661.

[39] Biolek, Zdenek, Dalibor Biolek, and Viera Biolkova. ”SPICE model of memristor with nonlinear dopant drift.” Radioengineering 18.2 (2009): 210-214.

[40] Prodromakis, Themistoklis, et al. ”A versatile memristor model with nonlinear dopant kinetics.” Electron Devices, IEEE Transactions on 58.9 (2011): 3099-3105.

[41] Corinto, Fernando, and Alon Ascoli. ”A boundary condition-based approach to the modeling of memristor nanostructures.” Circuits and Systems I: Regular Papers, IEEE Transactions on 59.11 (2012): 2713-2726.

[42] Kvatinsky, Shahar, et al. ”TEAM: ThrEshold adaptive memristor model.” Circuits and Systems I: Regular Papers, IEEE Transactions on 60.1 (2013): 211-221.

[43] Heath, James R., et al. ”A defect-tolerant computer architecture: Opportunities for nanotechnology.” Science 280.5370 (1998): 1716-1721. BIBLIOGRAPHY 87

[44] Strukov, Dmitri B., and Konstantin K. Likharev. ”CMOL FPGA: a reconfigurable ar- chitecture for hybrid digital circuits with two-terminal nanodevices.” Nanotechnology 16.6 (2005): 888.

[45] Lee, Hyung Dong, et al. ”Integration of 4F2 selector-less crossbar array 2Mb ReRAM based on transition metal oxides for high density memory applications.” VLSI Tech- nology (VLSIT), 2012 Symposium on. IEEE, 2012.

[46] J. Joshua Yang, Dmitri B. Strukov, and Duncan R. Stewart. ”Memristive devices for computing.” Nature nanotechnology 8.1 (2013): 13-24.

[47] E Lehtonen, J. H. Poikonen, and Mika Laiho. ”Two memristors suffice to compute all Boolean functions.” Electronics letters 46.3 (2010): 230.

[48] Kim, Hyongsuk, et al. ”Memristor-based multilevel memory.” Cellular nanoscale net- works and their applications (CNNA), 2010 12th international workshop on. IEEE, 2010.

[49] Laiho, Mika, and Eero Lehtonen. ”Arithmetic operations within memristor-based analog memory.” Cellular Nanoscale Networks and Their Applications (CNNA), 2010 12th International Workshop on. IEEE, 2010.

[50] Yakopcic, Chris, and Tarek M. Taha. ”Energy efficient perceptron pattern recognition using segmented memristor crossbar arrays.” Neural Networks (IJCNN), The 2013 International Joint Conference on. IEEE, 2013.

[51] T Serrano-Gotarredona, Teresa, et al. ”STDP and STDP variations with memristors for spiking neuromorphic learning systems.” Frontiers in neuroscience 7 (2013).

[52] C. Zamarreno-Ramos et al. ”On spike-timing-dependent-plasticity, memristive de- vices, and building a self-learning visual cortex”, Front. Neurosci., vol. 5, no. 26, pp.1 -36 2011

[53] Querlioz, Damien, Olivier Bichler, and Christian Gamrat. ”Simulation of a memristor-based spiking neural network immune to device variations.” Neural Net- works (IJCNN), The 2011 International Joint Conference on. IEEE, 2011. BIBLIOGRAPHY 88

[54] Snider, Greg S. ”Spike-timing-dependent learning in memristive nanodevices.” Nanoscale Architectures, 2008. NANOARCH 2008. IEEE International Symposium on. IEEE, 2008.

[55] Afifi, A., A. Ayatollahi, and F. Raissi. ”Implementation of biologically plausible spik- ing neural network models on the memristor crossbar-based CMOS/nano circuits.” Circuit Theory and Design, 2009. ECCTD 2009. European Conference on. IEEE, 2009.

[56] Y. LeCun, L. Bottou, Y. Bengio and P. Haffner: Gradient-Based Learning Applied to Document Recognition, Proceedings of the IEEE, 86(11):2278-2324, November 1998.

[57] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. ”Imagenet classification with deep convolutional neural networks.” Advances in neural information processing systems. 2012.

[58] Hasan, Raqibul, and Tarek M. Taha. ”Enabling back propagation training of mem- ristor crossbar neuromorphic processors.” Neural Networks (IJCNN), 2014 Interna- tional Joint Conference on. IEEE, 2014.

[59] Alibart, Fabien, Elham Zamanidoost, and Dmitri B. Strukov. ”Pattern classification by memristive crossbar circuits using ex situ and in situ training.” Nature communi- cations 4 (2013).

[60] Soudry, Daniel, et al. ”Memristor-based multilayer neural networks with online gra- dient descent training.” (2015).

[61] Kataeva, Irina, et al. Efficient Training Algorithms for Neural Networks Based on Memristive Crossbar Circuits, IJCNN (2015)

[62] Prezioso, Mirko, et al. ”Training and operation of an integrated neuromorphic net- work based on metal-oxide memristors.” Nature 521.7550 (2015): 61-64.

[63] Hu, Miao, et al. ”Memristor crossbar-based neuromorphic computing system: A case study.” Neural Networks and Learning Systems, IEEE Transactions on 25.10 (2014): 1864-1878. BIBLIOGRAPHY 89

[64] Rothganger, Fred. et al.Training neural hardware with noisy components. Neural Net- works (IJCNN), 2015 International Joint Conference on. IEEE, 2015.

[65] Muthuswamy, Bharathwaj. ”Implementing memristor based chaotic circuits.” Inter- national Journal of Bifurcation and Chaos 20.05 (2010): 1335-1350.

[66] Itoh, Makoto, and Leon O. Chua. ”Memristor oscillators.” International Journal of Bifurcation and Chaos 18.11 (2008): 3183-3206.

[67] Pershin, Yuriy V., and Massimiliano Di Ventra. ”Practical approach to programmable analog circuits with memristors.” Circuits and Systems I: Regular Papers, IEEE Transactions on 57.8 (2010): 1857-1864.

[68] Nair, Manu V., and Dudek, Piotr, ”Gradient-descent based learning in memristive crossbar arrays”,(IJCNN), 2015 International Joint Conference on. IEEE, 2015.

[69] Medeiros-Ribeiro G et al. Lognormal switching times for titanium dioxide bipolar memristors: origin and resolution, Nanotechnology. 2011 Mar 4

[70] G. Snider, ”Self-organized computation with unreliable, memristive nanodevices,” Nanotechnology 18, 10 August 2007

[71] Linares-Barranco, Bernab, and Teresa Serrano-Gotarredona. ”Memristance can ex- plain spike-time-dependent-plasticity in neural synapses.” Nature precedings (2009): 1-4.

[72] Rojas, Ral. Neural networks: a systematic introduction. Springer Science & Business Media, 1996.

[73] Bottou, Lon. ”Stochastic gradient descent tricks.” Neural Networks: Tricks of the Trade. Springer Berlin Heidelberg, 2012. 421-436.

[74] Leen, Todd K. and Moody, John E., Stochastic Manhattan learning: Time-evolution operator for the ensemble dynamics, Physical Review E, Vol 56, Num 1, July 1997

[75] Cauwenberghs, Gert. ”A fast stochastic error-descent algorithm for supervised learn- ing and optimization.” Advances in neural information processing systems (1993): 244-244. BIBLIOGRAPHY 90

[76] Alspector, Joshua, et al. ”A parallel gradient descent method for learning in analog VLSI neural networks.” Advances in neural information processing systems. 1993.

[77] A Ascoli et al. ”Memristor model comparison.” Circuits and Systems Magazine, IEEE 13.2 (2013): 89-105

[78] Abu Sebastian, Manuel Le Gallo, and Daniel Krebs. ”Crystal growth within a phase change memory cell.” Nature communications 5 (2014).

[79] MNIST handwritten digits database. http://yann.lecun.com/exdb/mnist/

[80] Y So ”A tutorial on logistic regression.” SAS White Papers (1995).

[81] Bengio, Yoshua. ”Learning deep architectures for AI.” Foundations and trends in Machine Learning 2.1 (2009): 1-127.

[82] H. Bourlard and Y. Kamp, Auto-association by multilayer perceptrons and singular value decomposition, Biological Cybernetics, vol. 59, pp. 291294, 1988.

[83] Hinton, Geoffrey. ”Training products of experts by minimizing contrastive diver- gence.” Neural computation 14.8 (2002): 1771-1800.

[84] Fischer, Asja, and Christian Igel. ”Training restricted Boltzmann machines: An in- troduction.” Pattern Recognition 47.1 (2014): 25-39.

[85] Rosenblatt, Frank. ”The perceptron: a probabilistic model for information storage and organization in the brain.” Psychological review 65.6 (1958): 386.

[86] McCulloch, Warren S., and Walter Pitts. ”A lfgical calculus of the ideas immanent in nervous activity.” The bulletin of mathematical biophysics 5.4 (1943): 115-133.

[87] Minsky, M., and S. Papert. ”Perceptrons (expanded edition 1988).” (1968).

[88] Hopfield, John J. ”Neural networks and physical systems with emergent collec- tive computational abilities.” Proceedings of the national academy of sciences 79.8 (1982): 2554-2558.

[89] Wen, Ue-Pyng, Kuen-Ming Lan, and Hsu-Shih Shih. ”A review of Hopfield neu- ral networks for solving mathematical programming problems.” European Journal of Operational Research 198.3 (2009): 675-687. BIBLIOGRAPHY 91

[90] Hinton, Geoffrey E., Terrence J. Sejnowski, and David H. Ackley. Boltzmann ma- chines: satisfaction networks that learn. Pittsburgh, PA: Carnegie-Mellon University, Department of Computer Science, 1984.

[91] Kirkpatrick, Scott. ”Optimization by : Quantitative studies.” Journal of statistical physics 34.5-6 (1984): 975-986.

[92] LeCun, Yann, et al. ”A tutorial on energy-based learning.” Predicting structured data 1 (2006): 0.

[93] Kohonen, Teuvo. ”Self-organized formation of topologically correct feature maps.” Biological cybernetics 43.1 (1982): 59-69.

[94] Maass, Wolfgang. ”Analog computations on networks of spiking neurons.” Proc. of the 7th Italian Workshop on Neural Nets, World Scientific Press. 1995.

[95] Maass, Wolfgang. ”Lower bounds for the computational power of networks of spiking neurons.” Neural computation 8.1 (1996): 1-40.

[96] Maass, Wolfgang. ”Networks of spiking neurons: the third generation of neural net- work models.” Neural networks 10.9 (1997): 1659-1671.

[97] Song, G., V. Chaudhry, and C. Batur. ”Precision tracking control of shape memory alloy actuators using neural networks and a sliding-mode based robust controller.” Smart materials and structures 12.2 (2003): 223.

[98] Rao, Rajesh PN, and Terrence J. Sejnowski. ”Spike-timing-dependent Hebbian plas- ticity as temporal difference learning.” Neural computation 13.10 (2001): 2221-2237.

[99] Ebong, Idongesit E., and Pinaki Mazumder. ”CMOS and memristor-based neural net- work design for position detection.” Proceedings of the IEEE 100.6 (2012): 2050- 2060.

[100] Jo, Sung Hyun, Kuk-Hwan Kim, and Wei Lu. ”Programmable resistance switching in nanoscale two-terminal devices.” Nano Letters 9.1 (2008): 496-500.

[101] SH Jo et al. ”Nanoscale memristor device as synapse in neuromorphic systems.” Nano letters 10.4 (2010): 1297-1301. BIBLIOGRAPHY 92

[102] Alibart, Fabien, Elham Zamanidoost, and Dmitri B. Strukov. ”Pattern classification by memristive crossbar circuits using ex situ and in situ training.”Nature communi- cations 4 (2013).

[103] Freund, Yoav, Robert Schapire, and N. Abe. ”A short introduction to boosting.”Journal-Japanese Society For Artificial Intelligence 14.771-780 (1999): 1612.

[104] Merkel, Cory E., et al. ”Reconfigurable n-level memristor memory design.” Neural Networks (IJCNN), The 2011 International Joint Conference on. IEEE, 2011. Appendix A

Neural Networks and related algorithms

A.1 Introduction

The appendix is a summary of important algorithms mentioned in this thesis.

A.2 Rosenblatt’s perceptron

Rosenblatts seminal work on the Perceptron [85] set the trend for neural networks research well into the 90s and remains an influential work to date. His model was inspired by Hebbs physiological [4] and other existing quantitative theories. In his own words, his model went beyond the existing theories on learning in three metrics: parsimony, verifiability, and gen- eralization. Rosenblatts perceptron is essentially a training algorithm to train a McCulloch- Pitts neuron [86] as a linear classifier. The training rule is inspired by the principle of Hebbian learning which is often paraphrased as-’Neurons that fire together, wire together’. Mathematically, the perceptron is modelled as:

N y = g(Σi=1wixi) = g(w.x) (A.1)

93 APPENDIX A. NEURAL NETWORKS AND RELATED ALGORITHMS 94

Figure A.1: The perceptron

where,

g(θ) = Activation f unction = 1 i f θ ≥ 0

0 i f θ < 0

y = Out put o f perceptron

T w = [w0,w1,...,wN] = Weightvector T x = [x0,x1,...,xN] = Inputvector

Note that the inputs and output in Rosenblatts perceptron were assumed to be two-state providing an all-or-nothing response. Also, it is assumed that x0 = 1. The learning rule for the perceptron, based on principle of Hebbian learning, is formulated as:

wi := wi + η(o − y)xi ∀ i ε {0,1,...,N} (A.2) where,

η = learning rate

o = desired out put

y = observed out put

A good discussion on the short-comings of the Perceptron is available in [87] APPENDIX A. NEURAL NETWORKS AND RELATED ALGORITHMS 95 A.3 Hopfield network

The Hopfield network [88], a landmark in the field of neural networks, is an associative memory model. It is a consisting of fully-connected McCulloch- Pitts type neurons where the connection matrix, T, is constrained to only have zero diagonal entries. Sometimes, the condition that T be symmetric is also enforced. While, this condi- tion is not essential for learning, it makes analysis simpler. For example, in order to store

Figure A.2: A Hopfield network

the patterns V s ∀ s ∈ [1,...,n], the following connection matrix can be used:

i s Ti j = Σs(2Vs − 1)(2Vj − 1) (A.3)

Tii = 0 (A.4)

Hopfield also discusses a continuous version of the model in [6]. The defining aspect of both the models is the association of an energy metric with the network states. When permitted to evolve over time, the network moves along the energy landscape (network states) and eventually settles to the network state corresponding to an energy minimum. Using princi- ples based on Hebbian-learning, it is possible to sculpt the energy landscape such that the patterns to be memorized are associated with the energy minima. The training mechanism to do this is as follows: Consider a Hopfield network with k nodes. The patterns to be learnt, s s s V = [V1 ,...,Vk ] ∀ s ∈ [1,...,n] , are also k-bit vectors. During the training the process, the goal is to tune the weights of the connection matrix such that

1 T = Σn V sV s (A.5) i j n s=1 i j APPENDIX A. NEURAL NETWORKS AND RELATED ALGORITHMS 96

A common techniques to do this is to drive the pattern sequentially into the Hopfield net- work. For each pattern input, the weights are updated according to the learning rule:

s s Ti j := Ti j + ηVi Vj (A.6)

Over the course of several iterations, the weights of the connection matrix approach the desired goal. The advantage of the Hopfield network is that when the training procedure is complete, the patterns stored in the network are associated with the energy function minima. This means, that when a trained Hopfield network is stimulated by a new pattern, it settles to a state corresponding to the memorized pattern that is closest to the input in its energy landscape. Several implementation questions arise when building a Hopfield network, in particular, the existence of cycles, sequencing of node updates, convergence criteria, etc. These have been well-studied and documented by Hopfield and others [89].

A.4 Boltzmann Machines

Boltzmann Machines [90], invented by and Terry Sejnowski, are the stochas- tic counterparts to Hopfield networks. Hopfield networks exploit the presence of local min- ima in the energy landscape to learn various patterns. It counts on being able to settle to local optima. However, training a is a global optimization problem. The goal is to build a network that best captures the probability distribution of the training patterns. The training algorithm searches for the best set of weights that minimize the differ- ence between the statistical properties of the training patterns and the patterns generated by the neural network. In order to do this, a variation of the simulated annealing [91] algorithm is used. Simulated annealing is an optimization technique inspired by a problem in con- densed matter physics which deals with finding ground states of matter. Like the Hopfield network, Boltzmann machines assign a scalar energy value for each state of the system.

s s E = −Σi< jTi jVi Vj (A.7)

Switching the values of the kth component of Vs, results in an energy change given by the equation:

s ∆Ek = ΣiTkiVi (A.8) APPENDIX A. NEURAL NETWORKS AND RELATED ALGORITHMS 97

Any change that results in reduction of energy is accepted. If the change results in an increase in the energy of the system, then the change is accepted with a probability pk determined by the Boltzmann function.

1 pk = E (A.9) − ∆ k 1 + e T Where, T, the Temperature, is simply a scaling factor applied to the energy term. It is kept high initially and gradually lowered to simulate the annealing process used in metal- lurgy. Using this distribution results in the property that the ratio between the probability of attaining two states, say A and B, is given by:

pA − (EA−EB) = e kT (A.10) pB This implies that the log likelihood ratio between two states is directly proportional their energy difference.

Figure A.3: Boltzmann machines and Restricted Boltzmann machines

A common feature used in Boltzmann machines is to have several hidden units that are not driven by the training patterns. A popular variant to the Boltzmann machines, called Restricted Boltzmann Machines (RBMs) are neural networks where connections are not permitted between units at the same level as shown in Figure A.3. The hidden units provide the network with flexibility to adapt and model various distributions. Several energy-based learning algorithms exist [92]. The optimization technique used in [90] was minimization of Kullback-Liebler divergence G. G provides a measure of how well the network models the target distribution:

p(Vs) G = Σ p(Vs)ln (A.11) s p0(Vs)

where APPENDIX A. NEURAL NETWORKS AND RELATED ALGORITHMS 98

p(Vs = Probability of generation of V s by the target distribution p0(Vs = Probability of generation of V s by the current state of the Boltzmann machine The goal of the optimization problem is to minimise G. This is equivalent to maximiza- tion of the log-likelihood function [90]. Consider the equation -

p(Vs) G = Σ p(Vs)ln = Σ p(Vs)ln(p(Vs)) − Σ p(Vs)ln(p0(Vs)) (A.12) s p0(Vs) s s

The first term in this equation is the clamped probability distribution over the visible 0 s units and therefore, independent of Ti j. The only term that depends on Ti j is ln(p (V )), which is the log-likelihood function. This implies that the problem of minimizing G can also be treated as a log-likelihood maximization problem. It can be shown that [90]:

δG 1 0 = − (pi j − pi j) (A.13) δTi j T

Where,

pi j= Probability of units i & j being in the on state when visible units are clamped. 0 pi j= Probability of units i & j being in the on state when visible units are unclamped. Computing the summation terms in these equations is computationally intensive and be- 0 comes infeasible as the dimensions increase. Therefore, in almost all applications pi j is estimated by sampling from the Boltzmann machine using a Monte-Carlo Markov chain al- gorithm called Gibbs sampling. Even running the Gibbs sampling algorithm to convergence can be expensive. So, a faster approximation, called Contrastive Divergence [83], was de- veloped which has also been shown to be capable of achieving a similar performance.

A.5 The self-organizing map (SOM)

Self-organizing maps [93] is a competitive learning algorithm that draws mimics the brain in that the internal representation of information generated by the algorithm has some spatial or topographical organization. Kohonen describes two essential effects that result in spatially organized maps [93]:

1. Spatial concentration of the neuronal response to a stimulus

2. Sensitization of the best-matching cell and its topological neighbours to the stimulus APPENDIX A. NEURAL NETWORKS AND RELATED ALGORITHMS 99

Figure A.4: A self-organizing map

The map consists of a 2-dimensional arrangement of nodes (N × M neurons). If the in- T put vector to be trained is x = [x1,x2,,xK] , then each node (i) is associated with a K- T dimensional weight vector, wi j = [wi j1,wi j2wi jK] ∀ i, j ∈ {1,2,...,N} × {1,2,...,M} that is randomly initialized before training. For each training sample, a measure of similarity between the weights at each node and the input sample is computed. This could be done using different measures such as the inner product, Euclidean distances, etc. The neuron with the weight vector that is closest to the input training sample is selected as the winner. The neurons connected to the winning neuron, c, that lie within a pre-determined distance

D are considered to belong to the neighbourhood set Nc:

i ∈ Nc i f |(|x − wc|)| < D (A.14)

During the learning phase, the winning neuron and its topological neighbours are up- dated according to the following rule:  wi(t) + α(t)[x(t) − wi(t)] i f i ∈ Nc wi(t + 1) =  (A.15) wi(t) i f i ∈/ Nc

An alternative method is to introduce a scalar kernel function hci and update all the weights according to the equation:

wi(t + 1) = wi(t) + hci(t)[x(t) − wi(t)] (A.16)

One of the most popular kernels is the Gaussian function, ||r − r ||2 h = h · exp(− i c ) (A.17) ci o σ2 APPENDIX A. NEURAL NETWORKS AND RELATED ALGORITHMS 100

th Where, ri is the vector pointing to the i neuron in the map and rc is the vector pointing to the winning neuron c. When the training is complete, the statistical properties of neural weights resemble that of the training dataset. A topographical ordering will also be observed as shown in Figure A.4 where the values of the weights are represented as colours. Weight vectors that are closer to each other will tend to agglomerate together. Several variations to the self-organizing map have been proposed, but the fundamental idea remains the same as first described by Kohonen in 1981.

A.6 Spiking Neural Networks (SNN)

In comparison to MLPs or Boltzmann machines, spiking neural networks [94][95], are a relatively newer class of algorithms that originated from the study of the mammalian brain structure and its ability to learn highly complex patterns. Theoretically, they can approxi- mate any continuous function and have more computational power than the networks dis- cussed so far [96]. Like the brain, these neural networks consist of spiking neurons con- nected through synapses. Information is encoded in the timing of the spikes generated by the neurons. The structure of an artificial synapse is also hugely inspired by the biological neuron. A simple depiction is shown in the Figure A.5.

Figure A.5: A synapse. The blocks shown are state-dependent and characterized by various dynamical effects leading to different types of spikes

There are two types of synaptic junctions, inhibitory (IPSP) and excitatory (EPSP). A stimulus arriving at the inhibitory junction lowers the post-synaptic potential and vice versa. Each of these inputs are weighed by a weight function that modulates the strength of the connection, much like the McCulloch-Pitts neuron. The neuron fires an action potential APPENDIX A. NEURAL NETWORKS AND RELATED ALGORITHMS 101 when the net post-synaptic potential exceeds a threshold θv. The action potential is a short and dramatic increase in the membrane potential that travels along the axon of the neuron to the next neuron or the target muscle. The action potential triggers an IPSP or an EPSP depending on the synaptic junction. After a neuron fires, it experiences a refractory period during which it does not fire. Sometimes this is modelled as an increase in the threshold. Artificial spiking neurons model this behaviour by defining approximate action potentials and post-synaptic potential waveforms. Diagrams that represent typical response functions for the IPSP and EPSP and the action potential are shown in Figure A.6.

Figure A.6: IPSP, EPSP, and action potential

An excellent model to describe the spike-generation mechanism is the Integrate-and-Fire model. In this model, spikes are a dramatic increase in the membrane potential when the post-synaptic potential exceeds a threshold value. The Integrate-and-Fire neuron used in the TrueNorth is an excellent and versatile model as described in the following equations.(A list of various variables and terms used in the equation is provided in Table A.7): SYNAPTIC INTEGRATION

Vj(t) = Vj(t − 1)+ (A.18) 255 Gi Gi Gi Gi Gi Σi=0 Ai(t)wi, j[(1 − b j )s j + b j · F(s j ,ρi, j) · sign(s j )]

LEAK INTEGRATION

Ω = (1 − ε j) + ε j · sign(Vj(t)) (A.19) λ λ λ Vj(t) = Vj(t) + Ω · [(1 − c j )λ j + c j · F(λ j,ρ j ) · sign(λ j)] APPENDIX A. NEURAL NETWORKS AND RELATED ALGORITHMS 102

Figure A.7: Data-types and symbols used in the LIF neuron model [9] c IEEE

THRESHOLD, FIRE, RESET

T η j = ρ j &M j

i f (Vj(t) >= α j + η j)

Vj(t) = δ(γ j)R j + δ(γ j − 1)(Vj(t)

−(α j + η j))

+δ(γ j − 2)Vj(t) (A.20) elsei f Vj(t) < −[β jκ j + (β j + η j)(1 − κ j)]

Vj(t) = −β jκ j + [−δ(γ j)R j

+δ(γ j − 1)(Vj(t) + (β j + η j))

+δ(γ j − 2)Vj(t)](1 − κ j) endi f

There are generally two categories of training algorithms for Spiking Neural Networks (SNNs):

1. STDP-based unsupervised learning

2. Rate-based . APPENDIX A. NEURAL NETWORKS AND RELATED ALGORITHMS 103

The Spike Timing Dependent Plasticity (STDP) training algorithms for SNN are un- supervised techniques based on Hebbian learning [97][98]. STDP refers to the biological process which results in an increase or decrease in the synaptic strength depending on the relative timing of the input and output action potentials at the synapse. This was illustrated for different spike shapes in Figure 3.10. For the action potential shown in Figure A.6, the resultant STDP update function is shown in Figure A.8.

Figure A.8: STDP waveform for the action potential shown in Figure A.6

Before training begins, all the synaptic weights are initialized to random values. During the training process, the input training sample is transformed into a collection of impulses that are driven into the inputs of the network. The input neurons are connected to other neurons of the network according to some pre-set rules. These rules are decided by the designer depending on various criterion such as complexity of training data, simulation or hardware cost, etc. When an input training sample is driven into the network, it causes some of the neurons to trigger an action potential. These action potentials are critical to the learning process. Many mathematical variants of the exact update rule exist in literature [17][97]. In general, the effect of the update mechanism is that the synaptic junctions that are activated slightly before the neuron fires are strengthened and the ones that activated slightly after the neuron fires are weakened. APPENDIX A. NEURAL NETWORKS AND RELATED ALGORITHMS 104

Over the course of several iterations, different sets of neurons learn to respond to differ- ent classes of inputs. After the training is complete, they are assigned different class labels depending on their firing behaviour. There is no clear method to define convergence in these networks and it is generally determined by the designer based on the problem. Rate-based SNN algorithms encode the input vector as spiking rates and use gradient descent based algorithms to alter the spiking rates. The important difference between rate- based SNN training algorithms and a traditional neural network with the same architecture lies in the encoding scheme. The other mathematical details remain the same.