SIMULATING LARGE SCALE BASED CROSSBAR FOR NEUROMORPHIC APPLICATIONS

Thesis

Submitted to

The School of Engineering of the

UNIVERSITY OF DAYTON

In Partial Fulfillment of the Requirements for

The Degree of

Master of Science in Electrical Engineering

By

Roshni Uppala

UNIVERSITY OF DAYTON

Dayton, Ohio

May, 2015 SIMULATING LARGE SCALE MEMRISTOR BASED CROSSBAR FOR

NEUROMORPHIC APPLICATIONS

Name: Uppala, Roshni

APPROVED BY:

Tarek M. Taha, Ph.D. Vijayan K. Asari, Ph.D. Advisor Committee Chairman Committee Member Associate Professor, Department of Professor, Department of Electrical and Electrical and Computer Engineering Computer Engineering

Guru Subramanyam, Ph.D. Committee Member Chair, Professor, Department of Electrical and Computer Engineering Engineering

John G. Weber, Ph.D. Eddy M. Rojas, Ph.D., M.A., P.E., Associate Dean Dean, School of Engineering School of Engineering

ii c Copyright by

Roshni Uppala

All rights reserved

2015 ABSTRACT

SIMULATING LARGE SCALE MEMRISTOR BASED CROSSBAR FOR NEUROMORPHIC

APPLICATIONS

Name: Uppala, Roshni University of Dayton

Advisor: Dr. Tarek M. Taha

The memristor is a novel nano-scale device discovered in 2008. are basically non- volatile variable resistors. Various breakthroughs of memristive devices have shown the potential of memristive crossbar designs for their ultra-high density and low-power memory. Initial studies have shown that memristor based neuromorphic processors could potentially consume 300,000 times less power than a traditional Xeon for neural network applications. These neuromorphic processors require large memristor arrays for evaluating neural networks. A key problem in design- ing these processors is simulating the large arrays of thousands of memristors as these simulations require extremely long times (sometimes weeks to months). Additionally, most existing simulation tools are unable to handle the large number of computations seen in these simulations. The simu- lations of these large arrays has not been examined in literature. One of the key objectives of this work is to evaluate large scale memristor crossbars that allow high density layout of synapses and thus enable building of highly capable neuromorphic systems. In order to achieve this, this study utilized a newly released parallel SPICE simulator, Xyce, developed by Sandia National Labs. Al- though it is 4 times faster than the existing SPICE simulators, Xyce is still too slow for evaluating iii the capabilities of large crossbar arrays. Therefore, this thesis also examines the development of an equivalent mathematical representation of the circuit level memristor based neuromorphic circuits for faster computation of the crossbars without the use of SPICE. This system is then used as an offline training approach to approximate the SPICE circuit. These trainings require thousands of iterations of the crossbar array. This thesis examined the training of memristor-based crossbars for neural network based and applications such as Image classification at the circuit level. Finally the analysis presented in this work will be crucial in understanding the future of the memristor-based crossbars in developing highly reliable and extremely low power processors and neuromorphic systems.

iv Dedicated to my mother, sisters and all the women.

“Life is not easy for any of us. But what of that? We must have perseverance and above all confidence in ourselves. We must believe that we are gifted for something and that this thing must

be attained.” - Marie Skodowska-Curie

“The woman who follows the crowd will usually go no further than the crowd. The woman who

walks alone is likely to find herself in places no one has ever been before.” - Albert Einstein

v ACKNOWLEDGMENTS

I express my sincere gratitude to my advisor Dr. Tarek M. Taha for his time and consideration throughout my thesis. I would also like to thank my committee members Dr. Vijayan K. Asari and

Dr. Guru Subramanyam for their support and encouragement. I would also like to thank Christopher

Yakopcic for helping me throughout my thesis and letting me build my work over his. I would also like to extend my thanks to Maureen Schlangen and all the University of Dayton Roesch Library staff for their endless support, strength and for funding me throughout my study here at UD in the form of student employment and graduate assistantship. Lastly, I would like to thank Binu Nair for his trust and belief in me and for being a pillar of strength all through this. I finally thank my family in providing me the right kind of assistance.

vi TABLE OF CONTENTS

ABSTRACT ...... iii

DEDICATION ...... v

ACKNOWLEDGMENTS ...... vi

LIST OF FIGURES ...... ix

LIST OF TABLES ...... xii

I. INTRODUCTION ...... 1

1.1 Evolution of Memristors - Mathematical Model ...... 1 1.2 Re-discovery of the Memristor in its Physical Form ...... 2 1.3 Memristors in Neuromorphic Computing ...... 3 1.4 Contributions ...... 4 1.5 Overview of the Thesis ...... 4

II. RELATED WORK ...... 5

2.1 Insights in to Memristors ...... 5 2.2 Neuromorphic Computing ...... 7 2.2.1 Early Architectures ...... 8 2.2.2 Modern Architectures ...... 10 2.2.3 Memristor based Architectures ...... 13

III. MEMRISTOR MODEL ...... 16

3.1 University of Dayton (UD) Memristor Model ...... 17 3.2 Memristor as Neurons ...... 19 3.3 Memristor based Neuromorphic Crossbar ...... 22 3.3.1 Write Operation ...... 22 3.3.2 Read Operation ...... 24

vii IV. LARGE SCALE CIRCUIT SIMULATIONS ...... 25

4.1 Need for Parallel Circuit Simulation ...... 26 4.2 Xyce ...... 27 4.2.1 Parallel Simulation using Xyce ...... 28 4.2.2 Installation of Xyce [1] ...... 30 4.3 Memristor based Neuromorphic Circuit for SPICE Simulation ...... 38 4.3.1 Finite drivers ...... 40 4.3.2 Training of SPICE Circuit ...... 43 4.4 Results ...... 44 4.4.1 System Setup ...... 45 4.4.2 Analysis ...... 46

V. ALTERNATIVE APPROACH IN MODELING MEMRISTOR BASED CROSSBAR FOR NEUROMORPHIC APPLICATIONS ...... 51

5.1 Traditional Single ...... 52 5.2 Single Layer Perceptron with Memristor based Crossbars ...... 56 5.2.1 Approximate Solution to a Circuit Level Memristive based Crossbar. . . 56 5.2.2 Training Single Layer MATLAB based Memristor Crossbar...... 60 5.3 Testing of Memristor based Crossbar as a Single Layer Perceptron in Xyce . . . 63 5.4 Results ...... 65 5.4.1 MNIST Digit Database ...... 66 5.4.2 MIT-CBCL Face Database ...... 71

VI. CONCLUSION ...... 77

BIBLIOGRAPHY ...... 79

viii LIST OF FIGURES

2.1 Insights into memristor...... 6

2.2 A multi-chip add-in board with Ni1000 Recognition Accelerators where the CPU manages the flow of data through these accelerators [2]...... 9

2.3 SYNAPSE-1 system architecture [3]...... 10

2.4 Benchmarks for mapping kernels to neural networks [4]...... 11

2.5 IBM Neurosynaptic Core [5, 6]...... 12

2.6 DARPA SyNAPSE chipboard and its TrueNorth core [7, 8]...... 13

2.7 The mrFPGA architecture [9]. (a) Overview of mrFPGA, and (b) Detailed design of the connection blocks and switch blocks...... 14

2.8 SRAM array replaced by a memristor crossbar [10]...... 15

2.9 Efficiency of memristors over other high performance architectures [10]...... 15

3.1 Input voltage and current simulation waveforms of model in [11] based on device in [12]. The memristor mode parameters mentioned is as follows: Vp = 4V , Vn = 4V ,Ap = 816000,An = 816000,xp = 0.985,xn = 0.985,αp = 0.1,alphan = −4 −4 0.1,a1 = 1.6 × 10 ,a2 = 1.6 × 10 , b = 0.05, x0 = 0.01. This simulation result has been taken from [13]...... 17

3.2 Illustration of Memristor as Synapse ...... 20

3.3 Weight Distribution [14]...... 21

3.4 Memristor as a synapse and its neuro-morphic crossbar representation...... 23

ix 3.5 Writing and reading into memristors in a crossbar...... 24

4.1 Memristor ...... 39

4.2 Complimentary CMOS based finite driver circuit ...... 40

4.3 Memristor based neuromorphic crossbar ...... 41

4.4 Training memristor based crossbar in Xyce...... 43

4.5 Effect of total training time in Xyce with change in number of cores to train a linearly separable 2-input logic function using level-3 Xyce MOSFET in the driver circuit...... 46

4.6 Effect of total training time in Xyce with change in number of cores to train a lin- early separable 2-input logic function using Mosis-180nm in the driver circuit...... 47

4.7 Convergence error curve from training 2-bit input linearly separable logic functions with 10,096 memristors...... 48

4.8 Effect of total training time in Xyce with change in number of cores to train a linearly separable 3-bit input logic function using level-3 Xyce MOSFET and Mosis 180nm MOSFET in the driver circuit...... 50

5.1 Single layer perceptron ...... 53

5.2 Matlab Memristive system ...... 56

5.3 Effective resistances for computation of voltage V1...... 57

5.4 Effective resistances for computation of voltage V2...... 58

5.5 Memristor based neuromorphic crossbar for testing offline trained weights for out- put voltage...... 65

5.6 MNIST digit dataset...... 67

5.7 Error curves obtained with the traditional perceptron with two different learning rates. 69

5.8 Error curves obtained with Matlab memristor crossbar during training with two dif- ferent learning rates...... 69

x 5.9 Comparisons of accuracies with traditional, matlab memristor crossbar and Xyce implementation...... 70

5.10 MIT-CBCL Face dataset...... 73

5.11 Error curves obtained with traditional perceptron during training with two different learning rates for CBCL-MIT Face dataset...... 74

5.12 Error curves obtained with Matlab memristor crossbar during training with two dif- ferent learning rates for CBCL-MIT Face dataset...... 75

5.13 Comparisons of accuracies with traditional, MATLAB memristor crossbar and Xyce implementation for CBCL-MIT Face dataset...... 76

xi LIST OF TABLES

4.1 2-input linearly separable logic functions...... 45

4.2 Linearly separable 3-bit input logic functions using two different MOSFET in the driver circuit using 2 processor cores...... 50

xii CHAPTER I

INTRODUCTION

1.1 Evolution of Memristors - Mathematical Model

Memristors, initially were mathematical postulations by Leon Chua [15] which showed the presence of an independent differential equation relating the charge q with the flux Φ in the circuit i.e dΦ = M · dq. This was mathematically different from the non-linear resistance that coupled the voltage v to the current i i.e. dv = R ·di. Through a rigorous study of the properties of this possible non-linear circuit element, it was found that this hypothetical device was a resistor with a memory where its resistance depends on the amount of charge flowing through it. This hypothetical non- linear circuit element was further generalized by Chua and Kang [16] to introduce a state variable w for the description of its physical properties. This generalized memristor is then characterized by two equations, one which related the voltage across the device to the current through it at a particular time i.e. v = R(w, i) · i and the other a state variable as a time-dependent function of the

dw form dt = f(w, i). The I − V characteristics obtained from this hypothetical non-linear circuit element shows a pinched hysteresis loop and was interpreted as remembering the history due to changing resistance but does not store charge or energy like the capacitor. These hypotheses and the mathematical formulations can then be applied to any circuit elements which shows a pinched hysteresis loops and can be referred to as memristive devices. Therefore, the memristor with respect

1 to the general mathematical model, is a circuit element which shows a pinched hysteresis loop in its I − V characteristics and this is independent of the physical mechanism that caused this effect. Moreover, memristors can also be considered as a fourth fundamental circuit element. The equivalent circuit of a memristor cannot be constructed using any of the three basic linear circuit elements such as the resistor, capacitor and inductor. Often, the memristor represents an independent component for building other passive non-linear circuits. Thus, due to this rigorous mathematical formulation and analysis, the behavior of a large number of devices which exhibits such pinched hysteresis loop characteristics can be predicted.

1.2 Re-discovery of the Memristor in its Physical Form

Before the advent of physical fabrication of this non-linear circuit element, the research group at HP Labs were focused on building nanoscale devices as alternatives to CMOS for logic and memory. This led to research in the use of molecular systems [17] and later to inorganic materials where there was a great deal of concentration on devices that gave the pinched hysteresis loop its I − V characteristics. Using Leon Chua’s mathematical model, a direct relation between the time-derivative of the state variable in the dynamical state equation and the drift velocity of the oxygen vacancies in a titanium dioxide resistive switch was found [18]. This formed the basis of the physical fabrication of the memristor where a quantitative mathematical model based on a physical mechanism was available. Since this breakthrough and with the limited scaling of based

SRAM circuits, there was an urgent need to use power-efficient devices. Memristors are a strong candidate for SRAM replacement in level 2 and higher caches. Since memristors are believed to have the ability to store information and process it, it has also been shown that it resembles the synapse of a biological neuron [19, 20, 21]. Thus neuromorphic computing with memristor crossbar has been a major research interest to the academic community. A large set of neuromorphic computing researchers are investigating what can be developed around these devices. 2 1.3 Memristors in Neuromorphic Computing

Researchers have tried to explore large scale neural networks with the use of powerful com- pute clusters, but they had a significant trade-off as these clusters consumed a high amount of power, thereby reducing the power efficiency of the system. In recent research, it was shown that memristor-based crossbar processors can enable neuromorphic computing that consume about

300, 000 times less power than a traditional processor (such as an Intel Xeon processor) [10]. This is mainly because the computations are being done in the analog domain based on the actual of the memristor devices. The simulation of such large memristor crossbars becomes essential to properly understand how these processors would behave

A key problem in the current approach in the research community is modeling large arrays of memristors (or crossbar arrays) with over 1000 memristors. These need to be modeled in low level circuit simulation tools (such as SPICE) to capture all the low level circuit level behaviors.

Unfortunately no group has been able to do this as of yet and almost all studies are done using high level tools (such as MATLAB). These high level tools do not capture all circuit level behaviors and so are not adequate to properly analyze the capabilities of the memristor crossbars. We believe that if such large scale memristor crossbars are designed and tested, they can provide much better efficiency in terms of power, speed, and low area compared to other core designs.

A breakthrough in this modeling capability came with a new circuit level device model de- veloped at the University of Dayton. This model allows large circuits to be simulated, while the existing SPICE models would typically cause math convergence errors, that would make simulation of circuits with more than 10 memristors crash. Modeling large circuits with this model is still problematic because the simulations take a very long time on a single computer. Using multiple computers has not been an option in the past as parallel SPICE simulators have not been available.

In January 2014, Sandia National Laboratory released a first ever parallel SPICE simulator called

3 Xyce that is able to run on a cluster of computers [22]. We have used this simulator and have been able to model significantly sized memristor-based crossbars successfully.

1.4 Contributions

The contributions presented in this thesis are as follows :

(1) Parallel simulation of large scale memristor-based crossbars of size about 8000−10000 mem-

ristors by configuring Xyce on Linux-based clusters.

(2) Design of large-scale memristor-based crossbars with finite drivers for neuromorphic appli-

cations.

(3) Development and implementation of the a mathematical model of memristors for offline train-

ing of large scale memristor-based crossbars.

(4) Investigate the effect of comparators in the performance of the SPICE level memristor-based

crossbar simulations using the offline trained weights.

1.5 Overview of the Thesis

This manuscript is organized into six chapters (including this chapter). The next Chapter II

provides an insight into the memristor device and a literature survey of the different kinds of neu-

romorphic architectures. The memristor model used in this work is described in Chapter III. It

also describes its representation as a neuron and its implementation as a crossbar for neuromorphic

computing. Chapter IV defines the need of a parallel circuit simulator and the configuration steps of

Xyce. It also describes the memristor-based crossbar using a finite driver at the circuit level. Chapter

V presents an approximate model to the circuit-level memristor-based crossbar and its application

in image classification using two datasets. The last chapter gives the conclusions and future work.

4 CHAPTER II

RELATED WORK

2.1 Insights in to Memristors

According to the Moores law, the number of on integrated circuits doubles approxi- mately every two years over the history of computing hardware. Therefore as the number of transis- tors have increased in the chip, it is observed that transistors can no longer shrink. Unless there are other alternatives to silicon based transistors or new architecture developments, there is a significant chance of processor performance remaining static. Researchers have been exploring different pos- sibilities to overcome such bottlenecks. Hewlett Packards (HP) recent discovery of the memristor device may help improve processor performance. The three basic electrical passive devices are the resistor, inductor and capacitor. Leon Chua from the University of California- Berkeley [15] was the

first to conceive the idea of a missing nonlinear fundamental passive circuit element. Chua coined this device as the memristor (memory+resistor) since it demonstrated the hysteresis property of fer- romagnetic core memory (to remember the magnetic history) and also the dissipative characteristics of a resistor. Therefore the memristor can be thought of as a reconfigurable resistor with memory which relates the electric charge and the magnetic flux as shown in 2.1a. It is basically a two- terminal element characterized by a relation g(f, q) = 0 where the charge or flux is controlled if it can be expressed as single function of q or f. The voltage across the device is V (t) = M(q(t))I(t)

5 (a) Passive circuit elements (b) I − V characteristics [24]. (c) Memristor charge flow

Figure 2.1: Insights into memristor.

where the memristance (M) is given in Equation 2.1.

dΦ dΦ/dt M = = (2.1) dq dq/dt

It can be inferred from the above relation that the memristance is simply a charge dependent resistance. On applying an alternative voltage at the terminal of the device, the I − V curve is characterized by a pinched hysteresis loop that passes through the origin as illustrated in Figure

2.1b. In the initial memristor fabrication design developed by HP [23], a titanium oxide layer

(T iO2) and an oxygen deficient (T iO2−x) layer of width D is sandwiched between two platinum electrodes as shown in Figure 2.1c. The electric current through the memristor shifts the oxygen vacancies thereby causing a gradual change in the electrical resistance. When a positive voltage is applied across the memristor, the oxygen vacancies spread, thereby creating a thicker T iO2−x layer and lowers the resistance through the device. When a negative voltage is applied to the memristor, the oxygen vacancies move closer to one end of the device, which leads to a thicker T iO2 layer and increases the resistance throughout the device.

6 Since the discovery of memristors, several researchers have been involved in developing a memristor-based SPICE model. One of the best models developed so far is the University of Day- ton memristor model as given in [25, 24]. This has the capability of modeling different memristor characterizations and also captures device responses for linearly increasing and sinusoidal inputs.

Hence, this model is used in this work and is explained in detail in Chapter III. Research on mem- ristors has been growing ever since the actual physical device was published. Memristors are tiny devices that can be densely packed in a crossbar. A crossbar is basically an array of perpendicular wires where at the junction of two intersecting wires, there exists a switch, such as a memristor. Such a crossbar layout with memristors provides up to 5.2 times the memory density of an STT-MRAM system with low energy consumption [13]. Before we describe the neuromorphic architecture using memristors, we explore some of the earlier techniques used for neuromorphic computing.

2.2 Neuromorphic Computing

Engineering intelligent machines is one of the exciting and challenging frontiers of modern hardware design. The field of has achieved remarkable progress in many classes of problems such as pattern recognition, natural language processing and time series prediction.

For realistic tasks such algorithms perform significantly better when massive computation power is available. This computational intensity requirement has limited their usability in large scale applications due to area and power requirements. New machine learning oriented hardware design approach must therefore be developed to overcome these limitations.

Many neuromorphic computing algorithms utilize large matrices of values such as synaptic weights. These matrices are continuously updated during operation of the system and are constantly being used to interpret new data. Implementing such algorithms like on conventional general purpose digital hardware (like ) is highly inefficient. A prime

7 reason for this is the physical separation between the memory arrays used to store the values of the synaptic weights and the arithmetic module used to compute the update rule. To overcome the inefficiency of general purpose hardware, numerous dedicated hardware designs based on CMOS technology have been proposed in the past two decades [26, 27, 28, 29, 30].

Neuromorphic computing, also known as neuromorphic engineering, describes the use of very large scale integration (VLSI) systems containing different electronic devices to mimic the behavior of biological neural architectures in the nervous systems. There has been a significant amount of research in this discipline to understand how different neuron circuits can affect the information, robustness, power consumption, and speed.

2.2.1 Early Architectures

Research and development of neuromorphic architectures dates back to the 1980’s. One such architecture is the ETANN chip which was developed by Intel [31, 32, 33] and is entirely an ana- log chip that was designed for feed forward artificial neural network operation. ETANN stands for

Electronically Trainable Analog Neural Network. It is an electronically trainable parallel data pro- cessor with 64-neurons and 10,240-synapses that performs artificial neural network [34] functions.

Although, it is a general purpose analog neural chip with 64 fully connected neurons, it does not support on-chip learning. Also, this chip has a limited resolution in storing the synapse weights.

ETANN chips can be cascaded to form a network of up to 1024 neurons with up to 81,920 weights.

A similar approach was used in the Mod2 Neurocomputer [35], where 12 such ETANN chips were used for real-time image processing.

In 1996, Nester Inc. and Intel developed the Ni1000 Recognition Accelerator [2] for high per- formance solutions of the pattern recognition applications. This accelerator supports classification speeds of up to 33,000 patterns per second, with real-time adaptations. The chip is compatible

8 with custom algorithms such as Radial Basis Function (RBF), Probabilistic RCE (RCE), Proba- bilistic Neural Networks (PNN), etc. This can be added in-board with a host CPU where the CPU manages its flow of data as shown in Figure 2.2. This accelerator accepts input vectors with a max- imum of 256 feature dimensions, each with 32 levels of resolution, and outputs up to 64 classes or probabilities making it more powerful than ETANN. The system however has limited or narrower functionality.

Figure 2.2: A multi-chip add-in board with Ni1000 Recognition Accelerators where the CPU man- ages the flow of data through these accelerators [2].

SYNAPSE-1, which stands for Synthesis of Neural Algorithms on a Parallel Systolic Engine, was developed by [3]. It is a modular system that uses Siemen’s MA16 neurochip as the main building block. Multiple such MA16 chips are cascaded to form 2D systolic arrays along with weight memories, data units, and a (as shown in Figure 2.3) to form this neuro-chip.

The transfer functions of neurons are calculated off-chip using table look-ups. Although SYNAPSE-

1 was able to map several networks such as back-propagation and Hopfield networks, its complex processing elements and 2D systolic array structure hinders simple programming.

9 Figure 2.3: SYNAPSE-1 system architecture [3].

Similarly there were many other digital and analog hardware neural network chips designed over these past few years and are summarized in different surveys such as in [36, 37, 38].

2.2.2 Modern Architectures

Some examples of modern neuromorphic architectures are Neurogrid and FACETS. Neurogrid

[39] is a multi-chip system developed by the Brains in Silicon group at in which the objective was to emulate neurons composed of 4x4 array of Neurocores. Each Neurocore contains a 256x256 array of neuron circuits with up to 6,000 synapse connections. FACETS [40] stands for Fast Analog Computing with Emergent Transient States (FACETS) and was developed by group of scientists and researchers in 2005. It was reported that this chip had 200,000 neuron circuits connected with 50 million synapses.

Several researchers have been engaged in obtaining powerful processing capabilities by simu- lating large scale networks with the help of high performance compute clusters [41]. It is seen that

10 with the use of such powerful clusters, there has been a significant trade off in terms of electrical power, space, and cost thereby limiting the application of such large neural networks.

Intel has envisioned the potential use of RMS (Recognition, Mining and Synthesis) in future developments and applications [42]. Recognition allows computers to examine data and construct mathematical models based on what they identify, such as a person’s face in a single picture. Data mining extracts one or more instances of a specific model from massive amounts of environmental data, such as finding a person’s face occur- ring in large numbers of pictures in various backgrounds, resolutions, etc. Synthesis constructs new instances of the models, allowing the what-if scenarios.

Work is required in developing highly scalable, multi-core architectures with reliable on-chip net- works. Recent studies in [4, 43] have shown the use of neural networks for applications relating to

RMS and kernels such as JPEG, FFT, etc as described in Figure 2.4.

Figure 2.4: Benchmarks for mapping kernels to neural networks [4].

In 2011, IBM [5, 6] demonstrated a 256 neuron, 64k/256k-synapse neurosynaptic cores in

45nm silicon with multiple real time applications. This is illustrated in Figure 2.5. This SRAM

11 Figure 2.5: IBM Neurosynaptic Core [5, 6].

based architecture was able to achieve more power efficiency than traditional processors and is de- signed to as a multi-core system. They also proposed a new programming paradigm which uses a corelet language for the neurosynaptic core [44]. This neurosynaptic chip requires each core to be programmed separately. Therefore with the use of such cognitive com- puting programming the team created composable, reusable building blocks of these cores known as corelets. Each corelet has a particular function that can be put together in a different configu- ration to create new applications. For example, one corelet would include all the individual cores that will perform arithmetic operations and other to perform Fast Fourier transform operations.

These corelets allow us to create applications without programming individual neurosynaptic cores.

Last year, the IBM team developed the TrueNorth neuromorphic CMOS chip [7, 8] in the DARPA

SyNAPSE program. Each TrueNorth chip contains about 5.4 million transistors wired into an array of 1 million neurons and 256 million synapses i.e., each neuron had 256 programmable synapses which convey the signals between them. It is a very energy-efficient chip, with a power consump- tion of 70milliwatts, and about 1/10,000th the power density of conventional .

The Figure 2.6a shows the DARPA SyNAPSE 16 chip board with TrueNorth chip as shown in

Figure 2.6b.

12 (a) 16 TrueNorth chip board (b) TrueNorth chip core array

Figure 2.6: DARPA SyNAPSE chipboard and its TrueNorth core [7, 8].

In 2012, the Advanced Processor Technologies Research Group (APT) at the School of Com- puter Science and the University of Manchester developed a specialized multi-core RISC architec- ture for neural simulations. This was called as the SpiNNaker [45, 46] and it stands for Spiking

Neural Network Architecture. One such SpiNNaker chip contains 18 ARM9 cores with a local

SRAM and a specialized router. Each chip holds the connectivity information for up to 16 million synaptic connections and is developed with a stacked 128MB SDRAM chip. Such 48 SpiNNaker chips were then organized in a board and had the potential for .

2.2.3 Memristor based Architectures

Memristors have become very popular with neuromorphic engineers and scientists due to their extremely small size, low power consumption and capability to store information. This provided an impetus for researchers to invest heavily on memristor-based neuromorphic computing chips.

One significant architecture using memristors was developed in FPGA by [9] and is shown in

Figure 2.7. Here, the memristor based FPGA is based on CMOS fabrication and uses memristors and metal wires as programmable interconnects so that they can be fabricated over logic blocks

13 resulting in a significant reduction of overall area and interconnect delays. This paper also showed that this architecture achieved about 5.18x area savings, 2.28x speed up and 1.63x power savings when compared to 20 largest MCNC benchmark circuits.

Figure 2.7: The mrFPGA architecture [9]. (a) Overview of mrFPGA, and (b) Detailed design of the connection blocks and switch blocks.

A recent study [10] examined the memristor based digital and analog multi-core neural proces- sors as shown in Figure 2.8. Their results showed that these multi-core processors with memristors provide significant area, power and speed efficiencies over current high performance commute plat- forms as shown in Figure 2.9. These memristor based cores take up lesser die area thereby achieving a reduction from 179 Xeon six-core processor chips to 1 memristor based multi-core chip and a re- duction in power from 17kW to about 0.07W . The main reason for low efficiency of SRAM based cores is because they remain idle for a longer periods of time due to leakage energy loss. The power efficiency of the memristor based core provides not only a higher power efficiency but also provides a significant amount of reduction in chip area. Though there were several studies show- ing the potential use of memristor crossbars in a larger scale [47], these systems have not yet been explored. This is due to the fact that large scale designs consume a huge amount of computation

14 Figure 2.8: SRAM array replaced by a memristor crossbar [10].

Figure 2.9: Efficiency of memristors over other high performance architectures [10].

time. After a certain size of the design, the computation capacity needed is beyond the capability of today’s existing simulation tools.

15 CHAPTER III

MEMRISTOR MODEL

This chapter presents the memristor model that is used in the analysis of large scale memristor- based neuromorphic crossbar. The SPICE based memristor model presented by [25] is the first model that was quantitatively correlated to multiple devices for both sinusoidal and repetitive sweep- ing inputs. Simulation results in Figure 3.1 from [13] strongly states that this memristor model matches very closely to the characterization data to the model published in [12]. The device has a

6 large ROFF /RON ratio of the order of 10 while still retaining a relatively low switching time of about 10 ns. It also has a large on-state resistance of 125 kohm. From Figure 3.1, it is observed that by applying a pulse of +7 V to the device it switches into a low resistance state and by applying a pulse of −7 V , it drives the memristor into a high resistance state. Because of such a reliable operation we have used this model over the other existing models [48, 49, 50] for analysis of large scale simulations of memristors as synapses/neurons for pattern recognition, neural network and high performance neuromorphic computing applications.

This chapter is organized as follows: Section 3.1 describes the memristor model used in our analysis given by [24]. Section 3.2 describes how a memristor can be correlated to a synapse in the neural network. Section 3.3 describes the read/write operation of memristive crossbar [13] where this crossbar configuration forms the basis of our analysis.

16 Figure 3.1: Input voltage and current simulation waveforms of model in [11] based on device in [12]. The memristor mode parameters mentioned is as follows: Vp = 4V , Vn = 4V ,Ap = 816000,An = −4 −4 816000,xp = 0.985,xn = 0.985,αp = 0.1,alphan = 0.1,a1 = 1.6 × 10 ,a2 = 1.6 × 10 , b = 0.05, x0 = 0.01. This simulation result has been taken from [13].

3.1 University of Dayton (UD) Memristor Model

The SPICE based memristor model proposed in [24] is a generalized model which satisfies the

three main properties of the physical memristor device such as voltage threshold, non-linear drift

and electron tunneling. Voltage threshold is related to the minimum voltage required to change

the state of the memristor from an ’off’ state to an ’on’ state. When a voltage greater than this

threshold is applied, the boundary within the memristor gets shifted due to the current. But when

this voltage is dropped, this boundary enters into a state of drift where the movement becomes non-

linear, hence the name non-linear drift. Electron tunneling is the effect created when a small current

is induced between two electrodes which are separated by a thin insulating film. This effect can

be seen in memristors when a positive bias voltage is applied to the electrodes of the memristor,

the positively charged oxygen vacancies drift towards the TiO2 layer. These three properties, first

observed on the physical device fabricated by HP Labs [51] was not all encompassed in other

published mathematical models [48, 52]. In this aspect, the UD generalized memristor model has an

17 advantage over the other models as it not only can represent these properties but also can be tuned with variable parameters to accurately match several published device characterizations. Moreover, it has been evaluated and proved that using this generalized memristor model provides a more accurate circuit simulation for a wide range of device structures and voltages [24]. This memristor model [24, 25] is used in our analysis and is modified for use in Xyce shown below.

.SUBCKT UD_memristor TE BE xsv *Fitting parameters +PARAMS: a1=0.00016 a2=0.00016 b=0.05 Vp=4 Vn=4 Ap=816000 An=816000 xp=0.985 xn=0.985 alphap=0.1 alphan=0.1 xo=0.01 *Multiplicitive functions to ensure zero state variable motion at memristor boundaries .FUNC wn(V) {V/(1-xp)} .FUNC wp(V) {(xp-V)/(1-xp)+1} *Function G(V(t)) - Describes the threshold device threshold .FUNC G(V) {IF(V <= Vp, IF(V >= -Vn, 0, -An*(exp(-V)-exp(Vn))), Ap*(exp(V)-exp(Vp)))} *Function F(V(t),x(t)) - Describes the state variable motion .FUNC F(V1,V2) {IF(V1 >= 0, IF( V2 >= xp, exp(-alphap*(V2-xp))*wp(V2) ,1 ), IF( V2 <= (1-xn), exp(alphan*(V2+xn-1))*wn(V2) ,1 ))} *IV Relationship - Hyperbolic sine due to MIM structure .FUNC IVRel(V1,V2) {IF(V1 >= 0, a1*V2*sinh(b*V1), a*V*sinh(b*V1))} Csv xsv 0 {1} Gsv 0 xsv value={F(V(TE,BE),V(xsv,0))*G(V(TE,BE))} * Current source that generates memristor IV response Gmem TE BE value = {IVRel(V(TE,BE),V(xsv,0))} .ends UD_memristor

For this generalized memristor model, the I-V characteristic equation is based on the hyperbolic sinusoidal function and this accounts for its metal-insulator metal junction. Due to this change, an increase in the conductivity is observed when the input voltage becomes greater than the specified voltage. As shown in Equation 3.1, there are three parameters (a1, a2, b) whose values determine

a specific type of memristor device. (a1, a2) correspond to the thickness of the dielectric layer in the memristor. These parameters account for the electron tunneling effect while the parameter b is

related to the conductivity of the device.

( a x(t)sinh(bV (t)) V (t) ≥ 0 I(t) = 1 (3.1) a2x(t)sinh(bV (t)) V (t) < 0

18 In this model, the state of the memristor depends on two functions g(V (t)) and f(x). The first

function g(V (t)) given in Equation 3.2 is responsible for setting a minimum threshold voltage

which is required to change the value of the state of the memristor. Here, an exponential value

subtracted from the equation in the simulations is a constant term to make sure that g(V (t)) starts

from zero once the voltage threshold is crossed. (Vp,Vn) are the voltage thresholds at the positive

and negative leads of the memristor model. The second function f(x) given in Equations 3.3 and

3.4 models the non-linear drift phenomenon when the memristor device is applied a voltage greater

than the threshold. During the non-linear dopant drift, the state variable motion slows down at the

boundaries. Thus, this function is modeled as an exponential decay function. Here, the parameters

(αp, αn) reflect the dampness of the state motion variable.  A (eV (t) − eVp ) V (t) > V  p p −V (t) Vn g(V (t)) = −An(e − e ) V (t) < −Vn (3.2)  0 −Vn ≤ V (t) ≤ Vp

( e−αp(x−xp)w (x, x ) , x ≥ x f(x) = p p p (3.3) 1 , x < xp xp − x wp(x, xp) = + 1 1 − xp

( eαn(x+xn−1)w (x, x ) , x ≤ 1 − x f(x) = n n n (3.4) 1 , x > 1 − xn x wn(x, xn) = 1 − xn dx = g(V (t))f(x) (3.5) dt

3.2 Memristor as Neurons

A ’neural network’ is a mathematical representation of information processing that happens in biological systems. However, in and engineering, these neural networks-based mathematical representations of organizing data has been widely used for pattern recognition ap- plications. Here, a neuron in a neural network is a single unit where its output is a function of 19 Figure 3.2: Illustration of Memristor as Synapse

the weighted linear combination of the its inputs as shown in Figure 3.2. The synapse can then be defined as the connection between two neurons with a weight associated with it. With respect to one neuron, there can be multiple neurons connected to it with the corresponding weights and these are known as synaptic weights. The highly dense large scale connections between multiple neu- rons in different layers such as visible layer, hidden layer and output layer through a set of synaptic weights is the neural network that can be designed to suit different pattern recognition applications.

Thus, if this network is trained with a certain set of input data and output data, the synaptic weights will then correspond to the relationship between the two sets and this is stored as information or memory. This information is learned and controlled between two neurons in a network. As seen in

Chapter II, large scale neuromorphic cores based on neural networks are growing rapidly to provide direct on chip processing capability to analyze big data for multiple pattern recognition and cogni- tive computing applications. However, the devices used in these architectures such as the transistors have reached a limit beyond which it cannot be reduced in size and therefore limits the capacity of the core for neural computation. This leads to the use of memristors as a building block for large scale neuromorphic and cognitive computing applications. Due to its extremely small size and low power consumption, memristors based circuits are ideal for neural network architectures.

20 Figure 3.3: Weight Distribution [14].

As shown in Figure 3.2, a pair of memristors where each one has its own state, is considered as a synapse in a neural network. When a charge or flux through the memristor is controlled, its con- ductance changes and retains that state for a long time even when no potential or voltage is applied across it. This non-volatile nature was demonstrated by HP Labs where resistance/conductance of the device showed only a decay of 0.5% of its original state even after three years [53]. Thus the state of the memristor depends on the voltage, conductance and time, and this enables the use of programmable memristive devices. Therefore, a memristor can be referred to as a device similar to a resistor/varactor where its resistances can not only be changed by applying a potential but also has the ability to retain its last resistance/conductance value when a potential is removed. This pertains to the notion of ’memory’ of information storage in memristors. Figure 3.3 explains the effective synaptic weight that can be determined using a pair of memristors. Here, one memristor

+ considered to have a ’positive-polarity’ conductance of σA and the other memristor is considered

− to have a ’negative-polarity’ conductance of σA . The difference between the conductances of the two memristors determines the synaptic value in terms of the weight wA and its polarity. In other

+ − words, if the value of the conductance σA is greater than σA , there is a positive polarity impact on the synaptic weights. This can be interpreted as having the weights change in a specific direction.

21 + − If the value of σA is less than σA , then there is a negative polarity impact which pulls the synaptic weights in the opposite direction.

3.3 Memristor based Neuromorphic Crossbar

Memristors have the potential to act as memory devices within analog neural circuits [21]. The crossbar based memristor synaptic network can potentially offer connectivity and function density which makes them highly suitable for neuromorphic applications [54, 55, 56, 57, 58]. Also, due to their low power and extremely small size, these memristors are used for large scale fabrication of synapses in the form of cross-bar array types where there exists a memristor between two overlap- ping wires. As explained in the earlier section, two memristors are required to represent a single synaptic weight to allow input data and its synaptic weight to have either a positive or a negative effect on the output of a neuron. Figure 3.4a describes a single memristor based neuron circuit that can be stacked together to form a large crossbar such as in the Figure 3.4b. Each of the output from the neuron corresponds to as a function output. To read or write values into memristors in these crossbars a specific technique is used as described in the paper [13].

3.3.1 Write Operation

The write operation is a 2-step process that writes the data into an entire row of memristors.

When a voltage greater than the threshold voltage of the memristor is applied, then the data can be written into it in the form of its resistance. In other words, the memristor state changes from either ’on’ to ’off’ or vice versa only when an input voltage is greater than the memristor threshold voltage. The following are the steps involved in writing binary data pattern to the crossbar array of memristors.

(1) Consider Vw as the write voltage applied to the crossbar for writing data into the memristors

and VMT as the memristor threshold voltage. 22 (b) Horizontal stacked memristor based neurons form- ing a crossbar (a) Illustration of a memristor based neuron .

Figure 3.4: Memristor as a synapse and its neuro-morphic crossbar representation.

(2) Apply Vw/2 voltage to the set of target rows (as in Figure 3.5a) and ground the other rows.

Here, the set of target rows are those memristors which require a state change based on the

input data. Apply −Vw/2 voltage to the set of target columns and Vw/2 voltage to the

other columns. Thereby, the target memristors will have an effective voltage across them

as (Vw/2 − (−Vw/2)) = Vw and this will thus induce a change of conductance of the target

memristors.

(3) Next, apply −Vw/2 voltage to the target rows. Voltage at the columns is kept in the same

configuration as above. The effective voltage across the non-target memristors is now −Vw

which is less than the threshold and therefore, they are set to value 0.

23 (a) Write operation (b) Read operation

Figure 3.5: Writing and reading into memristors in a crossbar.

3.3.2 Read Operation

In this mode, the target memristor states have to be read and caution should be taken to not overwrite its existing values. Therefore, a voltage less than the threshold voltage of memristors is applied for reading its state. The steps taken for this is as follows :

(1) Consider Vr as the read voltage applied to the crossbar and VMT as the threshold voltage of

the memristor.

(2) Apply a voltage Vr < VMT to the selected row of the crossbar from where the data needs to

be read as shown in Figure 3.5b.

(3) The read enabled transistors at the target memristor columns are activated and the memristor

value can then be sensed.

24 CHAPTER IV

LARGE SCALE CIRCUIT SIMULATIONS

This chapter presents a way to perform large scale memristor-based circuit level simulations.

Many researchers in the past have been able to simulate only 1000-3000 memristors at a time using simulators that can only run on one processor core [14, 24]. To simulate the memristor-based cross- bars in large scale of the order of around 8000 memristors, we need a parallel circuit simulator such as Xyce. The memristor crossbar design used in the analysis of this chapter is from [13]. This circuit is slightly modified to incorporate finite drivers and is tested with two different MOSFET configu- rations namely, level-3 Xyce MOSFET and low power Mosis MOSFET. Section 4.1 describes the need and importance for parallel circuit simulators such as Xyce. Section 4.2 explains the working of Xyce Parallel circuit simulator that is provided and released by the Sandia National Laboratories

[22]. This section also explains the hardware and software requirements that was used to install

Xyce. To make debugging and understanding of convergence issues easier, we have opted to build

Xyce from source code and not use the executable version. Section 4.3 describes the memristor based neuromorphic crossbar with two different driver models that is used in our analysis and its

SPICE training circuit. The results obtained are explained in Section 4.4.

25 4.1 Need for Parallel Circuit Simulation

Large scale transistor level chip design’s are massively complex and include many computa- tionally intensive operations. Before fabricating such chips they have to be initially verified with circuit simulation tools using a comprehensive set of tests to cover a variety of scenarios. These simulations are conducted on specific models and only few tests are applied in reality to the full chip design. This increases the likelihood of bugs arising in silicon. Some of today’s circuit sim- ulation approaches can reduce the cost at full-chip scale, run with the same time budget and avoid bugs into silicon. But these analog style SPICE-accurate simulators are developing a bottleneck as they fail to scale to a large size of circuit.

With the growth of technology and research, the number of transistors per square inch of an has exponentially been increasing and time-consumption is becoming the major hurdle in simulating such huge transistor level circuits. Today’s circuit simulation tools such as the traditional Berkeley SPICE, HSpice, LTSpice [59] do not have the capability of using the modern technology resources. Therefore, they are incapable of faster simulation of circuits beyond thou- sands of unknowns.

So, with the increase in multi-core CPU’s, GPU’s and inexpensive cluster computing systems their increased a need of parallelism [60, 61] in the world of circuit simulations. Several approaches

[62, 63, 64, 65] have been developed to provide parallel circuit simulation. Some of them used circuit level partitioning methods with approximations which breaks down in certain circumstances.

Originally a full chip design is flattened, and then with a netlist is run on SPICE. At the core of any circuit simulator is a linear system solver which operates on a set of DAE describing the circuit. While the system is sparse (circuit elements can only be connected to so many neighbors), run-time quickly explodes with the size of the circuit. To avoid performing flat SPICE simulation,

26 many modern simulators make hierarchical optimizations, first by simulating smaller sub-circuits and emitting a macro-model for that sub-circuit, which is then used at the next level of hierarchy.

A hierarchical simulator known as HSIM developed by Synopsys can simulate full chip designs by simulating the sub-modules and avoiding flattening the netlist of the circuit. This simulator re- uses solutions obtained from sub-circuits and applies those solutions to similar modules of the same circuit on the chip. Although HSIM reduces the total problem size and is multi-threaded, it still runs on a single processor only.

Non-volatile memory (NVM) [66] simulator was developed where NVM devices such as mem- ristor, phase-change memory, spin-transfer torque magnetic tunneling junction (STT-MTJ) are de- scribed with their equivalent circuits. Although this SPICE simulator is capable of simulating mem- ristors to a size of 32x32 by speeding simulation times, it fails to handle step functions, integral func- tions and if statements in the memristor model sub-circuit definition as published in [67, 48, 24, 11].

Therefore, the NVM Spice based memristor model can be referred to as a basic model.

4.2 Xyce

Xyce was initially developed in 2011 by a team in Sandia National Laboratories [22]. It is a parallel electronic high-performance analog circuit simulator which has the capability of simulating and solving extremely large circuits using large-scale parallel computing resources such as multi- core processing machines, cluster computers. This simulator is unique in its own way by providing the capability to simulate circuits with millions of devices in parallel. The team at Sandia designed

Xyce from ground-up approach in C++ and is SPICE-compatible. Parallel simulation in Xyce is mainly achieved by using the Trilinos scientific library, Message Passing Interface (MPI) to manage parallel workloads and threading. MPI is a specification for a standard library for message pass- ing initially defined by a group of parallel computing vendors and specialists. MPI is a language

27 independent communications protocol used for programming in parallel computers and provides high-performance, scalability and portability [68].

Unlike the traditional SPICE which uses a direct solver, Xyce is designed to use iterative solvers.

Solvers are basically a part of a mathematical software that solves a mathematical problem in simu- lator tools and its effectiveness is directly related to the efficiency of the simulator. As seen before, parallel simulators require integration of large and small scale parallelism throughout the circuit simulation. This is achieved by the iterative solvers and enables Xyce to simulate large circuits in parallel. Xyce is reported to provide scalability of the order of 100’s of computing nodes.

4.2.1 Parallel Simulation using Xyce

Xyce electronic circuit simulator heavily depends on MPI for parallel simulation of the circuit.

To run a circuit with Xyce parallel simulator an optimal number of cores must be selected such that the overhead obtained from the communication between processors can be minimum. This opti- mum value is selected by comparing the simulation time (that will also include the inter-processor communication) with the increase in the number of processor cores.

Parallel circuit simulation by Xyce is provided in two stages- device evaluation and solver. De- vice evaluation is the process of evaluating all the devices in the circuit into equations to solve its residual vector and Jacobian matrices. Once the devices are distributed among few processors, the device evaluation process becomes faster compared to evaluating them with a single processor.

Linear solvers such as the direct solver works well enough when the system is within tens to thou- sands of unknowns but fails beyond that. To allow the simulation of circuits beyond thousands of unknowns, an iterative solver is used in Xyce.

28 Xyce provides basically three modes of operations depending on the size of the circuit. The first is the serial load serial solve, second is the parallel load serial solve and the third is the parallel load parallel solve.

1. Serial load serial solve: In this mode of operation the system does not use MPI as by its name

is serial load and serial solve. It uses only one processor to evaluate the circuit devices and

solves it linearly. This kind of mode is best suited to a system size with hundreds of unknowns

in the circuit.

2. Parallel load serial solve: In this mode of operation, the system uses MPI to load the circuit

in parallel for device evaluation. The residual and Jacobian loads are distributed across the

processors and the resulting matrix problem is solved in serial on one processor with a direct

solver. This is given as the best option when the problem is small enough for direct solvers

to manage. Usually circuits with 103 to 104 number of unknowns can be simulated using the

parallel load serial solve. When parallel build with multiple processors are used, Xyce will

automatically use this option and solve it with KLU direct solver. There are also other kinds

of direct solvers given by Xyce that can be used such as AztecOO.

3. Parallel load parallel solve: In this mode of operation, the system uses MPI not only for device

evaluation but also to linearly solve the circuit. The residual and Jacobian loads are distributed

across processors and the resulting matrix problem is solved in parallel using iterative solvers.

This approach is taken to simulate very large circuits with 105 or more unknowns. Such

circuits are too large to solve using direct solvers on a single processor. If you feed a very

large circuit and provide multiple processors, then Xyce automatically uses this mode to solve

the circuit.

29 4.2.2 Installation of Xyce [1]

Hardware used

The building guide of Xyce mentions platforms on which the Xyce can be built [1], one of which is Red Hat Enterprise Linux. Therefore, we have used CentOS 6 (Community Enterprise Operating

System) which is a free Enterprise-class Linux distribution that is functionally compatible with Red

Hat Enterprise Linux (RHEL). Xyce can be installed in two ways- through source code and using an executable. To take the full advantage of Xyce we have built it from its source code. Also, note that this executable currently cannot be used with RHEL7 (CentOS 7) systems. This is because there are dynamically linked executables which depend heavily on system shared libraries and RHEL binaries which are not portable to its next release.

To take the advantage of multi-core parallel processing capability of this parallel circuit sim- ulator, we have tried to use the Oakley Cluster at the Ohio Supercomputer Center which contains

8300+ core HP Intel Xeon machines. Although we were able to build Xyce on the OSC cluster node, it is still unable to call the multiple nodes due to software issues. This issue is still being resolved with the help of the OSC and Xyce team. Therefore, we have built our own 4-core linux cluster with two HP Intel machines where each machine contains two cores and the processors communicated with the Message Passing Interface.

Software Requirements

Building Xyce Parallel from source requires several libraries. The libraries and versions that were used to build Xyce Parallel are as follows:

• Blas - It is a Basic Linear Algebra Subprograms package usually available with the CentOS6

system compilers.

30 • Lapack - It is a Linear Algebra PACKage usually available with the CentOS 6 system com-

piler.

• Bison - It is required by Xyce for its chemical reaction parser. The Bison-3.0 is used.

• Flex - It is also required by Xyce for its chemical reaction parser and the 2.5.34 version is

used.

• CMake - CMake is a tool to build software packages and control the software compilation

process using compiler independent configuration files and thereby generates makefiles and

workspace that can be used in the compiler environment. Cmake 2.8 is used to build Xyce

Parallel.

• FFTW Package - This is a C subroutine library for computing the discrete Fourier transform

(DFT) and is required by Xyce for Harmonic Balance analysis. Xyce can use either the Intel

Math Kernel Library (as provided in the Oakley Cluster at the Ohio Supercomputing Center)

or FFTW3. The FFTW3 is used to build Xyce on the 8-core cluster whereas the Intel Math

Kernel Library was used to build it on Oakley cluster.

• UMFPACK- This package is a set of routines for solving sparse linear systems of the form

Ax=b. Usually it is available with the suitesparse package. We have built UMFPACK sepa-

rately with AMD and the version UMFPACK-5.2.0 AND UFconfig-3.1.0 was used.

• AMD- This package is also a part of the suitesparse package but we have built is separately

and used AMD-2.2.0 version.

• GFORTRAN/g++/gcc/gfortran

• OpenMPI- This is an opensource Message Passing Interface implementation and openMPI-

x64 version was used.

31 • ParMETIS- This package is used by Xyce for graph partitioning and ParMetis3.4 was used.

• Trilinos- This is a collection of open source software packages/libraries for scientific appli-

cations. Xyce requires Trilinos for linear algebra and solver services. When building Xyce

Parallel, Trilinos should be built with MPI enabled. We used the version Trilinos-11.4.3.

Steps for Installing Xyce

1. Installing UMFPACK and AMD or SUITESPARSE package

i Download the packages from www.cise.ufl.edu into /home/me/src/tar using the follow-

ing command:

$ wget http://www.cise.ufl.edu/research/sparse/ umfpack/UMFPACK-5.2.0.tar.gz $ wget http://www.cise.ufl.edu/research/sparse/ UFconfig/UFconfig-3.1.0.tar.gz $ wget http://www.cise.ufl.edu/research/sparse/ amd/AMD-2.2.0.tar.gz

ii. Check if they are downloaded in the appropriate locations and unpack them into /home-

/src.

$ tar zxvf UMFPACK-5.2.0.tar.gz C /home/me $ tar zxvf UFconfig-3.1.0.tar.gz C /home/me $ tar zxvf AMD-2.2.0.tar.gz C /home/me

iii. Open the location in folder UFconfig-3.1.0 where the file UFconfig.mk is present and

make the following changes (Note: we should compile this with blas.)

CC=GCC CFlags=-O3 The following two lines should not be commented: BLAS = -lblas -lgfortran -lgfortranbegin LAPACK = -llapack Check the #UMFPACK_CONFIG=-DNBLAS .This line should be disable i.e., commented.

32 iv. From the terminal open the folder UMFPACK and build it. The Makefile in this direc-

tory is read and is executed.

$ cd UMFPACK $ make

v. Once the files are built, AMD and UMFPACK libraries can be found in the locations

AMD/Lib and UMFPACK/Lib respectively.

vi. Copy the libamd.a from AMD/Lib and libumfpack.a from UMFPACK/Liband into the

folder XyceLib/Serial (if your building Xyce Serial) or into XyceLib/Parallel (if your

building Xyce Parallel).

$ cd ˜/AMD/Lib/libamd.a $ cp libamd.a ˜/XyceLib/Serial $ cp libamd.a ˜/XyceLib/Parallel

$ cd ˜/UMFPACK/Lib/libumfpack.a $ cp libumfpack.a ˜/XyceLib/Serial $ cp libumfpack.a ˜/XyceLib/Parallel

2. Installing ParMetis package

i. Download parmetis from http://glaros.dtc.umn.edu/gkhome/metis/parmetis/download into

/home/me/me/tar.

ii. Unzip the tar file into /home/me.

$ tar zxvf parmetis.tar.gz C /home/me

iii. Make sure OpenMPI package is installed in the system. To check if its in the path do,

echo $LD_LIBRARY_PATH

If OpenMPI is not shown in the path then do,

module load openmpi-x86_64

33 If OpenMPI is installed but is still not shown in the path then export the path manually

by doing this,

export LD_LIBRARY_PATH=/usr/lib64/openmpi:$LD_LIBRARY_PATH export PATH=/usr/local/openmpi/bin:$PATH

The module load/or manually exporting its path sets up your environment to use Open-

MPI, and must be executed in every shell from which you want to execute MPI com-

mands or run MPI programs. All it does is set your ’PATH’ and LD_LIBRARY_PATH

variables to let MPI work.

iv. ’cd’ into the Metis folder located inside the ParMetis folder and edit the metis.h with

your system type.

v. Now to configure ParMetis use ’cd’ and open ParMetis folder and do as follows,

$ cd ˜/parmetis For building Xyce Serial: $ make config CC=mpicc CXX=mpicxx --prefix=/home/me/XyceLibs/Parallel For building Xyce Parallel: $ make config CC=mpicc CXX=mpicxx --prefix=/home/me/XyceLibs/Parallel

vi. Now to configure Metis, use the command ’cd’ and open Metis folder and do as follows,

$ cd home/me/src/parmetis/metis For building Xyce Serial: $ make config CC= CC=mpicc CXX=mpicxx --prefix=/home/me/XyceLibs/Serial For building Xyce Parallel: $ make config CC= CC=mpicc CXX=mpicxx --prefix=/home/me/XyceLibs/Parallel

3. Installing Trilinos pacakage

i. Download the trilinos package from http://trilinos.sandia.gov/download/trilinos-11.4.html

into /home/me/src/tar and unzip the package into /home/src

ii. Create the following directories in /home, 34 $ mkdir Trilinos/TrilinosParallel iii. Write the following shell scripts for trilinos into the file trilinos_parallel.sh:

#!/bin/bash SRCDIR=/home/me/trilinos-11.4.3-Source ARCHDIR=/home/me/XyceLibs/Parallel FLAGS="-O3-fPIC" cmake \ -G"Unix Makefiles"\ -DCMAKE_C_COMPILER=mpicc \ -DCMAKE_CXX_COMPILER=mpic++ \ -DCMAKE_Fortran_COMPILER=mpif77 \ -DCMAKE_CXX_FLAGS="$FLAGS"\ -DCMAKE_C_FLAGS="$FLAGS"\ -DCMAKE_Fortran_FLAGS="$FLAGS"\ -DCMAKE_INSTALL_PREFIX=$ARCHDIR \ -DCMAKE_MAKE_PROGRAM="make"\ -DTrilinos_ENABLE_NOX=ON \ -DTrilinos_ENABLE_LOCA=ON \ -DTrilinos_ENABLE_EpetraExt=ON \ -DEpetraExt_BUILD_BTF=ON \ -DEpetraExt_BUILD_EXPERIMENTAL=ON \ -DEpetraExt_BUILD_GRAPH_REORDERINGS=ON \ -DTrilinos_ENABLE_TrilinosCouplings=ON \ -DTrilinos_ENABLE_Ifpack=ON \ -DTrilinos_ENABLE_ShyLU=ON \ -DTrilinos_ENABLE_Isorropia=ON \ -DTrilinos_ENABLE_AztecOO=ON \ -DTrilino_ENABLE_Belos=ON \ -DTrilinos_ENABLE_Teuchos=ON \ -DTrilinos_ENABLE_Amesos=ON \ -DAmesos_ENABLE_KLU=ON \ -DAmesos_ENABLE_UMFPACK=ON \ -DTrilinos_ENABLE_Sacado=ON \ -DTrilinos_ENABLE_Zoltan=ON \ -DTrilinos_ENABLE_ALL_OPTIONAL_PACKAGES=OFF \ -DTPL_ENABLE_AMD=ON \ -DAMD_LIBRARY_DIRS="/usr/AMD/Lib"\ -DTPL_AMD_INCLUDE_DIRS="/usr/include/suitesparse"\ -DTPL_ENABLE_UMFPACK=ON \ -DUMFPACK_LIBRARY_DIRS="/usr/UMFPACK/Lib"\ -DTPL_UMFPACK_INCLUDE_DIRS="/usr/include/suitesparse"\ -DTPL_ENABLE_BLAS=ON \ -DTPL_ENABLE_LAPACK=ON \ -DTPL_ENABLE_ParMETIS=ON \ -DParMETIS_LIBRARY_DIRS="/home/me/XyceLibs/Parallel/lib"\ -DParmetis_INCLUDE_DIRS="/home/me/XyceLibs/Parallel/include" \ -DTPL_ENABLE_MPI=ON \ -DTPL_MPI_LIBRARIES=""\ 35 $SRCDIR

iv. iv. Copy the trilinos_parallel.sh script into the TrilinosParallel.

$ cp trilinos_parallel.sh ˜/Trilinos/TrilinosParallel

v. Building Trilinos package

$ cd Trilinos $ /TrilinosParallel/trilinos_parallel.sh

Once it is configured without any errors, install the package as follows,

$ make prefix=/home/me/XyceLibs/Parallel $ make install

4. Installing Xyce package

i. Make a new directory as $ mkdir XyceParallelBuild.

ii. Open the directory XyceParallelBuild and configure Xyce.

/path/to/Xyce/configure \ CXXFLAGS="-O3"\ LDFLAGS="-L$HOME/XyceLibs/Parallel/lib"\ CPPFLAGS="-I/usr/include/suitesparse -I$HOME/XyceLibs/Parallel/include"\ --enable-mpi \ CXX=mpiCC \ CC=mpicc \ F77=mpif77

iii. Once the script is configured do make and then after that do make install.

Installing Xyce on Ohio Supercomputer Center Clusters

Installation of Xyce-parallel on the Ohio Supercomputer Center (OSC) machines is slightly different mainly in building Trilinos and Xyce Parallel packages. This change is because we have to use the OSC cluster Intel-MKL libraries and other environment variables. Trilinos build script in

OSC cluster is as follows: 36 #!/bin/bash SRCDIR=$HOME/trilinos-11.6.1-Source ARCHDIR=$HOME/Xycelibs_intel/Parallel FLAGS="-O3-fPIC" cmake \ -G"Unix Makefiles"\ -DCMAKE_C_COMPILER=mpicc \ -DCMAKE_CXX_COMPILER=mpicxx \ -DCMAKE_Fortran_COMPILER=mpif90 \ -DCMAKE_CXX_FLAGS="$MKL_CFLAGS"\ -DCMAKE_C_FLAGS="$MKL_CFLAGS"\ -DCMAKE_Fortran_FLAGS="$MKL_FFLAGS"\ -DCMAKE_INSTALL_PREFIX=$ARCHDIR \ -DCMAKE_MAKE_PROGRAM="make"\ -DTrilinos_ENABLE_NOX=ON \ -DTrilinos_ENABLE_LOCA=ON \ -DTrilinos_ENABLE_EpetraExt=ON \ -DEpetraExt_BUILD_BTF=ON \ -DEpetraExt_BUILD_EXPERIMENTAL=ON \ -DEpetraExt_BUILD_GRAPH_REORDERINGS=ON \ -DTrilinos_ENABLE_TrilinosCouplings=ON \ -DTrilinos_ENABLE_Ifpack=ON \ -DTrilinos_ENABLE_ShyLU=ON \ -DTrilinos_ENABLE_Isorropia=ON \ -DTrilinos_ENABLE_AztecOO=ON \ -DTrilinos_ENABLE_Belos=ON \ -DTrilinos_ENABLE_Teuchos=ON \ -DTrilinos_ENABLE_Amesos=ON \ -DAmesos_ENABLE_KLU=ON \ -DAmesos_ENABLE_UMFPACK=ON \ -DTrilinos_ENABLE_Sacado=ON \ -DTrilinos_ENABLE_Zoltan=ON \ -DTrilinos_ENABLE_ALL_OPTIONAL_PACKAGES=OFF \ -DTPL_ENABLE_AMD=ON \ -DAMD_LIBRARY_DIRS="$HOME/Xycelibs_intel/Parallel/lib"\ -DTPL_AMD_INCLUDE_DIRS="$HOME/Xycelibs_intel/Parallel/include"\ -DTPL_ENABLE_UMFPACK=ON \ -DUMFPACK_LIBRARY_DIRS="$HOME/Xycelibs_intel/Parallel/lib"\ -DTPL_UMFPACK_INCLUDE_DIRS="$HOME/Xycelibs_intel/Parallel/include"\ -DTPL_ENABLE_BLAS=ON \ -DBLAS_LIBRARY_DIRS="/usr/local/intel/ composer_xe_2011_sp1.6.233/mkl/lib/intel64"\ -DBLAS_LIBRARY_NAMES:STRING="mkl_blas95_lp64"\ -DTPL_ENABLE_LAPACK=ON \ -DLAPACK_LIBRARY_DIRS="/usr/local/intel/ composer_xe_2011_sp1.6.233/mkl/lib/intel64"\ -DLAPACK_LIBRARY_NAMES:STRING="mkl_lapack95_lp64"\ -DTPL_ENABLE_ParMETIS=ON \ -DParMETIS_LIBRARY_DIRS="$HOME/Xycelibs_intel/Parallel/lib"\ -DParmetis_INCLUDE_DIRS="$HOME/Xycelibs_intel/Parallel/include"\ -DTPL_ENABLE_MPI=ON \ -DTPL_MPI_LIBRARIES=""\

37 $SRCDIR

Xyce configuration script is as follows:

./configure --prefix=$HOME/xyce CXXFLAGS="-O3"\ LDFLAGS="-L$HOME/Xycelibs_intel/Parallel/lib -L/usr/local/fftw3/3.3-intel/lib -L/usr/local/intel/composer_xe_2011_sp1.6.233/mkl/lib/intel64 -lmkl_intel_ilp64-lmkl_intel_thread-lmkl_core-liomp5-lm"\ CPPFLAGS="-I$HOME/Xycelibs_intel/Parallel/include -I/usr/local/fftw3/3.3-intel/include -I/usr/local/intel/composer_xe_2011_sp1.6.233/mkl/include"\ --enable-mpi \ CXX=mpicxx \ CC=mpicc \ F77=mpif77

Testing Xyce

To test if Xyce Parallel is installed correctly, the test suite given by the Xyce team Xyce- regression suite is used and the steps given to execute them are as follows:

i. Download the Xyce regression suite from

https://xyce.sandia.gov/ downloads/source_code.html and unpack it.

ii. Run the following code from $HOME/Build_Xyce_Parallel

$HOME/Xyce_Regression-6.1/TestScripts/run_xyce_regression \ --output=‘pwd‘/Xyce_Test --xyce_test="$HOME/Xyce_Regression-6.1"\ --taglist="+parallel+nightly?noverbose-klu-verbose-fft"\ --resultfile=‘pwd‘/parallel_results \ "mpirun-np2‘pwd‘/src/Xyce"

If all tests have passed, then its ensured that Xyce Parallel is correctly installed.

4.3 Memristor based Neuromorphic Circuit for SPICE Simulation

In chapter III, it is explained how a pair of memristors can represent a synaptic weight with its effective crossbar representation. To extend the analysis, Figure 4.1 represents how the voltages for 38 are each column of a memristor synapse is obtained. As already explained in chapter 3 each pair of memristors has a positive effect and negative effect on its total synaptic weight. The effective cur-

+ − rent (Isum,Isum) through each column of the neuron as given in Figure 4.1 is the sum of all currents through all memristors in that specific column. The voltage at each end i.e (V+,V−) is sensed and is applied as the input to the comparator. The comparator then gives either a binary 1 or 0 output de-

+ − pending on its input voltages. In Figure 4.1, consider σA ,σA are the conductances of the memristor pair in A. The current through them can be represented as in Equation 4.1 and Equation 4.2. There- fore, the total current at the end of every column can be computed as given in Equation 4.3 and its respective voltage is given in Equation 4.4. The final output of the comparator depends on the differ- ence between the two voltages V+ and V− for a neuron. This is given in Equation 4.5 and Equation

4.6.

+ + IA = AσA (4.1)

− − IA = AσA (4.2)

+ + + + + Isum = AσA + BσB + CσC + βσβ (4.3)

+ + V+ = IsumΩβ (4.4)

Vdiff = V+ − V− (4.5) ( 1 ,Vdiff > VT Figure 4.1: Memristor VO = (4.6) 0 ,Vdiff < VT

Using this memristor based synaptic representation, the crossbar circuit for neuromorphic appli- cations is designed as described in Chapter III. For a more realistic representation, this circuit is extended to incorporate finite drivers.

39 Figure 4.2: Complimentary CMOS based finite driver circuit

4.3.1 Finite drivers

Finite driver circuits allow only finite amount of current to be drawn and avoid large amount of currents to be supplied to the crossbar. For our analysis we have used MOSFET based CMOS circuits. CMOS circuits are widely known to be best suited for very or ultra large-scale integration and MOSFET’s are mostly used in them. The CMOS advantage is that the output of a CMOS inverter can be as high as the power supply voltage and as low as ground. This large voltage swing and the steep transition between logic levels yield large operation margins and therefore also a high circuit yield. The driver circuit chosen in our analysis consists of complimentary CMOS circuits with a pMOSFET and nMOSFET in each as shown in Figure 4.2. It consists of two PMOS and two NMOS where a PMOS and NMOS are connected through their drain and their output drives the second pair of PMOS and NMOS circuits. The layout of these drivers in the memristor based neuromorphic circuit can be seen in Figure 4.3. Incorporating these finite drivers makes memristor based crossbars more realistic in nature and simulations performed on them provide a more accurate analysis of the actual circuit.

To test the proposed finite driver model, we have first used the level-3 Xyce MOSFET device for the ideal simulation. This device is one of the basic MOSFETs available in Xyce and was 40 Figure 4.3: Memristor based neuromorphic crossbar

initially utilized to make expedited analysis of the memristor-based crossbar with the finite driver.

We have also used a more realistic MOSFET model which correlates more closely to its actual wafer characteristics. For this, the PMOS and NMOS device models have been taken from Mosis data which provide the wafer device characteristics and are shown below. The Mosis PMOS and NMOS models are for 0.18 microns feature size and thus provides low power and avoids high currents to pass into the crossbar circuit. The Mosis- PMOS and NMOS device models are mentioned as below:

.MODEL cmosn NMOS LEVEL = 49 +VERSION = 3.1 TNOM = 27 TOX = 4.1E-9 +XJ = 1E-7 NCH = 2.3549E17 VTH0 = 0.3694303 +K1 = 0.5789116 K2 = 1.110723E-3 K3 = 1E-3 +K3B = 0.0297124 W0 = 1E-7 NLX = 2.037748E-7 +DVT0W = 0 DVT1W = 0 DVT2W = 0 +DVT0 = 1.2953626 DVT1 = 0.3421545 DVT2 = 0.0395588 +U0 = 293.1687573 UA = -1.21942E-9 UB = 2.325738E-18 +UC = 7.061289E-11 VSAT = 1.676164E5 A0 = 2 +AGS = 0.4764546 B0 = 1.617101E-7 B1 = 5E-6 +KETA = -0.0138552 A1 = 1.09168E-3 A2 = 0.3303025 +RDSW = 105.6133217 PRWG = 0.5 PRWB = -0.2

41 +WR = 1 WINT = 2.885735E-9 LINT = 1.715622E-8 +XL = 0 XW = -1E-8 DWG = 2.754317E-9 +DWB = -3.690793E-9 VOFF = -0.0948017 NFACTOR = 2.186006 +CIT = 0 CDSC = 2.4E-4 CDSCD = 0 +CDSCB = 0 ETA0 = 2.665034E-3 ETAB = 6.028975E-5 +DSUB = 0.0442223 PCLM = 1.746064 PDIBLC1 = 0.3258185 +PDIBLC2 = 2.701992E-3 PDIBLCB = -0.1 DROUT = 0.9787232 +PSCBE1 = 4.494778E10 PSCBE2 = 3.672074E-8 PVAG = 0.0122755 +DELTA = 0.01 RSH = 7 MOBMOD = 1 +PRT = 0 UTE = -1.5 KT1 = -0.11 +KT1L = 0 KT2 = 0.022 UA1 = 4.31E-9 +UB1 = -7.61E-18 UC1 = -5.6E-11 AT = 3.3E4 +WL = 0 WLN = 1 WW = 0 +WWN = 1 WWL = 0 LL = 0 +LLN = 1 LW = 0 LWN = 1 +LWL = 0 CAPMOD = 2 XPART = 0.5 +CGDO = 8.58E-10 CGSO = 8.58E-10 CGBO = 1E-12 +CJ = 9.471097E-4 PB = 0.8 MJ = 0.3726161 +CJSW = 1.905901E-10 PBSW = 0.8 MJSW = 0.1369758 +CJSWG = 3.3E-10 PBSWG = 0.8 MJSWG = 0.1369758 +CF = 0 PVTH0 = -5.105777E-3 PRDSW = -1.1011726 +PK2 = 2.247806E-3 WKETA = -5.071892E-3 LKETA = 5.324922E-4 +PU0 = -4.0206081 PUA = -4.48232E-11 PUB = 5.018589E-24 +PVSAT = 2E3 PETA0 = 1E-4 PKETA = -2.090695E-3

.MODEL cmosp PMOS LEVEL = 49 +VERSION = 3.1 TNOM = 27 TOX = 4.1E-9 +XJ = 1E-7 NCH = 4.1589E17 VTH0 = -0.3823437 +K1 = 0.5722049 K2 = 0.0219717 K3 = 0.1576753 +K3B = 4.2763642 W0 = 1E-6 NLX = 1.104212E-7 +DVT0W = 0 DVT1W = 0 DVT2W = 0 +DVT0 = 0.6234839 DVT1 = 0.2479255 DVT2 = 0.1 +U0 = 109.4682454 UA = 1.31646E-9 UB = 1E-21 +UC = -1E-10 VSAT = 1.054892E5 A0 = 1.5796859 +AGS = 0.3115024 B0 = 4.729297E-7 B1 = 1.446715E-6 +KETA = 0.0298609 A1 = 0.3886886 A2 = 0.4010376 +RDSW = 199.1594405 PRWG = 0.5 PRWB = -0.4947034 +WR = 1 WINT = 0 LINT = 2.93948E-8 +XL = 0 XW = -1E-8 DWG = -1.998034E-8 +DWB = -2.481453E-9 VOFF = -0.0935653 NFACTOR = 2 +CIT = 0 CDSC = 2.4E-4 CDSCD = 0 +CDSCB = 0 ETA0 = 3.515392E-4 ETAB = -4.804338E-4 +DSUB = 1.215087E-5 PCLM = 0.96422 PDIBLC1 = 3.026627E-3 +PDIBLC2 = -1E-5 PDIBLCB = -1E-3 DROUT = 1.117016E-4 +PSCBE1 = 7.999986E10 PSCBE2 = 8.271897E-10 PVAG = 0.0190118 +DELTA = 0.01 RSH = 8.1 MOBMOD = 1 +PRT = 0 UTE = -1.5 KT1 = -0.11 +KT1L = 0 KT2 = 0.022 UA1 = 4.31E-9 +UB1 = -7.61E-18 UC1 = -5.6E-11 AT = 3.3E4 +WL = 0 WLN = 1 WW = 0 +WWN = 1 WWL = 0 LL = 0 +LLN = 1 LW = 0 LWN = 1

42 +LWL = 0 CAPMOD = 2 XPART = 0.5 +CGDO = 7.82E-10 CGSO = 7.82E-10 CGBO = 1E-12 +CJ = 1.214428E-3 PB = 0.8461606 MJ = 0.4192076 +CJSW = 2.165642E-10 PBSW = 0.8 MJSW = 0.3202874 +CJSWG = 4.22E-10 PBSWG = 0.8 MJSWG = 0.3202874 +CF = 0 PVTH0 = 5.167913E-4 PRDSW = 9.5068821 +PK2 = 1.095907E-3 WKETA = 0.0133232 LKETA = -3.648003E-3 +PU0 = -1.0674346 PUA = -4.30826E-11 PUB = 1E-21 +PVSAT = 50 PETA0 = 1E-4 PKETA = -1.822724E-3

4.3.2 Training of SPICE Circuit

Figure 4.4: Training memristor based crossbar in Xyce.

The circuit mentioned in Figure 4.3 is trained for two input linearly separable logic functions which consists of 14 output functions. The circuit is built in MATLAB and simulated using parallel circuit simulation Xyce. The outputs of the neurons obtained from Xyce simulation is then analyzed in MATLAB to obtain the updated weights. Figure 4.4 explains the technique which is used to call

Xyce and train the finite driver memristor crossbar to learn the logic functions. Since Xyce can only be run through the terminal in Linux, the bash/terminal is to be called to execute the netlist.

Therefore, a bash script was written that will be called from MATLAB to execute the netlist in

43 Xyce. The code also incorporates few environment variables such as OpenMPI and Xyce-Parallel toolbox that is required for the circuit simulation. The code is shown as below:

\caption{callXyce} #!bin/bash #For running xyce-parallel from MATLAB export PATH=/usr/lib64/openmpi/bin:$PATH export LD_LIBRARY_PATH=/usr/lib64/openmpi/lib:$LD_LIBRARY_PATH export PATH=/usr/local/Xyce-Release-6.1.0-OPENMPI-OPENSOURCE/bin:$PATH export LD_LIBRARY_PATH=/usr/local/Xyce-Release-6.1.0-OPENMPI-OPENSOURCE/ lib:$LD_LIBRARY_PATH mpirun -np $1 /usr/local/Xyce-Release-6.1.0-OPENMPI-OPENSOURCE/bin/Xyce -l $2 $3

4.4 Results

In this section, we evaluate the circuit-level memristor-based crossbar simulation in Xyce in terms of the total training time taken with respect to the number of function outputs and with change in number of cores. Here, we look into three main analysis.

(1) Effect of the change in the number of cores on the total simulation training time, thereby

obtaining optimal number of cores for a particular crossbar configuration.

(2) Effect of the change in the type of finite drivers, specifically the comparison between level-3

MOSFET device in Xyce and the Mosis-180nm MOSFET device.

(3) Effect of the change in the number of logic inputs from 2-input to the 3-input.

44 4.4.1 System Setup

Inputs Outputs A B F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11 F12 F13 F14 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 0 1 0 0 0 0 1 1 1 0 0 0 1 1 1 1 1 0 0 0 1 1 0 0 1 0 1 1 0 0 1 1 1 1 0 1 0 1 0 1 1 0 0 1 0 1 0 1

Table 4.1: 2-input linearly separable logic functions.

Here, the memristor-based crossbar configuration given in Figure 4.3 is used for our analysis.

This system is trained and tested on our computer cluster as explained in the hardware requirements of installation Xyce. In parallel simulation, the core communicates through a message passing interface (MPI). For training this neuromorphic circuit, the weights are initialized randomly between

(0.0010, 0.0011). Here, we train a 2-bit input and 3-bit input logic functions using finite drivers.

These linearly separable logic functions are explained in Table 4.1 where every output function

F1 −F14 resembles a specific condition given the two inputs A and B. We have excluded the output functions corresponding to the XOR and XNOR operations since they are non-linearly separable outputs. Similarly, a 3-input logic function consists of 8 different input combinations and therefore, each combination corresponds to a certain output. The total number of output functions for a 3-bit input is 104. The number of rows in the memristor crossbar corresponds to the number of inputs and the number of columns corresponds to twice the number of output functions. Therefore, for a

2-bit input with bias, the memristor-crossbar configuration consists of 4 rows (2 inputs and 2 biases) and 2 · 14 = 28 columns. For a 3-bit input with bias, the memristor-crossbar configuration consists of 5 rows (3 inputs and 2 biases) and 2 · 104 = 208 columns. For any every input pattern, there are two biases of value 1 and 0 are appended.

45 Figure 4.5: Effect of total training time in Xyce with change in number of cores to train a linearly separable 2-input logic function using level-3 Xyce MOSFET in the driver circuit.

4.4.2 Analysis

In Figure 4.5, we have shown the variation in the total simulation training time in Xyce with respect to the number of processors and the number of function outputs for a 2-input linearly sep- arable logic functions. To evaluate the performance for large number of output functions, we scale the memristor crossbar horizontally by stacking up multiple 14 output functions. We also used the level-3 Xyce MOSFET finite drivers at the input side.

Here, points a and b correspond to the same number of output functions (80 functions and 640 memristors) but point a refers to using only 1 processor while point b refers to using 2 processors.

We see that there is a significant reduction by about 9 times in the total training time when more than 75 outputs functions (600 memristors) are implemented. With a single processor, the training

46 Figure 4.6: Effect of total training time in Xyce with change in number of cores to train a linearly separable 2-input logic function using Mosis-180nm MOSFETs in the driver circuit.

time takes about 16 − 17 hours while with the single processor, it takes less than 1 hour. This fact can be seen by observing points c and d where both correspond to 290 output functions (with 2,320 memristors). Point c refers to using only 2 processors while point d refers to using 3 processors.

Again, we see a huge reduction in the overall training time where using two processors for 290 outputs functions takes around 18 hours while using three processors takes around 7.5 hours. From these observations, we can see that we are able to simulate the training of large scale memristor crossbars using finite drivers more efficiently and with significant reduction in training time using a parallel simulator such as Xyce.

Now, when we consider points e and f, we see that the performance of using one processor is better than using two processors when there are less number of output functions say around 56 functions. Point e which refers to training using one processor takes around 30 minutes while

47 Figure 4.7: Convergence error curve from training 2-bit input linearly separable logic functions with 10,096 memristors.

point f which refers to training using two processors takes around 90 minutes. Thus the simulation time using two processors becomes three times the time taken with one processor. This increase in training time with two processors mainly accounts for the overhead from the inter-processor communication which dominates heavily on the total simulation time. Therefore, we can say this size of crossbar (4 rows and 56·2 = 112 columns) with 448 memristors requires only one processor.

This can also be seen in points g and h where g refers to training using three processors and h refers to training using two processors. Here also, we can see that the overhead communication between the processors dominates heavily on the total simulation time as we increase the number of processing cores.

48 From the above analysis, we can conclude that smaller size memristor crossbars require only less number of processors where addition of extra processors increases the overhead communication between them which dominates the total simulation time. However, as the memristor crossbar size becomes larger, there is a limit on the amount of load that the current number of processors can take where going beyond this limit would require additional processors. This phenomenon can be seen at points T 1 and T 2 where T 1 and T 2 defines the limit of the size of the crossbars when using one and two processors. As shown, a single processor can take crossbars sizes having 75 − 80 output functions beyond which the time take rises sharply. When two processors are used, the simulation time drastically drops as shown in point b and the inter-processor communication is no longer an overhead. This can also be seen at the point T 2 which defines the limit of the crossbar sizes with two processors beyond which an additional processor is required.

We have also trained the memristor-based crossbar with the 2-input logic functions using Mosis-

180nm wafer characteristics of MOSFETs in the finite driver system. As shown in Figure 4.6, we can see that by increasing the size of the crossbar, faster simulation can be attained when we scale beyond a single processor. This shows that the large scale memristor-based crossbar is responsive to a realistic driver system and can also be trained and scaled to use multiple processors. Also we have successfully trained this memristor based crossbars with finite drivers in Xyce for 1262 linearly separable logic functions of 2-bit input with about 10096 memristors in the crossbar (as shown in

Figure 4.7).

Similar analysis can be shown for a 3-bit input with 104 linearly separable output functions using both the level-3 Xyce MOSFET and the Mosis-180nm MOSFET. It can be seen that by using level-3 MOSFET drivers, we can attain faster simulation that can help evaluate large scale memristor-based crossbars for different neuromorphic applications. It is also seen that it closely

49 Figure 4.8: Effect of total training time in Xyce with change in number of cores to train a linearly separable 3-bit input logic function using level-3 Xyce MOSFET and Mosis 180nm MOSFET in the driver circuit.

relates to the performance of the actual Mosis-180nm MOSFET drivers. The performance statistics using the two types of MOSFET drivers are provided in Table 4.2.

Time Taken Number of function outputs Level-3 Xyce MOSFET Mosis-180nm MOSFET 104 6.5 hrs 11.6 hrs 208 17.25 hrs 21.6 hrs 312 61.35 hrs 65.21 hrs

Table 4.2: Linearly separable 3-bit input logic functions using two different MOSFET in the driver circuit using 2 processor cores.

50 CHAPTER V

ALTERNATIVE APPROACH IN MODELING MEMRISTOR BASED CROSSBAR FOR NEUROMORPHIC APPLICATIONS

In the previous chapter, we have explained that there is a requirement for a parallel circuit simulator where large scale memristor crossbar, simulated at circuit level could be analyzed more effectively and quickly. Recent work in simulating memristor crossbars in large scale was done using today’s simulation tools such as LTSpice, PSpice. But this was limited to using only 3000 memristors as these simulation tools ran into convergence issues [14]. Moreover using these tools for simulating large scale memristor crossbars take more than 3 days to train as they are non-parallel simulators that utilize only a single core of a machine. Therefore, we require a parallel simulator that can simulate large scale memristor crossbars in much shorter time. Chapter IV explains about a parallel simulator released by Sandia called Xyce that has the potential to use multiple cores in parallel evaluate the circuit. It is also seen from the results of Chapter 4 that the total simulation

1 th time of a particular crossbar in Xyce is 4 the total simulation time obtained from a non-parallel circuit simulator such as LTSpice.

Since memristor-based crossbars have a great potential as a building block that can be modeled to suit different neuromorphic applications, we need a system where we can simulate and ana- lyze such highly dense memristor crossbars in a faster environment. Although Xyce is so far the

51 best approach at circuit level, it is still not sufficient to enable us with faster analysis and develop- ment of such large scale crossbars. Therefore, this chapter proposes an approximate software-based mathematical approach of a memristor crossbar that can be used in any programming language or environment such as MATLAB, Python or C-language. This system can be used or referred to as an ex-situ approach to develop and analyze large scale memristor crossbars for a variety of neuromor- phic applications.

This chapter is divided into the following sections: Section 5.1 describes the traditional single layer perceptron implemented in software, Section 5.2 describes in detail the implementation of a single layer perceptron using memristor crossbars as an offline approach to its circuit design.

Section 5.3 explains how the obtained offline trained weights from Section 5.2 can be tested with its circuit level representation in Xyce.

5.1 Traditional Single Layer Perceptron

This section, discusses in brief about the single layer perceptron which is later extended in the

Memristor-based neuromorphic circuit. The single layer perceptron is one of the earliest models that uses the concept of neurons and synapses for pattern recognition. The perceptron model, illustrated in Figure 5.1 consists of two layers, one an input layer and an output layer with uni-directional weights from the input to the output. These weights correspond to the synaptic weights of a neuron.

Given a set of input images and a set of its corresponding class labels, the network is trained to obtain optimal weights which associates every input to its particular group or class. The number of units in the input layer is determined by the dimensionality of the input patterns. Here, in our analysis, the input pattern corresponds to an image of size N × N where it is reshaped to form a

2 vector of dimensions N . Thus an input pattern will be represented as x = [x1, x2, x3, ..., xN 2 ].

In the perceptron input layer, along with the N 2 inputs, there is also an additional input known as

52 Figure 5.1: Single layer perceptron

the bias. The number of units in the input layer will then be N 2 + 1 and the input pattern x will be appended a value of 1 i.e. x = [x, 1]. In the output layer, the number of units depend on the application for which the network is to be trained. Here, we consider a M-class problem where the input patterns belong to one of the classes from 1 to M. For example, in a 10-class image classification problem such as in MNIST dataset, the number of output units will be 10 and each input image is to be classified among the ten groups. Here, a single weight is connected between each input neuron and each output neuron. This connection is similar to a synaptic connection between two neurons. Therefore, if there are N 2 + 1 input units and M output units, the weights of

N 2+1×M the perceptron is given by W ∈ R .

The training scheme to obtain the optimal weights W for the task of image classification using a traditional single layer perceptron is summarized in Algorithm 1. Our objective is to get the best optimal weights such that y = f(x · W ) would give the best estimate of the class of the image x in the testing set. For training, we have a set of P images in vectorized form such that

Xtrain = [x1, ...xP ] and a corresponding set of class labels d such that [d1, d2, ...dP ] ∈ d. The

53 weights are initialized to random values between (0.0001, 0.001). We use an iterative scheme to constantly adjust the initialized weights in the right direction based on the error obtained between the class estimate y and the ground truth class label d for every input image. Here, for each image at a single iteration, we first compute the Net value which is a linear operation of the input pattern and the weights i.e Net = W · x. From this Net value, we compute the function outputs Y which represent the estimate of the class of the input image with the current weights. The output y is computed as y = sgn(x). The error δ between the class estimate and the true class label is then computed by taking the difference. Now using δ and a learning rate parameter η, we update the weights W in the direction which can minimize this error for the next set of images. This procedure is repeated for all input images within a single iteration where the errors are accumulated. At the end of this iteration, we check for convergence criteria where the condition is if error goes to 0, and if this is not reached, we continue with the next iteration. We repeat this iterative process till the convergence is reached at which we obtain the optimal weights.

In the testing scheme, as shown in Algorithm 2, the net output at every output neuron is the dot product of the input test image vector (Xtest) and the weights W obtained from its training phase in

Algorithm 1. The total misclassified images is calculated based on the difference between the actual test label and its obtained class label.

The understanding of the single layer perceptron and its training algorithm is essential for its implementation as a neuromorphic memristor-based crossbar where the learning rule of the mem- ristor states correlates with the update rule as seen in the traditional network. Also, the evaluation of the memristor-based crossbar is compared against this traditional network where we check for similarity in their performance.

54 Algorithm 1 Algorithm to train the traditional single layer perceptron. Require: P ← Number of training images; n × n ← Dimension of P th image P ×n2 Xtrain ∈ R ; d ← True class labels for each pattern; η ← Learning rate numIter ← Number of iterations ; M ← Number of classes 1: procedure TRAINING PHASE(P, M, η, n, Xtrain, d) th 2: Xtrain ← [Xtrain , ones(P, 1)] . Append bias of 1 to every P image. 3: N ← (n2 + 1) .N represents the total dimensions of the image in vector form. 4: W ← rand(M,N) . Initialize weight matrix of size M × N to random values. 5: i ← 1 6: while i < numIter do . For each iteration. 7: E ← 0 . Set error to zero. 8: for every P th image do 9: Net ← (W · Xtrain(P, :)) . Compute the Net function of all output neurons. 10: y ← (Net > 0.5) . Compute y for all output neurons. 11: δ ← (d(P, :) − y) . Error(δ) between true class and estimated class label. T 12: ∆W ← η · δ · Xtrain(P, :) . ∆W is the change in weights. 13: W ← W + ∆W . Update rule for weights (W ). 14: E ← E + P(δ2) . Accumulate the errors for every P th pattern in E. 15: end for 16: E ← E/(M · P ) . Average Error 17: i ← i + 1 . Next iteration. 18: if E == 0 then . Check for convergence where average error is zero. 19: break 20: end if 21: end while 22: return W 23: end procedure

Algorithm 2 Algorithm to test the traditional single layer perceptron. th P ×n2 Require: P ← Number of test images; n × n ← Dimension of P image; Xtest ∈ R ; d ← True class labels for each pattern; W ← Obtained from training phase in Algorithm 1 1: procedure TESTING PHASE(P, η, n, Xtest, d,W ) th 2: Xtest ← [Xtest , ones(P, 1)] . Append bias of 1 to every P image. 3: N ← (n2 + 1) .N represents the total dimensions of the image in vector form. 4: incorImg ← 0 . Set the count of incorrectly classified images to zero. 5: for every P th image do 6: Net ← (W · Xtest(P, :)) . Compute the Net function of all output neurons. 7: y ← (Net > 0.5) . Compute y for all output neurons. 8: if sum(d(P, :) − y) = 0 then 9: Test image is classified correctly. 10: else 11: incorImg ← incorImg + 1 . Count the number of incorrectly classified images. 12: end if 13: end for 14: end procedure

55 Figure 5.2: Matlab Memristive system

5.2 Single Layer Perceptron with Memristor based Crossbars

In this section, we describe the representation and implementation of the memristor crossbar in a mathematical model using a resistance-based approximate circuit.

5.2.1 Approximate Solution to a Circuit Level Memristive based Crossbar.

The conductance of each memristor in a crossbar is represented by σ which can also be writ- ten as 1/R where R is its resistance value. Therefore, to approximate the memristor crossbar, its conductivity is represented with its equivalent resistance values as shown in Figure 5.2. Let the

¯ input voltage VDD be 1V and consider the input nodes A = 0, B = 1 and the biases β and β are 1 and 0 respectively. Therefore, to approximately model this crossbar, the memristors with low input nodes(like B and β¯) are represented with resistances which are connected to the ground. The objective is then to find the effective input voltages V1 and V2 to the comparator.

In this arrangement, to find voltage V1, we need to consider four resistances. Two of those, R31 and R21 are parallel and connected to VDD = 1V and two others R41 and R11 are connected in

eff parallel to the ground V = 0. This is illustrated in Figure 5.3. Thus the effective resistance R31,21 56 Figure 5.3: Effective resistances for computation of voltage V1.

eff can be computed from R31 and R21 and the effective resistance R41,11 computed from R41 and R11.

The effective resistances are explained in Equation 5.1 and 5.2 respectively.

1 1 1 = + (5.1) eff R R R21,31 31 21 1 1 1 = + (5.2) eff R R R41,11 41 11

As shown in Figure 5.3, the voltage V1 will then be the voltage drop across the effective resistance

eff R41,11 and this is given in Equation 5.3 and Equation 5.4.

eff R41,11 V1 = VDD eff eff (5.3) R41,11 + R31,21 1 (1/R41+1/R11) V1 = VDD (5.4) 1 + 1 1/R41+1/R11 1/R21+1/R31

57 Figure 5.4: Effective resistances for computation of voltage V2.

If we substitute the resistance of the memristor with its equivalent conductance value say σ = 1/R, then, the effective voltage drop V1 can derived as follows(in Equations 5.5-5.8).

1 σ41+σ11 V1 = VDD (5.5) 1 + 1 σ41+σ11 σ21+σ31 1 V = V σ41+σ11 (5.6) 1 DD σ21+σ31+σ41+σ11 (σ41+σ11)(σ21+σ31) (σ21 + σ31) V1 = VDD (5.7) (σ21 + σ31 + σ41 + σ11) P3 σ V = V i=2 i1 (5.8) 1 DD P4 i=1σi1

To find voltage V2 we consider another set of four resistances R32,R22,R42 and R12 as shown in Figure 5.4. These can be computed as shown in Equation 5.9 and Equation 5.10.

1 1 1 = + (5.9) eff R R R22,32 32 22 1 1 1 = + (5.10) eff R R R42,12 42 12

58 The voltage at V2 in terms of the effective resistances can be computed as shown in Figure 5.4 and in Equations 5.11-5.12.

eff R42,12 V2 = VDD eff eff (5.11) R42,12 + R32,22 1 (1/R42+1/R12) V2 = VDD (5.12) 1 + 1 1/R42+1/R12 1/R22+1/R32

Similar to deriving the voltage V1, we substitute the resistance with its conductance (σ = 1/r) and the voltage V2 is derived as shown in Equations 5.5-5.16.

1 σ42+σ12 V2 = VDD (5.13) 1 + 1 σ42+σ12 σ22+σ32 1 V = V σ42+σ12 (5.14) 2 DD σ22+σ32+σ42+σ12 (σ42+σ12)(σ22+σ32) (σ22 + σ32) V2 = VDD (5.15) (σ22 + σ32 + σ42 + σ12) P3 σ V = V i=2 i2 (5.16) 2 DD P4 i=1σi2

Now to generalize the voltage drop across the two input terminals of a comparator in the memristor based crossbar, consider indices (i, j) that refers to the ith row and jth column and the input voltages

th in at the i row as Vi which can take the values of 0 or 1 depending on the binary input pattern.

Therefore the voltage drop at terminal j can be summarized as the product of the input voltages and conductances at every ith row for the jth column and which is normalized by the sum of all the

conductances across the jth column. This is given in Equation 5.17.

N X in σijVi V = i=1 (5.17) j N X σij i=1 Using this generalized form of the output voltages of a memristor-based neuron, we can now model a large crossbar for neuromorphic applications without circuit-level training. The training of this approximate memristor based crossbar in MATLAB is explained in the following sub-section. 59 5.2.2 Training Single Layer MATLAB based Memristor Crossbar.

In this section, we will describe the procedure taken to train a large memristor based crossbar based on its mathematical model as explained in the previous section. This is summarized in the

Algorithm 3. Now using the generalized formula given in Equation 5.17, we can now find the voltage at the every output column of a memristor based crossbar. This mathematical model given in Figure 5.2 is simulated in MATLAB and trained as a memristive single layered perceptron as given in Algorithm 3.

Consider that we have a set of training images Xtrain with P as the number of images, M

as the number of classes and η as the learning rate parameter for the network. It is assumed that

the memristor states are represented by the weights of the network. The transformation of these

weights/memristor states of the network into its equivalent memristor conductances is obtained by

using its inherent I − V relationship characteristic formula given in Equations 5.18, 5.19.

Imem = a1 · W · sinh(b1 · Vmem) (5.18)

Imem σmem = (5.19) Vmem

Here, we set the values of two constants a1 = 0.00016 and b1 = 0.05 where a1 is related to the thickness of the dielectric layer in the Memristor and b1 is the factor influencing the conduction in the device. The memristor voltage Vmem is set at 1V .

For every input image, we first the obtain the conductances of the memristor σmem using the weights W and Equations 5.18, 5.19. Now, we compute the Net output at every output neuron using the generalized voltage equation given in 5.17. The final output of the output neurons y is a function of the Net output where if an odd column voltage of a neuron is greater than its even column voltage, then the output y is high(1) else its low(0). Similar to the traditional single layer perceptron, the weights/memristor states are updated based on the error computed between the true

60 Algorithm 3 Algorithm to train the Memristive single layer perceptron. Require: P ← Number of training images; n × n ← Dimension of P th image P ×n2 Xtrain ∈ R ; d ← True class labels for each pattern; η ← Learning rate numIter ← Number of iterations ; M ← Number of classes a1 ← Thickness of dielectric layer; b1 ← Factor influencing conduction in device Vmem ← Memristor voltage of 1 V memState ← Matrix of size 2 · M × N which holds all memristors states in the crossbar. 1: procedure MEMRISTORCONDUCTANCE(W ) 2: memState ← W 3: Imem ← a1 · memState · sinh(b1 · Vmem) 4: σ ← Imem mem Vmem 5: return σmem 6: end procedure 7: procedure TRAINING PHASE(P, M, η, n, Xtrain, d) th 8: Xtrain ← [Xtrain , ones(P, 1)] . Append bias of 1 to every P image. 9: N ← (n2 + 1) .N represents the total dimensions of the image in vector form. 10: W ← rand(2 · M,N) . Initialize weights of size 2 · M × N to random values 11: i ← 1 12: while i < numIterations do . For each iteration. 13: E ← 0 . Set error to zero. 14: for every P th image do 15: sigmamem ← MemristorConductance(W ) P (σmem(1:2:end,:)·Xtrain(P,:)) 16: Net ← P . Net output of odd column (Equation 5.17) o (σmem(1:2:end,:)) P (σmem(2:2:end,:)·Xtrain(P,:)) 17: Net ← P . Net output of even column (Equation 5.17) e (σmem(2:2:end,:)) 18: y ← (Neto > Nete) . Compute y for all output neurons. 19: δ ← (d(P, :) − y) . Error(δ) between true class and estimated class label. T 20: ∆W ← η · δ · Xtrain(P, :) . ∆W is the change in weights. 21: W ← W + ∆W . Update rule for weights (W ). 22: E ← E + P(δ2) . Accumulate the errors for every P th pattern in E. 23: end for 24: E ← E/(M · P ) . Average Error 25: i ← i + 1 . Next iteration. 26: if E == 0 then . Check for convergence where average error is zero. 27: break 28: end if 29: end while 30: return W 31: end procedure

61 class labels d and its estimated output labels y. This process is repeated for multiple iterations until the convergence criteria is satisfied. Once error reaches to zero, the optimal memristor states W are obtained. To test the memristor based approximate crossbar model in MATLAB, we employ

Algorithm 4 Algorithm to test the Memristive single layer perceptron. Require: P ← Number of testing images; n × n ← Dimension of P th image P ×n2 Xtest ∈ R ; d ← True class labels for each pattern; a1 ← Thickness of dielectric layer; b1 ← Factor influencing conduction in device Vmem ← Memristor voltage of 1 V ; M ← Number of classes memState ← Matrix of size 2 · M × N which holds all memristors states in the crossbar. W ← Obtained from training phase in Algorithm 3 1: procedure MEMRISTORCONDUCTANCE(W ) 2: memState ← W 3: Imem ← a1 · memState · sinh(b1 · Vmem) 4: σ ← Imem mem Vmem 5: return σmem 6: end procedure 7: procedure TESTING PHASE(P, M, W, n, Xtest, d) th 8: Xtest ← [Xtest , ones(P, 1)] . Append bias of 1 to every P image. 9: N ← (n2 + 1) .N represents the total dimensions of the image in vector form. 10: incorImg ← 0 . Set the count of incorrectly classified images to zero. 11: for every P th image do 12: sigmamem ← MemristorConductance(W ) P (σmem(1:2:end,:)·Xtest(P,:)) 13: Net ← P . Net output of odd column (Equation 5.17) o (σmem(1:2:end,:)) P (σmem(2:2:end,:)·Xtest(P,:)) 14: Net ← P . Net output of even column (Equation 5.17) e (σmem(2:2:end,:)) 15: y ← (Neto > Nete) . Compute y for all output neurons. 16: if sum(d(P, :) − y) = 0 then 17: Test image is classified correctly. 18: else 19: incorImg ← incorImg + 1 . Count the number of incorrectly classified images. 20: end if 21: end for 22: end procedure

a similar testing strategy that was earlier used in the traditional single layer perceptron. A set of test images Xtest is considered and a bias of 1 is appended. Similar to the training phase, the net output at each column of the neuron is computed using the memristor conductances and the input

62 test images as given in Equation 5.17. The final output of a neuron y is computed and is high (1) if the Net output in the odd column is greater than the Net output in its even column. A test image is correctly classified the difference between its true class labels and its computed output y is 0. This testing strategy is summarized in Algorithm 4.

5.3 Testing of Memristor based Crossbar as a Single Layer Perceptron in Xyce

In the previous section, we discussed a mathematical approach to train a memristor-based cross- bar as a single layer perceptron for pattern classification. This system can be considered as an offline training approach where we can obtain the optimal memristor states for a single layer perceptron, that can be used to test its corresponding circuit level model in Xyce. Thus, this will potentially help in reducing the amount of computation time required at training phase and enable faster analysis of memristor-based crossbars for neuromorphic applications. This process of testing the memristor based crossbar at circuit level in Xyce using the offline trained weights is explained here in this section and is summarized in Algorithm 5.

63 Algorithm 5 Algorithm to test memristor neuromorphic crossbar in Xyce using offline trained weights with and without comparator. Require: P ← Number of testing images; n × n ← Dimension of P th image P ×n2 Xtest ∈ R ; d ← True class labels for each pattern;M ← Number of classes W ← Obtained from training phase in Algorithm 3 flagComparator ← True if comparator is used. Else false (rows, cols) ← Number of rows and columns in the memristor crossbar 1: procedure TESTING PHASEWITH XYCE(Xtest,M,W ) th 2: Xtest ← [Xtest , ones(P, 1)] . Append bias of 1 to every P image. 3: N ← (n2 + 1) .N represents the total dimensions of the image in vector form. 4: incorImg ← 0 . Set the count of incorrectly classified images to zero. 5: rows ← N ; cols ← 2 · M 6: for p ← 1,P do 7: Set input voltage of crossbar to the input pattern i.e readV oltage ← Input pattern 8: Build circuit netlist using rows, cols and readV oltage. 9: Simulate the netlist in Xyce using the bash script as explained in Chapter 4. 10: Obtain the outputs Net of the crossbar from the Xyce log file. 11: for every output voltage j from crossbar do 12: if flagComparator then 13: if Net(j) > 0.5 then 14: y(j) ← 1 15: else 16: y(j) ← 0 17: end if 18: else 19: if Net(2 · j − 1) > Net(2 · j) then 20: y(j) ← 1 21: else 22: y(j) ← 0 23: end if 24: end if 25: end for 26: if sum(d(P, :) − y) = 0 then 27: Test image is classified correctly. 28: else 29: incorImg ← incorImg + 1 . Count the number of incorrectly classified images. 30: end if 31: end for 32: end procedure

The testing approach incorporated here is similar to the one explained in Chapter 4. For every input image Xtest a bias of 1 is added and it is then supplied as input voltage (readVoltage) to the

64 rows of the memristor based crossbar circuit. Here the number of rows in the crossbar correspond to the total dimensions of the image and the bias which is n2 + 1. The states of every memristor in the crossbar is set to its corresponding value from the offline trained weights. The circuit netlist is built and is simulated in Xyce by calling a bash script in MATLAB. The output Net of every neuron in the crossbar is considered in two ways. First one is implemented without the use of a comparator in the circuit as shown in Figure 5.5a and the second one is implemented using a comparator as shown in Figure 5.5b. The reason we are considering these two approaches is to show that incorporating a comparator at the circuit level to decide the final output y of a neuron does in fact impact the testing

performance of the system compared to the one without a comparator.

(a) Without comparator. (b) With comparator

Figure 5.5: Memristor based neuromorphic crossbar for testing offline trained weights for output voltage.

5.4 Results

In this section, we evaluate the performance characteristic of memristor-based perceptron net- work simulation in Xyce with offline trained weights towards the application of pattern classifica- tion. This performance is evaluated in terms of number of images correctly classified in a given test 65 set for two well-known image datasets; one is the digit recognition dataset known as MNIST [69] and the other is the CBCL-MIT face dataset [70] for face image classification. There are three kinds of analysis been done on each dataset. They are :

(1) Comparison of performance of the approximate mathematical memristor-based perceptron

in MATLAB with the traditional network. This is done to check if the memristor-based

perceptron performs similar to that of the traditional perceptron using the same learning rates

and training set.

(2) Evaluation of the circuit-level memristor-based crossbar in Xyce with the weights obtained

from the the MATLAB memristor-based perceptron. Here, we check how well the perfor-

mance from the simulation in Xyce correspond to its approximate model in MATLAB.

(3) Effect of the comparator on the performance of the circuit-level memristor-based perceptron

towards the task of pattern classification with the offline trained weights.

5.4.1 MNIST Digit Database

The MNIST is a set of handwritten digit dataset from numbers 0 to 9 and contains around

60,000 training samples and 10,000 samples to test. It is one of most widely used standard datasets for testing any neural network or pattern recognition algorithms to date. Here, to test the large scale memristor crossbar implementation in Xyce, we use this dataset as it provides a platform not only to test the algorithm but to check if the large Xyce implementation performs close to the traditional algorithm. These digit images are grayscale and are size normalized and centered in a fixed size image of 28 × 28. An illustration of the MNIST digit dataset is shown in Figure 5.6a.

Challenges and Variability in Dataset

The challenges in the dataset are given below.

66 (1) Non-uniformity in the orientation and thickness of the lines comprising of the digits.

(2) Variation in the aspect ratio, out-of plane rotation and scale of the digit in the image.

(3) Variation in the handwriting style of digit and these are non-uniform.

Due to these reasons, the dataset is a non-linearly separable data. This is illustrated by observing the distribution of the images in N 2 dimensional space where N ×N is the size of one image. Here,

N = 28 and so in 784 dimensional space, the images are points which follow a certain distribution.

Since N is very large, it can be visualized by applying Principal Component Analysis (PCA) on the images(N 2 dimensional) and observe the variation along the three most significant Eigen vectors.

This type of analysis is widely used for evaluating the image recognition algorithms in and pattern recognition. The PCA finds the directions with the most variation in the data and when these images are represented in the axes given by the Eigen vectors, the non-linearity can clearly be seen [71].

(a) Illustration of MNIST dataset with sample digit im- ages (b) PCA on MNIST dataset.

Figure 5.6: MNIST digit dataset.

67 System Setup and Analysis

Every grayscale digit image is first normalized and converted to a binary image which takes values −1, 1. Then, the images of size 28 × 28 are re-sized to form a pattern of size 784 × 1. We

use a set of 1000 patterns to train both the traditional network and the approximate mathematical

memristor-based perceptron using very low learning rates of 0.0001 and 0.00005. Note that the

same learning rates are used for both the traditional and memristor-based so that an appropriate

comparison of performance can be made between the two of them.

In both the approximate memristor-based model and the traditional single layer perceptron, we

initialize the weights between (0.0010, 0.0011) which is more consistent with the learning algorithm of the memristor-based perceptron. Such low weights and learning rates are required as there is huge scaling happening when the memristor conductances are computed from the weights. Therefore, we use the same range of weights and the low learning rates to train the traditional as well. Also note that both, approximate memristor used as an offline training approach and the circuit level memristor crossbar used as test in Xyce contains a memristor crossbar with 785 rows (includes a bias) and 2 · 10 = 20 columns and therefore has 15,700 memristors.

Figure 5.7 and Figure 5.8 provides the error curve obtained in training the traditional and the

approximate mathematical memristor-based perceptron respectively. We see that using two different

learning rates, the training of the approximate memristor model converges to zero and this is similar

to the error curve obtained with the traditional algorithm using the same range of weights and

learning rates. However, we see that the error curve is not as smooth as the one in traditional.

This is because there are two memristor states associated with the synaptic connection and often

during each iteration, the weights in the odd and even columns of the crossbar pushes and pulls

towards different directions to reach the optimal value. In the traditional algorithm, there is only

one state/weight associated with the synaptic connection.

68 (a) With learning rate 0.0001 (b) With learning rate 0.00005

Figure 5.7: Error curves obtained with the traditional perceptron with two different learning rates.

(a) With learning rate 0.0001 (b) With learning rate 0.00005

Figure 5.8: Error curves obtained with Matlab memristor crossbar during training with two different learning rates.

69 (a) With learning rate 0.0001

(b) With learning rate 0.00005

Figure 5.9: Comparisons of accuracies with traditional, matlab memristor crossbar and Xyce im- plementation.

Now, once the traditional and the approximate memristor-based perceptron networks are trained, these systems are tested with variable number of test images. Again, as done for the training im- ages, they are normalized and converted to binary images which takes values of −1, 1. Here, to test the performance of the memristor-based crossbars in Xyce, the optimal weights obtained from

70 the approximate model is loaded as states as done in Algorithm 5 and used to classify the test pat- terns. Two variations of circuit-level testing is used; one without the comparators and the other with comparators.

As shown in Figure 5.9, the number of test images used in the evaluation are 10, 20, 50, 100 and 200. To have an accurate comparison of image classification accuracy using the traditional, approximate memristor model and circuit-level simulation with and without comparators, we use the same set of test images for accurate performance evaluation. We see that the learning rate

η = 0.00005 performs much better and have an higher chance of classifying digit images more accurately. With different number of input test images(along the x−axis of Figure 5.9), we also see that the circuit-level memristor crossbars in Xyce and the approximate memristor-based perceptron also classifies the same number of test images correctly. As we increase the number of input test images, the percentage of correctly classifying the patterns remains the same. However, if we use the comparator to determine the output of the circuit-level memristor crossbars, we notice a drop in the number of test images correctly classified. This is because the comparators at circuit-level provide additional resistance which changes the net output voltage. Therefore, with this approach of using the approximate memristor based model as an offline training approach we can successfully train almost 15,700 memristors.

5.4.2 MIT-CBCL Face Database

This is a database of faces and non-faces as illustrated in Figure 5.10a and has been extensively used in the Center for Biological and Computational Learning at Massachusetts Institute of Technol- ogy(MIT). It contains around 2, 429 face images and 4, 548 non-face images in the training subset and a total of 472 face images and 23, 573 non-face images in the testing subset. This dataset has been extensively used in evaluating face detection algorithms and is considered as a standard to test any platform trained to classify an image as a face or non-face [70, 72, 73, 74]. These images are

71 19 × 19 gray scale in pgm format. To test the large scale memristor crossbar and the matlab mem- ristor neural network, the application of face detection in addition to digit classification validates our proposed large scale memristor-based perceptron algorithm.

Challenges and Variability in Dataset

Some of the challenges of this dataset are as follows:

(1) Image data is highly non-linear as the face images and non-face images are not linearly sepa-

rable.

(2) Lots of variability in the face images due to pose of the person, face feature variations due to

different identify and size of the face.

(3) Non-face images have even more variations as it is selected to the negative set of this dataset

and is highly non-uniform.

Similar to the MNIST dataset, we also apply the PCA analysis as in Figure 5.10b to visualize the non-linear distribution of the face images [75].

72 (a) Illustration of CBCL-MIT Face dataset with sam- ple face and non-face images. (b) PCA on CBCL-MIT Face dataset.

Figure 5.10: MIT-CBCL Face dataset.

System Setup and Analysis

Similar to the MNIST dataset analysis, every grayscale face image is first normalized and con- verted to a binary image which takes values −1, 1. Then, the images of size 19 × 19 are resized to

form a pattern of size 361×1. We use a set of 1000 patterns to train both the traditional network and

the approximate mathematical memristor-based perceptron using very low learning rates of 0.0001

and 0.00005. These training patterns contain equal number of face and non-face images. As in

MNIST dataset, the same learning rates and weight initialization in the range (0.0010, 0.0011) are

used for both the traditional and memristor-based perceptron. Also note that the crossbar size used

in both the approximate training model and the SPICE based model contains 362 (includes a bias)

rows and 2 · 2 = 4 columns with 1448 memristors.

73 (a) With learning rate 0.0001 (b) With learning rate 0.00005

Figure 5.11: Error curves obtained with traditional perceptron during training with two different learning rates for CBCL-MIT Face dataset.

As expected of the traditional perceptron, the error curve shown in Figure 5.11 converges smoothly to zero and the optimal traditional weights are obtained. However, we see that for the error curve shown in Figure 5.12 obtained with the approximate memristor model, there is a lot of spikes and is noisy. This can be attributed to two main factors; one is that there are two memristor states associated with a synaptic connection as explained earlier, the other is due to the high non- linearity in the face/non-face patterns. As shown, the approximate memristor-based perceptron is successful in learning this highly non-linear separation between the face and non-face patterns by making large weight update adjustments to each column of the memristor-crossbar after every iter- ation. We also can see that the error curve corresponding to the learning rate η = 0.00005 decays faster and with much better slope.

74 (a) With learning rate 0.0001 (b) With learning rate 0.00005

Figure 5.12: Error curves obtained with Matlab memristor crossbar during training with two differ- ent learning rates for CBCL-MIT Face dataset.

Here as done before, we compare the performance (number of images correctly classified) be- tween the traditional perceptron, the approximate memristor-based perceptron in MATLAB, and the circuit-level memristor-based crossbars with and without the use of comparators at the output. The networks trained with learning rates of η = 0.00005 provide better performance where the number of test images classified correctly are more. Moreover, the number of test images classified correctly by the traditional, approximate memristor-crossbar and the circuit-level memristor-based crossbar with offline trained weights almost the same with only a discrepancy of one image. Again we see that if a comparator is used at the circuit level to obtain the output, the number of output images correctly drops by almost 10 images. This is because the comparator adds additional resistance at the output thereby altering the output voltage.

75 (a) With learning rate 0.0001

(b) With learning rate 0.00005

Figure 5.13: Comparisons of accuracies with traditional, MATLAB memristor crossbar and Xyce implementation for CBCL-MIT Face dataset.

76 CHAPTER VI

CONCLUSION

In this thesis, we have proposed approaches simulation of large scale memristor-based crossbars for neuromorphic computing at circuit level using a parallel simulator. In this simulation, we have investigated the effect of multiple processor cores in training 2-bit and 3-bit inputs for classification of linearly separable output functions. Hence, we have seen that there exists a linear relationship between the size of the crossbar and the number of cores used in the simulation. This system has also been successfully tested with a finite driver circuit which incorporates two different MOSFETS, one is a level-3 Xyce basic MOSFET and the other being a more realistic wafer model Mosis-180nm feature size. Thus, this provides a low power memristor based crossbar circuit for large scale circuit simulation. Using this approach we have successfully trained around 10,096 memristors in 6 days for a 2-bit input with linearly separable output functions using 3 processors in parallel. This training time can potentially reduce by increasing the number of processors for simulation. This is still under investigation as we have to access the OSC clusters.

Also, to speed up the analysis of such crossbars to design different machine learning algorithms, an approximate mathematical approach replicating the memristor based crossbar has also been in- vestigated. This system is considered as an offline training approach and its seen that the offline trained weights provides the similar performance with the actual circuit-level simulation in Xyce.

77 With this approach we have successfully tested memristor based crossbars in Xyce with the of-

fline trained weights for 15,700 memristors for classifying MNIST digits and 1,448 memristors for classifying face and no face images from the CBCL-MIT Face dataset.

78 BIBLIOGRAPHY

[1] “Xyce building guide,” https://xyce.sandia.gov/documentation/BuildingGuide.html.

[2] “Ni1000 recognition accelerator,” http://www.warthman.com/images/Ni1000ds.pdf.

[3] U. Ramacher, W. Raab, N. Bruls, M. Wesseling, E. Sicheneder, J. Glass, A. Wurz, and R. Man- ner, “Synapse-1: a high-speed general purpose parallel neurocomputer system,” in Parallel Processing Symposium, 1995. Proceedings., 9th International, Apr 1995, pp. 774–781.

[4] T. Chen, Y. Chen, M. Duranton, Q. Guo, A. Hashmi, M. Lipasti, A. Nere, S. Qiu, M. Se- bag, and O. Temam, “Benchnn: On the broad potential application scope of hardware neural network accelerators,” in Workload Characterization (IISWC), 2012 IEEE International Sym- posium on, Nov 2012, pp. 36–45.

[5] P. Merolla, J. Arthur, F. Akopyan, N. Imam, R. Manohar, and D. Modha, “A digital neu- rosynaptic core using embedded crossbar memory with 45pj per spike in 45nm,” in Custom Integrated Circuits Conference (CICC), 2011 IEEE, Sept 2011, pp. 1–4.

[6] J. Arthur, P. Merolla, F. Akopyan, R. Alvarez, A. Cassidy, S. Chandra, S. Esser, N. Imam, W. Risk, D. Rubin, R. Manohar, and D. Modha, “Building block of a programmable neu- romorphic substrate: A digital neurosynaptic core,” in Neural Networks (IJCNN), The 2012 International Joint Conference on, June 2012, pp. 1–8.

[7] P. A. Merolla, J. V. Arthur, R. Alvarez-Icaza, A. S. Cassidy, J. Sawada, F. Akopyan, B. L. Jack- son, N. Imam, C. Guo, Y. Nakamura, B. Brezzo, I. Vo, S. K. Esser, R. Appuswamy, B. Taba, A. Amir, M. D. Flickner, W. P. Risk, R. Manohar, and D. S. Modha, “A million spiking-neuron integrated circuit with a scalable communication network and interface,” Science, vol. 345, no. 6197, pp. 668–673, 2014.

[8] “Ibm truenorth chip,” http://www.research.ibm.com/articles/brain-chip.shtml.

[9] J. Cong and B. Xiao, “mrfpga: A novel fpga architecture with memristor-based reconfigura- tion,” in Nanoscale Architectures (NANOARCH), 2011 IEEE/ACM International Symposium on, June 2011, pp. 1–8.

79 [10] T. Taha, R. Hasan, C. Yakopcic, and M. McLean, “Exploring the design space of special- ized multicore neural processors,” in Neural Networks (IJCNN), The 2013 International Joint Conference on, Aug 2013, pp. 1–8.

[11] C. Yakopcic, T. Taha, G. Subramanyam, and R. Pino, “Memristor spice modeling,” in Ad- vances in Neuromorphic Memristor Science and Applications, ser. Springer Series in Cognitive and Neural Systems, R. Kozma, R. E. Pino, and G. E. Pazienza, Eds. Springer Netherlands, 2012, vol. 4, pp. 211–244.

[12] W. Lu, K.-H. Kim, T. Chang, and S. Gaba, “Two-terminal resistive switches (memristors) for memory and logic applications,” in Design Automation Conference (ASP-DAC), 2011 16th Asia and South Pacific, Jan 2011, pp. 217–223.

[13] C. Yakopcic, T. M. Taha, and G. Subramanyam, “Hybrid crossbar architecture for a memristor based cache,” CoRR, vol. abs/1302.6515, 2013.

[14] C. Yakopcic, R. Hasan, and T. Taha, “Tolerance to defective memristors in a neuromorphic learning circuit,” in Aerospace and Electronics Conference, NAECON 2014 - IEEE National, June 2014, pp. 243–249.

[15] L. Chua, “Memristor-the missing circuit element,” Circuit Theory, IEEE Transactions on, vol. 18, no. 5, pp. 507–519, Sep 1971.

[16] L. Chua and S. M. Kang, “Memristive devices and systems,” Proceedings of the IEEE, vol. 64, no. 2, pp. 209–223, Feb 1976.

[17] C. P. Collier, E. W. Wong, M. Belohradsk, F. M. Raymo, J. F. Stoddart, P. J. Kuekes, R. S. Williams, and J. R. Heath, “Electronically configurable molecular- based logic gates,” vol. 285, no. 5426, pp. 391–394, 1999. [Online]. Available: http://www.sciencemag.org/content/285/5426/391.abstract

[18] M. D. Pickett, D. B. Strukov, J. L. Borghetti, J. J. Yang, G. S. Snider, D. R. Stewart, and R. S. Williams, “Switching dynamics in titanium dioxide memristive devices,” Journal of Applied Physics, vol. 106, no. 7, 2009.

[19] M. Sah, C. Yang, R. Budhathoki, and H. Kim, “Features of memristor emulator-based artificial neural synapses,” in Circuits and Systems (ISCAS), 2013 IEEE International Symposium on, May 2013, pp. 421–424.

[20] O. Kavehei, Y.-S. Kim, A. Iqbal, K. Eshraghian, S. Al-Sarawi, and D. Abbott, “The fourth ele- ment: Insights into the memristor,” in Communications, Circuits and Systems, 2009. ICCCAS 2009. International Conference on, July 2009, pp. 921–927.

[21] S. H. Jo, T. Chang, I. Ebong, B. B. Bhadviya, P. Mazumder, and W. Lu, “Nanoscale memristor device as synapse in neuromorphic systems,” Nano Letters, vol. 10, no. 4, pp. 1297–1301, 2010.

80 [22] E. Keiter, H. Thornquist, R. Hoekstra, T. Russo, R. Schiek, and E. Rankin, “Parallel transistor- level circuit simulation,” in Simulation and Verification of Electronic and Biological Systems, P. Li, L. M. Silveira, and P. Feldmann, Eds. Springer Netherlands, 2011, pp. 1–21.

[23] R. Williams, “How we found the missing memristor,” Spectrum, IEEE, vol. 45, no. 12, pp. 28–35, Dec 2008.

[24] C. Yakopcic, T. Taha, G. Subramanyam, and R. Pino, “Generalized memristive device spice model and its application in circuit design,” Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, vol. 32, no. 8, pp. 1201–1214, Aug 2013.

[25] C. Yakopcic, T. Taha, G. Subramanyam, R. Pino, and S. Rogers, “A memristor device model,” Electron Device Letters, IEEE, vol. 32, no. 10, pp. 1436–1438, Oct 2011.

[26] Y. Ota and B. Wilamowski, “Cmos architecture of synchronous pulse-coupled neural network and its application to image processing,” in Industrial Electronics Society, 2000. IECON 2000. 26th Annual Confjerence of the IEEE, vol. 2, 2000, pp. 1213–1218 vol.2.

[27] T. Serrano-Gotarredona, T. Prodromakis, and B. Linares-Barranco, “A proposal for hybrid memristor- spiking neuromorphic learning systems,” Circuits and Systems Magazine, IEEE, vol. 13, no. 2, pp. 74–88, Secondquarter 2013.

[28] M. Payvand, J. Rofeh, A. Sodhi, and L. Theogarajan, “A cmos-memristive self-learning neural network for pattern classification applications,” in Nanoscale Architectures (NANOARCH), 2014 IEEE/ACM International Symposium on, July 2014, pp. 92–97.

[29] H. Mostafa, F. Corradi, F. Stefanini, and G. Indiveri, “A hybrid analog/digital spike-timing dependent plasticity learning circuit for neuromorphic vlsi multi-neuron architectures,” in Cir- cuits and Systems (ISCAS), 2014 IEEE International Symposium on, June 2014, pp. 854–857.

[30] D. Chabi, Z. Wang, W. Zhao, and J.-O. Klein, “On-chip rule for ultra high density neural crossbar using memristor for synapse and neuron,” in Nanoscale Architectures (NANOARCH), 2014 IEEE/ACM International Symposium on, July 2014, pp. 7–12.

[31] M. Holler, S. Tam, H. Castro, and R. Benson, “An electrically trainable artificial neural net- work (etann) with 10240 ’floating gate’ synapses,” in Neural Networks, 1989. IJCNN., Inter- national Joint Conference on, 1989, pp. 191–196 vol.2.

[32] L. Kern, “Design and development of a real-time neural processor using the intel 80170nx etann,” in Neural Networks, 1992. IJCNN., International Joint Conference on, vol. 2, Jun 1992, pp. 684–689 vol.2.

[33] J. Calvin, S. K. Rogers, D. R. Zahirniak, D. W. Ruck, and M. E. Oxley, “Characterization of the 80170nx (etann) chip sigmoidal transfer function for a device vgain=3.3v,” vol. 1965, 1993, pp. 654–661.

81 [34] S.-C. Wang, “Artificial neural network,” in Interdisciplinary Computing in Java Programming, ser. The Springer International Series in Engineering and Computer Science. Springer US, 2003, vol. 743, pp. 81–100.

[35] M. Mumford, D. Andes, and L. Kern, “The mod 2 neurocomputer system design,” Neural Networks, IEEE Transactions on, vol. 3, no. 3, pp. 423–433, May 1992.

[36] “Artificial neural networks in hardware: A survey of two decades of progress,” Neurocomput- ing, vol. 74, no. 13, pp. 239 – 255, 2010, artificial Brains.

[37] P. Ienne, T. Cornu, and G. Kuhn, “Special-purpose digital hardware for neural networks: An architectural survey,” Journal of VLSI signal processing systems for signal, image and video technology, vol. 13, no. 1, pp. 5–25, 1996.

[38] D. Hammerstrom, A Survey of Bio-Inspired and Other Alternative Architectures. Wiley-VCH Verlag GmbH and Co. KGaA, 2010.

[39] S. Joseph, “Neurogrid: Semantically routing queries in peer-to-peer networks,” in Revised Papers from the NETWORKING 2002 Workshops on Web Engineering and Peer-to-Peer Computing. London, UK, UK: Springer-Verlag, 2002, pp. 202–214. [Online]. Available: http://dl.acm.org/citation.cfm?id=647080.714934

[40] “Facets,” http://facets.kip.uni-heidelberg.de/images/4/48/Public–FACETS-15879-Summary- flyer.pdf.

[41] P. Yalamanchili, S. Mohan, R. Jalasutram, and T. Taha, “Acceleration of hierarchical bayesian network based cortical models on multicore architectures,” Parallel Comput., vol. 36, no. 8, pp. 449–468, Aug. 2010.

[42] “Ni1000 recognition accelerator,” http://pcl.intel-research.net/publications/RMSConvergence- May08.pdfS.

[43] H. Esmaeilzadeh, A. Sampson, L. Ceze, and D. Burger, “Neural acceleration for general-purpose approximate programs,” in (MICRO), 2012 45th Annual IEEE/ACM International Symposium on, Dec 2012, pp. 449–460.

[44] A. Amir, P. Datta, W. Risk, A. Cassidy, J. Kusnitz, S. Esser, A. Andreopoulos, T. Wong, M. Flickner, R. Alvarez-Icaza, E. Mcquinn, B. Shaw, N. Pass, and D. Modha, “Cognitive computing programming paradigm: A corelet language for composing networks of neurosy- naptic cores,” in Neural Networks (IJCNN), The 2013 International Joint Conference on, Aug 2013, pp. 1–10.

[45] S. Furber, F. Galluppi, S. Temple, and L. Plana, “The project,” Proceedings of the IEEE, vol. 102, no. 5, pp. 652–665, May 2014.

[46] X. Jin, S. Furber, and J. Woods, “Efficient modelling of spiking neural networks on a scalable chip multiprocessor,” in Neural Networks, 2008. IJCNN 2008. (IEEE World Congress on Com- putational Intelligence). IEEE International Joint Conference on, June 2008, pp. 2812–2819. 82 [47] C. Yakopcic, R. Hasan, T. Taha, M. McLean, and D. Palmer, “Efficacy of memristive crossbars for neuromorphic processors,” in Neural Networks (IJCNN), 2014 International Joint Confer- ence on, July 2014, pp. 15–20.

[48] H. Abdalla and M. Pickett, “Spice modeling of memristors,” in Circuits and Systems (ISCAS), 2011 IEEE International Symposium on, May 2011, pp. 1832–1835.

[49] T. Chang, S.-H. Jo, K.-H. Kim, P. Sheridan, S. Gaba, and W. Lu, “Synaptic behaviors and modeling of a metal oxide memristive device,” Applied Physics A, vol. 102, no. 4, pp. 857– 863, 2011.

[50] M. Laiho, E. Lehtonen, A. Russell, and P. Dudek, “Memristive synapses are becoming reality,” 2010.

[51] D. B. Strukov, G. S. Snider, D. R. Stewart, and R. S. Williams, “The missing memristor found,” Nature, vol. 453, pp. 80–83, 2008.

[52] J. G. Simmons, “Generalized formula for the electric tunnel effect between similar electrodes separated by a thin insulating film,” Journal of Applied Physics, vol. 34, no. 6, pp. 1793–1803, 1963.

[53] G. Snider, “Cortical computing with memristive nanodevices,” SciDAC Review, vol. 10, pp. 58–65, 2008.

[54] A. Afifi, A. Ayatollahi, and F. Raissi, “Implementation of biologically plausible models on the memristor crossbar-based cmos/nano circuits,” in Circuit Theory and Design, 2009. ECCTD 2009. European Conference on, Aug 2009, pp. 563–566.

[55] T. Raja and S. Mourad, “Digital logic implementation in memristor-based crossbars - a tuto- rial,” in Electronic Design, Test and Application, 2010. DELTA ’10. Fifth IEEE International Symposium on, Jan 2010, pp. 303–309.

[56] K.-H. Kim, S. Gaba, D. Wheeler, J. M. Cruz-Albrecht, T. Hussain, N. Srinivasa, and W. Lu, “A functional hybrid memristor crossbar-array/cmos system for data storage and neuromorphic applications,” Nano Letters, vol. 12, no. 1, pp. 389–395, 2012.

[57] C. Yakopcic, T. Taha, G. Subramanyam, R. Pino, and S. Rogers, “Analysis of a memristor based 1t1m crossbar architecture,” in Neural Networks (IJCNN), The 2011 International Joint Conference on, July 2011, pp. 3243–3247.

[58] S. H. Jo, K.-H. Kim, T. Chang, S. Gaba, and W. Lu, “Si memristive devices applied to mem- ory and neuromorphic circuits,” in Circuits and Systems (ISCAS), Proceedings of 2010 IEEE International Symposium on, May 2010, pp. 13–16.

[59] “Ltspice,” http://www.linear.com/designtools/software.

[60] K. Gulati, J. Croix, S. Khatri, and R. Shastry, “Fast circuit simulation on graphics processing units,” in Design Automation Conference, 2009. ASP-DAC 2009. Asia and South Pacific, Jan 2009, pp. 403–408. 83 [61] J. Power, J. Hestness, M. Orr, M. Hill, and D. Wood, “gem5-gpu: A heterogeneous cpu-gpu simulator,” Computer Architecture Letters, vol. PP, no. 99, pp. 1–1, 2014.

[62] “Cadence ultrasim,” http://www.cadence.com/products/cic/UltraSimfullchip.

[63] “Synopsys hsim,” http://www.synopsys.com/Tools/Verification/AMSVerification/CircuitSimulation/HSIM.

[64] A. Newton and A. Sangiovanni-Vincentelli, “Relaxation-based electrical simulation,” Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, vol. 3, no. 4, pp. 308–331, October 1984.

[65] F. Manne and R. Bisseling, “A parallel approximation algorithm for the weighted maximum matching problem,” in Parallel Processing and Applied , ser. Lecture Notes in Computer Science, R. Wyrzykowski, J. Dongarra, K. Karczewski, and J. Wasniewski, Eds. Springer Berlin Heidelberg, 2008, vol. 4967, pp. 708–717.

[66] “Nvmspice,” http://www.nvmspice.org.

[67] R. Pino, J. Bohl, N. McDonald, B. Wysocki, P. Rozwood, K. Campbell, A. Oblea, and A. Tim- ilsina, “Compact method for modeling and simulation of memristor devices: Ion conduc- tor chalcogenide-based memristor devices,” in Nanoscale Architectures (NANOARCH), 2010 IEEE/ACM International Symposium on, June 2010, pp. 1–4.

[68] W. Gropp, E. Lusk, N. Doss, and A. Skjellum, “A high-performance, portable implementation of the mpi message passing interface standard,” Parallel Comput., vol. 22, no. 6, pp. 789–828, Sep. 1996.

[69] Y. Lecun and C. Cortes, “The MNIST database of handwritten digits.” [Online]. Available: http://yann.lecun.com/exdb/mnist/

[70] M. Alvira and R. Rifkin, “An empirical comparison of snow and svms for face detection,” Center for Biological and Computational Learning, MIT, Cambridge, MA, A.I. memo 2001- 004, 2001.

[71] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural net- works,” vol. 313, no. 5786, pp. 504–507, 2006.

[72] B. Heisele, T. Poggio, and M. Pontil, “Face detection in still gray images,” Center for Biolog- ical and Computational Learning, MIT, Cambridge, MA, A.I. memo 1687, 2000.

[73] K.-K. Sung, “Learning and example selection for object and pattern recognition,” Ph.D. dis- sertation, MIT, Artificial Intelligence Laboratory and Center for Biological and Computational Learning, Cambridge, MA, 1996.

[74] H. A. Rowley, S. Baluja, and T. Kanade, “Neural network-based face detection,” IEEE Trans- actions on Pattern Analysis and Machine Intelligence, vol. 20, no. 1, pp. 23–38, 1998.

[75] J. Foytik and V. K. Asari, “A two-layer framework for piecewise linear manifold-based head pose estimation,” Int. J. Comput. Vision, vol. 101, no. 2, pp. 270–287, Jan. 2013.

84