Research Collection

Doctoral Thesis

Robust online learning in neuromorphic systems with spike- based, distributed synapses

Author(s): Stefanini, Fabio

Publication Date: 2013

Permanent Link: https://doi.org/10.3929/ethz-a-010118610

Rights / License: In Copyright - Non-Commercial Use Permitted

This page was generated automatically upon download from the ETH Zurich Research Collection. For more information please consult the Terms of use.

ETH Library DISS. ETH NO. 21337 Robust online learning in neuromorphic systems with spike-based, distributed synapses

A dissertation submitted to

ETH ZURICH

for the degree of Doctor of Sciences

presented by

Fabio Stefanini Institute of Neuroinformatics Laurea di Dottore in Fisica Università degli Studi di Roma “La Sapienza”

born April, 19th, 1983

citizen of Italy

accepted on the recommendation of

Prof. Dr. Rodney Douglas, examiner

Prof. Dr. Giacomo Indiveri, co-examiner

Prof. Dr. Stefano Fusi, co-examiner

2013 Classification in VLSI

ii To my wife. Classification in VLSI

iv Abstract

Neuromorphic Very Large Scale Integration (VLSI) hardware offers a low-power and compact electronic substrate for implementing distributed plastic synapses for artificial systems. However, the technology used for constructing analog neuromorphic circuits has limited resolution and high intrinsic variability, leading to large circuit mismatch. Consequently, neuromorphic synapse circuits are imprecise and thus the mapping of pre-defined models onto neural processing systems built with such components is practically unfeasible. This problem of variability can be avoided by off-loading learning to ad-hoc digital resources external to the synapse but this work-around introduces communication bottlenecks and so compromises compactness and power efficiency. Here I propose a more direct solution using a system composed of aggregated classifiers and distributed, stochastic learning to exploit the intrinsic variability and mismatch of the substrate. The system consists of a feed-forward neural network with one hidden layer of randomly connected with fixed weights and a pool of binary classifiers with stochastic learning. To demonstrate the system, I developed software procedures to configure the hardware classifier and used them to test its ability to learn to identify a target class. The method follows earlier works where the circuit parameters are first configured to obtain the desired behavior and then the plastic synapses are characterized by imposing determined pre- and post- synaptic mean activities. The classifier is then included in a simulated environment to compare the complete system to theoretical and computational models. In particular, a well-known dataset consisting of images of hand-written digits has been used as benchmark to demonstrate with experimental results that by aggregating the responses of imprecise classifiers trained with the feed-forward network it is possible to achieve a performance which is comparable to state-of-the-art. The intrinsic noise present in the neuromorphic hardware and the type of distributed plastic synapses used result in an effective combination of classifiers, as described by recent machine learning techniques. This work is a crucial step towards enabling online learning through synaptic plasticity on imprecise hardware, thus providing a novel, resource efficient substrate for machine learning. It demonstrates how a “bottom-up” approach that exploits the properties of the substrate to obtain a functioning neuromorphic system can elegantly overcome the complications due to the calibration of imprecise circuits with limited configurability, thus extending the valuability and feasibility of analog VLSI techniques. The biological plausibility of the system links to relevant issues in neuroscience, such as the role of intrinsic variability in probabilistic computation in the brain, and suggests possible design strategies for future emerging technologies.

v Classification in VLSI

Compendio

Hardware neuromorfo costruito con tecnologia VLSI offre un substrato compatto di tipo low-power per l’implementazione di sinapsi plastiche distribuite in sistemi artificiali. Il tipo di tecnologia correntemente utilizzata per realizzare circuiti neuromorfi presenta tuttavia una bassa risoluzione e un’alta variabilità intrinseca, quindi una grande disomogeneità circuitale, cioè mismatch. Per questo motivo i circuiti neuromorfi che realizzano le sinapsi sono imprecisi e una mappatura di modelli prestabiliti su sistemi costruiti a partire da tali componenti diventa una soluzione praticamente impercorribile. Il problema della variabilità può essere evitato dedicando risorse specializzate ai meccanismi di apprendimento, separate dal blocco di circuiti neuro-sinaptici, tuttavia questa opzione introduce colli di bottiglia di comunicazione e compromette compattezza circuitale e efficienza energetica. La mia proposta consiste in una soluzione più diretta che usa un sistema composto da classificatori aggregati e apprendimento stocastico distribuito al fine di sfruttare la variabilità intrinseca e il mismatch del substrato. Il sistema consiste in una rete neurale di tipo feed-forward con un livello intermedio di neuroni connessi a random, con pesi sinaptici fissi, e un insieme di classificatori binari con apprendimento stocastico. Per dimostrare la funzionalità del sistema, ho sviluppato le procedure software necessarie per la configurazione dei classificatori hardware e le ho usati per testare le loro capacità di apprendimento nell’identificare una data classe di stimoli. Questo metodo segue precedenti lavori in cui prima i parameteri circuitali del neurone vengono configurati per ottenere il comportamento desiderato e successivamente una caratterizzazione delle sinapsi plastiche viene ottenuta imponendo determinate attività pre- e post- sinaptiche. Il singolo classificatore è stato poi inserito in una simulazione per confrontare il sistema di classificazione completo con i risultati stimati dalla teoria e dai modelli. In particolare, un set di dati di caratteri scritti a mano comunemente usato in computer science è stato usato come punto di riferimento per dimostrare, con risultati sperimentali, che è possibile ottenere una performance di classificazione confrontabile con lo stato dell’arte tramite una combinazione di risposte di classificatori imprecisi addestrati con la rete feed-forward. La presenza del rumore intrinseco dell’hardware neuromorfo e il tipo di sinapsi impiegate comportano un’efficace combinazione di classificatori, come si trova in certe recenti descrizioni di tecniche di machine learning. Questo lavoro rappresenta un passo in avanti rispetto alla possibilità di realizzare sistemi di apprendimento online sfruttando plasticità sinaptica su hardware con basso fattore di precisione e quindi propone un substrato nuovo, efficiente in termini di utilizzo di risorse, per machine learning. Il lavoro dimostra come un approccio “dal basso” che sfrutti le proprietà del substrato per ottenere un sistema neuromorfo funzionante può superare in maniera elegante le complicazioni dovute alle accurate calibrazioni di hardware impreciso e con limitate possibilità di configurazione e perciò estente la validità e la praticabilità del VLSI analogico. La plausibilità biologica del sistema impiegato si collega a interessanti approfondimenti nelle neuroscienze e suggerisce possibili strategie di design per le future tecnologie emergenti.

vi Disclaimer

I hereby declare that the work in this thesis is that of the candidate alone, except where indicated in the text and as described below. Chapter 3 is a modified version of the paper [Chicca et al., 2013]. Chapter 4 is a modified version of the paper [Sheik et al., 2011]. The results in appendix summarize the work carried-out in collaboration with Michael Beyeler and Rahel Von Rohr during their Master and Semester projects in the Neuroscience program of University of Zurich and ETH. The network model in Chapter 5 and the classification results in Chapter 6 have been developed in collaboration with Mattia Rigotti (New York University and Columbia University, NYC, New York). The use of “we” in the thesis refers to the aforementioned people in the relevant sections.

Publications arising from this thesis

The work of this thesis or part of it has been published on journals and conference proceedings as listed below. These publications are also mentioned in the text where relevant.

• Sheik, S. and Stefanini, F. and Neftci, E. and Chicca, E. and Indiveri, G., Systematic configuration and automatic tuning of neuromorphic systems, Circuits and Systems (ISCAS), 2011 IEEE International Symposium on, 873-876 (2011)

• Indiveri, G. and Stefanini, F. and Chicca, E., Spike-based learning with a generalized integrate and fire silicon neuron, Circuits and Systems (ISCAS), Proceedings of 2010 IEEE International Symposium on, 1951-1954 (2010)

• Beyeler, M. and Stefanini, F. and Proske, H. and Galizia, G. and Chicca, E., Exploring olfactory sensory networks: simulations and hardware emulation, Biomedical Circuits and Systems Conference (BioCAS), 2010 IEEE, 270-273 (2010)

• E. Neftci, S. Sheik, F. Stefanini, G. Indiveri, A Python package for accessing, configuring, and applying electronic spiking neural networks, Frontiers in Neuroinformatics, Python In Neuroscience, Research Topic, June, 2013, Accepted

• S. Sheik, M. Pfeiffer, F. Stefanini, G. Indiveri, Spatio-temporal spike pattern classification in neuromorphic systems, International Conference on Biomimetic and Biohybrid Systems, August, 2013, Accepted

• E. Chicca, F. Stefanini, G. Indiveri, Neuromorphic electronic circuits for building autonomous cognitive systems, Proceeding of the IEEE, Submitted

vii Classification in VLSI

Acknowledgments

Several people and circumstances have been very influential on this work and I would like to acknowledge them here. Obviously the location and the context in which I worked was a critical factor. I had a great time at INI and I’m proud of having occupied the noisiest desk in the Institute, located in probably one of the noisiest spots in Bau55: it was in front of the workshop and in front of the entrance, along the aisle, right next to the Master’s room. I thank INI for having giving me the opportunity to learn to classify around a 100 different walking patterns and to distinguish probably 20 different English accents. In addition, I’m probably second in the rankings based on the average number of unique people I exchanged at least one word with every day. This made my stay at INI greatly enjoyable. In addition, this work has benefited from the continuous practice of giving demonstrations to some of the most prestigious scientists in this field (I can count around 20), which has been possible thanks to the INI series of Colloquia and the thorough work of its organizers. I thank the organizers of the series of Workshops held in Capo Caccia where I used to spend great time every year, acquiring knowledge, working day and night with my colleagues, and making me feel part of something cool all the time. Each workshop had a strong repercussion on my scientific views in this field. I thank Prof. Kwabena Boahen for pointing me to the works of Coates, Prof. Ernst Neibur for the journey on the world of criticality in neural networks, Hyunsurk Eric Ryu for insightful discussions on consumer electronics. Among all the people that contributed to make these 4.5 y of work a joyful period of my life I’d like to mention Roman and his summarizing skills – not to mention the fact that he initiated me to the practice of Wing Chun Kung Fu, Marco and his joking skills, Saber and his technical skills, Mohammad and his imitation skills, and all the “toeggli” pals and their absence of skills in playing the fusball properly. I hope I taught you something. Jonathan Binas, Harshawardhan Ramachandran had contributed in proof-reading almost unreadable versions of this thesis and to them goes my appreciation. Elisabetta had a profound influence in my choice of INI as the home for my PhD in the first place, but she also had an important role throughout the beginning months of my journey. I thank her for her timeless supporting presence despite the logistics and the stimulating conversations. I’m thankful to Emre who has been my personal mentor for a long time and dealt with my first hassles with the hardware and the software. He not only helped me with the first steps in this whole world of neuromorphic (system) engineering at INI but has also been a charismatic model that I used as a reference throughout my studies. His coding skills had a strong impact on the course of this thesis. I thank Rodney for helping me determining key passages of my work and for having taught to me the meaning of the sentence “to get to the point”. I hope to carry this concept along with me for a long time through my scientific carrier but also in my personal life. I’m in debt with Mattia and Stefano for having kindly supported my work and shared inspirational conversations, aka delirium. Many of the ideas that make my thesis valuable come as a result of their enthusiastic way of doing and sharing great science. I’m thankful to Prof. Daniel J. Amit who initiated me to the world of neural networks a long time ago and wrote the book that has inspired, motivated, and energized me in every dip I got stuck into along my path. viii If I “sit on the shoulders of giants”, then these guys are the giants. This work would have been a fifth of what it is (both in volume and value) if Sadique wouldn’t have spent 1 hour / day (on average, with large deviations) with me in discussions about the deepest issues in neuroscience and neuromorphic engineering as well as on the most superficial ones, with his honesty and his open-minded way of destroying my visions while being the source of inspiration for new ones1. I hated him at times – and I’m sure I have been returned, but I always needed him to clarify my ideas, to improve them, to focus on the core of the issue we were facing and not on its appearance, to put a break on false hopes derived from suspicious results and to construct new paths to challenge my beliefs. As in the best friendships, our fights (intellectual and physical) became the strength of our union, both professionally and personally, and this made him the ideal companion for my endeavour. I’m happy if I was able to return to him even just a half of what he gave me. I’m finally thankful to my supervisor and mentor Giacomo Indiveri from which I admired his patience, his respect, his moderation, his temperance, his appreciation for any bit of my work, his positive attitude and his respectful consideration towards the community. I learned from him to recognize the value of basic science in engineering and how it is possible to achieve great results and get inspirational insights on fundamental issues by keeping a humble attitude towards research. I thank his perseverance in following the neuromorphic approach, which convinced me in pursuing the goals that are described in this work. This work would have been inexistent without the support of a great woman who decided to follow me in an other country, stood besides me every day of my studies, listened patiently to my monologues, shared with me the exciting moments as well as the tough ones and built the great family I have now. Finally, I wouldn’t be here without the efforts of my parents who put me at the starting point of this long journey on March 2009.

1The typical location for the discussions has been the 3rd row of seats on bus 69 – I thank ZVV for the great venue and all the passengers for their patience.

ix Classification in VLSI

x Contents

1 Introduction 1 1.1 Building self-behaving robots ...... 1 1.2 Neuromorphic cognitive systems ...... 2 1.3 Neural-inspired hardware ...... 4 1.4 Learning algorithms and neural networks for pattern recognition ...... 5 1.5 Randomly connected neurons ...... 6 1.6 Synaptic plasticity in real-time hardware ...... 7 1.7 Motivation of this thesis ...... 8 1.8 Thesis overview ...... 8

2 Neuromorphic VLSI 9 2.1 Introduction ...... 10 2.2 Neural dynamics in analog VLSI ...... 10 2.3 Silicon neurons ...... 12 2.4 Silicon synapses ...... 14 2.4.1 Homeostatic plasticity: synaptic scaling ...... 15 2.5 Synaptic plasticity: spike-based learning circuits ...... 15 2.6 From circuits to networks ...... 18 2.6.1 Recurrent neural networks ...... 18 2.6.2 Distributed multi-chip networks ...... 20 2.7 Experimental results ...... 22 2.7.1 Synaptic and neural dynamics ...... 22 2.7.2 Spike-based learning ...... 23 2.8 Discussion ...... 25 2.9 Conclusions ...... 27

3 The software ecosystem 29 3.1 Introduction ...... 30 3.2 A software front-end for real-time neuromorphic systems ...... 31

xi Classification in VLSI

3.3 Asynchronous communication back-end ...... 32 3.4 Neural network definition front-end ...... 34 3.5 Mapping models into spiking hardware ...... 36 3.6 The pyTune tool-set ...... 37 3.7 Methods ...... 37 3.7.1 Chip functional blocks ...... 38 3.7.2 Neural network topology ...... 38 3.7.3 Automatically tuning system parameters ...... 39 3.8 Results ...... 40 3.9 Discussion ...... 40 3.9.1 Towards a neuromorphic kernel ...... 42 3.9.2 A black-box approach to configuration ...... 42 3.10 Conclusions ...... 42

4 Configuring the hardware 45 4.1 Introduction ...... 46 4.2 Neuron dynamics ...... 47 4.2.1 Statistical properties of the VLSI hardware neuron ...... 47 4.2.2 Depolarization distribution ...... 47 4.3 Synaptic dynamics and plasticity ...... 48 4.3.1 Measuring transition probabilities ...... 51 4.4 Exponential dynamics of learning synapses: bug or feature? ...... 52 4.5 Discussion ...... 54 4.6 Conclusions ...... 56

5 The neural network model 57 5.1 Introduction ...... 58 5.2 The feed-forward network with random projections ...... 59 5.3 Ensembles of hardware classifiers: theory and simulations ...... 60 5.3.1 Abstract model of stochastic learning synapses ...... 61 5.3.2 Phenomenological model of the spiking implementation ...... 62 5.3.3 The interplay between synaptic scaling and stop-learning ...... 64 5.3.4 Examples from simulations ...... 65 5.3.5 Combinations of weak classifiers ...... 69 5.4 Building the neural network in VLSI ...... 70 5.4.1 Random projections with a probabilistic mapper ...... 70 5.4.2 Testing the VLSI perceptrons ...... 71 5.5 Discussion ...... 73 5.5.1 Aggregating techniques and the stochastic learning ...... 73 5.5.2 The use of high-level simulations to calibrate low-level parameters . . . . 73 5.5.3 Similarities with boosted SVMs ...... 74 5.5.4 Performance scaling with the number of aggregated predictors ...... 74 5.6 Conclusions ...... 75 xii CONTENTS

6 Classification 77 6.1 Introduction ...... 77 6.2 Aggregating the linear classifiers ...... 78 6.3 Effects of circuit mismatch ...... 79 6.4 Improved predictions by corrupting the decision thresholds ...... 81 6.5 Discussion ...... 81 6.5.1 Competition emerging from lateral inhibition ...... 82 6.5.2 Introducing readout noise for improving readouts ...... 83 6.6 Conclusions ...... 83

7 Discussion 85 7.0.1 Classification systems in neuromorphic hardware ...... 85 7.1 What is the best pre-processing? ...... 86 7.1.1 Towards low-power cognitive hardware ...... 87 7.2 Computational costs ...... 88 7.2.1 Support-vector-machines ...... 89 7.3 Learning and representational power ...... 89 7.4 Learning attractor networks ...... 90 7.5 Spatio-temporal patterns ...... 90 7.6 Limitations of modern computers ...... 91 7.7 Hardware synapses and learning ...... 92 7.8 Biological plausibility ...... 93 7.8.1 Evidence for randomly connected neurons ...... 93 7.8.2 Sensory fusion and reward in the thalamus ...... 93 7.9 The de-coupling role of stochasticity and mismatch ...... 94

8 Conclusions 95 8.1 Outlook ...... 96 8.2 Lessons learned ...... 97

Appendices 99

A Other experiments 101 A.1 Sharpening tuning curves ...... 101 A.2 Neuromorphic olfaction ...... 102 A.3 Considerations on STDP ...... 104 A.3.1 Studying coincidences ...... 105

xiii Classification in VLSI

xiv CHAPTER 1

Introduction

To complicate is easy, to simplify is hard.

Bruno Munari

1.1 Building self-behaving robots

By formalizing the rules of thinking into the mathematical framework of logic, scientists and philosophers of all times have aimed at describing human behaviour through a set of basic mathematical operations. The development of computing hardware have permitted to translate the formal descriptions into fast operating machines simulating behaviour. The joined efforts between engineers, mathematicians, programmers have pushed the parallel development of computers and algorithms in a positive-feedback loop that led the foundation for the birth of the community of computer scientists and the “artificial intelligence” field. Artificial intelligence involves computer vision, image, sound or video processing, pattern recognition, data mining, robotics, information retrieval, and aims at the automation of tasks traditionally regarded as requiring some sort of intelligence to substitute human intervention. The traditional approach of artificial intelligence has faced issues concerning the construction of intelligent agents behaving in real-world, unpredicted situations where the need for limited power consumption, compactness, algorithmic complexity, real-time capabilities is taken to the extreme. The implementation of algorithms using general purpose hardware needs large memories, fast computation, rapid transmission of data between the memory and the Central Processing Unit (CPU). The typical engineering trend is thus to evaluate the specific needs of the application and trading correspondingly algorithmic complexity with prior knowledge of the problem to reduce the overall power consumption. This approach has led to the development of a vast repertoire of intelligent devices that are rapidly changing our modern life-style. On the other hand, nature has found ways to optimize the mentioned trade-off with very different solutions that we could emulate by looking at the properties of real nervous system.

1 Classification in VLSI

Several researchers have proposed in the past to take inspiration from the biology of nervous systems to build better intelligent machines. The goal requires the understanding of the “organizing principles” of the brain [Mead, 1990] and the design of techniques to embed intelligent hardware with those principles. For example, the communication between the different elements of a neural network in the brain is asynchronous, as opposed to computers where operations only happen with the clock cycles. This line of research led to the definition of neuromorphic engineering, a line of research which aims at reproducing basic properties of the biology of nervous system in VLSI hardware by exploiting the physical properties of the electronic substrate. The neuromorphic engineering community has successfully demonstrated the realization of VLSI devices with circuits adhering to those organizing principles at different levels of faithfulness using standard Complementary Metal Oxide Semiconductor (CMOS) technology, analog, digital and hybrid analog/digital solutions. The current trend in this field of research is to map existing models of neural networks into reconfigurable hardware, thus exploiting the advantages of the neuromorphic approach. The attempt is to demonstrate eventually behavior in an artificial system which raises from the emulated biological processes. However, this goal has encountered several obstacles in the past due the large variability of the electronic substrate. The latter imposes calibration procedures that are often unfeasible for large, partially observable neuromorphic systems with limited reconfigurability and has represented a fundamental limitation of the VLSI approach. One of the main challenges in the field of neuromorphic engineering consists on finding a design strategy for the realization of robust cognitive systems that employ imprecise components.

1.2 Neuromorphic cognitive systems

Several approaches have been recently proposed for building custom hardware, large-scale, brain- like neural processing architectures [Jin et al., 2010, Schemmel et al., 2008, Silver et al., 2007] but the vast majority of them are aimed at implementing fast simulations of large-scale neural networks [Jin et al., 2010, Schemmel et al., 2008, Izhikevich and Edelman, 2008, Djurfeldt et al., 2008]. The SpiNNaker project aims at building a multi-chip digital system for large scale simulations of neural networks using multi-core chips and efficient asynchronous communication. Each ARM-based CPU is designed to emulate about a 1000 neurons communicating through event packets. A SpiNNaker board with 18 ARM968 processors has been recently demonstrated with a peak power consumption of 1 W [Painkras et al., 2013]. The nominal power consumption, as well as the total number of neurons and synapses that can be allocated on the hardware, depend on the specific implementation of the neural network and the total network activity, thus only rough estimates are provided. For example, an implementation of the standard Spike-Timing Dependent Plasticity (STDP) algorithm would require about 425 cycles per each incoming spike per each neuron, for a total computation time of about 23 ms for each synaptic update for 1000 neurons receiving spike-trains at 10 Hz [Davies, 2013]. The flexibility of the ARM-based hardware and the software infrastructure permits the definition of custom neuronal and synaptic models as well as plasticity rules, with the only constraint of the networking and processing capabilities of each core. The FACETS system and the recently announced BrainscaleS project that follows it aim at building a neuromorphic platform for fast, large-scale neural network emulations. Instead

2 CHAPTER 1. INTRODUCTION of packaging different chips, the solution proposed by the Heidelberg group is to build a full wafer system with multiple analog neuromorphic multi-neuron cores communicating using digital, asynchronous protocols [Schemmel et al., 2008]. The system runs at an accelerated time scale (from 103 times faster than real-time up to 105 approximately) and was not been designed with robotic applications in mind. Calibration procedures and control tools have been implemented using a graph model [Wendt et al., 2008] that maps neural network models in the FACETS hardware but their applicability is restricted to that specific neuromorphic system [Ehrlich et al., 2010, Brüderle et al., 2011]. Learning capabilities are obtained through STDP, which involves local computation of spike-time differences and delegation of synaptic update to additional circuitry external to the neuro-synaptic core. The large variability of the neuro-synaptic core has limited the mapping of functioning models on even the prototyped hardware and learning has been demonstrated only under severely controlled conditions or applied through off-line procedures [Pfeil et al., 2013]. The Neurogrid project [Silver et al., 2007] uses programmable multi-neuron VLSI chips each emulating up to 65000 neurons. A board comprising 16 chips have been assembled and used in various experiments. Recently [Choudhary et al., 2012], the board has been used to emulate 4000 neurons with 16 M feed-forward and recurrent synaptic connections. The neural network was constructed to perform a mathematical operation over the input to demonstrate the possibility of carrying-out arbitrary operations on neuromorphic hardware. The network was constructed using Neural Engineering Framework (NEF) [Eliasmith and Anderson, 2004], a theoretical framework to translate equations and differential equations into neural network dynamics, while the Neurogrid hardware has no means to implement online learning. The ALAVLSI project in Rome resulted recently in the demonstration of working memory in a neuromorphic, analog VLSI system with asynchronous, digital communication [Giulioni et al., 2011]. The hardware system used comprises a multi-neuron chip with 128 neurons and 128 plastic synapses each [Giulioni et al., 2008]. The digital communication permits to reconfigure the system to accommodate for a variety of network architectures. Using the framework of mean-field theory [Fusi and Mattia, 1999] the authors where able to realize a self-constructing attractor neural-network in the neuromorphic hardware and fully characterize its properties. This work represented a crucial step in the realization of a self-constructing memory system, a fundamental building block for implementing numerous cognitive functions, such as working memory, attentional selection, choice behaviour and others [Amit and Brunel, 1997, Wang, 2002, Deco and Rolls, 2005]. (See also [Misra and Saha, 2010] and [Cattell and Parker, 2012] for an extended survey of the progress of hardware emulations during the last two decades.) These systems can be very useful tools for neuroscience modeling, e.g. by accelerating the simulation of complex computational neuroscience models. However this work focuses on an alternative approach aimed at the realization of a neurally-inspired learning system for robotic applications where real-world conditions are hard to predict and thus to simulate. To this end, the system should consist of compact, real-time, and energy efficient computational devices that directly emulate the style of computation of the brain, i.e., using the physics of Silicon to reproduce the bio-physics of the neural tissue. In addition, it should include learning abilities to adapt its behavior to novel, previously unseen environments and learn relevant task-rule

3 Classification in VLSI associations to implement the required cognitive flexibility. This approach leads, on one hand, to the implementation of compact and power-efficient behaving systems ranging from brain-machine interfaces to autonomous robotic agents, and on the other serves as an extra basic research instrument, for exploring the dynamical and computational properties of the neural system they emulate to gain a better understanding of their underlying operational principles. These ideas are not new: they follow the original vision of Mead [Mead, 1990], Mahowald [Mahowald, 1992], and colleagues [Douglas et al., 1995b]. Indeed, neuromorphic circuits of the types described in this thesis have already been employed in the past to build real-time sensory motor systems and robotic demonstrators of neural computing architectures [Horiuchi and Koch, 1999, Indiveri and Douglas, 2000, Indiveri, 2001, Lewis et al., 2003, Serrano-Gotarredona et al., 2009]. However, the devices and systems built to date following this approach have been synthesized using ad-hoc methods to implement very specific sensory- motor mappings or functionalities. The challenge that remains open is to bridge the gap from designing these types of reactive artificial neural modules to building complete neuromorphic behaving systems that are endowed with cognitive abilities.

1.3 Neural-inspired hardware

While fully custom implementations permit to optimize the design for the specific purpose of emulating neural networks and adhere to the original spirit of neuromorphic engineering [Mead, 1990], it is also important to evaluate the feasibility of efficient neural network simulations using available hardware. During the last few decades, several researchers have realized neurally- inspired hardware using traditional and unusual technologies. Some of the main issues with hardware implementations using market available technologies, such as Digital Signal Processors (DSPs) [Jung and Kim, 2007], Field Programmable Gate Arrays (FPGAs) [Cassidy and Andreou, 2008, Wood et al., 2012, Maguire et al., 2007, Wang et al., 2013] or Graphical Processing Units (GPUs) [Nageswaran et al., 2009], are the limited on-chip resources, e.g. memory, the low bandwidth to communicate spikes between the computing and memory locations or off- chip communication, power requirements to run real-time computation, cumbersome packaging, programming and configuration overhead. These difficulties challenge the realization of large- scale networks on traditional hardware such that only simplified networks have effectively been possible. For example, in FPGA implementations, simplification strategies include simulating small networks or networks with only sparse neuronal activities, simplifying the neuronal models to dedicate the on-chip memory to only synaptic computation, serializing the computation to simplify the design. In addition, none of the works to date have been effectively used to implement on-chip, online learning in real-time due to the additional heavy memory and computation requirements of the learning mechanisms [Cheung et al., 2012, Wood et al., 2012]. Recently an FPGA implementation of a spiking neural network with delay adaptation has been proposed [Wang et al., 2013]. DARPA’s SyNAPSE initiative, started in 2008, have pushed the development of digital and hybrid analog/digital implementations of neural networks using custom designs for CMOS and nanotechnologies. A fully digital neuro-synaptic core [Arthur et al., 2012] have been demonstrated on a 0.18 µm chip simulating the activity of 256 neurons with 1024 weighted input synapses each. It makes use of fully asynchronous communication [Manohar, 2000] to optimize power

4 CHAPTER 1. INTRODUCTION consumption and is not capable of modifying the synaptic weights due to the lack of any learning capability. A second neuro-synaptic core has also been proposed by IBM recently [Seo et al., 2011]. An architecture for object recognition using the above technologies has been proposed [Nere et al., 2012]. Using a synchronous technology, the core consists of 256 neurons with 256 input synapses each through a crossbar architecture which is modified by an STDP learning rule implemented on chip. An hybrid digital/analog neural architecture recently proposed uses a -based crossbar [Minkovich et al., 2012] for storing synaptic values and implementing the STDP learning, but digital overhead is needed to control the crossbar and multiplex the network’s activity [Srinivasa and Cruz-Albrecht, 2012]. The growth of the community of researchers exploring the possibility to implement neural networks in hardware using either available technologies or customized devices, demonstrates the attraction that the simulation of brain-inspired computation has gained since the pioneering works of and colleagues at Caltech.

1.4 Learning algorithms and neural networks for pattern recog- nition

The technologies cited above can be combined with the mathematical methodologies to obtain hardware implementations of intelligent systems. These two aspect of artificial intelligence are coming to an incredibly fruitful convergence, thus it is important to get a full understanding of both the possibilities of the hardware and the needed functionalities, combining theory and engineering, to succeed in the constraction of intelligent machines. Support Vector Machines (SVMs) [Vapnik, 1995] are a class of learning models that has been widely used in several machine-learning problems, from isolated handwritten digit recogni- tion [Cortes and Vapnik, 1995, Schölkopf and Burges, 1999] to face detection in images [Osuna et al., 1997], time series prediction tests [Müller et al., 1997] and others. Non-linear SVMs can perform classification of non-linearly separable data through what is known as the “kernel trick” [Aizerman et al., 1964, Cortes and Vapnik, 1995] which is a mathematical artifice to map a general set of data from its space into a high-dimensional space (feature space) where the data has a simpler structure, thus allowing classification. In this space the mapping reduces to computing a function of scalar products (a kernel) acting on the input space [Vapnik, 1998] so that an explicit solution for the map is not needed. Thus in machine learning problems, the task typically consists in finding the kernel function that results in the best generalization performance. Typical kernels include Radial Basis Functions (RBFs) and polynomial kernels and are used to perform Principle Component Analysis (PCA) [Hoffmann, 2007], Fisher’s Linear Discriminant Analysis (LDA) [Mika et al., 1999] and other methods. Though very effective on difficult problems, e.g. classification in case of small train sets, SVMs typically require a large number of computational elements and memory [Bengio and LeCun, 2007] and cannot be used realistically on larger datasets due to the amount of scalar products to compute using the kernel method. Using SVMs, it is possible to get to less then 1% error rate [LeCun et al., 1998, Decoste and Schölkopf, 2002] on the MNIST dataset, a dataset of isolated handwritten digits commonly used as benchmark for machine learning algorithms. Several datasets have been tried in the past obtaining good performance without much parameter tuning [Meyer

5 Classification in VLSI et al., 2003], a results that also shows the potential of SVM approaches. However, the typical computational cost of SVM is very high, about 14 million multiply-adds per recognition [LeCun et al., 1998], but Reduced Set Support Vector Machine (RS-SVM) can reduce the computational cost (about 650,000 multiply-add per recognition) while still maintinaing good classification performance [Schölkopf and Burges, 1999]. SVMs can be combined with other methods in deep learning techniques. Deep learning is a subfield of machine learning that involves using several layers of representation corresponding to layers of features. High-level features are constructed from low-level features, such that the goal of designing a deep architecture corresponds to optimizing the feature selection for the specific task, e.g. face recognition [Bengio, 2009]. Because of the “vanishing gradient” problem [Hochreiter et al., 2001], unsupervised learning techniques [Hinton et al., 2006] or Long Short Term Memory (LSTM) [Hochreiter and Schmidhuber, 1997] are typically used to reduce the space of features to be detected and eventually obtain convergent learning. The state-of-the-art using deep learning methods is reached currently by exploting the processing power of GPUs and solve several pattern classification tasks despite the problems of the backpropagation.

1.5 Randomly connected neurons

There is currently increasing interest in shallow architectures for several reasons. Shallow architectures are easier to train [LeCun et al., 1998] and less computational demanding [Bengio and LeCun, 2007]. An example of a widely used shallow architecture which can be used to etract relevant information for solving recognition task consists of using a single layer of non-linear units, e.g. Linear-Threshold Units (LTUs), which receive randomized input from the sensory layer. In contrast to cascades of feature extracting layers in deep architectures [Riesenhuber and Poggio, 1999], it is not straight forward to extrapolate associations between the activities of the randomly connected neurons and meaningful properties of the input stimuli. On the other hand, random connections provide an easy-to-build tool to encode input variables and stimuli and have been recently used in several works to build compact network emulating complex behaviour. For example, the NEF [Eliasmith and Anderson, 2004] provides a mathematical methodology to encode arbitrary variables and functions of variables using large populations of randomly connected neurons with diverse response properties and it has successfully being used in combination with neuromorphic and digital hardware to construct neural simulations and emulations of complex behaviour [Galluppi et al., 2012, Choudhary et al., 2012]. In a recent work, Tapson and Van Sheik [Tapson and van Schaik, 2013] used a randomly connected layer of sigmoid units to preprocess input stimuli used for classification. The authors developed an online learning rule to calculate the synaptic weights needed to solve the classification problem. Their results show that one such layer was enough to reach a performance which was comparable to state-of-the-art depending on the number of units in the hidden layer, with a maximum they tried of 7840 (10 times the size of the hand-written digits use as input patterns). Finally, randomly connected neurons are a particular case of Extreme Learning Machines (ELMs) [Huang et al., 2006]. This type of networks involves a single layer of hidden units (not necessarily randomly connected) and, similar to NEF, provides an analytical solution for determining the output weights. The advantage of being able to describe relevant quantities and variables with fixed randomly

6 CHAPTER 1. INTRODUCTION connected neurons is not only practical but also theoretical. It has been shown theoretically [Cho and Saul, 2010] that SVMs that use kernels mimicking the computation of large neural networks with one hidden layer of hidden units yield state-of-the-art results in some benchmark problems, beating not only other SVMs but also deep-belief networks.

1.6 Synaptic plasticity in real-time hardware

A great deal of research in neuromorphic engineering has focused on the implementation of synaptic plasticity rules in analog VLSI hardware for real-time learning. Previous works have come to a state were it has been possible to emulate basic weight changes mechanisms observed in biology, e.g. STDP. One of the oldest attempts to implement synaptic plasticity consisted in reproducing STDP in hardware using subthreshold circuits [Häfliger et al., 1997]. The plasticity was obtained by using a capacitor storing the temporal correlation between pre- and post-synaptic spikes and a differential pair to adjust the synaptic weight accordingly. A weight-dependent STDP circuit has been used to implement synchrony detection with bimodal synapses [i Petit and Murray, 2004] using a similar circuit design. A 32×32 array of excitatory neurons with 21 STDP synapses each has been implemented in 0.25 µm CMOS technology [Arthur and Boahen, 2006]. The chip was used to show plasticity enhanced phase-coding, a mechanism proposed as a model for place cells formation and computation in the hippocampus. The STDP circuit involved two capacitors integrating the pre- and post-synaptic spiking activity and a control part changing the state of the bi-stable synapse at every pre- or post-synaptic spike according to the value of the integration with respect to a threshold. Circuits enabling STDP plasticity on CMOS synapses typically require integration over long time constants as required by the learning rule, however it is possible to obtain STDP-like properties without this integration. A voltage-dependent, spike-based synaptic plasticity rule shows this properties and has been also implemented in several generations of VLSI chips at the Institute of Neuroinformatics (INI) in Zurich [Indiveri et al., 2006, Chicca et al., 2013]. CMOS synapses are typically easy to configure and to design to obtain low-power memory arrays. However, the circuits for plasticity usually require large capacitors to implement the long time-constants required by the plasticity rule. More compact solutions have been recently proposed [Seo et al., 2011] using fully digital, synchronous architectures. A Floating Gate MOSFET (FGMOS) is a field-effect transistor in which the charge stored on the insulated gate can be changed by exploiting the electron tunneling effect through the insulator [Kahng and Sze, 1967]. FGMOSs have been proposed as compact models of synapses due to their ability to store weights and implement synaptic plasticity rules [Harrison et al., 1998] using a single transistor. One of the drawbacks of FGMOSs is the need for specific additional circuitry to control the weight changes through tunneling and to perform fine-tuned calibrations to compensate for the heavy effects of transistor mismatch. The use of FGMOS thus basically involves outsourcing of the synaptic plasticity mechanism from the synaptic array to some other parts of the chip. A different technology using similar architecture principles has gained much attraction during the last decade. The memristor [Chua, 1971, Chua and Kang, 1976] is a two-nodes device that forms at the point of contact of two nano-wires in a Silicon substrate. The resistance of a memristor can be changed by applying certain voltages at the nodes and can span several orders of magnitude. It has thus been proposed as a replacement for CMOS synapses

7 Classification in VLSI due to its compactness and its configurability [Jo et al., 2010]. However, the device physics of the memristor is still poorly understood and the challenge of obtaining reliable control of large memristor arrays has only few success stories. Some successful approaches involve reduction of the full range of possible resistivities into a low number of ranges (typically 4 or 5), continuous control of the synaptic weights through online STDP by bulky external circuitry and fast store/re-store cycles (typically 100/sec) of the synaptic matrix into a local memory [Minkovich et al., 2012, Pershin and Ventra, 2010].

1.7 Motivation of this thesis

Recent advancements in neuromorphic engineering suggest that is possible in the near future to reproduce neural networks in hardware of the size of a small animal’s cortex, a target often referred to as “brain-scale”. In order to allow such large systems to show complex behaviour while interacting with the environment, it is important to equip the system with synaptic plasticity and map robust neural network models onto the hardware. While circuits for synaptic plasticity have been demonstrated on analog VLSI hardware, fully operational systems haven’t been shown to date solving typical machine learning tasks, for example. The purpose of this thesis is to propose a general framework to assess the learning capabilities of neuromorphic hardware and propose a neural network based on those capabilities. To this end, we identified a key goal being to demonstrate real-time pattern recognition using neuromorphic hardware with online learning capabilities. The challenges addressed by this thesis are two-folded. On one hand, configuration procedures and corresponding software tools have to be implemented in order to assess the hardware capabilities and performance and set its internal parameters to the desired regime. On the other hand, these procedures have to be used to validate a specific neural network model that can be supported by the neuromorphic hardware while showing enough robustness to be practically employed in real, out-of-the-lab scenarios.

1.8 Thesis overview

This thesis is structured as follows. Chapter 2 gives an overview over the neuromorphic hardware used in this work and partially characterized during my studies. Chapter 3 present the software tools used to control and communicate the neuromorphic hardware. Chapter 4 describes the procedures used to configure the neuromorphic circuits implementing synaptic plasticity. Chapter 5 describes the problem of spike-based learning in hardware to build the neuromorphic classifier. Chapter 6 characterizes the properties of ensembles of linear classifiers that use the spike-based plasticity. In Chapter 7 the neural network is used to classify hand-written digits. Finally in Chapter 8 an extensive discussion of the work is provided and Chapter 9 concludes the thesis. In the Appendix, summarized results from several side projects that have been addressed during the time of this thesis are collected.

8 CHAPTER 2

Neuromorphic VLSI

Be like water making its way through cracks. Do not be assertive, but adjust to the object, and you shall find a way around or through it. If nothing within you stays rigid, outward things will disclose themselves.

Bruce Lee

The neuromorphic engineering approach aims at emulating fundamental biophysical properties of the nervous system, termed organizing principles, into VLSI hardware by exploiting the physical properties of the Silicon substrate [Mead, 1989]. This approach leads to low-power, compact devices that reproduces neuronal [Indiveri et al., 2011] and synaptic dynamics [Bartolozzi and Indiveri, 2004], as well as several forms of synaptic plasticity required for memory formation, such as long-term modifications [Fusi et al., 2000] and adaptation [Bartolozzi and Indiveri, 2006]. While neuromorphic implementations of such functionalities can be used for reproducing aspects of cortical computation at large scales with considerable speed-ups compared to traditional computer simulations [FACETS], they can instead be optimized for robotic applications, where real-time interaction with sensory systems is a requirement, and compactness and power-consumption are the main issues. In this Chapter, a set of building blocks that comprises circuits for spiking neurons, dynamic synapses, synaptic plasticity and event-based communication is presented. These elements are integrated in a reconfigurable neuromorphic system to implement several models of cortical-like computation. The implementation of such models, which has been possible with the contribution of the work presented in this thesis, represents the success of several decades of research in the field aiming towards the construction of bio-inspired, self-adaptive behaving system. In particular we summarize here the work of the last 5 to 7 years of research in the INI.

9 Classification in VLSI

2.1 Introduction

Machine simulation of cognitive functions has been a challenging research field since the advent of digital computers. However, despite the large efforts and resources dedicated to this field, humans, mammals, and many other animal species including insects, still outperform the most powerful computers in relatively routine functions such as sensory processing, motor control and pattern recognition. The disparity between conventional computing technologies and biological nervous systems is even more pronounced for tasks involving autonomous real-time interactions with the environment, especially in presence of noisy and uncontrolled sensory input. One important aspect is that the computational principles followed by the nervous system are fundamentally different from those of present day computers. Rather than using Boolean logic, precise digital representations and clocked operations, nervous systems carry out robust and reliable computation using hybrid analog/digital unreliable processing elements; they emphasize distributed, event- driven, collective, and massively parallel mechanisms and make extensive use of adaptation, self-organization and learning. The circuits described in this chapter comprise a set of fundamental building blocks for the realization of such type of neuromorphic cognitive systems. We show how these building blocks, based on dynamic synapse circuits, hardware models of spiking neurons, and spike-based plasticity circuits can be integrated to form multi-chip spiking recurrent and Winner-Take-All neural networks, which in turn have been proposed as neural models for explaining pattern recognition [Senn and Fusi, 2005, Brader et al., 2007], working memory [Giulioni et al., 2011, Renart et al., 2003], decision making [Deco and Rolls, 2005, Schöner and Dineva, 2007] and state-dependent computation [Rutishauser and Douglas, 2009, Rigotti et al., 2010a] in the brain.

2.2 Neural dynamics in analog VLSI

In order to interact with the environment in real-time and process real-world sensory signals efficiently, neuromorphic behaving systems must use circuits that have biologically plausible time constants (i.e. of the order of tens of milliseconds). In this way, they are well matched to the signals they process, and are inherently synchronized with the real world events. This constraint is not easy to satisfy using analog VLSI technology. Standard analog circuit design techniques either lead to bulky and silicon-area expensive solutions [Rachmuth et al., 2011] or fail to meet this condition, resorting to modeling neural dynamics at “accelerated” un-realistic time-scales [Wijekoon and Dudek, 2008, Schemmel et al., 2007]. One elegant method to alleviate this problem is to use current-mode design techniques [Toma- zou et al., 1990] and log-domain subthreshold circuits [Liu et al., 2002, Mitra et al., 2010]. When Metal Oxide Semiconductor Field Effect Transistors (MOSFETs) are operated in the subthreshold domain, the main mechanism of carrier transport is that of diffusion, as it is for ions flowing through proteic channels across neuron membranes. As a consequence, MOSFETs have an exponential relationship between gate-to-source voltage and drain current, and produce currents that range from femto- to nano-Ampères. In this domain it is therefore possible to integrate relatively small capacitors in VLSI circuits, to implement temporal filters that are both compact and have biologically realistic time constants, ranging less then 10 to hundreds of milliseconds. A very compact subthreshold log-domain circuit that can reproduce biologically plausible

10 CHAPTER 2. NEUROMORPHIC VLSI

Vdd Vdd

Ith Iin

Iout VS VC VG I1 I2

Iτ IC

Figure 2.1: Log-domain DPI circuit diagram. Red arrows show the translinear loop considered in the analysis. temporal dynamics is the Differential Pair Integrator (DPI) circuit [Bartolozzi and Indiveri, 2007, Bartolozzi et al., 2006], shown in Fig. 2.1. The equations that characterize this circuit are:

κVC d I = I e UT I = C V (2.1) out 0 C dt C Iin = I1 + I2 I2 = Iτ + IC where the term I0 represents the transistor dark current, UT represents the thermal voltage and κ the subthreshold slope factor [Liu et al., 2002]. Thanks to its log-domain characteristics, it is possible to derive the DPI circuit response properties by applying the translinear principle [Gilbert, 1996]: if we take into account the loop made by the arrows in Fig. 2.1, in which the sum of voltage-differences is zero, we can write: Ith · I1 = I2 · Iout. Then, by replacing I1 and expanding I2 from eq. (2.1) we get:

Ith · (Iin − Iτ − IC ) = (Iτ + IC ) · Iout. (2.2)

Thanks to the properties of exponential functions, we can express IC as a function of Iout:

UT d IC = C Iout (2.3) κIout dt

By replacing IC from this equation and dividing everything by Iτ in eq. (2.2), we get:   Ith d IthIin τ 1 + Iout + Iout = − Ith (2.4) Iout dt Iτ

where τ , CUT /κIτ . This is a first-order non-linear differential equation that cannot be solved analytically. However, in the case of sufficiently large input currents (i.e. Iin  Iτ ) the term −Ith on the right side of eq. (2.4) can be neglected. Furthermore, under this assumption and starting from an initial condition Iout = 0, Iout will increase monotonically and eventually the condition Iout  Ith will be met. In this case also the term Ith on the left side of eq. (2.4) can be neglected. So the full Iout non-linear equation (2.4) reduces to:

d IinIth τ Iout + Iout = (2.5) dt Iτ

This is a first-order linear differential equation which is essential for reproducing faithfully the

11 Classification in VLSI dynamics of synaptic transmission observed in biological synapses [Destexhe et al., 1998]. Note that because of the circuit symmetry, the translinear principle used to derive eq. (2.5) can be applied without making any approximations: e.g., if n- and p-MOSFETs have the same κ factors, these terms cancel out when making the analogy between sum of voltages and product of currents. This is not the case for alternative log-domain integrator circuits proposed in the literature, where the κ term cannot be eliminated, and introduces non-idealities [van Schaik et al., 2010b,a, Yu and Cauwenberghs, 2010], or where special “isolated-well” pFET devices are required to remove it [Arthur and Boahen, 2006]. It is therefore possible to design compact layouts and complementary versions of the circuit (e.g., swapping n- with pFET devices) directly, for example to easily implement inhibitory or excitatory synapse designs. As we will show in the next sections, this non-linear log-domain filter is extremely useful for implementing the relevant dynamics in both silicon neurons and silicon synapses.

2.3 Silicon neurons

Several VLSI implementations of conductance-based models of neurons have been proposed in the past [Mahowald and Douglas, 1991, Dupeyron et al., 1996, Alvado et al., 2004, Simoni et al., 2004]. Given their complexity, these circuits require significant silicon real-estate and a large number of bias voltages or currents to configure the circuit properties. Simplified Integrate-and-Fire (I&F) models would require far less transistors and parameters but would fail at reproducing the rich repertoire of behaviors of more complex ones [Izhikevich, 2003, Brette and Gerstner, 2005]. Recently proposed generalized I&F models however have been shown to capture many of the properties of biological neurons and require less and simpler differential equations compared to the standard Hodgkin-Huxley (HH) model [Brette and Gerstner, 2005, Mihalas and Niebur, 2009]. Their computational simplicity and compactness make them valuable options for VLSI implementations [Folowosele et al., 2009, Wijekoon and Dudek, 2008, Livi and Indiveri, 2009]. We describe here a generalized I&F neuron circuit originally presented in [Livi and Indiveri, 2009], which makes use of the DPI circuit described in the previous Section and which represents an optimal compromise between circuit complexity and computational power: the circuit is compact, both in terms of transistor count and layout size; it is low-power; it has biologically realistic time constants; and it implements refractory period and spike-frequency adaptation, which are key ingredients for producing resonances and oscillatory behaviors often emphasized in more complex models [Izhikevich, 2003, Mihalas and Niebur, 2009]. The circuit schematic is shown in Fig. 2.2. It comprises an input DPI circuit used as a low-pass

filter (ML1−3), a spike-event generating amplifier with current-based positive feedback (MA1−6), a spike reset circuit with refractory period functionality (MR1−6) and a spike-frequency adaptation mechanism implemented by an additional DPI low-pass filter (MG1−6). The input DPI models the neuron’s leak conductance, producing exponential sub-threshold dynamics in response to constant input currents; the integrating capacitor Cmem the neuron’s membrane capacitance and the positive-feedback circuits in the spike-generation amplifier model both Sodium channel activation and inactivation dynamics; the reset and refractory period circuit models the Potassium conductance functionality; and the spike-frequency adaptation DPI circuit models the neuron’s

Calcium conductance that produces the after-hyper-polarizing current Iahp, proportional to the neuron’s mean firing rate. By applying the current-mode analysis of Section 2.2 to both the input

12 CHAPTER 2. NEUROMORPHIC VLSI

Vdd Vdd Vdd Vdd

/ACK MG1 MA5 MR1 MA1 M G2 /REQ I Vahp Ca Ia Vdd MA6 MR2 MG4 Iin MG3 Vthrahp

MA2 MR3 Vthr ML2 ML1 Iahp CR Ir MA3 MR4 Cmem VP MG6 Imem MR6 Vlk Vlkahp CP ML3 MA4 MR5 Vref MG5

Figure 2.2: Adaptive exponential I&F neuron circuit schematic. The input DPI circuit (ML1−3) models the neuron’s leak conductance. A spike event generation amplifier (MA1−6) implements current-based positive feedback (modeling both sodium activation and inactivation conduc- tances) and produces address-events at extremely low-power operation. The reset block (MR1−6) resets the neuron and keeps it in a resting state for a refractory period, set by the Vref bias voltage. An additional low-pass filter (MG1−6) integrates the spikes and produces a slow after hyper-polarizing current Iahp responsible for spike-frequency adaptation. and the spike-frequency adaptation DPI circuits we derive the complete equation that describes the neuron’s subthreshold behavior:

    Ith d Iahp 1 + τ Imem + Imem 1 + = Imem∞ + f (Imem) Imem dt Iτ d τ I + I = I u(t) (2.6) ahp dt ahp ahp ahp∞ where Imem is the sub-threshold current that represents the neuron’s membrane potential variable,

Iahp is the slow variable responsible for the spike-frequency adaptation mechanisms, and u(t) is a step function that is unity for the period in which the neuron spikes and null in other periods.

The term f(Imem) is a function that depends on both the membrane potential variable Imem and the positive-feedback current Ia of Fig. 2.2:

Ia f(Imem) = (Imem + Ith) (2.7) Iτ

In [Indiveri et al., 2010] the authors measured Imem experimentally and showed how f(Imem) could be fitted with an exponential function of Imem. The other parameters of eq. (2.6) are defined as:

CUT CpUT τ , , τahp , κIτ κIτahp κ κ Vlk Vlkahp UT UT Iτ , I0e , Iτahp , I0e I Ith I th I − I − I , I ahp I mem∞ , ( in ahp τ ) ahp∞ , Ca Iτ Iτahp

where Ith and Iτahp represent currents through n-type MOSFETs not present in Fig. 2.2, and κ κ Vthr Vthrahp UT UT defined as Ith , I0e , and Ithahp , I0e respectively. In addition to emulating calcium-dependent after-hyperpolarization Potassium currents ob- served in real neurons [Connors et al., 1982], the spike-frequency adaptation block MG1−6 reduces

13 Classification in VLSI

power consumption and bandwidth usage in networks of these neurons. For values of Iin  Iτ we can make the same simplifying assumptions made in Section 2.2. Under these assumptions, and ignoring the adaptation current Iahp, eq. (2.6) reduces to:

d Ith τ Imem + Imem = Iin + f(Imem) (2.8) dt Iτ where f I ≈ Ia I . ( mem) Iτ mem So under these conditions, the circuit of Fig. 2.2 implements a generalized I&F neuron model [Jolivet et al., 2004], which has been shown to be extremely versatile and capable of faithfully reproducing the action potentials measured from real cortical neurons [Badel et al., 2008, Naud et al., 2009]. Indeed, by changing the biases that control the neuron’s time-constants, refractory period, and spike frequency adaptation dynamics this circuit can produce a wide range of spiking behaviors ranging from regular spiking to bursting (see Section 2.7). While this circuit can express dynamics with time constants of hundreds of milliseconds, it is also compatible with fast asynchronous digital circuits (e.g., < 100 nanosecond pulse widths), which are required to build large spiking neural network architectures (see the /REQ and /ACK signals of Fig. 2.2 and Section 2.6). This allows us to integrate multiple neuron circuits in event-based VLSI devices and construct large distributed re-configurable neural networks.

2.4 Silicon synapses

Synapses are fundamental elements for computation and information transfer in both real and artificial neural systems, and play a crucial role in neural coding and learning. While modeling the non-linear properties and the dynamics of real synapses can be extremely onerous for Software (SW) simulations (e.g., in terms of computational power and simulation time), neuromorphic Hardware (HW) can faithfully reproduce synaptic dynamics using pulse (spike) integrators. As synapse circuits integrate their corresponding input spikes in a parallel manner, the neural network emulation time does not depend on the number of synapses involved, and the synapse and network response always happen in real-time. The resulting fully-parallel computation, a primitive also observed in biological systems, allows the system to process input information at a speed which is independent of the dimensionality of the input. An example of a full excitatory synapse circuit is shown in Fig. 2.1. This circuit, based on the DPI circuit described in Section 2.2, produces biophysically realistic Excitatory Post Synaptic Currents (EPSCs), and can express short term plasticity, N-Methyl-D-Aspartate (NMDA) voltage gating, and conductance-based behaviors. The input spike (the voltage pulse Vin) is applied to both MD3 and MS3. The output current Isyn, sourced from MD6 and through MG2, rises and decays exponentially with time. The temporal dynamics are implemented by the DPI block

MD1−6. The circuit time constant is set by Vτ while the synaptic efficacy, which determines the EPSC amplitude, depends on both Vw0 and Vthr [Bartolozzi and Indiveri, 2007].

2.4.1 Homeostatic plasticity: synaptic scaling

Synaptic scaling is a homeostatic mechanism active in biological neural systems to stabilize the network’s activity. It operates by globally scaling the neuron’s synaptic weights to keep its firing

14 CHAPTER 2. NEUROMORPHIC VLSI rate within a functional range, in face of chronic changes of their activity level, while preserving the relative differences between individual synapses [Turrigiano et al., 1998]. In VLSI, synaptic scaling is an appealing biologically inspired mean to solve technological issues such as mismatch, temperature drifts or long lasting dramatic changes in the input activity level.

Thanks to its independent controls on synaptic efficacy set by Vw and Vthr, the DPI synapse of Fig. 2.1 is compatible with both conventional spike-based learning rules, and homeostatic synaptic scaling mechanisms. Specifically, while learning circuits can be designed to locally change the synaptic weight by acting on the Vw of each individual synapse (e.g., see Section 2.5), it is possible to implement adaptive circuits that act on the Vthr of all the synapses connected to a given neuron to keep its firing rate within desired control boundaries. This strategy has been recently demonstrated in [Bartolozzi and Indiveri, 2009].

2.5 Synaptic plasticity: spike-based learning circuits

One of the key properties of biological synapses is their ability to exhibit different forms of plasticity. The type of plasticity circuits that we describe here model Long Term Potentiation (LTP) and Long Term Depression (LTD) [Abbott and Nelson, 2000]. These plasticity mechanisms produce long-term changes in the synaptic strength of individual synapses in order to form memories and learn about the statistics of the input stimuli. In neuromorphic VLSI chips, implementations of long-term plasticity mechanisms allow us to implement learning algorithms and set synaptic weights automatically, without requiring dedicated external read and write access to each individual synapse. However, when considering physical implementations of synapses, both biological and electronic, it is important to realize that synaptic weights are always bounded (i.e. they cannot grow indefinitely or assume negative values) and that it is technically impossible to store synaptic weight values with infinite precision for indefinite long time. These facts impose strong constraints on the capacity of networks of neurons, and on their ability to form new memories (learn to recognize new patterns) while preserving memories already stored in their synaptic weights [Fusi, 2002]. It has been demonstrated that the optimal strategy for protecting previously stored memories in neural networks with spike-based plasticity and therefore maximizing their capacity is to use synapses with a discrete number of stable states and with stochastic transitions from one stable state to the other [Fusi and Abbott, 2007]. Specifically, it was demonstrated that by modifying only a random subset of the network synapses with a small probability, memory lifetimes increase by a factor inversely proportional to the probability of synaptic modification [Fusi, 2002]. In addition, the probability of synaptic transitions can be used as a free parameter to set the trade-off between the speed of learning against the memory capacity. This optimal strategy can be implemented very efficiently in VLSI if one uses synapse circuits with two stable states and exploits the variability of the input spike trains as the source of stochasticity for the transition of the synaptic weights to the LTD or LTP stable states [Mitra et al., 2009]. The circuit shown in Fig. 2.3a implements such a weigh-update mechanism. The circuit comprises three main blocks: an input stage MI1−2, a spike-triggered weight update block ML1−4, and a bi-stability weight storage/refresh block (see transconductance amplifier in Fig. 2.3a). The input stage receives spikes from pre-synaptic neurons and triggers increases or decreases in weights, depending on the two signals VUP and VDN generated downstream by the

15 Classification in VLSI post-synaptic plasticity circuits (described below). The bi-stability weight refresh circuit is a positive-feedback amplifier with very small “slew-rate” that compares the weight voltage Vw to a set threshold Vthw and slowly drives it toward one of the two rails Vwhi or Vwlo, depending on whether Vw > Vthw or Vw < Vthw respectively. This bi-stable drive is continuous and its effect is superimposed to the one from the spike-triggered weight update circuit. The analog, bi-stable, synaptic weight voltage Vw is then used to set the amplitude of the EPSC generated by the synapse circuit (e.g., the circuit shown in Fig. 2.1).

The two signals VUP and VDN that determine whether to increase or decrease the synaptic weight are shared globally among all synapses afferent to a neuron. They can be produced by using a single post-synaptic circuit that for example sets their values depending on the relative timing of the pre- and post-synaptic spikes, following a standard STDP prescription. STDP mechanisms [Abbott and Nelson, 2000] that update the synaptic weight values based on the relative timing of pre- and post-synaptic spikes can be implemented very effectively in VLSI technology [Indiveri et al., 2006, i Petit and Murray, 2004, Fusi et al., 2000, Häfliger et al., 1997]. However, while standard STDP mechanisms can be effective in learning to classify spatio-temporal spike patterns [Gütig and Sompolinsky, 2006, Arthur and Boahen, 2006], these algorithms and circuits are not suitable for both encoding information represented in a spike correlation code and a mean rate code without spike correlations [Senn, 2002]. For this reason, rather than implementing the classical form of STDP based solely on the timing of pre- and post-synaptic spikes, we implemented a plasticity circuit that depends on the membrane potential of the post-synaptic neuron and on a slow “Calcium” variable computed by integrating the post-synaptic neuron spikes, in addition to the the timing of the pre-synaptic spikes [Mitra et al., 2009]. This implementation is based on the plasticity mechanism proposed in [Brader et al., 2007], which has been shown to be able to classify patterns of mean firing rates, to capture the rich phenomenology observed in neurophysiological experiments on synaptic plasticity, and to reproduce the classical STDP phenomenology. This algorithm can be used to implement unsupervised and supervised learning protocols, and to train neurons to act as perceptrons or binary classifiers [Senn and Fusi, 2005]. Typically, input patterns are encoded as sets of spike trains that stimulate the neuron’s input synapses with different mean frequencies, while the neuron’s output firing rate represents the binary classifier output. This circuit is shown in Fig. 2.3b. The spikes produced by the post-synaptic neuron are integrated by the DPI circuit MD1−5 to produce the VCa voltage. This signal represents the neuron’s Calcium concentration and is a measure of the recent spiking activity of the neuron.

Three current-mode winner-take-all circuits [Lazzaro et al., 1989] MW1−19 are used to compare VCa to the three thresholds Vthk1, Vthk2, and Vthk3. In parallel, the neuron’s membrane potential Vmem is compared to a fixed threshold Vthm by a voltage comparator. The outcomes of these comparisons set VUP and VDN such that, whenever a pre–synaptic spike Vspk reaches the synapse weigh-update block:  Vw = Vw + ∆w if Vmem > Vmth and Vthk1 < VCa < Vthk3 (2.9) Vw = Vw − ∆w if Vmem < Vmth and Vthk1 < VCa < Vthk2 where ∆w is a factor that depends on V∆w of Fig. 2.3b, and is gated by the eligibility traces VUP or VDN . If none of the conditions above are met, ∆w is set to zero by setting VUP = Vdd, and

16 CHAPTER 2. NEUROMORPHIC VLSI

Vdd

VUP ML4 Vdd

ML3 + Vwhi MI2

Vspk

Vw −

MI1 M Vthw L2 Vwlo

VDN ML1

(a)

Vdd Vdd Vdd

VUP MO3 MO1 MO2 VDN Vdd Vdd Vdd Vdd Vdd Vdd VCa MO4 Vthk3 MW13 MW19 MW12 MW18 Vthk2 Ik2 ICa ICa Ik3 MW16 MW11 Vdd Vdd MW9 MW15 MW8 MW14 MW10 MW17 Vmem M M − T2 T4 Vmhi Vmlo Vthm + Vmlo MT1 MT3 Vmhi

Vdd Vdd

VτCa Vdd Vdd Vdd MD5 Vdd Vthk1 MW7 V MW6 thrCa VCa MD4 MD3 ICa Ik1 MW5 MW3 VwCa MW2 MD2 MW4

VREQ MD1 VΔw MW1

(b)

Figure 2.3: Spike-based learning circuits. (a) Pre-synaptic weight-update module (present at each synapse). (b) Post-synaptic stop-learning control circuits (present at the soma).

17 Classification in VLSI

Figure 2.4: Silicon neuron diagram. This is a schematic representation of a typical circuital block comprising a Soma (e.g., I&F circuit described in Sec. 2.3) receiving input from several Synapses (e.g., the DPI and learning circuits described in Sec. 2.4 and Sec. 2.5 respectively). In addition to the synaptic currents, an external input current can be directly injected into the neuron Soma. Adaptation and learning mechanisms can coexist both at the level of the single synapse (e.g., local STDP mechanisms), and at the level of the whole neuron (e.g., intrinsic plasticity and homeostatic plasticity mechanisms)

VDN = 0.

The conditions on VCa implement a “stop-learning” mechanism that gives the system optimal generalization performances by preventing over-fitting when the input pattern has already been learned [Senn and Fusi, 2005]. For example, when the pattern stored in the synaptic weights and the input pattern are highly correlated, the post-synaptic neuron will fire with a high rate and

VCa will rise such that VCa > Vthk3, and no more synapses will be modified. In [Mitra et al., 2009] we show how such circuits can be used to carry out classification tasks with a supervised learning protocol, and characterize the performance of these types of VLSI learning systems. Additional experimental results from these circuits are presented in Section 2.7.

2.6 From circuits to networks

The silicon neuron, synapse, and plasticity circuits presented in the previous Sections can be combined together to form full networks of spiking neurons. Typical spiking neural network chips have the elements described in Fig. 2.4. These elements can be hard-wired on chip to create specific (and fixed) network topologies [Chicca, 1999] (see Fig. 2.5a), and/or arranged in a cross-bar architecture as in Fig. 2.5b [Indiveri et al., 2006, Arthur et al., 2012].

2.6.1 Recurrent neural networks

In the most general Recurrent Neural Networks (RNN) each neuron is connected to every other neuron (fully recurrent network). Unlike feed-forward networks, the response of RNNs to the input does not only depend on the external input but also on their internal dynamics, which in turns is determined by the connectivity profile. Thus, specific changes in connectivity, for example through learning, can tune the RNN behavior, which corresponds to the storage of internal representations of different external stimuli. This property makes RNNs suitable for implementing, among other properties, associative memories [Amit and Fusi, 1994], working memory [Mongillo et al., 2003], context-dependent decision making [Rigotti et al., 2010a]. There is reason to believe that, despite significant variation across cortical areas, the pattern of connectivity between cortical neurons is similar throughout neocortex. This fact would imply that the remarkably wide range of capabilities of the cortex are the results of a specialization of different areas with similar structures to the various tasks [Douglas et al., 1989, Douglas and Martin, 2004]. An intriguing hypothesis about how computation is carried out by the brain is the existence of a

18 CHAPTER 2. NEUROMORPHIC VLSI

finite set of computational primitives used throughout the cerebral cortex. If we could identify these computational primitives and understand how they are implemented in hardware, then we would make a significant step toward understanding how to build brain-like processors. There is an accumulating body of evidence that suggests that one potential computational primitive consists of a RNN with a well defined excitatory/inhibitory connectivity pattern [Douglas and Martin, 2004] typically referred as soft Winner-Take-All (sWTA) network. In sWTA neural networks group of neurons compete with each other in response to an input stimulus. The neurons with highest response suppress all other neurons to win the competition. Competition is achieved through a recurrent pattern of connectivity involving both excitatory and inhibitory connections. Cooperation between neurons with similar response properties (e.g., close receptive fields or stimulus preference) is mediated by excitatory connections. Competition and cooperation make the output of an individual neuron depend on the activity of all neurons in the network and not just on its own input [Douglas et al., 1995a]. As a result, sWTAs perform not only common linear operations but also complex non-linear operations [Douglas and Martin, 2007]. The linear operations include analog gain (linear amplification of the feed-forward input, mediated by the recurrent excitation and/or common mode input), and locus invariance [Hansel and Sompolinsky, 1998]. The non-linear operations include non-linear selection [Amari and Arbib, 1977, Dayan and Abbott, 2001, Hahnloser et al., 2000], signal restoration [Dayan and Abbott, 2001, Douglas et al., 1995b], and multi-stability [Amari and Arbib, 1977, Hahnloser et al., 2000]. The computational abilities of these types of networks are of great importance in tasks involving feature-extraction, signal restoration and pattern classification problems [Maass, 2000]. For example, localized competitive interactions have been used to detect elementary image features (e.g., orientation) [Ben-Yishai et al., 1995, Somers et al., 1995]. In these networks, each neuron represents one feature (e.g., vertical or horizontal orientation); when a stimulus is presented the neurons cooperate and compete to enhance the response to the features they are tuned to and to suppress background noise. When sWTA networks are used for solving classification tasks, common features of the input space can be learned in an unsupervised manner. Indeed, it has been shown that competition supports unsupervised learning because it enhances the firing rate of the neurons receiving the strongest input, which in turn triggers learning on those neurons [Bennett, 1990]. sWTA networks implemented using the dynamic synapses and spiking neurons described in the previous Sections offer also the possibility to explore in real-time different network and transient dynamics (e.g., by changing connection weights, time constants or other circuit parameters), thus serving as a useful computational neuroscience tools for the exploration of neural processing and learning mechanisms, while implementing real-time behaving models of complex neural processing systems such as the olfactory systems of the locust and zebra-fish [Rabinovich et al., 2008].

2.6.2 Distributed multi-chip networks

The modularity of the cortex described in the theoretical works and suggested by the experimental observations above mentioned, constitutes a property of great importance related to the scalability of the system. If we understood the principles by which such computational modules are arranged together, what type of connectivity allows for efficient communication through large distances, we would be able to build scalable systems, i.e. systems whose properties are qualitatively reproduced

19 Classification in VLSI

(a)

AER INPUT E INPUT AER I I I I I I I I I I I I I I E E E E E E E

E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E I I I I I I

E E E E E E

AER OUTPUT

(b)

Figure 2.5: sWTA network topology. (a) Schematic representation of the connectivity pattern of the sWTA network. These connections are implemented by synapses with hardwired connections to pre- and post-synaptic neurons. Empty circles represent excitatory neurons and the filled circle represents the global inhibitory neuron. Solid/dashed lines represent excitatory/in- hibitory connections. Connections with arrowheads are mono-directional, all the others are bidirectional. Only 8 excitatory neurons are shown for simplicity. (b) Chip architecture. Squares represent excitatory (E) and inhibitory (I) synapses, small unlabeled trapezoids represent I&F neurons. The I&F neurons transmit their spikes off-chip and/or to locally connected synapses implementing the network topology depicted in (a). Adapted from [Chicca et al., 2007].

20 CHAPTER 2. NEUROMORPHIC VLSI at all scales.

The idea of modularity poses some technological questions as to how the communication between the systems should be implemented. Large VLSI networks of I&F neurons can already be implemented on single chips, using today’s technology. However implementations of pulse-based neural networks on multi-chip systems offer greater computational power and higher flexibility than single-chip systems and constitutes a tool for the exploration of the properties of scalability of the neuromorphic systems. Because inter-chip connectivity is limited by the small number of input-output connections available with standard chip packaging technologies, it is necessary to adopt time-multiplexing schemes for constructing large multi-chip networks. This scheme should also allow for an asynchronous type of communication, where information is transmitted only when available and computation if performed only when needed in a distributed, non-clocked manner.

In recent years, we have witnessed the emergence of new asynchronous communication standard that allow analog VLSI neurons to transmit their activity across chips using pulse-frequency modulated signals (in the form of events, or spikes). This standard is based on the Address Event Representation (AER) communication protocol [Mahowald, 1992]. In AER input and output signals are real-time asynchronous digital events (spikes) that carry analog information in their temporal relationships (inter-spike intervals). Since the activity of the VLSI neurons is sparse and has typical firing rates that ranges from a few spikes per second to a few hundred spikes per second, the speed of digital buses (tens of megahertz) allows the outputs of many VLSI neurons firing at these biologically typical rates to be multiplexed over one AER bus. To handle cases in which multiple sending nodes attempt to transmit their addresses at exactly the same time (event collisions) on-chip arbitration schemes have been developed [Mahowald, 1992, Boahen, 2000, 2004].

The AER solution provides the possibility to implement arbitrary custom multi-chip architec- tures, with flexible connectivity schemes. Address events can encode the address of the sending node (the spiking neuron) or of the receiving one (the destination synapse). The connectivity between different nodes can be set by using external digital devices and is typically defined as a look-up table with source and destination addresses pairs. This asynchronous digital solution permits flexibility in the configuration (and re-configuration) of the network topology, while keeping the computation analog and low-power at the neuron and synapse level.

In order to promptly explore the computational properties of different types of large-scale multi-chip computational architectures, we developed a dedicated HW and SW infrastructure, which allows a convenient, user-friendly way to define, configure, and control in real-time the properties of the HW spiking neural networks, as well as a way to monitor in real-time their spiking and non-spiking activity. The custom HW aspects of this infrastructure are described in [Fasnacht et al., 2008, Fasnacht and Indiveri, 2011], while the SW ones are presented in [Sheik et al., 2011]. In addition to configuring and controlling the neuromorphic multi-chip setups, the SW “echo-system” developed provides dynamic parameter estimation methods [Neftci et al., 2011, 2012a] as well as automatic methods for measuring and setting circuit-level parameters and user-defined network-level cost-functions [Sheik et al., 2011].

21 Classification in VLSI

Figure 2.6: Membrane potential of I&F neuron in response to a 50 Hz pre-synaptic input spike train for different values of short-term depression adaptation rate, which is controlled by Vstd bias (see Fig. 2.1). The dashed trace in background corresponds to the response without STD. Black dots correspond to input spike-times.

2.7 Experimental results

The circuits and architectures described in this paper have been designed and developed over the course of several years. Therefore the experimental data presented in this Section has been collected from multiple neuromorphic VLSI devices and systems. The results presented demonstrate the correct behavior of the circuits described in the previous Sections.

2.7.1 Synaptic and neural dynamics

To show the combined effect of synaptic and neural dynamics, we stimulated a silicon neuron via an excitatory DPI synapse circuit, while sweeping different Short-Term Depression (STD) parameter settings. The typical phenomenology of STD manifests as a reduction of EPSC amplitude with each presentation of a pre-synaptic spike, with a slow (e.g., of the order of 100 ms) recovery time [Markram and Tsodyks, 1996]. In Fig. 2.6 we plot the neuron’s membrane potential

Vmem during the stimulation of one of its excitatory synapses with a regular pre-synaptic input spike train of 50 Hz, for different STD adaptation settings. Small parameter settings for the STD bias voltage have no or little effect. But for larger settings of this bias voltage the effect of STD is prominent: the synaptic efficacy decreases with multiple input spikes to a point in which the net input current to the soma becomes lower than the neuron’s leak current, thus making the neuron membrane potential decrease, rather than increase over time. Another important adaptation mechanism discussed in Section 2.3, is that of spike-frequency adaptation. To show the effect of this mechanism, we set the relevant bias voltages appropriately, stimulated the silicon neuron with a constant input current, and measured it’s membrane potential.

Figure 2.7 shows an example response to the step input current, in which Vlkahp = 0.05 V , Vthrahp = 0.14 V , Vahp = 2.85 V . As shown, we were able to tune the adaptation circuits in a way to produce bursting behavior. This was achieved by simply increasing the gain of the negative feedback adaptation mechanism (Vthrahp > 0). This is equivalent to going from an asymptotically stable regime to a marginally stable one, that produces ringing in the adaptation current Iahp, which in turn produces bursts in the neuron’s output firing rate. This was possible due to the flexibility of the DPI circuits, which allow us to take advantage of the extra control parameter

Vthrahp, in addition to the adaptation rate parameter Vahp, and the possibility of exploiting its

22 CHAPTER 2. NEUROMORPHIC VLSI

Figure 2.7: Silicon neuron response to a step input current, with spike frequency adaptation mechanism enabled and parameters tuned to produce bursting behavior. The figure inset represents a zoom of the data showing the first 6 spikes.

Figure 2.8: Stochastic transitions in synaptic states. The non-plastic synapse is stimulated with a Poisson distributed spikes train. The neuron fires at an average rate of 30 Hz. The pre-synaptic input (Vpre) is stimulated with Poisson distributed spike trains with a mean firing rate of 60 Hz. The updates in the synaptic weight produced an LTD transition that remains consolidated. VH and VL show the potentiated and depressed levels respectively while w denotes the synaptic weight, and θ the bi-stability threshold. Adapted from [Mitra et al., 2009]. non-linear transfer properties as described in Section 2.4, without requiring extra circuits or dedicated resources that alternative neuron models have to use [Folowosele et al., 2009, Mihalas and Niebur, 2009, Wijekoon and Dudek, 2008].

2.7.2 Spike-based learning

In this section we present measurements from the circuits implementing the STDP learning mechanism described in Section 2.5. To stimulate the synapses we generated pre-synaptic input spike trains with Poisson distributions. Similarly, the post-synaptic neuron was driven by a current produced via a non-plastic synapse (a DPI circuit with a constant synaptic weight bias voltage) stimulated by software-generated Poisson spike trains. These latter inputs are used to drive the I&F neuron towards different activity regimes which regulate the probabilities of synaptic transitions [Fusi and Mattia, 1999, Brader et al., 2007], effectively modulating the learning rate in unsupervised learning conditions, or acting as teacher signals in supervised learning conditions. The Poisson nature of the spike-trains used in this way represents the main source of variability required for implementing stochastic learning, as described in [Brader et al., 2007]. In Fig. 2.8 we show measurements from a stochastic learning experiment in which the neuron is driven

23 Classification in VLSI to a regime where both potentiation and depression are possible but depression has a higher probability to occur. As shown, the weight voltage undergoes both positive and negative changes, depending on the timing of the input spike and the state of the post-synaptic neuron (as explained in Section 2.5). In addition, the weight voltage is slowly driven toward one of the two stable states, depending whether it is above or below the threshold θ (where θ corresponds to the voltage

Vthw of Fig. 2.3a). Long-term transitions typically occur only when the weight crosses θ. In the case of the experiment of Fig. 2.8 an LTD transition has occurred, after a 400 ms presentation of the input stimulus. This mechanism ensures that only a fraction of the stimulated synapses undergo long-term modifications, and more generally that synapses are not modified when not in the right regime (e.g., during spontaneous activity) making the circuits robust to noise. If Fig. 2.9a we show the results of another stochastic learning experiment in which we stimulated the post-synaptic neuron with a high-frequency Poisson-like spike train through a non-plastic excitatory input synapse, in order to produce Poisson-like firing statistics in the output. The dashed line on the Vmem plot represents the learning threshold voltage Vthm of Fig. 2.3b. The VUP (active low) and VDN (active high) signals are the same shown in Fig. 2.3b and represent the currents that change the synaptic values when triggered by pre-synaptic spikes. They can be considered as eligibility traces that enable the weight update mechanism when they are active. In Fig. 2.9b we show the results of an experiment where we trained a matrix of 28 × 124 = 3472 plastic synapses, constituting the dendritic tree of a neuron, with multiple presentations of the same input pattern representing the “INI” acronym. Initially all the neuron’s input synaptic weights are set their low state (black pixels). Then the post-synaptic neuron is driven by a teacher signal that brings the neuron’s membrane potential distribution in the right regime, while input synapses are stimulated according to the image pattern: in the input image (top left image), each white pixel represents a Poisson spike train of approximately 55 Hz, sent to the corresponding synapse. Similarly, each black pixel represents a low rate spike train (6 Hz) which is transmitted to its corresponding synapse. By design, elements of the input matrix that correspond to a white pixel have a high probability to make a transition to the potentiated state, while black pixels do not tend to make any changes in the synaptic matrix. By repeating the presentation of the initial input pattern multiple times, this pattern gets gradually “stored” in the synaptic matrix. The bottom left image of Fig. 2.9b represents the synaptic matrix at the end of the experiment. Note that the “stop-learning” mechanism described in Section 2.5 together with the stochastic nature of the input patterns changes only a random subset of the synapses, leaving the rest of them available for storing different patterns. The above experiments demonstrate the properties of the learning circuits implemented in the VLSI chips. In a feed-forward configuration, the neuron can be controlled by an external spiking teacher signal, which indirectly controls the transition probabilities. This “perceptron- like” configuration allows the realization of supervised learning protocols for building real-time classification engines. But, as opposed to conventional perceptron-like learning rules, the spike- triggered weight updates implemented by these circuits overcome the need for an explicit control (e.g., using error back-propagation) on every individual synapse. In “Hopfield-network” like RNN configurations the same neuron and plasticity circuits can implement Attractor Neural Networks (ANN) learning schemes [Giulioni et al., 2008, 2011], exploiting the neural network dynamics to form memories through stochastic synaptic updates, without the need for explicit

24 CHAPTER 2. NEUROMORPHIC VLSI

(a)

(b)

Figure 2.9: Stochastic learning and stop-learning. (b) An “INI” pattern (the 3472 active inputs are represented by white pixels in the top left plot) is repeatedly presented to the synaptic matrix while the output neuron is firing at ∼ 50 Hz. Because of the stochastic nature of learning, a random subset of weights are updated. As the stimulus is presented over and over again, more and more synapses get potentiated (see right panels, from top to bottom). White (black) pixels represent high (low) synaptic values after learning. random generators at each synapse.

2.8 Discussion

In this chapter we presented a set of low-power hybrid analog/digital circuit that can be used as basic building blocks for constructing adaptive fully-parallel, real-time neuromorphic architectures. While several other projects have already developed dedicated hardware implementations of spiking neural networks, using analog [Wijekoon and Dudek, 2008], digital [Furber and Temple, 2007] and mixed mode analog/digital [Schemmel et al., 2008] approaches, few [Horiuchi and Koch, 1999, Boahen, 2005, Sarpeshkar, 2006, Hynna and Boahen, 2009] follow the neuromorphic approach originally proposed in the early nineties [Mead, 1990]. The foundations of this neuromorphic approach were established by pointing out that the implementation of efficient hardware models of biological systems requires the use of transistors in the subthreshold analog domain and the exploitation of the physics of the VLSI medium. We argue that the circuits and architectures presented here adhere to this approach, and can therefore be used to build efficient biophysically realistic real-time neural processing architectures and autonomous behaving systems. One common criticism to this subthreshold analog VLSI design approach is that systems built this way have a high degree of noise and device mismatch. On one hand, this criticism is not accurate, as subthreshold current-mode circuits have lower noise energy (noise power times bandwidth), and superior energy efficiency (bandwidth over power) than above-threshold

25 Classification in VLSI ones [Sarpeshkar et al., 1993, Shi, 2009]. On the other hand, while in principle it would be possible to minimize the effect of device mismatch following standard electrical engineering approaches and appropriate analog VLSI design techniques, this might not be the best strategy to follow for neuromorphic systems. Standard mismatch reduction techniques would lead to very large transistor designs, which would in turn significantly reduce the number of neurons and synapses integrated onto a single chip (see for example [Rachmuth et al., 2011], where a whole device was used to implement a single synapse). Rather than attempting to reduce mismatch using brute-force engineering techniques at the chip design level, we propose to take advantage of the digital asynchronous AER communication infrastructure, such as the one described in Section 2.6.2, to implement efficient event-based mismatch reduction techniques [Neftci and Indiveri, 2010]. In addition, we argue that the adaptation mechanisms and learning circuits described in Sections 2.3-2.5 can be used to design neural architectures that either exploit the variability present in the system [Sheik et al., 2012, Merolla and Boahen, 2006], or compensate for the inhomogeneities in the circuits. Work in this direction has already been carried out: for example it has been shown how spike-based learning and plasticity can be used to reduce the effects of device mismatch [Cameron and Murray, 2008]. Similarly, synaptic scaling homeostatic mechanisms compatible with the DPI synapse of Section 2.4 have been proposed to compensate for drift or for the presence of large differences among the circuits (e.g., due to faults, or changing of components in a multi-chip system) [Bartolozzi and Indiveri, 2009]. In decision-making studies, synaptic plasticity mechanisms analogous to the ones described in Section 2.5 have been proposed as a potential solution for compensating for the cellular and synaptic inhomogeneities present in the nervous system [Fusi et al., 2007]. The use of these strategies and mechanisms in VLSI systems allows neuromorphic engineers to build efficient computing systems by integrating very compact (but inaccurate) circuits into large dense arrays, rather than designing systems based on small numbers of very precise (but large) computing elements. This strategy of combining large numbers of imprecise circuits and computing elements to carry out robust computation is also compatible with a wide set of traditional machine learning approaches which work on the principle of combining the output of multiple inaccurate computational modules that have slightly different properties, to optimize classification performances and achieve or even beat the performances of single accurate and complex learning systems [Jacobs et al., 1991, Breiman, 2001]. A set of similar theoretical studies showed that the coexistence of multiple time-scales of synaptic plasticity (e.g., present due to mismatch in the time-constants of the DPI synapse circuits) can dramatically improve the memory performance of ANNs [Fusi et al., 2005]. The coexistence of slow and fast learning processes has been shown to be crucial for reproducing the flexible behavior of animals in context-dependent decision-making tasks and the corresponding single cell recordings in a neural network model [Fusi et al., 2007]. More generally, intrinsic variability and diverse activation patterns are often identified as fundamental aspects of neural computation for information maximization and transmission [Maass et al., 2002, Rigotti et al., 2010a, Shew et al., 2011, Schneidman et al., 2003]. These strategies are naturally implemented in both real biological systems and in the physical devices and systems that we proposed, and they represent a fundamental organizing primitive shared by these two worlds.

Building cognitive systems using noisy and inhomogeneous subthreshold analog VLSI circuits might appear as a daunting task. But we argue that the neural circuits and architectures presented

26 CHAPTER 2. NEUROMORPHIC VLSI in this paper represent an essential set of building blocks that pave the way toward this goal: The circuits we proposed, and analogous neuromorphic circuits proposed in the literature [Indiveri et al., 2011] have already been used to build efficient, low-power, scalable, computing systems that can interact with the environment [Silver et al., 2007, Neftci et al., 2010, Choudhary et al., 2012], learn about the input signals they have been designed to process [Mitra et al., 2009], and exhibit adaptive abilities analogous to those of the biological systems they model [Indiveri, 2003, Bartolozzi and Indiveri, 2009, Mill et al., 2011]. We showed in this paper how the sWTA networks and circuits presented can implement models of working memory and decision making (thanks to their selective amplification and reverberating activity properties), which are often associated to high-level cognitive abilities. Multi-chip systems employing these architectures can reproduce the results of a diverse set of theoretical studies based on models of sWTA and ANN to demonstrate cognitive properties: for example, Schöner and Sandamirskaya [Schöner and Dineva, 2007, Sandamirskaya and Schöner, 2010] link the types of neural dynamics described in Section 2.6 to cognition by applying similar network architectures to sensory motor processes and sequence generation; Rutishauser and Douglas [Rutishauser and Douglas, 2009] show how the sWTA networks described in this paper can be configured to implement finite state machines and conditional branching between behavioral states [Neftci et al., 2012b]; Rigotti and colleagues [Rigotti et al., 2010a,b] describe neural principles, compatible with the ones implemented by the circuits described in Section 2.5, for constructing recurrent neural networks able to produce context-dependent behavioral responses; Giulioni and colleagues [Giulioni et al., 2011] demonstrate working memory in a spiking neural network implemented using the same type of silicon neuron circuits and plasticity mechanisms [Giulioni et al., 2008] described in Sections 2.3 and 2.5.

2.9 Conclusions

The construction of neuromorphic systems able to solve cognitive tasks in real-world environments requires the development of a hardware infrastructure which is capable of implementing the desired neural network models while providing enough flexibility to switch models for testing their functionality in real-world scenarios. The solution proposed here includes analog-domain circuits for neuronal and synaptic dynamics and digital communication for constructing the desired neural network. The analog dynamics of neurons permits to obtain low-power, compact designs emulating some bio-physical properties of biological models of neurons and synapses. The DPI circuit, a current-based, sub-threshold, log-domain CMOS circuit, has been adopted to reproduce synaptic and neuronal dynamics while obtaining compact, low-power circuits with the long time-constants needed to match biological time-scales. Similarly, synaptic plasticity with stochastic dynamics is obtained by using bi-stable synapses and spike-triggered, current-based circuits. The firing activity in the neural network with weakly connected neurons is exploited as the source of stochasticity required for the slow-learning mechanism, thus random generators are not required and this results in compact designs for the synapse circuits. Because of the use of digital communication, Commercial Off-The-Shelf (COTS) devices avail- able from consumer market could be used to implement reconfigurable connectivity, thus reducing total costs and work overload. The neuromorphic chips are compatible with an asynchronous AER communication scheme, such that eventually custom devices for optimizing the communication

27 Classification in VLSI could be introduced in the future as a replacement for the inexpensive but an-optimal, synchronous communication implemented with COTS devices. All the described elements have been integrated in the neuromorphic system used to carry-out the work of this thesis.

28 CHAPTER 3

A software ecosystem for the spiking neuromorphic hardware

All the work that I have done in my life will be obsolete by the time I am 50. This is a field where one does not write a principia, which holds up for two hundred years.

Steve Jobs

A wide range of electronic spiking neural networks are currently being developed to imple- ment fast, low-power, and compact alternatives to software simulations run on general purpose computers. Typically the software used to interact with these systems and provide a higher-level abstraction useful for experimentation is written on a case-to-case basis. However, many compo- nents of such software are common to most of the implementations proposed in the literature. One possible solution to this inconvenience is to identify the commonalities between the different design principles of the proposed systems to provide hardware descriptions and permit rapid and seamless interfacing between the different setups. We present pyNCS, a python package that provides a framework to facilitate the configuration and flexible experimentation of hardware implementations of spiking neural networks. The pyNCS package assumes that the electronic neural network is composed of a series of modules that can be connected among each other to form the complete neural processing systems. These modules are typically represented by different types of VLSI chips (e.g. comprising sensors, different types of analog and/or digital multi-neuron chips). But they could also be multiple “cores” in a multi-core device, multiple dedicated pro- cessors in a digital implementation, or combinations of the above. pyNCS is based on an API that uses an abstraction of the neural hardware resources to provide parameter configuration, connectivity configuration and communication functions. This abstraction achieved on the basis of an eXtensible Markup Language (XML) description of the system permits the description of the neural system in terms of populations and connections. Like pyNN, a common programming interface to multiple software simulators, pyNCS enables a procedural description of a hardware

29 Classification in VLSI neural network and the prototyping of experiments that can be reused for several neuromorphic chips. But unlike pyNN, the main goal of pyNCS is to provide online interaction with real-time spike-based multi-chip systems. Here the software is described in details and its functionalities are demonstrated in conjunction with the neuromorphic multi-chip system. We demonstrate pyNCS in controlling an analog VLSI neuromorphic chip and show the use of pyNCS as a pyNN module. pyNCS is open-source software and available from http://inincs.github.com/pyNCS/

3.1 Introduction

The rapid development of neuromorphic hardware platforms and devices poses important issues regarding the software and hardware tools to map models into the reconfigurable hardware implementations. These tools should provide full interaction with the system to explore different configurations and control the communication between devices to construct multi-chip setups that fulfill the requirements of the desired neural network topology. The typical approach is to hard-code the specific functionalities of the hardware in the software tools. In the case of digital hardware using general purpose computers [Furber and Temple, 2007], the mapping consists in the compilation of the neural network definition code into machine code, e.g. assembly. This compilation layer is intermediate between the network definition and the actual instantiation of the neural network on the hardware. In the case of large-scale, hybrid analog-digital neuromorphic systems [Schemmel et al., 2008, Silver et al., 2007], the compilation of the code is not possible and. Instead the mapping of the neural network definition onto the hardware substrate consists of interpreting the requirements of the defined model and obtain the desired functionalities by considering the specific hardware capabilities (e.g., network topology constraints, parameter model constraints, neuronal placement and routing) and inform the custom driver interfaces about the actual configuration to put in place. This design strategy optimizes the mapping in terms of speed and accuracy, because it relies on a detailed knowledge of the hardware constraints, but a dedicated software module has to be re-written for any new neuromorphic device, even though most probably this module will share many parts with previously written code. Given the experimental nature of neuromorphic devices in academic research, it is hard to define design standards upon which to design new neuromorphic devices. The differentiation of spiking multi-neuron chips helps exploration of the possibilities of the technology but naturally leads to coding overhead due to the need for specific control infrastructure for each device. A way to reduce this overhead is to define design standards while allowing the differentiation, in a way similar to commercially available technology, e.g., graphic cards for computers. Instead one alternative valuable possibility is to realize a modular software infrastructure that separates the front-end interface to the user from the back-end interface to the drivers. In this case, only minimal parts of the code have to be adjusted to control the new neuromorphic device. The specifications related to the neuromorphic hardware can be collected in a simple format, e.g., a text file using a mark-up language, which is platform independent and can be read and filled in by the actual hardware designers. These type of choices are at the basis of pyNCS, a collection of software tools and Application Programming Interfaces (APIs) to control the parameters of the spiking neuromorphic hardware and monitor its activity. Here we describe the core parts of the software tools and provide some examples. Our software tool-chain has alerady allowed a

30 CHAPTER 3. THE SOFTWARE ECOSYSTEM rapid integration of new devices with existing multi-chip setups at the INI. Its code will be soon released as open-source under GPL license. A design philosophy similar to the one of pyNCS has been successfully adopted in the development of pyNN [Davison et al., 2008], a simulator-independent language for the definition of neuronal network models. The modular structure of pyNN permits to separate the neural network definition APIs from the actual simulator. This structure permits to choose the optimal simulator according to the specific needs of the simulation, e.g., a fast but not memory efficient simulator or conversely a slower but more efficient one. The practical advantage of pyNN is thus the possibility to re-use existing code describing the neural network model, i.e., neuron models and topologies. A Hardware Abstraction Layer (HAL) has been implemented to permit control of hardware simulators using the pyNN interface [Brüderle et al., 2009]. The main functionality of PyHAL is encapsulated by a hardware access class which implements the exchange layer between the front-end objects and the low-level hardware objects. The hardware access layer performs the translation from biological parameters like reversal potentials, leakages, synaptic time constants and weights to the available set of hardware configuration parameters. This abstraction layer has been designed mainly on top of the particular philosophy of a hardware infrastructure used for neural network simulations. For example, the event-based communication with the hardware is limited to a run() function. The use of real-time neuromorphic hardware, on the other hand, needs an online interaction with the system, thus continuous communication should be allowed.

3.2 A software front-end for real-time neuromorphic systems

By developing a neuromorphic ecosystem, i.e., a generic infrastructure for interfacing neuromorphic devices and providing real-time interaction to them, it is possible to re-use large branches of existing code to interface newly developed neuromorphic chips. The coding strategies for such infrastructure depend mainly on the trade-off between generality and specificity. A generic infrastructure would allow the set the communication with and between qualitatively different neuromorphic chips with minimal efforts (as in plug-and-play functionality) but its implementation might be unfeasible in a highly experimental scenario as an academic one. On the other hand, a specific infrastructure would need large programming efforts every time a new chip is fabricated and incorporated to the existing ecosystem. In order to maximize the generality of the infrastructure, i.e., the possibility to interface to a large variety of devices, while maintaining a low level of coding complexity, we wrote a Python package based on API modules that interfaces to the low-level drivers by exploiting a standard server/client architecture for both the communication and the configuration. The front-end of the software consists of a set of classes that permit the definition of a neural network similarly to common simulator interfaces, such as pyNN [Davison et al., 2008], the control of the hardware parameters and the monitoring of the spiking activity in real-time. The back-end extracts the relevant hardware specifications, incapsulated in XML files, to manage the communication of data and commands with the low-level drivers actually controlling the hardware interfaces. Both the AER communication and the configuration APIs are implemented on a client/server network architecture. At the practical level, the network-based architecture allows remote access to the setups. The pyNCS APIs are generalized bindings to the custom API drivers controlling

31 Classification in VLSI the specific neuromorphic hardware. The communication and the configuration APIs are specified in an XML file with the multi-chip setup specification, e.g., the address space description, the IP network addresses of the servers, and other informations (see also next section).

3.3 Asynchronous communication back-end

An example of an AER-based multi-chip architecture that can be controlled using the pyNCS software is described in what follows. The choice of using COTS in some parts of the communi- cation infrastructure heavily compromises energy efficiency of the overall system but it permits fast and flexible re-programmability of the system. Custom, efficient, scalable fully asynchronous solutions are described in the literature [Manohar, 2000, Choi et al., 2005, Arthur et al., 2012, Painkras et al., 2013] and their possible adoption is independent from the programming issues of the software infrastructure here described.

The AER loop

A local AER bus connects in a loop several boards on which different types of neuromorphic chips are mounted. Along this bus, AER events are transmitted from, e.g., spiking neurons on the chips in the form of source digital addresses, translated into destination addresses by a “mapper” device through a look-up table, and accepted by destination boards. The addressing scheme of the AER loop is generated by pyST, a sub-module of pyNCS.

Non-hierarchical address spaces

Address encoders and decoders that create the AER events from the neuron spiking activity and stimulate corresponding synaptic circuits are located on-chip at the borders of the neuro-synaptic cores. These events only contain local information, thus their address space is restricted to the size of the single chips. The address space is then expanded by micro-controllers on the chip boards and programmed to assign unique address ranges to each chip-board pair.

Address specifications

Address specifications describe the addressing scheme within each chip and among chips in the AER loop. The address specifications are also used by the high-level front-end to translate physical address representations into more readable address representations, e.g., for plotting needs. Address specifications at the chip level are described in the chip nhml file. A stripped-now version of the address specification of one chip is reported here.

[0, ... 127] X

32 CHAPTER 3. THE SOFTWARE ECOSYSTEM

[0, ..., 31] Y x s X0 X1 X2 X3 X4 X5 X6 Y4 Y3 Y2 Y1 Y0 [...] The text-based representation of the pin layout permits arbitrary configurations to easily account for each custom implementation. Also the specifications describing the multi-chip setup address configuration is reported in an XML-based text file. A brief example of a 4-bit multi-chip setup that uses a 32-bit AER bus follows. [...] [...] [...] The channelAddressing field specify which bits represent the AER loop addressing. Those are the bits that the communication interfaces read and write to route the AER events along the setup according to the neural network topology. Each slot section specifies the address range dedicated to each device. The actual setup is defined by a second text file describing which particular chips are connected to which slot and which APIs are used for communication and configuration of the neuromorphic chips.

33 Classification in VLSI

172.21.36.26 209 127.0.0.1 25 127.0.0.1 3.0

Each chip slot contains a pointer to the Neuromorphic Hardware Mark-up Language (NHML) file with the chip specification and the name and parametrs of the python API module used to configure the chip hardware parameters. The setup file also specifies the python module containing the APIs used to allow the AER communication. The virtualchip block is used to allow off-line use of pyNCS, i.e., without any physical chip connected to the setup. The communicator and configurator blocks contain the python API modules and parameters for monitoring the AER events and for sending commands to configure the hardware parameters through the front-end interface.

3.4 Neural network definition front-end

In order to carry out experiments using the neuromorphic hardware, a neural network definition layer is necessary. In other words, it is necessary to provide the user with a set of tools (APIs) that would permit access to the neuron ensemble on the hardware, the definition of the network topology through a connectivity diagram and configuration of the low-level parameters of the chips (e.g. synaptic time-constants). The structure of the front-end has been deliberately kept similar to the one typical of software simulators for neuroscientists. It is possible to instantiate groups of neurons, define the connectivity between them, configure low level parameters, send and receive AER events on the AER bus.

Example

Here is an example of the definition of a randomly connected that receives sparse inputs from a Dynamic Vision Sensor (DVS) chip, a spiking imaging sensor [Lichtsteiner

34 CHAPTER 3. THE SOFTWARE ECOSYSTEM

Figure 3.1: Neuromorphic system configuration framework. Low-level drivers interface custom chips to workstations (e.g., using USB connections). The pyNCS tool-set abstracts the chip specific characteristics, defines the setup, and the chip’s functional blocks (e.g., populations of neurons). The pyTune tool-set performs the calibration of high-level parameters and optimization of cost functions, using the optimization algorithms in the base sub-module. and Delbruck, 2005].

nsetup = pyNCS.NeuroSetup(’’, ’’) nsetup.load\ldots recurrent_neurons = pyNCS.Population(’’, ’’) recurrent_neurons.populate_all(nsetup, ’ifslwta’, ...) recurrent_C = pyNCS.Connections(recurrent_neurons, recurrent_neurons, ’learning’, ’all2all’) input_neurons = pyNCS.Population(’’, ’’) input_neurons.populate_all(nsetup, ’dvs’, ...) input_C = pyNCS.Connections(input_neurons, recurrent_neurons, ’excitatory’, ’random_all2all’, {’p’:0.2})

35 Classification in VLSI

The Population class is a container of addresses of neurons and synapses which are address- able through AER on the hardware. The addresses are created using the multi-chip system specifications contained in the NeuroSetup object and the given arguments, i.e. the number and type of required neurons. Connection classes contain the pairings between neuron addresses that represent the neural network topology. These pairings constitute the look-up table which is used to map neuron AER events into synaptic events. Several methods to construct the neural network topologies have been implemented. In the preceding example a sparse connectivity map is created.

3.5 Mapping models into spiking hardware

While computational neuroscience models simulate neurons and synapses using parameters directly related to their biological characteristics (such as leak conductance, time constants, etc.), neuromorphic VLSI systems emulate them using circuits that can be configured by setting bias voltages and currents. The biases in these circuits are often only indirectly related to the parameters of computational neuroscience models. More generally, the relationship between parameters in theoretical models, software simulations, and hardware emulations of spiking neural networks is highly non-linear, and no systematic methodology exists for establishing it automatically. Current automated methods for mapping the VLSI circuits bias voltages to neural network type parameters are based on heuristics and result in ad-hoc custom made calibration routines. For example, in [Brüderle et al., 2009] the authors perform an exhaustive search of the parameter space to calibrate their hardware neural networks, using the simulator-independent description language “PyNN” [Davison et al., 2008]. This type of brute-force approach is possible because of the accelerated nature of the hardware used, but it becomes intractable for real-time hardware or for very large systems, due to the massive amount of data that must be measured and analyzed to carry out the calibration procedure. An alternative model-based approach is proposed in [Neftci et al., 2010], where the authors fit data from experimental measurements with equations from transistors, circuit models, and computational models to map the bias voltages of VLSI spiking neuron circuits to the parameters of the corresponding software neural network. This approach does not require the extensive parameter space search techniques, but new models and mappings need to be formulated every time a new circuit or chip is used, making it’s application quite laborious. We recently proposed a systematic and modular framework for the tuning of parameters on multi-chip neuromorphic systems [Sheik et al., 2011]. On one side the modularity of the framework allows the definition of a wide range of generic (network, neural, synapse, circuit) models that can be used in the parameter translation routines; on the other side, the framework does not require detailed knowledge of the hardware/circuit properties, and can optimize the search and evaluate the effectiveness of the parameter translations by measuring experimentally the behavior of the hardware neural network. We implemented this framework using the Python programming language, and making strong use of its object-oriented features. Indeed, Python’s recent popularity in the neuroscience community [FINS-Python], combined with its platform independence and the ease of extending it with other programming languages, makes it the natural choice for this framework.

36 CHAPTER 3. THE SOFTWARE ECOSYSTEM

The framework consists of two software modules: pyNCS and pyTune. The pyNCS tool-set allows the user to interface the hardware to a workstation, to access and modify the VLSI chip bias settings, and to define the functional circuit blocks of the hardware system as abstract software modules. The abstracted components represent computational neuroscience relevant entities (e.g. synapses, neurons, populations of neurons, etc.) which do not depend directly on the chip’s specific circuit details, and provide a framework that is independent of the hardware used. The pyTune tool-set allows users to define abstract high-level parameters of these computational neuroscience relevant entities, as functions of other high- or low-level parameters (such as circuit bias settings). This tool-set can then be used to automatically calibrate the properties of the corresponding hardware components (neurons, synapses, conductances, etc.), or to determine the optimal set of high- and low-level parameters that minimize arbitrary defined cost-functions. Using this framework, neuromorphic hardware systems can be automatically configured to reach a desired configuration or state, and parameters can be tuned to maintain the system in the optimal state.

3.6 The pyTune tool-set

The pyTune tool-set is a Python module which automatically calibrates user defined high-level parameters, and optimizes user defined cost-functions. The parameters are defined using a dependency tree that specifies lower-level sub-parameters in a recursive hearachical way. An example of such type of dependency tree is shown in Fig. 3.2. This hierarchical scheme allows the definition of arbitrarily complex parameters and related cost-functions. For example, synaptic efficacies in neural network models can be related to the bias voltages in neuromorphic chips which control the gain of synaptic circuits. In the DPI synapse [Bartolozzi and Indiveri, 2007], there are three bias voltages that simultaneously affect the synaptic gain in a non-linear fashion. Using the pyTune tool-set it is possible to automatically search the space of these bias voltages and set a desired synaptic efficacy by measuring the neuron’s response properties (e.g. mean output rate) from the chip. The automated parameter search can be applied to more complex scenarios to optimize high- level parameters related to network properties. For example, the user can specify the mapping between low-level parameters and the gain of a winner-take-all network [Yuille and Geiger, 2003], or the error of a learning algorithm [Hertz et al., 1991]. Furthermore, the pyTune tool-set is not restricted to neuromorphic chip setups: it can be used in pure software simulation scenarios in which it is necessary to optimize cost-functions that involve complex or abstract parameters, as a function of direct or low-level model parameters.

3.7 Methods

Here we describe the steps required to integrate, configure, and calibrate a custom neuromorphic system using the pyNCS and pyTune tool-sets. They involve the creation of files describing the chip’s functional blocks (excitatory synapses, inhibitory synapses, neurons, etc.), the experimental setup (e.g., how many chips and what measurement instruments are used), and the topology of the neural network to be emulated. In addition it is necessary to define low- and high-level parameters that characterize the network, using the hierarchical scheme described in Fig. 3.2.

37 Classification in VLSI

Figure 3.2: Example of parameter definition. A hierarchical dependency tree specifies which lower-level parameters affect the parameter being defined. The bias parameters directly change the biase voltages on the chip(s). Each parameter contains a definition of a getValue and (optionally) a setValue function. These functions specify how to measure and set the chip signals that define the parameter. If the setValue function is not defined, the system travels the tree following the specified optimization algorithm until it finds the lowest-level parameters that affect the chip biases.

3.7.1 Chip functional blocks

All the details of the chip are collected in a single file in which its functional blocks are defined and the chip’s pins are related to their parameters. The file specifies how to change the biases of the functional blocks (e.g. via Digital to Analog Converter (DAC)s). In addition, input and output signals are defined according to the AER protocol [Boahen, 1998]. The chip is represented in pyNCS as a Chip object. Once this object is instantiated, chip bias voltages are treated as ordinary variables and any operation that uses these variables actually reads the signals from the pins, while any operation that modifies these variables actually sets the corresponding DAC value.

3.7.2 Neural network topology

The topology of the neural-network is described in terms of populations of neurons (Population) and inter-connections thereof (Mapping). The properties of the connections are defined by the synapses used to make these connections (e.g. excitatory, inhibitory or plastic). pyNCS provides convenient methods to build such topologies on the neuromorphic setups. The following code snippet demonstrates how to use the pyNCS tool-set in a simple example. In this example two populations of neurons are defined. They are physically placed on two different neuromorphic chips (’chip1’ and ’chip2’) that are contained in the setup object. Both pop1 and pop2 are first defined as neural populations and then populated with I&F neurons on the two chips. This operation sets the actual communication between the software abstraction layer and the hardware by communicating to the system which AER addressess are dedicated to the instantiated objects. Finally, a mapping instance is created and a one-to-one connectivity is

38 CHAPTER 3. THE SOFTWARE ECOSYSTEM created. Internally, this operation generates the look-up-table that corresponds to the desired connectivity profile. Several other possibilities (all-to-all, sparse connectivity, etc.) are available through the front-end.

#Initialize setup setup = pyNCS.NeuroSetup(’setup.xml’) # Populate 5 LIF neurons from ’chip1’ pop1 = pyNCS.Population(id=’pop’, description=’Pop 1’) pop1.populate_by_number(setup, chip=’chip1’, type=’IF_leaky’, N=5) # Populate 10 LIF neurons from ’chip2’ pop2 = pyNCS.Population(id=’pop’, description=’Pop 2’) pop2.populate_by_number(setup, chip=’chip2’, type=’IF_leaky’, N=10) # Connect pop1 to pop2 mapping = pyNCS.Mapping(’network’) mapping.connect_one2one(pop1, pop2, type=’excitatory’)

When the populations are “populated”, the physical addresses corresponding to the desired neurons are selected from the available address space. This means that no automated placing scheme is applied and specific addresses can always be explicitly selected. As soon as the connectivity is effectively uploaded into the system via the communication API, the AER events are routed to the specified destinations. If no further configuration is required, the client computer can be disconnected from the boards and the neuromorphic system will act as a stand-alone entity.

3.7.3 Automatically tuning system parameters

The pyTune tool-set relies on the translation of the problem into parameter dependencies. The user defines each parameter by its measurement routine (getValue function) and its sub-parameters dependencies. At the lowest level, the parameters are defined only by their interaction with the hardware, i.e. they represent biases of the circuits. The user can choose a minimization algorithm from those available in the package or can define custom methods, to do the optimization that sets the parameters value. Optionally one can also define a specific cost function, that needs 2 to be minimized. By default, the cost function is computed as (p − pdesired) where p is the current measured value of the parameter and pdesired is the desired value. Explicit options (such as maximum tolerance for the desired value, maximum number of iteration steps, etc.) can also be passed as arguments to the optimization function. Finally, the sub-parameters’ methods are mapped by the appropriate plug-in onto the corre- sponding driver-calls, in the case of an hardware system, or onto method calls and variables in the case of a system simulated in software. Each mapping specific to a system has to be separately implemented and included in pyTune as a plug-in.

3.8 Results

In this section we demonstrate how pyTune is used in conjunction with pyNCS to set the mean firing rate of a population of neurons on a multi-neuron chip.

39 Classification in VLSI

The setup comprises a multi-neuron chip which is a 10 mm2 prototype VLSI device imple- mented in a standard 0.35µm CMOS technology. The chip comprises an array of 128 integrate-and- fire neurons and 4096 adaptive synapses with biologically plausible temporal dynamics [Bartolozzi and Indiveri, 2007]. Each of the 128 neurons receive input from 32 synaptic circuits which are subdivided into sets of excitatory non-plastic synapses, inhibitory non-plastic synapses, and excitatory plastic ones, with on-chip learning capabilities [Mitra et al., 2009]. The multi-neuron chip biases can be modified by a board comprising a series of DACs and interfaced via USB to the workstation. Input and output spikes are sent to/from the chip using the AER protocol.

In this example, we created a population of 5 neurons and defined the Rate parameter as the mean firing rate of the population. Built-in functions of PyNCS are used to stimulate the neurons, monitor their output spiking activity and compute the population mean firing rate. In principle the population’s mean firing rate can depend on several parameters (e.g. injected currents, recurrent connectivity, external inputs, time-constants). In order to simply illustrate how pyTune handles the dependencies we define the parameter as dependent on two sub-parameters, the Injection current to the neurons and the Leak current of the membrane. These are in fact biases of the chip, i.e. voltages to the gate transistors generating the injection current (p-type transistor) and the leak current (n-type) [Indiveri et al., 2006].

To visualize the dependence of Rate on its two sub-parameters, we carried out a two- dimensional sweep across the parameter space. The points in the 3D plot of Fig. 3.3 represent the values of Rate, which lie on a non-linear surface because of the exponential relationship between 2 the biases and their respective currents. The default cost function is (r − rdesired) , where r is the value of Rate measured from the hardware system and rdesired = 50 Hz is the target value. In this example, pyTune minimizes the cost function using an implementation of the Truncated Newton (TN) algorithm [Nash, 2000] provided by SciPy’s optimization module [Jones et al., 2001–]. The blue path shown in Fig. 3.3 connects the points measured by pyTune while setting the Rate. Figure 3.4 shows the progress of the optimization algorithm (cost versus iteration step). After 10 iterations, the algorithm converges to a mean rate of 48.3 Hz, which is compatible with the target rate of 50 Hz and the tolerance of 3 Hz declared as argument of the algorithm.

The code used to produce this data is available online (http://ncs.ethz.ch/publications/ examples-iscas-2011).

3.9 Discussion

The pyNCS software share similarities with the kernel of a traditional operating system architecture. The function of the kernel is to implement a collection of facilities of “universal applicability” and “absolute reliability” by which an arbitrary set of operating system facilities and policies can be conveniently, flexibly, efficiently, and reliably constructed. Moreover, let the flexibility be constrained at any instant, it should be possible for an arbitrary number of systems created from these facilities to co-exist simultaneously [Wulf et al., 1974].

40 CHAPTER 3. THE SOFTWARE ECOSYSTEM

Figure 3.3: A 3D wire-frame plot of the parameter space for the experiment described in Sec. 3.8. Each dot on the wire-frame nodes is a measure of the Rate value. The surface represents the non- linear dependence of the parameter to be optimized (Rate) on the two sub-parameters (Leak and Injection). The sub-parameters are voltages on the gate of the n/p-transistor which controls the current of the leak/injection to the neuron circuit [Indiveri et al., 2006]. The blue line shows the algorithm’s path during the optimization process (the red star represents the final point at 48.3 Hz). The error-bars on each point show the standard deviation on the rate measurement computed over the population, mainly due to transistors mismatch.

Figure 3.4: The cost function is minimized to set the Rate parameter to a desired value of 50Hz on the example experiment of Sec. 3.8. Error bars represent standard deviations. The final value of the population rate is 48.3Hz, with a deviation from the desired value below the tolerance we passed to the algorithm. See text for details.

41 Classification in VLSI

3.9.1 Towards a neuromorphic kernel

In computers, a kernel is a bridge between applications using abstract resources and the actual data processing done at the hardware level through the management of the system’s resources. In neuromorphic systems, similar role is played by the pyNCS APIs providing access to the hardware resources through the low-level abstraction layer. As in the case of kernels, where different processors, interfaces, I/O devices, are typically supported through the specific drivers (or servers as in microlithic kernels, e.g. the Linux kernel), the neuromorphic kernel should manage the configuration of and communication between different neuromorphic devices. The definition of a communication protocol and the introduction of set of specifications for the description of the low-level functionality are one possible strategy to fulfill the purpose. The pyNCS software uses a widely adopted digital communication framework, the AER [Boahen, 1998], introduces a mark-up language for the specifications of the low-level hardware and uses server/client architectures for the communication between the different hardware resources and the management from the front-end. It is currently being tested on different type of hardware and collaboratively improved by several research groups around the world. Future development of a complete kernel-like framework for neuromorphic devices should facilitate the development of cognitive systems and the assessment of their capabilities.

3.9.2 A black-box approach to configuration

The possibility to configure of a large neuromorphic system typically depends on a detailed knowledge of the system components. However, since an increasing engineering community is now able to access facilities for the production of custom neuromorphic chips based on well-established circuits, there is an increasing need for a shared a system-level software infrastructure to test the newly developed chips, configure them and include them in the existing setups. To that end, we argue that a black-box type of approach is appropriated because it provides an abstraction from the detailed knowledge of the electronics and maximizes the possibility to re-use of tools that have already been developed. In other words, the modular structure of the software we described simplifies the interfacing between the hardware designers tool-chain and the software designers and users needs. This abstraction is obtained through the definition of hardware specification mark-up language files, which can be easily composed by hardware designers. The definition of parameter dependencies is also based on a mark-up language description and permit to link high-level parameters, e.g., synaptic time constants, to the hardware configuration, e.g., through current biases on the circuits. We believe that these tools can simplify the development of neuromorphic hardware platforms for various purposes in this highly heterogeneous research community.

3.10 Conclusions

The vast majority of neuromorphic systems developed in past decades adopted a digital commu- nication scheme where spike events are encoded into strings that represent either the sources of the spikes (neurons) or the destinations to target (synapses). By exploiting this representation we were able to construct a modular software platform that can be easily integrated with existing systems and that provides the required flexibility for expanding them with newly developed devices

42 CHAPTER 3. THE SOFTWARE ECOSYSTEM with limited coding overload. A modular and expandable platform-independent Python-based framework to control, calibrate and tune custom neuromorphic systems has been realized following this method and has been presented in this Chapter. The Python software tools can be used to automatically or manually configure neuromorphic hardware setups for emulating spiking neural networks. The modularity consists in the possibility of integrating a wide range of additional modules or interfaces through the definition of simple APIs modules. These modules range from drivers for interfacing the tools to custom VLSI chips, to programs for controlling measurement instruments and acquiring data from the hardware setups, but also include optimization routines for finding the optimal set of parameters that produce a desired behavior in the hardware setup. In this respect, the strengths of the work proposed are not in the specific methods used to solve the parameter mapping and calibration problems, but to provide a framework in which existing methods can be easily integrated and adapted, for using them in conjunction with the hardware. The development of the software infrastructure permits to experiment with different architectures, different neuromorphic chips and sensors, for carrying-out a variety of experiments and explore the different possibility. Thus, the pyNCS software represents a fundamental component for the construction of a solid neuromorphic ecosystem for experimenters, modelers and engineers and has been an important milestone in the development of this thesis.

43 Classification in VLSI

44 CHAPTER 4

Procedures for the configuration of the neuromorphic chip

A figther receives a punch, gives three back.

Chinese proverb

In the previous chapters I presented the neuromorphic hardware system and the software tools developed to configure it and monitor the spiking activity. In this chapter I describe the experimental procedures used to characterize the learning circuits and configure them to obtain a fully functional binary classifier in hardware. The idea is to first configure the neuron circuit parameters following earlier works on I&F VLSI neurons Fusi and Mattia [see, for example 1999], Fusi et al. [see, for example 2000], then to globally characterize the plastic synapse circuits by imposing determined pre- and post-synaptic mean activities. Thanks to these procedures, it is possible to characterize the statistical properties of the post-synaptic neuron activity relevant for the training and measure the mean value and standard deviations of the synaptic states’ transition probability. A useful feature consists in the possibility to reduce the total amount of synaptic current induced to the post-synaptic neuron without interfering with the synaptic states, while non-plastic synapses can be used to drive the neuron with arbitrary currents to impose any desired correlation on the pre- and post-synaptic activities. The procedures developed are necessary to obtain a detailed characterization of the neuromorphic chip and verify its functionality, and can be applied to future generation of chips thanks to a system-level approach that doesn’t rely on detailed knowledge of the electronic circuits. The particular chip used in the presented work consists of a 32×128 single neuro-synaptic core array, which includes 128 neurons, 28×128 plastic synapses, 4×128 synapses with fixed weights. A space-multiplexer circuits was digitally configured to redirect the synaptic lines such that the first neuron receives currents from all the synapses of the array, and so obtaining a perceptron-like architecture.

45 Classification in VLSI

4.1 Introduction

One of the current challenges in neuromorphic engineering consists in the adoption of an appro- priate level of abstraction for the models to implement in hardware. For example, by modeling the finest biological processes behind, for example, synaptic dynamics, one can obtain a faithful emulation on hardware, but this choice compromises area consumption and power efficiency and finally computational capabilities, because only few synapses can be implemented on a single chip [Rachmuth et al., 2011]. The choice on the model to implement in the VLSI hardware and the corresponding circuit design choices have to be considered also in the development of the needed tools for calibrating the circuits. Current neuromorphic systems implement a large variety of spiking neuronal model. In software simulations, there are no particular restrictions on the implementable models apart from computational and communication bandwidth constraints [Furber and Temple, 2007, Galluppi et al., 2012]. Conversely, in hardware system, the neural model is chosen once [Indiveri et al., 2011, Seo et al., 2011, Schemmel et al., 2007, Silver et al., 2007] prior to fabrication. A simplified I&F neuron model as the one described in 2 can be implemented with a small number of transistors but do not reproduce a large repertoire of behaviours observed in biological neurons [Izhikevich, 2003, Brette and Gerstner, 2005]. In large spiking networks, the choice of I&F neurons can be advantageous because of the compactness of the implementation, which permits to integrate many neurons in single chips, and the power of the analytical tools used to describe the network behaviour [Amit, 1992, Fusi et al., 2000, Camilleri et al., 2007, Giulioni et al., 2011, Indiveri et al., 2011]. Note however that the neuronal model determines the sub-threshold behaviour, which in turns could affects the dynamics of the network, e.g., for the synaptic plasticity. Memories in a neural network are stored as synaptic connections between neurons. In neuromorphic VLSI hardware, the connectivity can be realized in a physical substrate or virtually. Physical synapses can be implemented in standard CMOS using capacitors, floating gates or digital bits to store synaptic weights and to implement the synaptic dynamics [Bartolozzi and Indiveri, 2006], or using new technologies such as [Chua, 1971, Jo et al., 2010, Kim et al., 2011]. The connectivity is realized by physically wiring neurons to the synapses [Camilleri et al., 2007], e.g., using a crossbar architecture [Seo et al., 2011] or through the AER decoder for receiving spikes from external sources [Silver et al., 2007, Chicca et al., 2013]. Virtual connectivity is typically realized by storing the weights in a digital device, e.g., a RAM. The stored weights are transmitted together with the AER packets and are encoded and decoded locally at the source and destination sites, e.g., at the level of the physical circuits. A combination of virtual and physical synapses can be combined, so exploiting the advantages of both worlds, to minimize synaptic integration while maximizing connectivity [Goldberg et al., 2001] or to allow space- and time-multiplexing of the neural network emulation [Bailey and Hammerstrom, 1988, Goldberg et al., 2001, Minkovich et al., 2012]. The storage of memories in neuromorphic systems with distributed synapses is realized by changes in synaptic connectivity, i.e., synaptic weights. In an off-line learning scenario, synaptic weights are obtained from theoretical models or simulations and then uploaded into the hardware neural network. In an online learning scenario, instead, synapses continuously change depending on the pre- and post-synaptic activities, thus analysis and calibration of the learning parameters are needed. Synaptic changes can be realized either by using a specific technology, e.g., memristors [Jo

46 CHAPTER 4. CONFIGURING THE HARDWARE et al., 2010, Serrano-Gotarredona et al., 2013] or FGMOS [Ramakrishnan et al., 2012], or by emulating a bio-physical model in CMOS [Indiveri et al., 2006]. In the former case, the main challenge is the realization of a physical device with the properties of biological synapses for storage and maintenance of the memories. Much research efforts are put in understanding the device physics, eventually compensating for mismatch through extensive calibrations when necessary. In the case of CMOS, instead, synaptic weights are stored as charges in large capacitors and simple current-based circuits modify the synaptic weights. The typical approach is to take advantage of a theoretical model and allow the network to resemble the theoretical model through the plasticity [Giulioni et al., 2011], for example by using mean-field theory to obtain attractor dynamics [Amit, 1992]. This is mainly because when using distributed analog synapses the probing of each synaptic value is not possible and typically global signals, i.e., biases, control the learning behaviour of large blocks of synapses and eventually the entire array, even though the synaptic weights are independent for each synapse. A similar system-level approach has also been adopted here.

4.2 Neuron dynamics

The neuron circuits are presented in Chapter 2. They reproduce a current-based, leaky I&F model [Fusi and Mattia, 1999, Indiveri et al., 2006]. Here, the statistical properties of the hardware model, i.e., the statistics of the neuron’s firing activity in response to stationary but noisy input currents, are described.

4.2.1 Statistical properties of the VLSI hardware neuron

The synaptic state transitions depend on the dynamics of the depolarization variable Vmem. It can be shown that an I&F neuron subjected to a noisy input current expresses different type of dynamics in the high or low spike frequency regimes [Fusi and Mattia, 1999]. The input current is modeled as a stochastic variable with a gaussian distribution centered around a mean µ with variance σ. With the linear, fixed-weight synapses implemented in the VLSI chip it is possible to produce such current using spike-trains of high firing rate as input and specific parameters for the synapses. The Orn In the Noise-Driven (ND) regime (low µ/σ), the statistics of the output spikes is determined by the fluctuations of the depolarization variable which dominates over the average drift. As a consequence, the Inter-Spike Interval (ISI) distribution is flat. This reflects into a Poisson-like dynamics of the spike generation [Ben Dayan Rubin et al., 2004]. In the Signal-Driven (SD) regime instead (high µ/σ), spikes are generated at more regular times and the ISI distribution shows a peak around 1/r value, where r is the mean firing rate (Fig. 4.1a).

4.2.2 Depolarization distribution

The above considerations about the statistics of the spiking events are reflected in the distribution of the depolarization variable. In the SD the strong afferent current forces the membrane potential to travel the range from resting level to the spiking threshold with a stronger drift. In this situation the distribution of the membrane potential is broadly distributed over the whole range between resting state and firing threshold. In the ND regime instead, the depolarization variable

47 Classification in VLSI

(a) ISI distribution (b) CV

Figure 4.1: (a) ISI distribution for different output firing rates. The spikes are recorded from the VLSI hardware. The output neuron is stimulated using computer-generated Poisson spike trains of different input frequencies stimulating the excitatory synapse. When the I&F is driven towards the SD regime, the spiking behaviour becomes more regular and ISI distribution becomes peaked around 1/r where r is the mean firing rate. (b) CV of the ISIs for different output firing rates.

tends to concentrate at the resting level Vmem = 0 and will reach the firing threshold by means of large variations due to the stochastic input current. Thus, the distribution will tend to be more concave and peaked around the low values (Fig. 4.3). Such asymmetry in the two regimes is used to implement a synaptic plasticity rule that has a dependence on the activity of the post-synaptic neuron and uses only information that is local in time and space [Fusi, 2002]. If the pre-synaptic spikes are not correlated with the post-synaptic activity, every pre-synaptic spike will be randomly sampling the distribution of the depolarization variable, with respect to a certain threshold, and this mechanism is used to either potentiate or depress the synapse.

4.3 Synaptic dynamics and plasticity

In the current implementation of the distributed plastic synapses array there is no mean to have a direct access to the synaptic states, nor to read-out the internal variables of the synapses. In order to characterize the parameters of the learning circuits, I designed a set of procedures to measure the dynamics of the internal variables in an indirect way (see also [Fusi et al., 2000]). The plastic synapses are initially reset to the low-state. The synapses are then stimulated with a spike train of regularly-spaced spikes with a certain ISI. The parameters of the learning circuits are then set in such a way that every pre-synaptic spikes tends to potentiate the synapse, by increasing its values of a fixed amount a. In between the pre-synaptic spikes the synapse tends to be restored to the original low value by the refresh current implementing the bi-stability mechanism, modeled by the speed parameter α. For a given set of parameters, ISI and number of pre-synaptic spikes, the state of each synapse is read after the experiment. In the ideal case, there is a critical pair of ISI and number of pre-synaptic spikes that potentiates the synaptic variables up to the bi-stability threshold and this pair is the same for all the synapses. Instead, because of the device mismatch, each synapse will be affected by updated

48 CHAPTER 4. CONFIGURING THE HARDWARE

Figure 4.2: Trace of neuron’s membrane potential during learning. A train of Poisson-ditributed spikes of high rate is sent to the non-plastic excitatory synapse. The parameters of the soma circuits are such that the neuron has a strong leak (Vlk ∼ 0.3) to compensate for the strong input current and to generate output spikes only for strong fluctuations in the input spikes. The green and pink traces are the UP (active low) and DN (active high) signals, which enable synaptic updates on the learning circuits (see Sec. 2.5).

Figure 4.3: Distribution of Vmem for different mean firing rates. Experimental protocol as in Fig. 4.1a

49 Classification in VLSI

Figure 4.4: Cartoon explanation of the measurement procedure for a and α on the VLSI hardware. See text for details.

Figure 4.5: Measurement of a and α on the VLSI hardware. Half of the plastic synapses (a random set) is initially reset to the low state and then an indipendently chosen random set of synapses is stimulated with a regular spike-train with a fixed number of spikes separated by a given ISI (average values across the entire synaptic array of 3472 synapses). The parameters are set in such a way the every pre-synaptic spike tends to potentiate the synapse.

by different parameters a and α, and this results in a diversity of critical pairs. In Fig. 4.5 is reported the average number of potentiated synapses for a set of ISI and number of spikes. The resulting curves are smoothed versions of step-functions, as an effect of the above considerations. The two mentioned parameters determined the resilience of the synapse to undergo long term potentiation [Fusi et al., 2000]. The synaptic dynamics can be described as a generalized Takács process [Takács, 1955] and the transition times can be estimated as a function of a and α. Hence, the synaptic state transition probabilities, from which it eventually depends the success of the supervised learning, are determined by those parameters. The parameters can be determined numerically by the repeating the measures previously described. What is needed is a pair of critical points where the synapses just crosses the potentiation threshold [Fusi et al., 2000]. Indeed, the transition occurs when the following

50 CHAPTER 4. CONFIGURING THE HARDWARE

Figure 4.6: Estimation of the synaptic update parameters. The number of synaptic “jumps” needed to obtain a transition is measured on all the synapses with regular spike-trains of different ISI. From the values obtained the internal variabele of the jump height a and weight leak α are estimated with a fit (see inset). condition is met: na − (n − 1)T α ≥ θ (4.1) where n is the minimal number of pre-synaptic spikes needed to cross the threshold θ. By measuring two of these points, i and j, one can derive the parameters [Fusi et al., 2000]:

n − n α = θ i j (4.2) ni(nj − 1)Tj − nj (ni − 1)Ti (n − 1)T − (n − 1)T α = θ j j i i (4.3) ni(nj − 1)Tj − nj (ni − 1)Ti (4.4)

Analogously, the parameters of the synaptic dynamics can also be determined by a graphical method. By re-adjusting eq. 4.4 one can write the minimal number of pre-synaptic spikes needed to obtain a synaptic transition as a function of the ISI:

−α∆t n = (4.5) a − α∆t where ∆t is the ISI.

4.3.1 Measuring transition probabilities

The transition probabilities are measured as follows. First, the synaptic states are randomly reset to either high or low values with a 50% probability. Then, Poisson spike-trains of a given mean firing rate and 500 ms duration are generated and used to stimulate the synaptic circuits. At the same time, a Poisson spike-train is sent to the non-plastic synapse of the post-synaptic neuron, imposing an irregular but stationary mean firing activity. After the stimulation, the state of the synapses is read through a specific stimulation protocol and a single value is obtained for the synapses that potentiated from the low state, as well as one value for the synapses that depressed

51 Classification in VLSI

Figure 4.7: Probabilities of LTP and LTD as a function of the post-synaptic neuron firing rate. Each data point correspond to an experiment with the protocol described in Sec. 4.3.1. Dashed lines connect binned averages and bars represent standard deviations in within bins. from the high state. The experiment is repeated several times, then binned average values and standard deviations are computed.

4.4 Exponential dynamics of learning synapses: bug or feature?

While assessing the capabilities of one VLSI chips used to carry out the learning experiment, several non-idealities of the un-optimized neuromorphic system have emerged. Most of these non-idealities have been traced to the source at the circuit level and the circuits have been modified already in the next generation of neuromorphic chips. Here, I present one example that affects dramatically the dynamics of learning. The dynamics of learning relies on a smooth dependence of the transition probabilities from the synaptic changes. The correlation between the input pattern and the connectivity reflects linearly on the output rate because of the linear summation of synaptic currents. This provides a mechanism for the post-synaptic neuron to “sense” the connectivity and activate stop-learning when needed and prevent over-training. The theory prescribes the existence of an internal variable that represents the state of the synapse so that weight transitions (high/low) happens together with state transitions (active/inactive). The learning circuits implemented on the ’ifslwta’ chip combine the internal variable of the synapse with its weight. In fact, there is no internal variable as such and changes triggered by pre-synaptic spikes are directly damped into the synaptic weight. This weight is kept as a voltage, Vw on the capacitance storing the weight of the synapse. The synaptic current Iw is generated by a transistor whose gate is controlled by Vw, which makes the effective synaptic V efficacy depend exponentially on this value since Iw ∝ e w . The consequence of this design is that the post-synaptic activity will be affected mostly by small weight changes instead of state-changes and this implies an other constraint on the synaptic parameters. If the constraint is not met, the learning dynamics could be dependent only on Vw changes and not on synaptic state changes, making learning impossible. The situation is depicted as a cartoon in Fig. 4.9. In pratice, all

52 CHAPTER 4. CONFIGURING THE HARDWARE

Figure 4.8: Learning the “INI” pattern. The image on the top is converted into Poisson spike-trains with rates proportional to the intensity level of the corresponding pixels (50 Hz for white and 5 Hz for black pixels). The pattern of spike-trains obtained is presented 10 times onto the plastic synapses of a neuron driven with a teacher signal into a region of weak potentiation. After each presentation the synaptic levels are read with a particular stimulation procedure (see text) and the result are reported on the side. White pixels correspond to potentiated synapses. All the synapses started at the low state (black pixels). The image on the bottom corresponds to the fifth stimulation. The experiments shows the independent, stochastic transitions of the bi-stable synapses due to the independently generated Poisson spike-trains in input. the synapses which are in a high state will send zero current as soon as their value is depressed, before the state transition will happen. This condition is highly likely because it will be triggered by a single spike received by a potentiated synapse when depression is allowed. There are two possible solutions to this problem. The first one is to reduce the dynamic range of weight changes (Vhigh closer to Vlow), so to reduce the influence of Vw to the post-synaptic activity. This solution is effectively impractical for two reasons, one being that by reducing Vhigh one will also reduce the effects of synaptic state changes, i.e. the slope of the line in Fig. 4.9, the other being the fact the mismatch and the large number of synapses will cause some of them to be completely ineffective and some to be strongly effective. The second solution comprises a redraw of the synaptic circuits that de-couples the weight and the internal state of the synapses. We analyzed a modified version of the VLSI circuits that comprises a diff-pair circuit to implement the above functionality: the relationship between the state of the synapse and its weight become sigmoidal and a common threshold for low/high weight and active/inactive state is imposed. In this way, synaptic state changes are reflected on synaptic weight changes with a large dynamic range. The design we propose is similar to what’s been proposed in [Chicca et al., 2003], where inverters are adopted in order to have discrete weights. However, a smoother version of the weight as in the solution we propose might have technical advantages, such as power consumption and layout, and computational implications given the programmability of the parameters of the diff-pair. This arguments will be investigated in future generations of multi-neuron chips. For some experiments in this thesis, the synaptic current has been shut down by choosing a practically infinitesimal synaptic time-constant. In order to recover the dependence of the total synaptic

53 Classification in VLSI

Figure 4.9: Synaptic current (y-axis) results from changes in synaptic connectivity (x-axis). Synaptic states and synaptic weights coincide in the current VLSI implementation, hence even when synaptic states don’t change, the total synaptic current is strongly affected by changes in synaptic weights (Iw ∝ exp(Vw), in red). Changes in synaptic current affect the dynamics of learning through the transition probabilities, shown along y-axis (Long-Term-Potentiation and Long-Term-Depression). current from the state-changes, an additional constant current proportional to the actual state of the synaptic array is introduced at every pattern presentation.

4.5 Discussion

To calibrate and fine-tune circuits in a mixed analog/digital neuromorphic VLSI system it is necessary to have a direct access to each circuit node in the distributed system. This is not possible in a highly distributed system such as the one that we used in our experiments. The method proposed in this Chapter exploits the distributed communication of the AER, and few, reconfigurable output nodes from the post-synaptic neurons, e.g., membrane potential voltage output. One alternative solution to the problem of calibration is to derive detailed equations of the VLSI circuits and use them to estimate the hardware parameters, i.e., circuit biases, from the quantities of interest of the model, e.g., synaptic time constants [Neftci and Indiveri, 2010]. However, the equations typically depend on the specific circuit implementation and so it makes this approach practically unfeasible to generalize to a multi-chip, experimental scenario. Furthermore, it typically needs extensive calibration procedures to obtain accurate estimations of the parameters of the substrate or compensate for temperature changes that might compromise the experiments. These calibrations might in general require direct voltage measurements on circuit nodes not available in the output pin array. The mentioned disadvantages can be worked-around with the black-box I described. Indeed, it relies in an indirect derivation, at every call of a “set-parameter” function, of the quantities of interest from the measurement of high-level parameters available through the distributed AER system or configurable analog output signals from the post-synaptic circuits. Obviously the black-box approach is much slower at the configuration time compared to

54 CHAPTER 4. CONFIGURING THE HARDWARE an analytical parameter estimation method, especially in systems operating in real-time. For example, the measurement of a time-constant estimated around 500 ms needs at least this amount of time to be carried-out. However, the method is not focused on a precise calibration of the internal parameters so the measurement procedures are typically operated once. Rather, they have been used to estimate the variability of the properties of the system, e.g., the speed of learning, an information that led to further investigations in software simulations (see Chapter 6). The idea there is to consider intrinsic variability as a property of the substrate rather than a flaw of the CMOS technology to compensate for. This method can indeed be extended to future emerging technologies to provide novel substrates for neuromorphic cognitive systems.

How do our synaptic core compare to other neuromorphic systems? FGMOS devices [Diorio et al., 1996, Brink et al., 2013] can be used to store analog synaptic weights for very long times. The advantage of using these devices lies in their compactness because only a single transistor is required. However, since the value of the synaptic weight is determined only by the charge stored on the gate capacitance of that single transistor, FGMOS synapses are typically affected by large mismatch which imposes fine-tuned calibration procedures and external resources to configure the weights or for implementing the plasticity mechanism [Brink et al., 2013]. In view of an embedded solution, this problem would require one to store all the calibration parameters and make them accessible at any point in time by the synaptic array [Bailey and Hammerstrom, 1988]. Hence, FGMOS synapses lead to communication bottlenecks in the architecture.

The most successful neuromorphic system that uses memristor devices to store synaptic weights is composed by 106 neurons and 1010 synapses [Minkovich et al., 2012]. Instead of using a crossbar architecture, the memristor-based system exploits a time-multiplexing scheme to minimize neuron and synaptic integration while maintaining a high degree of short and long term connections. This solution relies on a continuous refresh of the synaptic weights on the memristor array due to the unreliability of the memristor devices [Kim et al., 2011]. Compared to a CMOS-based solution with bi-stable synapses, the memristor array with time-multiplexing is less power efficient because it cannot maintain memories for long time unless an external, digitally controled resource is constantly updating the synaptic array.

Systems relying on large cores with distributed synapses pose important practical issues with respect to their configuration. For example, large arrays of memristor synapses require large currents to be configured, thus compromising the overall system’s power efficiency. In addition, most memristor arrays require constant updates, realized through time-multiplexed systems [Minkovich et al., 2012], to compensate the fact that weights decay due to the physical properties of the nano-device. This approach shares a fundamental similarity with traditional Von Neumann architectures, which is the fact that the memory (the synaptic states of the network) and the computation (the neuro-synaptic core) are distinct physical entities. Conversely, the CMOS approach and recent studies on memristor devices follow the intent of using network with distributed synapses as fundamentally different computational and memory systems, as it was originally proposed. One typically refers to the first system as a reconfigurable synaptic array as opposed to plastic synaptic arrays.

55 Classification in VLSI

4.6 Conclusions

The fine-tune calibration of a neuromorphic system with highly distributed analog circuits, e.g., with distributed plastic synapses, is made unfeasible by the typical time requirements of a brute-force approach and the difficulty of accessing all the necessary nodes of the VLSI circuits. However, a high-level approach that abstracts from the detailed knowledge of the circuits can be used to derive quantities that are relevant at the global level, e.g., by measuring only few relevant internal nodes and extract quantities from the recorded spiking activity. These quantities have usually a direct correspondence to the theoretical models, such as the long-term transition probabilities of the binary synapses in the system presented in the preceding chapters. In addition, it is still possible, in several cases, to obtain global estimates of the parameter variability of the hardware implementation and relate them to the particular circuit implementation. Here we provided one example relating to the currents generated for single synaptic weight updates. One important element for the success of the method is to set the system in experiment-like conditions, for example by keeping the number of excited synapses as close as possible to the one that will be used in the classification tasks. A very similar approach has been proven successful also for the configuration of attractor dynamics in recurrent neural networks in neuromorphic hardware. The work described in this Chapter represented an important milestone of the whole project because, inspired by the observations reported therein, I devised the strategies to obtain a powerful classifier using sub-optimal systems and components. In addition, the routines that implement the measurement procedures can be re-used for future generations of neuromorphic chips because they don’t depend on the specific hardware implementation. Rather, they use the abstraction provided by the software tools to obtain direct measures of the relevant quantities.

56 CHAPTER 5

A generic neural network for pattern recognition

Simplicity is the ultimate sophistication.

Leonardo da Vinci

Neural networks have been used extensively in the past to model associative memory and to solve simple pattern recognition tasks. The “neocognitron” by Fukushima [Fukushima, 1980] was probably one of the first examples. Based on the idea that the human cortex is organized in a hierarchical structure, scientists proposed more advanced neural network models, e.g., for solving face recognition [Brunelli and Poggio, 1993, Riesenhuber and Poggio, 1999]. The key point of most of these very effective techniques is to separate the problem of extracting relevant features from the actual classification. Feature extraction models can be carefully designed for the specific task to solve in order to simplify the classification, reduce the computational requirements, improve the convergence of the actual learning. In this framework, some degree of prior knowledge on the task is typically required, as well as much computational power and large memory. These properties make the architectural design unfeasible for several applications. Here we exploit the distributed, highly parallel computation operated by the learning synapses using a single feed-forward hidden layer of Randomly Connected Neurons (RCNs) with fixed weights to build a neuromorphic classifier. This solution can be fairly easily implemented in neuromorphic hardware and operates a simple transformation that is very effective and scalable to more complex tasks. The theoretical framework in which the network is designed permits to estimate the proper choices for several practical applications. The model is mapped on compact, low-power but highly imprecise neuromorphic hardware using time-multiplexed AER and compact VLSI circuits with distributed synapses that change wit to a stochastic learning mechanism. Several classifiers are finally combined to improve the performance of the system. This chapter summarizes the theoretical and technical aspects of the implementation of the network using software simulations and the neuromorphic hardware. These results demonstrate the possibility to implement effective classifiers using low-power, imprecise neuromorphic hardware in real-time. The testing and characterization of the network behaviour on a typical machine learning dataset

57 Classification in VLSI will be described in following chapters.

5.1 Introduction

The idea of using neurons to solve classification tasks dates back to Rosenblatt’s perceptron model [Rosenblatt, 1958]. The perceptron is a binary classifier that maps an input vector into an output vector through a weighted sum of the inputs. For a long time the perceptron has been considered the landmark of intelligence and several cognitive abilities have been attributed to the ability of the perceptron to “learn its inputs”, i.e., to modify its weights to be able to respond differently to different inputs. Even though the fascination of the perceptron model faded-out after the famous work by Marvin Minsky and Seymour Papert [Minsky and Papert, 1969] that proved the impossibility of the single-layer perceptron to learn the XOR function, several neuroscientists have borrowed from it the idea of imposing an output unit, e.g., a neuron, a desired activity, e.g., a certain mean firing-rate, in response to one class of input patterns. The input-output function operated by such neurons has to depend on the overall activity of the network in order to produced a large repertoire of behaviours. These input-output functions can be seen as parametric pattern recognition tasks where neurons are required to respond to certain patterns of activation to the input. In view of hardware implementations, it is important to assess not only the theoretical problem of assigning a large repertoire of input-output functions to the neurons in the network but also the implementability of the model and the robustness in real, out-of-lab conditions by making realistic assumptions on the hardware substrate. These aspects often share similarities with the biological substrate [Amit and Fusi, 1994]. The strategy we adopted to address the problem of robust learning in unknown environment is to distribute memory storage (through synaptic plasticity) and maintenance (memory elements, i.e., robust synapses) on the network. The problem of classifying input patterns is typically separated into a feature extraction stage and a classification stage. To extract relevant features of the input domain, a neural coding is introduced, i.e., a way for neurons to encode input information. Neural coding can be learned, as in the case of unsupervised learning [Hinton and Sejnowski, 1999, Bishop, 2006]. For example, unsupervised learning can be used to improve convergence and stability of deep- learning networks [Erhan et al., 2010]. Spike-codings have been proposed to impose a neuron a certain spiking activity in response to a given spatio-temporal pattern of spikes [Guyonneau et al., 2006, Gütig and Sompolinsky, 2006, Dhoble et al., 2012]. Most of these model aim at reproducing biological processes that support spike-based computation in the brain while exploring its computational advantages using information theory [Rieke, 1997, Izhikevich, 2006]. An alternative approach instead considers the information that is available to single readout neurons and use a simple transformation to allow the readout neurons to show enough diversity in the responses, i.e., to implement a large number of input-output functions as required by high-level cognitive functions [Rigotti et al., 2010b, Barak et al., 2013, M.Rigotti et al., 2013]. One such simple transformation is the one operated by RCNs in a feed-forward network, where neurons receive connections from randomly weighted input units. These neurons respond to complex combinations of the parameters characterizing the different sources of information, thus they are said to show “mixed selectivity”. This type of behavior can explain high-level cognitive functions in relation to the diversity of responses observed in regions of the cortex correlated with

58 CHAPTER 5. THE NEURAL NETWORK MODEL highly-cognitive functions [Asaad et al., 1998, Miller and Cohen, 2001, M.Rigotti et al., 2013]. In machine learning, SVMs use the “kernel-trick” to project the input to a very high-dimensional space, similarly to the linear projection operated by the RCN [Aizerman et al., 1964]. Using RCNs in machine learning tasks thus seems then to be a reasonable choice because of both a theoretical and a biological argument, but in addition it has practical advantages that will get clearer later in the chapter. The system here presented exploits the use of distributed learning synapses. A system with distributed learning can be scaled-up if the synapses use local information, as in the case of Calcium-based models [Fusi, 2002, Clopath et al., 2010, Graupner and Brunel, 2012]. Also in the case of the VLSI circuits of the neuromorphic system used in this thesis the synapse stores only its internal state, whereas synaptic changes depend on signals from the post-synaptic soma and are activated by pre-synaptic spikes. Several alternative approaches exist and have been extensively explored in literature. The precision of analog CMOS synapses is limited by device mismatch [Shockley, 1961]. Consequently, the circuits are highly variable and lead to undesired behaviours. One elegant, partial solution to the problem of variability is to consider synapses that have stable states, eventually binary [Amit and Fusi, 1994, Fusi et al., 2005]. These types of synapses don’t require precise weight-updates, are more robust to spurious modifications due to spontaneous activity then continuous synapses, and can store memories for long times. Online learning using bounded synapses can be obtained using stochastic state-updates [Fusi et al., 2000]. State transitions on deterministic, bi-stable synapses can be imposed stochastically by exploiting the variability of the spike-trains in the network [Chicca and Fusi, 2001, Brader et al., 2007]. Memories based on bounded synapses and stochastic learning have the palimpsest property, i.e., new memories slowly erase older memories, which is a useful property for a continuously learning system since training and testing phases aren’t distinguished between. Furthermore, the convergence of the learning algorithm can be mathematically proven [Senn and Fusi, 2005]. In conclusion, bi-stable synapses have important theoretical properties and are a feasible model for physical synapses. However, as shown in preceding chapters, the high variability of the CMOS circuit implementing the synaptic state updates can have strong effects on the dynamics of learning.

5.2 The feed-forward network with random projections

A hidden layer of RCNs can be used as a generic pre-processing stage. The only requirements for the neurons is to have some form of non-linearity, similarly to the requirements of SVMs for non-separable problems [Bishop, 2005]. Such a pre-processing stage has the property of being robustly implementable in hardware in a compact design since no learning is required. Random projections have not only technical advantage but can also be shown to maximize classification margin on several typical SVM problems, i.e., by choosing an SVM kernel that corresponds to a single layer of RCNs [Cho and Saul, 2010]. Furthermore, the use of random projections has several practical advantages. By choosing the threshold for the non-linearity of the neurons it is possible to fix the average level of activity of the hidden layer, i.e. coding level [Rigotti et al., 2010b], without the need for normalization of the input data or feed-forward inhibition

59 Classification in VLSI

Figure 5.1: Schematic diagram of the feed-forward network. Input images taken from the dataset are transformed in software using random projections onto binary neurons (black: inactive; with active). Random weights are represented as arrows with shades of grey. The activities of the binary neurons are converted into Poisson spike-trains with firing rate proportional to them. The spike-trains are fed to the VLSI bi-stable synapses (red and green spots) from which output spikes and neuron’s membrane potential can be recorded. Thus, the output neuron operates a weighted sum of its inputs collected along its (dimension-less) dendritic branch. from the input layer [Senn and Fusi, 2005, Brader et al., 2007]. This parameter will in turn regulate the sparsity of the representations, which affects the generalization properties of the classifier [Barak et al., 2013]. The computational advantage of using a large pool of RCNs is well known [Vapnik, 1998, Bishop, 2006, Rigotti et al., 2010b] and consists of realizing a dimensionality expansion. By fixing the coding level of the hidden layer it is possible to estimate analytically and empirically the generalization properties of the classifiers using the theoretical framework of the perceptron [Rosenblatt, 1958, Barak et al., 2013]. These aspects ensure the possibility for the linear classifiers of the output layer to successfully learn to detect the correct input patterns, i.e. the convergence of the output classifier is granted. As we shall see later in the thesis, it is not very important to obtain perfect classification from each classifier as more-than-chance performance levels will suffice (see for example [Breiman, 1996, Ji and Ma, 1997]). One necessary condition though is that the classifiers are independent, a condition that will be granted by a combination of stochastic learning and circuit mismatch. A single linear classifier can be trained to detect representatives of a given target class of input stimuli. The input stimuli are projected to the hidden layer of RCNs to solve non-linearly separable problems. The projection is implemented by a matrix multiplication of the activity levels of the input patterns, e.g. grey-scale levels of the image in the case of hand-written digits, by a matrix composed of random values drawn from a normal distribution with mean µ = 0 and σ = 0.2 variance. The patterns of activity of the RCNs are then converted into Poisson spike trains of fixed mean firing rate proportional to the activity level.

5.3 Ensembles of hardware classifiers: theory and simulations

The plastic synapses of the output classifiers are bi-stable and undergo long-term transitions on a stochastic basis due to the stochasticity of the incoming spike trains. One of the implications of the presence of stochasticity in the synaptic state transitions is that for any given pattern,

60 CHAPTER 5. THE NEURAL NETWORK MODEL

Figure 5.2: A pool of weak linear classifiers can perform better than a single good, sub-optimal, linear classifier. The ellipses represent two clouds of patterns for two classes to be separated. (1): the optimal linear classifier (hyperplane represented as red line). (2): a pool of weak classifiers, all similar to each others. (3): a pool of very weak classifiers all different from each others. (4): graphical representation of majority rule, where the red separates the hyperplane where the majority of classifiers are active. The majority-rule based classifier approximates the optimal classifier when the number of weak classifiers is large. independent classifiers will learn different representations of the input stimuli. For example, a single layer of linear classifiers with stochastic, bi-stable synapses have been used to learn, with a single layer of neurons, the XOR problem [Senn and Fusi, 2005]. Several imprecise classifiers can be aggregated using simple rules, such as a majority rule, to obtain a global, improved classifier [Breiman, 1996]. The reason of the improvement lies in the independence of the single classifiers. In other words, there would be no gain in accumulating evidence from identical classifiers since there would be no additional information carried by each new unit introduced. Thus, in order to obtain a successful classification, the single classifiers have to have better-than-chance performance and have to be independent from each other. Hence, learning for the single classifiers can be imprecise as long as multiple classifiers can be introduced in the system [Ji and Ma, 1997].

5.3.1 Abstract model of stochastic learning synapses

In order to obtain an empirical estimation of the final synaptic state to be compared with the one obtained from the hardware, software simulations of the abstract rule [Senn and Fusi, 2005] have been implemented in Python using the numpy numerical libraries [Walt et al., 2011]. The implementation of the abstract rule is particular straight-forward in a flexible programming language such as Python, and can be optimized easily using the software tools available to compile vector and matrix operations from numpy into plain C using, for example, Cython [Behnel et al., 2011]. The network model consists of N input neurons connected to M output neurons. An output t neuron is active, φi = 1, if a stimulus elicits a response higher then a threshold θ, otherwise

61 Classification in VLSI

t t φi = 0. The total synaptic current hi is composed by the weighted sum of the input stimulus with some global inhibitory signal gI . During learning, the synaptic weights Jij connecting neuron i to neuron j change on a stochastic basis according to a supervised learning mechanism. With this mechanism, the output neurons are assigned to target classes and their activity is set to the desired output configuration per each stimulus presentation in the training phase. The synapses undergo stochastic long-term modifications at each pattern presentations with the following rule:  + t t Jij (t) + ζj (1 − Jij(t)), if φi = 1, hi ≤ θ + δ Jij (t + 1) = (5.1) − t t Jij (t) − ζj Jij(t), if φi = 0, hi ≥ θ − δ,

± ± ± where ζj are binary variables which are 1 with probability q and 0 with probability 1 − q . Similar to the perceptron, the goal of learning is to find the set of synaptic weights for which a stimulus taken from the set of the Target (Null) class elicits an activity higher (lower) then the threshold θ with some given margin δ. The learning converges to this condition under fairly weak assumptions on the parameters of the learning dynamics and in a finite number of steps that scales like 1/N [Senn and Fusi, 2005].

5.3.2 Phenomenological model of the spiking implementation

The abstract rule can be implemented in a biologically plausible model that introduces a soft stop-learning mechanism for the long-term modifications based on the activity of the post-synaptic t neuron [Fusi, 2002, Brader et al., 2007]. The formal neurons and their states φi are substituted t ± by mean rates νi and the stochastic variables ζ by variables that depend on the post-synaptic rates, Γ±. The stop learning condition is implemented through a dependence of the synaptic modifications on the activity of the post-synaptic neuron. For example, a slowly integrating variable can be introduced to gate the long term modifications. This variable acts as a measure of the post-synaptic activity, similarly to an internal Calcium concentration [Abarbanel et al., 2002, Shouval et al., 2002]. By introducing thresholds on the values of the Calcium variable, long-term modifications can be blocked for very low or very high firing rates, which are both situations in which the synaptic states are highly correlated with the input patterns. For intermediate values, the synapses are modified with probabilities that are rescaled linearly by the pre-synaptic rates. This type of dependence reflects the electronic implementation of the bi-stable synapses. The learning rule of eq. 5.1 is modified into the following:  + Jij (t) + νpreΓ (1 − Jij(t)) Jij (t + 1) = (5.2) − Jij (t) − νpreΓ Jij(t), where the Γ± contain the dependence on the post-synaptic Calcium variable. In the case we will consider, for example, they are drawn from a Gaussian distribution. In the rate-based model, the synaptic connectivity and the input patterns are represented by binary vectors. The output neuron is represented by a threshold-linear unit:

µ µ T νt(ξ ) = θ(ξ~ · ~wt + ν ), (5.3)

62 CHAPTER 5. THE NEURAL NETWORK MODEL

µ where θ(x) = x if x > 0 and 0 otherwise, ξ is a pattern of the class µ, wt is the synaptic connectivity at time t. The term νT represents the effect of the input current given by the external teacher signal. The learning protocol proceeds as follows. The synaptic values are initialized at random with a probability of 1/2 for a synapses to be initiated at a high state. This condition is granted at any time (on average) in the case of balanced learning. Each binary classifier is assigned to one class. We will refer to this class as the Target class while all the patterns of the other classes will be considered as belonging to a single Null class. Each unit is trained upon a supervised learning protocol. Patterns from all classes are selected at random each time and a teacher signal is delivered to the neuron together with the input pattern. If a pattern that belongs to the Target class is presented, the teacher signal “high” is chosen, whereas if the presented pattern belongs to the Null class the teacher signal “low” is chosen. When the learning process starts, the difference in terms of output rates ν upon the presentation of a pattern of the Target class and one of the Null class is given mostly by the teacher signal. This has to be such that for a pattern of the Null class, the output rate ν corresponds to a region where the probability of LTD dominates, v.v. when for a pattern of the class Target the probability of LTP dominates. As a consequence of this choice, the learning process will slowly select the synapses that represent the class Target and not the Null class. The following rules for a successful convergence of the classifiers can be derived empirically using the software simulations and compared with the theoretical analysis in literature [Amit and Fusi, 1994, Fusi et al., 2000, Brunel et al., 1998, Brader et al., 2007].

• The learning has to be slow. When the probabilities of transitions are too high, each pattern presentation erases the memory of the previous patterns. In order to avoid this to happen one has to slow-down the learning process.

• The learning has to be smooth. When a pattern is presented a random set of synapses get potentiated or depressed, depending on whether the pattern belongs to the Target class or the Null class. Depending on the overlap between the classes a number of these modified synapses represent both the classes. If too many of these synapses are modified, an unbalanced situation is reached, where the teacher signal is not strong enough to overcome the current given by the plastic synapses. In order to avoid this problem the pattern presentations have to be balanced between Target and Null classes and the number of modified synapses at each presentation has to be small.

• The stop-learning effect has to be strong enough. As an effect of the stop-learning mechanism, each classifier does not maximize the margin of the training set, causing a diversity in the outputs of the pool of classifiers. This process is reminiscent of a Bootstrap Aggregating (Bagging) technique where random samples of training sets are generated Breiman [1996]. The diversity is needed in order for the majority rule over the pool to be effective.

• The stop-learning must not bee too strong. In order for the stop-learning to be effective the output rate of the neuron has to change as an effect of the learning process. The change is due to the number of synapses that become correlated to the Target class, as an effect of learning, and their total current depends on the synaptic weight parameter. By increasing

63 Classification in VLSI

the synaptic weight it is possible to control the number of synapses that will represent the class. For example, by increasing it, a smaller number of synapses will be needed to drive the output neuron into the stop-learning region. However, stronger synapses cause more variations in the output rate during learning, increasing the risk of compromising the smoothness of the learning mentioned above.

• The transition probabilities have to be balanced and this can be assured by balancing the presentation of Target and Null classes.

5.3.3 The interplay between synaptic scaling and stop-learning

The goal of the learning is to impose the desired output by means of stochastic synaptic modifications. Obviously these modifications are not random. The abstract rule model imposes modifications that are imposed by “clamping” the output neurons to desired values [Senn and Fusi, 2005]. The rate-based model adopted in the VLSI hardware is a modification of that model which is biologically more plausible and has a simpler implementation in VLSI [Chicca et al., 2003, Brader et al., 2007]. It requires to consider separately the number of active synapses and the total synaptic current. This choice has important consequences for the electronic hardware implementation and requires a short explanation. The abstract model considers a teacher signal that imposes the correct response on the post- synaptic neuron almost regardless of the total number of active synapses. The minimum number of synapses needed to have at least some level of activity is determined by the stop-learning threshold. This threshold doesn’t require fine tuning but learning tightly separated patterns requires a small value. Under this condition, only the current due to the “correct” synapses will survive the learning process whereas the part due to the synapses that don’t correctly encode the input will be removed. The former component is referred to as the signal component relative to the task, while the latter is a noise component. Note that this is the only possibility for a successful learning with bounded synapses. Hence, as in the case of the perceptron with un-bounded synapses, the learning increases the Signal-to-Noise Ratio (SNR) until the classification margin is reached, and correct classification is obtained. Unfortunately, the type of teacher signal used in the rate-based model doesn’t permit the above dynamics. Supposing we fix the Γ± distributions for the long-term modifications, the requirements of the abstract rule translate into the requirement of very strong synaptic weights. However, this compromises the stability of the learning process which in turn cannot be controlled by a constant teacher signal. For example, if fluctuations in the change of synaptic states bring the output neuron in a region where LTP dominates when patterns of either classes are presented, the learning will never converge (see also [Senn and Fusi, 2005]). The natural solution for the problem of instable learning, is to have the two region of stop-learning, the ones for stopping LTD at low rates and the one for stopping LTP at high rates, very distant. The constant teacher signal is used to set the neuron to one of the two regions in accordance with the presented label. Because of the distance of the two regions, instability will never happen. However, this condition requires very large regions for permitted LTP and LTD, which is impractical in hardware, or prior knowledge on the type of input patterns, their average intra-class variability, their average inter-class separability, their average coding-level.

64 CHAPTER 5. THE NEURAL NETWORK MODEL

5.3.4 Examples from simulations

One can observe the dynamics of learning by measuring the output firing rate and the overlap between the state of the synapses and the prototypes of each of the two classes. While these cannot be used as actual measures of the memory storage, they can be used as qualitative tracers for parameter optimization [Brunel et al., 1998]. Here, we compute the overlap m with a class C as a non-normalized scalar product between the synaptic states array W and the average of the binary input patterns of one class:

mC = W~ · < ξC > (5.4)

The prototype is calculated by averaging the patterns of the class and using a 0.5 threshold to obtain the binary pattern representing the prototype of the class. We report in the follow pages a collection of three panels to report the dependence of the learning dynamics with increasing synaptic-weight limit value. In each panel, two classifiers are trained to recognize the digit 2 from the MNIST database against all the other digits. The order of the presentations during learning is kept the same on all three experiments. Color-code is blue for Null and green for Target class. Each panel consists of: top-left, LTP and LTD probability functions; top-center, output rates as a function of the overlap; top-right: evolution of the overlap as a function of the pattern presentation; bottom-left: histograms of the output rates in Hz; bottom-center: output rates as a function of the number of pattern presentations; bottom-right: overlap difference between the Target, t, and the Null, n, classes: mt − mn. Scattered data and lines in each plot correspond to two classifiers trained in parallel. Synaptic weight limit is increased in each successive panel, until this value causes fluctuations on output rates that lead to a a breakdown in dynamics (last panel). The global inhibition weight is also increased to rescale the total synaptic current. Increasing the strength of the synapses results in an amplification of the fluctuations due to to synaptic transitions and to the variations of the input stimuli. If the fluctuations are too amplified, the output neuron could cross the regions of correct synaptic transitions determined by the shape of the transition probability functions. In this case, the synaptic transitions become incoherent with the supervised learning.

65 Classification in VLSI

66 CHAPTER 5. THE NEURAL NETWORK MODEL

67 Classification in VLSI

68 CHAPTER 5. THE NEURAL NETWORK MODEL

Figure 5.3: Aggregating random weak-classifiers. Two classes of synthetic data are generated (red and blue scattered data). Each weak classifier is represented by a line (2 parameters) and a direction (third parameter, not shown) that separates the plane and assign prediction labels on the data (circles are correctly classified, crosses are incorrectly classified data-points). The linear classifiers are randomly generated and dropped if their performance is lower than chance on the test set with a margin. Combinations of such classifiers, for example majority, are then used to classify the training data. The shade of green in the bottom-left plot represents the accumulation of plane-separations by the classifiers. The learning is stopped at N = 200 generated classifiers. The error rate of the bottom-right plot is computed on the test set.

5.3.5 Combinations of weak classifiers

The dynamics due to the stochastic learning leads to highly heterogeneous classifiers, having all similar performance but each resulting on different misclassified subsets. This is equivalent to train each classifier on a different subset of the training set. In this case, it is possible to combine the responses of the single classifiers to obtain a single, improved classifier. A multitude of machine learning algorithms uses similar techniques, for example by explicitly training each predictive model on chosen subsets of the training data (bootstrap) and then obtain “aggregated” predictions [Breiman, 1996, Schapire and Freund, 2012]. The type of models for the single classifiers doesn’t make much difference as long as they show enough variability, i.e., if perturbing the training set causes significant differences in the classifier’s, then aggregation helps. In the extreme cases, random predictors can be constructed that have just more then chance performance on the test set to avoid any training on each predictor and thus improve overall training time [Ji and Ma, 1997]. The instability of the stochastic learning process depends mainly on the probabilities of long-term modifications. In the slow-learning condition, the probabilities are low and each classifier evolves toward the “average” stimulus pattern [Brunel et al., 1998]. If probabilities are high, each stimulus presentation imposes large modifications on the synaptic states. One can illustrate this mechanism by training an ensemble of classifiers with different transition probabilities, maintaining the ratio of LTPs and LTDs for balanced learning. The results for

69 Classification in VLSI

+ + (a) pmax = 0.1 (b) pmax = 0.01

+ (c) pmax = 0.001

Figure 5.4: Instability of learning for different levels of maximal transition probabilities. Black and white pixels represent the responses of each classifier (y-axis) per each pattern presentation (indicated by the x-axis) after 10000 presentations of a subset of 100 patterns from the training set. The red lines correspond to the presentation of the target class, the digit 2, which is the same for all the classifiers. Low-probabilities ensure convergence but require long training, thus aggregation can be exploited to speed-up the training.

20 classifiers are shown in Fig. 5.4.

5.4 Building the neural network in VLSI

5.4.1 Random projections with a probabilistic mapper

Random projections can be very efficiently implemented in neuromorphic hardware. By exploiting the linearity of the synapse circuits, random weights can be associated to an all-to-all connectivity from the input to the hidden layer, while having one single synaptic circuit per neuron. The different synaptic weights are realized by introducing probabilistic synaptic transmission [Goldberg et al., 2001]. Each synaptic connection is associated with a release probability proportional to the associated weight. For every spike generated by a pre-synaptic physical or simulated neuron, the digital device that realizes the connectivity of the neural network will transmit the spike to the target excitatory synapse with the probability associated with that connection. Thus, this technique effectively rescales the generated Poisson spike-trains. Random probabilities taken from a gaussian distribution are associated with each one of the all-to-all connections from the input to the hidden layer. Obviously this technique will generate only positive weights, however the leak of the neurons is used to center the generated synaptic current around the non-linearity of

70 CHAPTER 5. THE NEURAL NETWORK MODEL

Figure 5.5: Intersection is defined as the ratio between the number of neurons that are active for any mixture of the input stimuli and the coding level. An optimal situation is obtained in the case of dense representation. the neurons’ activation function. Changes in the leak of the neurons, a global parameter, effectively lead to a change in the number of active neurons in the hidden layer, i.e., the coding level. A useful effect is that, because of the random connections and the neurons’ non-linearity, the coding level will be more or less constant across the different input stimuli, irrespective of the actual coding level of the stimulus. The coding level of the hidden layer has important implications on the coding of the input classes and the learning. With changes in coding level it is possible to regulate the generalization/discrimination trade-off of the input classes [Barak et al., 2013] and influences the number of classes that can be correctly classified [Rigotti et al., 2010b]. In addition, the VLSI hardware has to be calibrated in order to produce the expected output for the given coding level, since the output firing rates that activate LTP and LTD are restricted to certain regions. In Fig. 5.5, results taken with a neuromorphic chip with 2048 neurons with non-plastic, excitatory synapses and a probabilistic mapper for the random connections are shown. The coding level is set by changing the leak of the I&F neurons. The bias Vlk is a global bias that controls the leak of all the neurons by generating a constant current that continuously discharges the capacitance of each neuron. For the experiment, 4 different non-overlapping spiking patterns are generated. Each pattern consists of 4 spike-trains generated from a homogeneous Poisson process at high firing rate (90 Hz), and 12 spike-trains at low-rate (10 Hz). Using the excitatory synapses with probabilistic release, each spike-train is projected into a population of 2048 neurons. The activity of these neurons is recorded for all the 4 patterns and the selectivity of each RCN is computed. Neurons that are selective for any two combinations of input patterns are accumulated in a measure of “intersection”, similar to a mixed selectivity measure [Rigotti et al., 2010b].

5.4.2 Testing the VLSI perceptrons

The VLSI perceptron is finally tested using simulated patterns and patterns from the MNIST dataset. Each stimulus pattern, represented as grey-scale level, is projected into a pool of

3472 formal neurons. To do that, a random matrix Wij is generated by sampling random numbers

71 Classification in VLSI

Figure 5.6: Accordance between non-spiking simulations and neuromorphic hardware.

from a Gaussian distribution of mean and variance µ = 0 and σ = 0.2. The binary patterns ξi used to train the VLSI classifiers are obtained by applying a threshold θ = 0 to the projection vector:

ξ~ = max(Wij · ~x, θ) (5.5)

By choosing a threshold of θ = 0, we fix the coding level of the projected patterns to f = 1/2., thus on average N/2 = 3472/2 = 1736 synapses receive a spike-train of high-rate. Each presentation consists of a Poisson spike-pattern of the duration of 500 ms whose firing rates correspond to the binary representation: a value of 1 correspond to a firing rate of 50 Hz, a value of 0 to a rate of 5 Hz. Patterns are chosen randomly from the training set. One of the classes is chosen as the target class and whenever a pattern from that class is presented, a “teacher” signal consisting of a Poisson train of high-rate is multiplexed together with the pattern, stimulating the excitatory non-plastic synapse of the output neuron. This teacher signal has been calibrated prior to the experiment and is able to drive the neuron to a post-synaptic activity where LTP dominates. Thus, in the absence of the teacher signal, the neuron fires at a rate which corresponds to higher chance of LTD. Measurements from the VLSI hardware and from a non-spiking simulation of the abstract rule are shown in Fig. 5.6. To compare the simulation and the hardware measurements, the overlap between the average pattern of the two classes and the synaptic matrix is shown [Brunel et al., 1998]. Absolute numbers, instead of ratios, are deliberately reported because, due to technical issues, it is practically convenient to have a sense of the number of synapses that are actually contributing to the dynamics of the output neuron. As can be seen from the figure, for the chosen benchmark, a very small number of synapses actually contributed to sensitive changes in the output firing rate to make the classifier converge to the final solution, i.e., to activate the stop-learning condition. For the hardware, an even smaller number of synapses contributes, because of circuit mismatch. Notice that these numbers depend on the how tight is the separation between the two classes, thus fine tuning of the relative strength between the excitatory synapses, the global inhibition and teacher signal was required [Senn and Fusi, 2005].

5.5 Discussion

The type of network introduced here and the synapse model implemented in the VLSI hardware are a generic but reliable solution to the problem of static pattern recognition. As opposed

72 CHAPTER 5. THE NEURAL NETWORK MODEL to feature-extraction based networks, here there are no assumptions on the dimensionality or modality of the input stimuli. Such property is important in view of the construction of a generic recognition system but if also of biological relevance. In many cases, indeed, neurons in the brain receive input from different sources encoding, e.g., in sensory integration or combinations of internal states and external stimuli. The neural network presented here can be used in principle in all these cases regardless of the specific input domain, thus representing a fundamental building block for a generic learning system.

5.5.1 Aggregating techniques and the stochastic learning

The empirical analysis explained in this chapter, about the properties of the aggregated stochastic learners, is similar to the analysis of [Amit and Blanchard, 2001]. The authors describe the ergodic nature of a given algorithm, i.e., the algorithm convergences to the same solution, on the same dataset, no matter the randomized initial conditions. One method to study how this property helps is to compute the average conditional margin and the average conditional covariance of two classifiers drawn independently from the distribution of possible classifiers and use it in a character recognition task using a limited set of properties [Amit and Mascaro, 2001]. Boosting is a particular case of such an ergodic algorithm. It would be interesting to analyze the convergence of aggregated learners that use synapses with stochastic learning using the same formalism. For example, even though having similar convergence to a deterministic algorithm and thus always converging to similar classification performance, the stochastic learning might always produce different internal synaptic structures. These structures might preserve certain properties of the input relevant for subsequent stages of the network, increasing the flexibility of the network.

5.5.2 The use of high-level simulations to calibrate low-level parameters

It should be noted that in the simulations it has been assumed that only VLSI mismatch is the cause of the large deviations in the transition probabilities observed in Chap. 4. However, in those deviations there is a component that depends on the single trial, the outcome of which is determined by the single realization of the “noisy” post-synaptic activity. The problems might arise when either the state transitions are strongly correlated with the post-synaptic activity, which can happen with a wrong choice of parameters. This type of dependence can be easily introduced in the simulations, for example by introducing a global rescale of the transition probabilities for each pattern presentation. What has been just described is an example of how a high-level, phenomenological description of a complicated hardware implementation can be used to easily explore the links between low-level parameters and the final outcome of the learning. For example, the strong dependence on the single realization of the post-synaptic activity can be used as an additional non-linear, temporally distributed modulating mechanism on top of the supervised learning.

5.5.3 Similarities with boosted SVMs

The use of a large, randomly connected, feed-forward hidden layer of neurons in the network is similar to a SVM approach. In addition, the readout neurons undergo independent realizations of a stochastic learning on the same data and as a result their classification responses can be

73 Classification in VLSI aggregated to obtain a single, improved classifier. This operation is equivalent to train the classifiers on different dataset, an operation equivalent to Bootstrap AGGregatING (bagging) [Breiman, 1996]. These similarities with machine learning techniques are implemented by the feed-forward neural network very efficiently in terms of power consumption, memory requirement, speed of operation. It is then reasonable to ask whether the proposed network can be implemented with the same efficiency using existing techniques from machine learning. The results presented in this Chapter suggest a negative answer. The typical implementation of SVMs require the definition of a “kernel”, a mathematical entity to treat the data as they were projected in a space with indefinitely large dimensionality. While mathematically powerful, the training of SVMs often requires some computational effort, e.g., quadratic programming algorithms. Boosting is a simple technique to improve the classifier and leverage the computational costs of the classification. However, only few attempts can be found in literature that combine SVMs and boosting. More common classifiers used with boosting are random trees [Chen et al., 2004, Breiman, 2001] or simple linear classifiers [Schapire and Freund, 2012]. In [Kim et al., 2002], the authors use bagging and boosting in combination with SVMs and obtain an improvement on state-of-art at that time. However, the application of plain bagging techniques requires the introduction of an external source of noise to generate the data bootstrap for each SVM. By exploiting the stochasticity of the spike-trains, multiple classifiers with stochastic learning can naturally operate in parallel.

5.5.4 Performance scaling with the number of aggregated predictors

The improvement obtained by aggregating multiple classifiers is similar to the improvement obtained by plain boosting, even though in boosting each new predictor is trained on an explicitly defined dataset. It is instead remarkable to find that a similar scaling in performance is obtained by introducing uncorrelated, stochastic learners. This is shown empirically in what follows. A synthetic dataset of N = 2, 10, 50 random binary patterns with a coding level of f = 1/2 is generated. Sets of patterns are generated by adding independent Gaussian noise on each component of the class prototypes. Three datasets with 10%, 20%, 50% noise levels are created, where the noise level is the standard deviation of the Gaussian distribution used for generating the noise. Several experiments are run with increasing number of perceptrons to test the aggregation. A scaling similar to the scaling of the boosting algorithm is observed (Fig. 5.7, where the error rate can be shown to decrease exponentially fast with the number of aggregated classifiers [Schapire and Freund, 2012]: err ≤ exp −2γ2N, (5.6)

Here, γ quantifies by how much each classifier is better then chance. Notice that in the case of boosting, each classifier is trained on a particular dataset, which is the subset of patterns that have been previously misclassified, and then weighted according to their test performance [Schapire and Freund, 2012]. In the case of the feed-forward network, instead, the classifiers are independent from each other and have all the same weight in the final decision. For this reason, a direct comparison with boosting is not appropriated here. The synthetically generated classes, however, can be compared to the noisy representations generated by a spread in dimensionality operated by the large input layer of RCNs [Barak et al., 2013]. Sparse representations lead to loosely correlated patterns, corresponding to the case of noisy representations that require many aggregated perceptrons. Conversely, dense representations maintain strong intra-class correlations and thus,

74 CHAPTER 5. THE NEURAL NETWORK MODEL

Figure 5.7: Classification rates using random, uncorrelated patterns with different levels of noise. Each class is generated from a prototype pattern by adding Gaussian noise on each component. Even for the extreme case of a 50% noise on each component, the performance scale with the number of output units introduced, a result which is in accordance with aggregating predictors techniques (see text). as it is shown here, require fewer aggregated classifiers for low error-rates.

5.6 Conclusions

Averaging several, independent, imperfect models trained on some dataset is a provably efficient strategy to obtain successful classifiers on typical machine learning problems. A similar strategy can be adopted in the hardware realization a of feed-forward neural network consisting of randomly connected neurons for the hidden-layer and pools of simple linear classifiers as output layer, the latter trained with stochastic learning. However, additional parameters related to the specific hardware implementation of the learning rule for the binary synapses have to be considered for a successful realization of such model. Here we used software simulations to characterize the system and proceeded by steps from purely abstract learning rules producing random classifiers to a biologically plausible but non-spiking implementation of the stochastic learning rule and finally to the spiking hardware implementation. The basic building blocks for the hardware realization of the proposed models have been realized and future work can go in the direction of an integration of such elements into a fully functional classification system. The results in this Chapter show that by applying machine learning techniques using imperfect classifiers obtained with the stochastic learning rule it is possible to obtain optimal classifiers. The simple transformation operated by the RCN is effective in creating the representations needed for the correct classification while having a simple mapping on the neuromorphic, spike-based, imprecise hardware. Random projections can be realized with probabilistic synapses simulated

75 Classification in VLSI with the digital communication instead of using single, calibrated analog parameters for each connection. Instead of optimizing the single, weak, hardware classifier, the possibility to combine several weak classifiers that have stochastic, binary synapses to obtain a single optimal classifier has been assessed in simulations by testing the system on synthetic data with a varying number of independent classifiers. The next Chapter will present results of the application of the network to an isolated handwritten digits dataset using software simulation that include the inhomogeneities of the analog VLSI implementation and that take advantage from the combination of several weak classifiers.

76 CHAPTER 6

Classification of hand-written digits with imprecise electronics

Anyone who attempts to generate random numbers by deterministic means is, of course, living in a state of sin.

John von Neumann

A single hidden-layer of randomly connected neurons applies a simple but effective transfor- mation to the input features. An ensemble of readout neurons trained on these representations can be used to perform robust classification also in the presence of strong system noise, e.g., circuit mismatch in a VLSI implementation. In this Chapter the performance of the simulated network on the classification of the MNIST dataset, a dataset of hand-written digits widely adopted in machine learning for testing pattern recognition algorithms, is presented. The resulting performance is compared to state-of-art and the efficiency of the system is discussed. Mismatch is then added as quenched noise on the probabilities of long-term modifications and on the synaptic weights. With the addition of this noise, the performance of the system is not sensibly affected up to large levels of noise. This behavior is unexpected since one would expect a smooth, increasing deterioration of performance at all levels. This result reflects the robustness of the network on the hardware’s system noise, i.e., the intrinsic variability of the components due to the fabrication process. This observation is in line with recent machine learning techniques that introduce decorrelation on collections of classifiers to improve the performance of the overall classifier. Similarities with standard techniques are discussed.

6.1 Introduction

The performance of machine learning algorithms for pattern recognition is typically tested on benchmark datasets. A dataset typically consists of a set of patterns used during the training phase and a separated set of patterns used during the testing phase. The classification task

77 Classification in VLSI consists of using the patterns of the training set and the corresponding class labels to define the parameters of a model with which to assign class labels to patterns of the test set. Datasets differ for the number of input classes and the total number of patterns. Unlabeled data is also provided in some cases. Machine learning algorithms are typically tested on a variety of datasets and every method has its advantages and disadvantages. There exist vast literature around static pattern recognition tasks using machine learning approaches. Due to the increasing power of the computers and the development of powerful machine learning methods, the typical performance on known datasets is very high, with a ratio of correct hits typically higher than 95%. The MNIST dataset [LeCun et al., 1998] is one of the most widely known datasets in the machine learning community. It consists of a large set of labeled images of segmented hand-written digits. It is widely recognized as a benchmark for testing machine learning algorithms. A large database of tested algorithms and neural networks is constantly updated on Yann LeCun’s website [mnist]. Performance on the test set can be improved by enhancing the training data with transformed images [Cireşan et al., 2010] or by adding knowledge about spatial transformations using convolutional neural networks [LeCun et al., 1998] or by using pre-training to extract useful features from the training images [Hinton and Salakhutdinov, 2006]. Without using any of these tricks, the best published result for a standard feed-forward neural network is of 1.6% errors on the test set. The feed-forward network proposed in this thesis is functionally equivalent to a combination of SVMs and aggregating techniques. An SVM is typically implemented by introducing the “kernel” trick, a mathematical tool to compute patterns in the output domain without the need to explicitly define the mapping through the hidden layer [Aizerman et al., 1964]. This mathematical artifice is computationally heavy to implement and typically requires a good choice on the type of kernel used. However, state-of-art performance can be obtained on a large number of machine learning benchmarks. Boosting algorithms, instead, exploit the use of aggregated, weak classifiers, trained on particular subsets of the training data and then weighted according to their test-errors. Such combination typically leads to very good performance with few computational resources and shows that any weak learner can be transformed into a strong learner with boosting. However, it is not trivial how to implement boosting in an online fashion because it requires the storage of at least part of the dataset at all times [Schapire and Freund, 2012]. Boosting and aggregating techniques in general can be applied to any algorithm but the best improvements are obtained when weak classifiers are used as single classifiers. This has a theoretical reason (more information is added by aggregation if each classifier is sub-optimal) but also practical, since training good classifiers at each time step would require long computing times, whereas the aggregation leads to improvements as long as the classifiers have better-than-chance predictions. Probably for these reasons, it is hard to find any attempt to boost SVMs. In all the few cases found, bagging or boosting independently trained SVMs always helps but unfortunately no informations about the computing time and power are reported [Kim et al., 2002, Dong and Han, 2005].

6.2 Aggregating the linear classifiers

The performance of the network mainly depends on the dimension of the randomly connected layer and on the number of output classifiers used [Rigotti et al., 2010b]. Because the main limiting

78 CHAPTER 6. CLASSIFICATION

Figure 6.1: Scaling of performances of the classification with the number of independent output units per class. factor of the neuromorphic hardware is typically on the number of synapses and input bandwidth, we’ll fix the amount of randomly connected neurons to match the number of excitable synapses, in our case 3472. Since the MNIST dataset consists of images of size 28 × 28 = 784 pixels, the randomly connected neurons apply an expansion of dimensionality of roughly 4 times the input dimension. The output classifiers are trained using the binary representations obtained from the hidden layer of randomly connected neurons. The classifiers are trained in parallel and independently from each other. In particular, the random variables used for the stochastic weight updates are independent and identically distributed. Notice that the assumption of i.i.d. variables can be maintained in the case of a spike-based hardware system if one exploits the variability of the spike-trains in a weakly-coupled network of I&F neurons [Chicca and Fusi, 2001]. If the transition probabilities are small, i.e., if the slow-learning condition is met, the final state of the synapses will be more correlated with the prototype of each class than with any other pattern of the class [Brunel et al., 1998]. Hence, each classifier will respond to a different “internal representation” of the input class stored as synaptic weights. However, by combining the outputs of several perceptrons per class and applying a majority voting scheme, it is possible to improve the performance of the overall classifier.

6.3 Effects of circuit mismatch

The VLSI hardware is affected by intrinsic variability to the fabrication process. The effects of intrinsic variability, or mismatch, can be modeled in terms of local changes of transition probabilities and on synaptic weights. Since mismatch doesn’t vary with time, the amount of this change can be assigned once and maintained throughout the learning and testing phases.

79 Classification in VLSI

Figure 6.2: The performance of the network is not affected by simulated mismatch. Each data point corresponds to a single simulation. The dashed line represents the chance level.

In order to keep the model simple and yet be able to capture the main effects of hardware mismatch on the learning dynamics, we introduce two noise sources that consist of random variables with Gaussian distributions. The first one is centered around 1 and has a multiplicative effect on the probability of long-term modification of one synapse, computed at each pattern presentation. This noise models the mismatch on the transistors regulating the refresh current of the VLSI synapses and the weight changes (see Ch. 2). The second source of noise has the effect of modifying the weight of each synapse to model the mismatch on the transistors generating the Excitatory Post Synaptic Potentials (EPSPs) for each synapse. It has both a multiplicative and an additive effect to represent the mismatch on each of the two levels of the bi-stable synapses. The levels of mismatch are controlled by the choosing the variance of the gaussian distributions from which the mismatch levels are drawn. Since both the probabilities and the synaptic states levels are normalized between 0 and 1, we can assume that a variance of, for example, 0.2 would correspond to a hardware mismatch level of 20%. The following expressions summarize the above definitions:

± ± p → ηpp (6.1) m a Jij → ηJ Jij + ηJ , (6.2)

m,a ± where ηp, ηJ are the different noise sources introduced above, p is the probability of synaptic transitions (+ for LTP, − for LTD) and Jij are the synaptic states (potentiated or depressed).

80 CHAPTER 6. CLASSIFICATION

6.4 Improved predictions by corrupting the decision thresholds

Once the single classifiers have been trained, the performance of the aggregated classifier can be further improved by optimally choosing the decision boundary. The decision boundary is chosen offline by analysing the output of the classifiers on the training set. This optimization step can be considered as part of the training and is performed offline. However, in a real application, the output of the classifiers has to be read out by some other system, for example to activate a specific motor output. Here we address the question whether it is important to optimize the decision threshold and whether the readout system has to be reliable or not. Conversely, one could ask whether it is important that all the classifiers converge to pre-defined mean firing rates or the output histograms for Target and Null classes can be slightly shifted. For example, this could be the result of variations on the total synaptic current generated by the independently trained weight configurations. One can observe the effect of different thresholds for the decision boundaries by artificially introducing noise on the decision boundaries. In Fig. 6.3 results from simulations are shown. In this experiment, after having trained the linear classifiers, decision boundaries are optimized with the Receiver Operating Characteristic (ROC) curve analysis using the training set and the corresponding output histograms [Hanley and McNeil, 1982]. On top of the values obtained with the standard procedure, random gaussian noise is added to the threshold of each classifier. The process is repeated multiple times and for each time a new predictor, resulting from the combination of the “noisy” predictors, is collected. Ensembles of 5, 10, 20 such predictors are then created. The performance of the aggregated classifier on the test set is calculated for different levels of noise added to the thresholds. The noise level corresponds to the variance of the Gaussian distribution, with µ = 0, used to generate the noise. An improvement in performance is observed for high levels of noise, until values that are comparable to the separation of the output histograms (∼ 40 Hz).

6.5 Discussion

In this Chapter, a feed-forward network has been applied to the MNIST database of isolated handwritten digits and obtained a simulated performance of ∼ 4% error rate, without mismatch. This number has to be compared to the state-of-art classifier that uses a deep network of Convolutional Neural Network (CNN) and dropout obtaining 1.6% error rate. There are several possible reasons for this discrepancy. However we point out the simplicity of the proposed approach compared the standard approach of, for example, Hinton et al. 2012, which requires not only the definition of the CNN or equivalent base functions for the specific dataset (e.g., sounds), but also the tuning of several additional parameters such as the renormalization factor for the weight vector and the learning-rate decay factor. Additionally, we were able to switch from an image classification task to an auditory discrimination task by simply switching to a different dataset while keeping all network parameters the same. The data consisted of spiking data generated by a Silicon cochlea, represented as “cochleograms” in Fig. 6.4, listening to spoken digits [Liu et al., 2010]. Such a data switch typically requires considerable amount of parameter tuning to match specific properties of the dataset using traditional approaches, e.g., the use of

81 Classification in VLSI

Figure 6.3: Performance of a sub-optimal simulation as a function on the random noise level on the single read-out thresholds. The classification improves for several values of noise level as an effect of the generation of a random, independent set of classifiers from the ones obtained from learning. specific kernels for the auditory data or Spectro-Temporal Receptive Fieldss (STRFs) [Mesgarani et al., 2006, Sivaram et al., 2010, Povey et al., 2011].

6.5.1 Competition emerging from lateral inhibition

The best performance of the feed-forward network is obtained when each classifier is trained using the output of the preceding one as an additional bias. The reason behind this result lies in the additional de-coupling that one obtains with the biasing configuration. When one classifier has correctly learned the task, i.e., has stopped learning, the others will be influenced by its output and reach the stop-learning condition before learning the task themselves. However, these influenced classifiers will activate learning every time an ambiguous pattern is presented. In other words, they will “focus” their learning on ambiguous patterns. The process can be repeated in cascade or operated in parallel through lateral inhibition. The case of sequential learning is similar to boosting, where new classifiers are trained on the misclassified patterns of the preceding ones, and then weighted accordingly [Schapire and Freund, 2012]. However, the cascade process will compromise the online learning. Boosting can be instead realized by introducing lateral inhibition as a form of competition [Xie et al., 2002]. In this case, there will also be an interaction between the teacher signal of the supervised learning and the local connectivity. Similar behaviour is reported in Appendix “Sharpening tuning curves”, in which we used the lateral connectivity hard-wired in the neuromorphic chip to drive the post-synaptic activity, eventually influencing long-term plasticity. See Chapter 5 for more general consideration on stochastic learning related to boosting.

82 CHAPTER 6. CLASSIFICATION

Figure 6.4: Examples of cochleograms from a Silicon cochlea and their predicted labels using the simulated neural network. Each cochleogram represents the average spike-rate of a certain channel of the cochlea (y-axis) at a certain time (x-axis). The cochleograms, treated as 60×54 pxl images, are converted into spiketrains in the same way as with the MNIST images and fed to the classification network. Left column: Two cochlograms corresponding to the spoken digit “9” are classified as “7” and “9”. Right column: Two cochleograms corresponding to digit “7” are classified as “7” and “2”. The time length of each digit is rescaled to each segmented data.

6.5.2 Introducing readout noise for improving readouts

As one would expect, the performance of the aggregated classifier benefits from the noise added on the decision boundaries. This effect is due to the fact that each new noisy predictor is created independently from the others, as it was the result of a completely independet learning process run on a different sampling of the dataset (as in bagging, where random bootstraps are chosen from the training dataset). The process of explicitly introducing noise from external sources is one of the common methods for generating aggregated predictors [Dietterich, 2000]. Here, the process doesn’t require any training, but obviously one doesn’t know a priori what is a good value for the noise level. A safe assumption is to choose a value lower than the distance between the peaks of the output histograms for the Target and Null class. The distributions of the noisy decision boundaries have to be centered around the ideal decision boundaries. Instead of using the ROC curve analysis, which is impractical for online learning, one could safely assume the optimal decision boundaries to be defined roughly by the stop-learning decision boundaries [Senn and Fusi, 2005]. Noise can then be added to those values. In conclusion, it seems that the ensemble of decision thresholds should be such that they “probe”, in an uncorrelated manner, the predicition distributions for the Target and Null classes, but the noise cannot be too large to avoid losing any information about the real decision boundaries.

6.6 Conclusions

The feed-forward network proposed in previous chapters is a practical solution for hardware implementation of classification systems. The random connections of the pre-processing hidden layer, the learning synapses and the possibility to aggregate several imperfect classifiers result in a robust system that in fact exploits the properties of the Silicon substrate. This method overcomes several technical limitations of typical approaches aiming at the application neuromorphic systems to real-world tasks, which either rely on precise components or require pre-established knowledge of the task to solve, e.g., image recognition. The results presented in this Chapter suggest instead that a single hidden layer of RCNs and a pool of imprecise classifiers is a valuable, effective and

83 Classification in VLSI elegant solution for the application of neuromorphic systems to typical machine learning tasks.

84 CHAPTER 7

Discussion

The darkest places in hell are reserved for those who maintain their neutrality in times of moral crisis.

Dante Alighieri

This thesis describes a neuromorphic system for online learning with spike-based synapse circuits distributed on the physical neural network. The system is based on a model that is robust to the high intrinsic variability of the eletronic substrated, thus is not susceptible to large circuit mismatch. This work suggests possible directions for the development of future emerging technologies on unreliable substrates. Here I discuss particular aspects of my work related to the broad field of neuromorphic cognitive systems and compare to the state-of-art.

7.0.1 Classification systems in neuromorphic hardware

Hardware realizations of neural networks for pattern recognition introduce specific requirements in addition to the need of good generalization performance of the system. For example, for robotic applications, biological time-constants are needed to process the data from the sensors in real-time [Chicca et al., 2013]. Alternatively, fast but time-multiplexed architectures can be adopted but this option requires large memories, e.g., Static Random Access Memorys (SRAMs), and compromises the compactness of the system [Minkovich et al., 2012, Cassidy et al., 2013]. Other neuromorphic solutions are proposed as platforms for generic simulations of large neural networks [Schemmel et al., 2008, Painkras et al., 2013] and thus are not particularly concerned with low power consumption and compactness. Parallel processing technologies, e.g., GPUs, can be used in conjunction with dedicated implementations of machine learning algorithms to obtain fast computing systems [Hinton et al., 2012, Brette and Goodman, 2012] but the energy requirements are typically prohibitive for embedded systems. A comparison between the different hardware solutions depend also on the specific task to realize. Most often, the comparison is based on a generic simulation of “a” neural network and

85 Classification in VLSI considers speed of simulation (overall number of events produced or processed second), power consumption per spike, size of the network (number of neurons, number of synapses). However, it is important to consider the requirements of the full system in view of the final application. For example, Misra and Saha 2010 compare several hardware systems developed in the last decade but explicitly mention that for practipal purposes a fully operational system demands many more components than the solely hardware neural network, thus disregard this aspect in their analysis. For practical purposes, a fully-functional system can be analyzed under the following aspects: generalization: The ability of the classifier to maintain, after training, a consistent response while tested on unseen, noisy representations of the input classes. computational requirements: The amount of operations needed for each presentation and param- eters update during learning and testing. memory requirements: The amount of memory required to store the learned parameters, to use the input data for updating them, and to obtain labels prediction from the network during test phase (e.g., to compute all the required scalar products). pre-processing: The amount of computation, parameters and prior knowledge used to obtain representations of the input data that could, for example, simplify the final stage of classification, stabilize learning, improve performance, etc. speed: How long it takes to process one input pattern, to learn by updating the internal model or to just produce a prediction. control: Whether the system can run without supervision, i.e., without separated training and testing phases.

Depending on the goal and the type of application targeted, the aforementioned parameters are differently weighted. For example, mobile applications have limited computational power and energy resources, as opposed to home appliances.

7.1 What is the best pre-processing?

The comparison between different machine learning algorithms for classification is of great importance to grasp a sense about the capabilities of a certain algorithm. The typical approach adopted in machine learning has been to choose a well defined problem, generate the corresponding set of data, run the different algorithms on the same data and obtain a number related to the capacity of the algorithm to solve the classification, i.e. obtain the generalization performance. The MNIST dataset [LeCun et al., 1998], for example, has become a benchmark in this type of studies and its properties have been well established. The benchmark methodology is fundamentally insufficient in comparing different learning systems, not necessarily in the form of machine learning algorithms or not restricted to any particular task or to any particular dataset specifically. It can’t capture, for example, the amount of prior knowledge needed to configure a certain algorithm to solve particular tasks, a process which in general might be easier using some algorithms (e.g. change few parameters) than for others (e.g. re-defining the dimensionality of the symbolic representations). We will refer to a work where different algorithms used for “pre-training” of

86 CHAPTER 7. DISCUSSION single layers of deep architectures are compared in a systematic manner [Coates et al., 2010]. The results show empirically that a large number of input nodes and dense feature extraction are critical components of the feature extraction step to reach high performance, more than the number of hidden layers of the network, thus the representational power, or the learning algorithm itself [Coates et al., 2010]. In this sense, simpler methods such as K-means clustering can be preferred to more complex ones, which in general require more parameter tuning and are computationally more demanding. The feed-forward network that we propose goes on the same line of the results described above. The randomly connected neurons in fact give the best representation, i.e. the one that allows the most task flexibility, for learning internal representations [Rigotti et al., 2010b]. Random connections don’t need to be learned and can be considered as the easiest way to perform feature extraction without any prior knowledge of the system. We can reasonably consider it as the minimal pre-processing. The linear classifiers used in the output layer are similar to the ones used typically as output layers of more complex networks.

7.1.1 Towards low-power cognitive hardware

The high goal of building autonomous, behaving robots doesn’t necessarily require a full under- standing of the pattern recognition problem. Indeed, there is an accumulation of experimental evidence around the idea that certain brain regions, e.g. the Pre-Frontal Cortex (PFC), are involved in several higher-level cognitive tasks requiring not only recognition capabilities but also context-dependent behavior and memory formation, to cite a few. More specifically, the PFC is involved in “top-down” processing, i.e. in tasks where actions are instructed by mental representations, rules, thoughts, decisions or when the mapping between inputs and actions are not established or are rapidly changing [Miller and Cohen, 2001]. Models of the PFC have been proposed in the past mostly using ANNs as a basic component, reproducing persistent activity in the brain which can be related to the mental representations of the relevant quantities for the particular cognitive task [Amit and Mongillo, 2003, Rigotti et al., 2010b]. Several studies have shown how to construct ANNs in VLSI hardware. The main components of such systems are analog integrate-and-fire neurons with plastic excitatory synapses and inhibitory synapses. In the experiments, the patterns have been imposed as driving forces from an external source whereas online learning requires to establish the robustness of the network to the variability of the internal components as opposed to well-defined, calibrated input stimuli. The feed-forward neural network presented in this thesis could serve as the basis for robust autonomous learning of the attractors, without the need of an explicit external source, in general tasks where the mapping between the stimulus and the consequent action to be taken has to be established [Rigotti et al., 2010b]. Understanding the fundamental properties of such structures at the basis of intelligence is key to defining future directions for the construction of behaving agents. For example, since all the computation that the randomly connected neurons and their synapses are required to carry out is input integration and no plasticity is needed, very compact but slightly more power hungry implementations might be preferred to larger and more efficient ones. This would improve the number of states and transitions that the network would be able to learn. On the other hand, many learning synapses per neuron are required such that analog implementations using crossbars

87 Classification in VLSI and event broadcasts might be better scalable to large networks compare to digital, memory limited solutions.

7.2 Computational costs

One of the goals of machine learning is to realize methods that can be applicable to highly complex, unpredictable situations, with minimal external intervention and minimal prior knowledge. The feed-forward network we propose gets closer to those prescriptions, compared to other more commonly adopted methods, while also minimizing the requirements for a hardware implementation. We analyze the above aspects of the proposed network and compare them to SVMs and deep neural-networks. A characteristics that has gotten much interest in recent years is the depth of the network. In a multi-layer neural network, for example, the depth is the number of layers of the network. A fixed SVM is considered to have a depth of 2. Finding the optimal depth of a network, if any, for a specific task is both a theoretical issue and an engineering one, because by fixing the depth one can optimize the hardware design of the corresponding network. Shallow architectures involve typically more neurons per layer, thus requiring hardware implementations that are optimized for parallel computation. Deep architectures, on the other hand, involve typically less units per layer and less units in total, thanks to their representational power [Bengio and Delalleau, 2011, Bengio and LeCun, 2007]. Interestingly, it has been shown that an SVM with the choice of a kernel that corresponds to a randomly connected feed-forward network leads to competitive performances in several benchmarks [Cho and Saul, 2010]. Hence, though deep-networks require a smaller number of neurons than shallow architectures to represent any given function, it seems that for some tasks a much simpler function might be sufficient. For example, flexible behaviour, the ability to identify relevant cues to learn a rule-based decision task, is enabled by the introduction a layer of randomly connected neurons [Rigotti et al., 2010b] which solve any non-linear separability. Deep networks can be efficiently trained to obtain complex neural representations and useful feature extraction [Hinton, 2007]. However, the implementation of such networks is often very demanding in terms of computational power and memory such that dedicated hardware has to be used in order to parallelize the operations and get tremendous speed-ups. The GPU used to simulate the neural network having currently the record performance on a famous benchmark for hand-written digits classification [Hinton et al., 2012] is declared to have a maximum card power of 244 W and requires to be installed on a system with a minimum power requirement of 600 W. These numbers are prohibitive for any realistic robotic application, e.g. for autonomous exploration of unknown environments for prolonged times. In addition, GPU-based solutions require to write dedicated programs in order to fully exploit the parallel computation provided by the GPU but the mapping of the sequential programs of existing algorithms into the equivalent parallel implementations is not necessarily optimal. Compared to deep-networks, the price to pay for flexibility of our network is in the number of neurons of the randomly connected layer. This number depends linearly on the complexity of the task to be solved [Rigotti et al., 2010b, Barak et al., 2013]. For pattern recognition tasks, where the number of classes is typically of the order of 10 and never larger than a 1000, we can expect the number of needed randomly connected neurons to be small enough to be embedded on single neuromorphic chips without much area requirements per neuron [Indiveri et al., 2011].

88 CHAPTER 7. DISCUSSION

7.2.1 Support-vector-machines

Support Vector Machines (SVMs) [Vapnik, 1995] are a class of learning models that has been widely used in several machine-learning problems, from isolated handwritten digit recognition [Cortes and Vapnik, 1995, Schölkopf and Burges, 1999] to face detection in images [Osuna et al., 1997], time series prediction tests [Müller et al., 1997] and others. Non-linear SVMs can perform classification of non-linearly separable data through what is known as the “kernel trick” [Aizerman et al., 1964, Cortes and Vapnik, 1995] which is a mathematical artifice to map a general set of data from its space into a high-dimensional space (feature space) where the data has a simpler structure, thus allowing classification. In this case the mapping reduces to computing a function of scalar products (a kernel) acting on the input space [Vapnik, 1998] so that an explicit solution for the map is not needed. Thus in machine learning problems, the task typically consists in finding the kernel function that results in the best generalization performance. Typical kernels include Radial Basis Functions (RBFs) and polynomial kernels and are used to perform Principal Component Analysis (PCA) [Hoffmann, 2007], Fisher’s Linear Discriminant Analysis (LDA) [Mika et al., 1999] and other methods. Though very effective on difficult problems, e.g. classification in case of small training sets, SVMs typically require a large number of computational elements and memory [Bengio and LeCun, 2007] and cannot be used realistically on larger datasets due to the amount of scalar products to compute using the kernel method. In the feed-forward neural network, instead, the amount of computation doesn’t change with experience because the memory, in the form of synaptic states, maintains track of past experience, so that basically only two scalar products are required per pattern, independent of the point in time at which the pattern is presented.

7.3 Learning and representational power

The primary advantage of deep networks is that they can compactly represent a significantly larger set of functions than shallow networks. Formally, one can show that there are functions which a k-layer network can represent compactly (with a number of hidden units that is polynomial in the number of inputs), that a (k - 1)-layer network cannot represent unless it has an exponentially large number of hidden units [Bengio and Delalleau, 2011]. In this work, a shallow network consisting of only one hidden layer has been used. While the representational power of such a network is less than deep-networks, on the other hand it provided sufficient representational power to easily train perceptrons using a supervised learning algorithm. While important in theory, the representational power of deep networks comes at the price of very difficult learning such that, for example, greedy layer-wise training has to be introduced. A more realistic scenario in the brain might be that of shallow architectures because of their learning capabilities. In [Rigotti et al., 2010b], randomly connected neurons are used to learn rule-based classification tasks for reproducing flexible behaviour by an off-line learning prescription. In the NEF, randomly connected neurons are used to map high-level variables into neural activities and vice versa [Eliasmith et al., 2012] and gives a formal description of those two steps (termed encoding and decoding). In this thesis, we used randomly connected neurons as a preprocessing stage for the static input patterns. Because of the convergence of the perceptron rule, as soon as few units encode for the correct class to learn, the task can be solved by linear classifiers [Senn and Fusi, 2005].

89 Classification in VLSI

7.4 Learning attractor networks

Recurrent neural-networks have been used in the past to describe experimental observation in region of cortex highly correlated with memory maintenance and formation for decision making, attentional selection, motor planning and other cognitive functions [Amit, 1992, Amit and Brunel, 1997, Deco and Rolls, 2005, Wang, 2002]. While a number of neuromorphic or neuro-computing systems have been used to demonstrate attractor networks in hardware performing relevant computational tasks to support those cognitive functions [Giulioni et al., 2011, Seo et al., 2011, Choudhary et al., 2012, Neftci et al., 2010, Massoud and Horiuchi, 2011], a surprisingly limited number of them has addressed issues related to how such systems can self-construct and adapt to highly complex, continuously changing task-rule associations as one expects from real-world unknown scenarios. In those works, the main focus has been primarily to realize hardware system able to store limited numbers of abstract memory representations and to maintain those abstract representations through the configured or learned synaptic connectivity within the network. This thesis proposes to combine those advancements in the construction of neuromorphic electronic systems and methods from theoretical neuroscience to tackle the problem of how physical neural networks can learn task-relevant representations and use them to operate in real-world scenarios. We used a theoretical framework based on the idea that the behavioural task can be decomposed in a combination of classification tasks, where learning consists in finding the task-related associations between a neuron’s activity and the pre-synaptic pattern that the neuron receives at a certain point in time [M.Rigotti et al., 2013, Barak et al., 2013]. The pre-synaptic patterns are the result of a simple transformation operated by a large pool of RCN, which produces the mixed-selectivity required for flexible behavior, i.e., the ability to learn to solve new tasks and adapt to arbitrary task-rule changes. These abilities have been already shown in theory and simulations [Rigotti et al., 2010a,b] and we envision the realization of hardware systems with those abilities by using the know-how in neuromorphic engineering and theoretical neuroscience described above.

7.5 Spatio-temporal patterns

All the components of the feed-forward neural network used for this thesis are described in a rate- based framework. Spikes are only used in the hardware as a medium for the implementation of the stochastic plasticity while the non-spiking simulations are sufficient to describe the dynamics of the network. However, understanding spatio-temporal processing can be important for a more efficient variables encoding using time representations [Izhikevich, 2006], e.g. for non-static stimuli, and eventually spatio-temporal pattern recognition [Masquelier et al., 2009, Gütig and Sompolinsky, 2009]. In fact, a random network can be used also for spatio-temporal pattern recognition if one considers the random delays introduced by the neurons of the hidden layer or some other sort of linearity due to dendritic computation [M.Rigotti et al., 2013]. A similar architecture has been considered in a model of thalamo-corthical projections to implement recognition of sounds that sweep in frequency [Sheik et al., 2011]. Analogously, Reichardt detectors introduce delay elements to detect moving patterns (CITE). With the randomly connected network the hetereogeneity of the network is elegantly exploited both in space (random weights) and time (random integration time-constants) to discriminate spatio-temporal patterns.

90 CHAPTER 7. DISCUSSION

7.6 Limitations of modern computers

Is the execution of an automated process a form of intelligence? This problem dates back to the question raised by Alan Turing in his famous work “Can machines think?” [Turing, 1950] and its analysis would go beyond the scope of this thesis. However, the present work is related to this question. By considering learning as the process by which a neuron can implement a desired function and by studying one physical realization of its mechanisms, this work aims at constructing a solid basis for the realization of learning machines. As opposed to modern computers, learning machines must be able to autonomously define the program to solve a specific task they are required to solve. The philosopher Searle [Searle, 1980] describes the need for “causal powers” similar to the brain in relation to the possibility of building machines endowed with intentionality. Unfortunately, the intrinsic flexibility necessary for such intentionality would preclude us from being able to describe its internal process, i.e., most probably it wouldn’t be possible to easily associate symbolic meanings to their internal representation in a direct manner. Recent studies on recordings in cortical regions which are associated with highly cognitive functions enlight the importance of such internal diversity as a key component to enable flexible behavior [M.Rigotti et al., 2013]. Hence, the ability to impose desired functions to single neurons without external intervention, i.e., without an explicit program, and with minimal prior knowledge of the task seems to be the key component that would distinguish an intelligent machine from a modern computer or any other pre-designed brain-ispired machine. Neural networks can be simulated with modern computers. The largest neural network ever simulated has been distributed across 1000 machines (16000 cores) and has processed 10 millions of 200x200 images for three days [Le et al., 2012]. The work showed the surprising ability of the system to obtain feature selective neurons, e.g., face detectors, with unsupervised learning from unlabeled data. Thus, modern computers are not limited in computational abilities up to the realization of systems with practical uses. However, the generic purpose nature of modern computer architectures results in fact in inefficient implementations of artifical intelligence algorithms because a large part of the low-level, internal operations are dedicated to managing the flow of data between the memory and CPU [Backus, 1978] to carry out the relevant computation and not to executing the relevant operations only. While this aspects seems to be a purely engineering one, it has been pointed out the it has an impact on the way intelligent machines are designed and conceived. It boils down to the presence of a von Neumann bottleneck. The idea of the von Neumann bottleneck has been introduced by John Backus [Backus, 1978] and it refers to the channel that is used for the communication between the Central Processing Unit (CPU) and the memory. This channel imposes several programming constraints, mainly related to the fact that a large part of the traffic in the bottleneck is not useful data but merely names of data as well as operations and data used only to compute such names. Thus,

[... ] programming is basically planning and detailing the enormous traffic of words through the von Neumann bottleneck, and much of that traffic concerns not significant data itself but where to find it.

Obviously many operations on irrelevant data translate into many irrelevant transistor switches, hence reducing bottlenecks seems to be a necessary engineering philosophy to get the minimal power overhead and, as a by-product, the minimal design overhead. The work of this thesis is a

91 Classification in VLSI tiny step into the direction of building machines where computation and memory are distributed and co-localized and has adopted a multidisciplinary, “neuromorphic” approach to investigate the above issues.

7.7 Hardware synapses and learning

Several neuromorphic hardware implementations consider the use of digital components for different parts of the system. While digital communication has been broadly adopted, both analog and digital implementations of neurons and synapses have their advantages. However, several digital implementations proposed seem to share fundamental similarities with traditional von Neumann architectures which constitutes unnecessary computation overhead. The neuro-synaptic core presented in [Merolla et al., 2011] is extremely compact compared to the analog hardware, where the slow-dynamics of the synapses requires area-consuming capacitors. In the SpiNNaker hardware, an SDRAM is used to locally store the synaptic weights, a choice that constitutes an information bottleneck. For example, weights can only be accessed at the time of the pre-synaptic spikes, which introduces a limitation for the plasticity algorithm. Analog synapses can be implemented compactly and have all the advantages of local storage of memory. Because of the linearity of the DPI circuit integrated in the synapse, a single circuit for synaptic dynamics can be responsible for producing all the post-synaptic currents generated along the dendritic tree. With this trick, digital weights can be adopted also in the analog implementation resulting in more compact designs [Moradi and Indiveri, 2011] with low-power synapse dynamics. In addition, synaptic weights are always accessible for the plasticity mechanism. By storing one or few bits as digital weights for the synapses one has sufficient functionality for several applications in a compact design. However, a major limitation of these synapses is the absence of any mechanism for plasticity. In other words, a specific configuration has to be installed for each specific application because the system itself is not equipped with any type of mechanism that would automatically produce such a configuration for the corresponding task. How to learn such a synaptic configuration offline is a problem that these systems don’t solve. In addition, the need to deliver the synaptic configuration to every point of a large scale system might impose serious scalability issues. In addition, learning is typically a process that strongly depends on the specific context. In this sense, non-learning solutions share the conceptual limitation with computers concerning the need for a program to be defined and installed on the machine. The program, which in the case of neural network hardware is represented by the topology of the network, cannot be defined without some prior knowledge of the problem to be solved. A solution for online, continuous learning, on the other hand, seems to be the only possibility for the construction of behaving agents that adapt their behaviour through experience and that make use of large scale neuromorphic devices. There are two possibilities to reintroduce plasticity and gain the necessary functionality for a complete self-behaving system. One is to equip each synaptic cell with local plasticity mechanisms, thus sacrificing Silicon estate. A second possibility [Arthur et al., 2012] is to displace the circuits for synaptic plasticity at the end of the dendritic line, on the proximity of the neuronal circuits. By using a single element responsible for the synaptic plasticity one can shrink the dimensions of the synaptic core and aim at large scale implementations. Though compelling, the latter approach seems to borrow one crucial aspect of von Neumann architectures which is the presence of a

92 CHAPTER 7. DISCUSSION communication bottleneck [Backus, 1978]. Since synaptic plasticity is a non linear process, i.e. synapses are independent from each other, it is impossible to concentrate the synaptic dynamics on a single computing element unless:

• time is split in frames where synaptic dynamics is computed for all the synapses serially and plasticity is applied to all synapses pre-synaptically and post-synaptically,

• information about the recent history of the synapse is delivered to the plasticity core and corresponding updates are delivered back to each synapse,

• events arriving at the input are buffered into a memory while the neuro-synaptic core is busy processing the previous data.

Similar solutions have been adopted for the construction of a digital neuro-synaptic core with 256 synapses and 256 neurons where two independent clocks are responsible for the IO (slow clock, ∼ 1 kHz) and the neuro-synaptic dynamics (fast clock, ∼ 1 MHz). In the SpiNNaker system, a large buffer, local to the digital cores, collects all the incoming spikes and is used to compare the pre-synaptic activity with the post-synaptic activity at run-time, thus compromising the real-time capabilities.

7.8 Biological plausibility

7.8.1 Evidence for randomly connected neurons

Many cells, especially in higher order brain areas, such as the PFC, show complex patterns of responses that are often difficult to associate with simple event or object representation rules [Asaad et al., 1998, Miller and Cohen, 2001, M.Rigotti et al., 2013]. The activities of these neurons can be more closely reproduced by modeling randomly connected neurons to implement “mixed selectivity”, i.e., the property of such neurons of representing complex mixtures of multiple aspects of the stimuli [Rigotti et al., 2010b]. The diversity of neural responses is related to the high dimensionality of the internal representations and can predict behavioural performances [M.Rigotti et al., 2013]. One advantage of using high dimensional representation is the possibility to employ simple classifiers as read-outs [Barak et al., 2013, Coates et al., 2010].

7.8.2 Sensory fusion and reward in the thalamus

A situation where several representations converge is in the fusion of different sensory modalities. The thalamus [Jones et al., 1985] plays an important role as a relay of sensorial experiences to the cortex and in corticortical communication [Guillery, 1995, Sherman, 2007]. Interestingly, responses related to reward-event associations, i.e., retrospective and prospective coding, are independent of sensory modality and are modulated by the value and timing of the reward [Komura et al., 2001]. The non-linearity of randomly connected neurons can be used also in these cases as a normalizing mechanism which acts irrespectively of the sensory modality. Furthermore, the importance of the role of feed-forward thalamocortical connectivity is supported by anatomical evidences [Kaas, 1999]. The case of the thalamus is however a far more complicated one, since the thalamic feedback from the cortex seems to explain much of the plasticity observed in the thalamus [Krupa et al., 1999], suggesting that the highly diverse connectivity that project into such

93 Classification in VLSI relay structures is highly dynamic. However, the scaling properties of randomly connected neurons in the intermediate layer are similar to the ones of carefully chosen or learned connections [Barak et al., 2013]. Lastly, the nonlinearity of cortical responses due to interactions with reward signals or metabolic state could be studied at the level of the thalamus using the theoretical framework of randomly connected neurons.

7.9 The de-coupling role of stochasticity and mismatch

Multiple, independent output classifiers learn different representations of the target class due to the stochasticity introduced in the long-term transitions. By aggregating the responses of the such classifiers one can obtain an aggregated, improved classifier in the same line of aggregating techniques described in machine learning. The stochastic plasticity is an implicit implementation of a Bagging technique [Breiman, 1996]. In Bagging a pool of classifiers are trained on different subsets of the training data. The idea is to create diverse models of the prior distributions and then combine them to estimate the overall data distribution. The results of the training process is a pool of diverse, “weak” classifiers with performances which are likely to be weaker than a single classifier trained with the whole dataset. However, by means of a majority rule applied to the set of outputs, the overall performances have a significant increase over the single classifier. A requirement for the aggregation to be successful is the instability of the learning, i.e. the fact that by any idd choice of the subset of data one obtains idd classifiers. In the case of stochastic plasticity with stop-learning through the post-synaptic Calcium variable the instability is due to the stochastic process of selecting a subset of synapses to store the input pattern. In fact, by aggregating such classifiers one doesn’t even need to solve the non-linear separability of the dataset since multiple, independent linear classifers can be sufficient to obtain the desired, non-linear partition of the input space [Senn and Fusi, 2005]. By adding mismatch in the transition probabilities of the learning rule and in the bound values of the synaptic weights the level at which stop-learning is activated doesn’t depend only on the level of storage of the input pattern but also on the particular subspace learned. For example, because of the mismatch on the synaptic weights, certain regions of an image might have stronger influence on the stop-learning condition of one of the classifiers. Thus, mismatch can be seen as a way of de-coupling classifiers. This idea is in line with recent advancements in machine learning techniques that explicitly introduce noise in the system by randomly silencing a certain percentage of neurons in the hidden layer at each pattern presentation [Hinton et al., 2012].

94 CHAPTER 8

Conclusions and outlook

Small prjects need more help than great.

Dante Alighieri

Neural network models are a powerful mathematical tool to realize machines that solve typical machine learning tasks while being very economical in terms of resources. In this thesis, I proposed to use a simple feed-forward architecture [Barak et al., 2013, Tapson and van Schaik, 2013] where a single hidden layer of randomly connected neurons serves as a data “pre-processing” stage, and an output layer of aggregated, weak classifiers [Breiman, 1996]. While it is known that projections in large dimensional spaces are a powerful mechanism to solve non-linear tasks [Vapnik, 1995], here this choice satisfies also practical needs. This is because the random weights are fixed, thus one can use very compact synapse circuits to implement the weighted sum and multiplex the AER events in time to minimize resources [Goldberg et al., 2001]. I have shown how such network can be implemented in electronic hardware despite the high variability of the system due to the device mismatch. The network is tested on a machine learning benchmark of isolated handwritten digits using a simulation that considers the hardware inhomogeneities. This work is the result of a bottom-up approach by which the properties of the physical substrate are explored in relation to the task. Thus, instead of trying to compensate for the mismatch of the VLSI in order to optimize the precision of the neural network simulation, the inhomogeneities of the hardware have been considered as having a functional role of their own, i.e., in de-coupling the responses of the weak classifiers. The implications of this work are both theoretical and practical since the results suggest a way to realize powerful classifiers with imprecise electronics, a possibility that is supported by recent theories of machine learning. This hypotheses has been validated empirically in this thesis and can be further verified using statistical methods in the future. Many years passed since the first publication on neuromorphic electronic systems [Mead, 1990], and substantial progress has been made by the small but vibrant neuromorphic engineering community. For example we are now able to build efficient real-time sensory-motor systems by combining neuromorphic circuits and systems of the type described in this paper with the new

95 Classification in VLSI generations of event-based sensors. But more importantly, significant progress has also been made in computational neuroscience, machine learning, and robotics. We are now at a point where it is foreseeable to reach the goal of building real-time autonomous cognitive behaving systems but to do so it is important to combine interdisciplinary research in all these fields while pursuing the neuromorphic approach originally proposed in [Mead, 1990], and to exploit the progress made so far in both theory and hardware development, e.g., by adopting solutions such as those proposed in this thesis. The approach adopted in this thesis has permitted to analyze important issues related to the realization of a hardware neural network for classification built using “neuromorphic” components. However, this goal goes beyond the mere proposal of “using” the neural elements, e.g., neurons and synapses, to “build” models of neural networks in hardware and I believe the value of this aspect to be of great importance. We are witnessing a proliferation of neuromorphic systems that make use of neural components, i.e., neurons and synapses, to realize large-scale models on physical substrates. Circuits and models for neurons, synapses and plasticity are so widely developed that are hard to improve. A reasonable number of such elements can be nowadays integrated in analog or digital devices. Devices can be easily combined to construct large networks thanks to development of reliable software and hardware infrastructures and theoretical frameworks [Wijekoon and Dudek, 2012, Arthur et al., 2012, Imam et al., 2012, Cassidy et al., 2013, Pfeil et al., 2013, Painkras et al., 2013, Choudhary et al., 2012, Brink et al., 2013]. These advancements permit, for example, to easily setup a brand new working environment with real, working hardware to carry-out experiments. How to obtain brain-like computation from these hardware systems? I see the existence of two possible alternatives. The first alternative is a direct mapping of the theoretical models by calibrating and configuring the system to operate in the desired regime. A different, bottom-up approach, consists of exploiting the principles of computation that can be observed at a system level and that at the same time can describe cognitive behaviour. Such a system would be robust by construction because it is consistent with the nature of its components. This approach would aim at defining the “organizing principles” of the brain at a cognitive level. In this thesis we identified two principles, one concerning how information is represented in a useful way with simple, robust components, and the second assuming online learning to be a fundamental component for endowing machines with autonomous behaviour. At this point in the history of neuromorphic engineering the possibility to extract the biologically and computationally relevant high-level properties that emerge from a system of distinct components, such as hardware neurons and synapses, is remarkable and can inspire the next generation of neuromorphic engineers.

8.1 Outlook

Not only does this work have a practical, short-term impact but also a longer-term one. One of the challenges of modern VLSI is to build electronic devices that are fault tolerant and robust to the inherent noise of the substrate which is the source of device mismatch and parameter variability. Both current technologies (e.g., CMOS) and future nano-scale technologies (e.g., memristor), are strongly affected by these properties. The variability, stochasticity, and in general reliability issues that are starting to represent serious limiting factors for advanced computing technologies, do not seem to affect biological computing systems. Indeed, the brain is a highly

96 CHAPTER 8. CONCLUSIONS stochastic system that operates using noisy and unreliable nano-scale elements. Rather than attempting to minimize the effect of variability in nano-technologies, one alternative strategy, compatible with the neuromorphic approach, is to embrace variability and stochasticity and exploit these “features” to carry-out robust brain-inspired probabilistic computation. This can be done for example by adopting distributed computing strategies that accept and possibly exploit variability to improve their performance. Such strategies are now becoming relevant in the computational neuroscience and machine learning fields. With this thesis, I suggest to use a theoretical framework that accepts variability as a property of the substrate instead of an annoying “bug” to compensate for. Using this approach, imprecise, unreliable hardware classifiers can be combined together, or “aggregated” [Breiman, 1996, Schapire and Freund, 2012], using a feed-forward random network [Rigotti et al., 2010b, Tapson and van Schaik, 2013], and distributed spike-based learning strategies [Fusi et al., 2000, Indiveri and Fusi, 2007] to obtain a reliable pattern recognition system. Further investigation in this direction is required to successfully map these architectures into a fully functional neuromorphic hardware system, and extend their capabilities to the case of rapidly changing spatio-temporal patterns of spikes. Work in this direction is already carried-out in my Institute combining the tools developed throughout the realization of this thesis with novel techniques and newly fabricated hardware devices. In the past decade, we have witnessed the development of a variety of neuromorphic systems to a state where fully functional, large-scale hardware neural networks can be configured as state- dependent computational modules and run in real-time [Neftci et al., 2012a]. However, most of the proposed approaches are designed to solve very specific tasks, such as speech recognition in well controlled pre-determined environments. With this thesis I propose to investigate the possibility of realizing general purpose neuro-computational hardware modules. These modules would exploit the properties of random neural networks and attractor neural networks to autonomously extract task-relevant information from the environment and adapt the state-dependent computation to the contextual needs. As shown, this idea is supported by experimental findings [Asaad et al., 1998, Miller and Cohen, 2001, M.Rigotti et al., 2013], theories of computational neuroscience [Amit, 1992, Rigotti et al., 2010b,a, Barak et al., 2013] and the success of several machine learning techniques such as SVM [Cortes and Vapnik, 1995, Cho and Saul, 2010], bagging [Breiman, 1996] and drop-out [Hinton et al., 2012]. Such techniques can be exploited to implement highly cognitive functions on compact, low-power devices. This approach suggests that a shift of attention from highly engineered, pre-constructed systems towards self-adaptive systems deeply embedded in the environment is at least possible in principle. Work in this direction has already been planned as a continuation of this thesis with the goal of combining sensory processing and state-dependent computation in mobile devices.

8.2 Lessons learned

The neuromorphic engineering community is a vibrant community of passion-driven researchers coming from various fields of research. The different goals of the many active projects carried out by the community live around the common desire to unravel the principles of brain computation. How can we learn about brain computation using todays technologies? Once we understand it or parts of it, by what means can we replicate it? What details of the model are important, e.g. spikes? These type of questions traditionally animate the debate in neuromorphic engineering.

97 Classification in VLSI

The answer to most of these questions could allows us to realize efficient, neural-inspired machines, however the future challenges in this field, i.e., the construction of large scale, fully functional cognitive neuromorphic systems, need a fundamental shift of attention towards a system level approach. In fact, while biologically-inspired practical solutions already exist to address specific engineer- ing issues, such as optimizing communication and distribute communication, to obtain relatively low-power systems, there is still a long way to a get a full understanding of the strategies for computation at a higher level of abstraction, which is needed to develop feasible large-scale models of behaving systems. By carrying-out this work, I learned to recognize the value of neuromorphic engineering as a tool to study the principles underlying brain computation with a system level approach. While large-scale simulators of pre-constructed models are valuable tools to understand information and neural processing, compact, low-power devices that can run in real-time with online, distributed learning are valuable and scalable tools to investigate the basic principles of neurally-driven behaviour, while being able to embed these devices in real world tasks. Obviously the road that leads to the realization of intelligent, autonomous machines endowed with animal-like behavior is long and touches all the aspects of cognitive science, neuroscience and engineering. Neuromorphic engineering is a unique multidisciplinary field where the goal of building artificial brains is the driving force to explore models and understand their functions by combining ideas from different scientific fields. Working in the neuromorphic engineering community provides inspiration for future investigation in this quest of understanding how the brain works for building artificial brains and possibly better brains.

98 Appendices

99

APPENDIX A

Other experiments

The tools and computational models developed throughout the realization of this thesis have been supported by several side-projects within the master program of the INI. Here I summarize the main results of these works as a reference for possible future developments.

A.1 Sharpening tuning curves

Tuning curves are a common tool used to characterize the response properties of sensory neu- rons to relevant stimuli. They are obtained by plotting the mean firing rate of the neuron as a function of a stimulus parameter. One of the most famous examples of tuned neural responses is orientation selectivity, by which neurons in primary visual cortex (V1) of higher mammals (also called simple cells) exhibit tuned responses to oriented edges and gratings. The feed-forward input to simple cells is provided by the lateral geniculate nucleus (LGN), whose neurons do not show orientation preference. According to the original model of Hubel and Wiesel [Hubel and Wiesel, 1965] orientation selectivity of simple cells arises mainly from the feed-forward anisotropic, specific connectivity patterns from LGN neurons to V1 simple cells. Following studies however suggested that also recurrent cortical connections play an important role in shaping orientation selectivity [Somers et al., 1995]. To model the effect of recurrent connectivity observed in the cortex, cooperative-competitive networks, or soft-WTA networks of neurons have been proposed [Ben-Yishai et al., 1995]. Furthermore, it has been also recently shown that although the basic structure of the connectivity patterns underlying selectiv- ity is innate, visual experience is essential for enhancing its specific features and for maintaining the responsiveness and selectivity of cortical neurons [Crair et al., 1998]. These evidences suggest that both fixed WTA architectures and plasticity or learning mechanisms are required for reproducing the robust feature selectivity properties of cortical neurons. In this work we explored the interaction between these two mechanisms and its effect on tuning curves, using both software simulations, and a neuromorphic VLSI system, comprising a network of silicon neurons with synaptic circuits that exhibit bio-physically realistic dynamics, and

101 Classification in VLSI

Figure A.1: Lateral connectivity on the output layer enhances synaptic plasticity and can drive learning. spike-driven plasticity [Fusi et al., 2000, Mitra et al., 2009]. The theoretical basis for this work is based on a model where soft-WTA competition supports unsupervised learning, as it enhances the firing rate of the most active neurons hence increasing the probability of inducing learning on the afferent synapses. In our experiments we implemented a network of Integrate-And-Fire (I&F) neurons with a Winner-Take-All (WTA) connectivity profile [Chicca et al., 2007], using synapses that update their weights ac- cording to the spike-timing dependent learning algorithm described in [Brader et al., 2007]. The pre-specified initial conditions of the network we implemented, expressed in its feed-forward connectivity structure between the LGN and V1 neurons, give rise to coarse feature selectivity and broad tuning curves: uniform inputs induce a weakly localized activity at the output which reflect the nature of the particular connectivity profile used. The uniform input represents the activity of all the neurons in LGN which are tuned to the same feature of the stimulus. Our results show how the network’s WTA mechanism amplifies and sharpens the response profile acting as a teacher signal for the neurons receiving the highest inputs and thus driving the synaptic plasticity. As the network learns robustly to sharpen the pre-built broad selectivity profile, in both the bit-precise SW simulations and the noisy hardware implementation, we propose that the WTA architecture used plays an important role in shaping the coarse innate feature selectivity on the basis of experience.

A.2 Neuromorphic olfaction

Olfactory stimuli are represented in a high-dimensional space by neural networks of the olfactory system. A great deal of research in olfaction has focused on this representation within the first processing stage, the olfactory bulb (vertebrates) or antennal lobe (insects) glomeruli. In particular the mapping of chemical stimuli onto olfactory glomeruli and the relation of this mapping to perceptual qualities have been investigated. While a number of studies have illustrated the importance of inhibitory networks within the olfactory bulb or the antennal lobe for the shaping and processing of olfactory information, it is not clear how exactly these inhibitory networks are organized to provide filtering and contrast enhancement capabilities. In this work the aim is to study the topology of the proposed networks by using software simulations and hardware

102 APPENDIX A. OTHER EXPERIMENTS

Figure A.2: Angle separation distributions between different odor representations for the linear simula- tion, the spiking simulation and the VLSI emulation. implementation. While we can study the dependence of the activity on each parameter of the theoretical models with the simulations, it is important to understand whether the models can be used in robotic applications for real-time odor recognition. We present the results of a linear simulation, a spiking simulation with I&F neurons and a real-time hardware emulation using neuromorphic VLSI chips. We used an input data set of neurophysiological recordings from olfactory receptive neurons of insects, especially Drosophila. In this work we developed a feed-forward model based on anatomical evidences on the Antennal Lobe (AL) of insects [Vosshall et al., 2000] and Calcium imaging studies [Wang et al., 2003] to study the role of feed-forward inhibition in odor discriminability. We used a subset of the DoOR odorant response database [Galizia et al., 2010] as input to a linear simulation, a spiking simulation and a VLSI emulation of the network. In the database, each odorant is represented as an ORN activation vector of mean-rate activity. We excluded odorants with less than 7 data points and receptor neurons with less than 80 data points. The remaining input matrix contains data for 137 odorants in 23 glomeruli. The empty cells in the matrix (ca. 15%) were filled with estimations of the receptor neurons’ spontaneous activity. All values in the input matrix were globally normalized. We measured the distributions of angle separations between different odor representations. The table in Fig. A.2 shows the distribution of angles (com- puted for all possible odor pairs) for the three simulations (rows) and for three values of inhibition strength (columns). When inhibition is disabled (left column) the PNs angle histogram is identical to the input (ORN) angle histogram for the linear simulation (top graph). Small variations due to the noise introduced by the Poisson statistic of the spike trains are observed in the spiking simulation; more noise is observed in the VLSI emulation because of device mismatch. When inhibition is enabled (center column) an average increase in angles between odors is observed in the three models. This network effect can be increased by increasing the strength of inhibition (right column). These results show that inhibition could be used by the AL to increase angles between odor pairs and therefore improve odor discriminability. The three models show comparable results (see [Beyeler et al., 2010] for a more complete description of the work).

103 Classification in VLSI

A.3 Considerations on STDP

It is known that the amount of information that spike-timing carries is much higher than that of mean-rate codes. This suggests that neurons that learn to read-out information from spikes are needed (e.g. tempotron). However, this is not the case when a large population of neuron with heterogeneous parameters is involved. In a Inter-aural Time Difference (ITD) problem, Vasilkov and Tikidji-Hamburyan [Vasilkov and Tikidji-Hamburyan, 2012] interestingly show how a rate code can be used to detect sub-millisecond ITDs with a precision of up to ±2µs/neuron. The rate is computed across a population of 10000 neurons. Interestingly, a fundamental ingredient is the heterogeneity of parameters across the population of neurons. Following the above considerations, the sound recognition system can use distributed repre- sentations to classify complex sounds which change in sub-millisecond time-scales.

• By increasing the weigth of the synapses one shortens the time constant of integration of the output neurons (see also Thorpe / earlier spikes. . . ). The pattern recognition system then reduces to finding the shortest time-constant that makes the neuron fire only during the presentation of the pattern.

• Bounded synapses in this type of algorithm cause a limitation in the time resolution: more

input spikes are needed to build Vmem -> less probability of coincidences in ∆t where ∆t is

the minimum time-interval to build output spikes (∼ Vleak?)

• No matter what learning rule you have, no matter the shape of STDP you have, as long as you can implement coincidence detection you can implement pattern recognition as in Guetig and Sompolinsky 2007 / Thorpe and friends / Pfeiffer and co.

• This type of pattern recognition is not pattern recognition: you waste a whole dendritic tree for a single class of patterns

• This type of pattern recognition does not implement classification: you cannot arbitrarily choose your classes. A class is defined once you choose a single pattern because it is determined only by the number of patterns you can select by that pattern ± jitter.

• In addition, jitter has to be uncorrelated otherwise even for very short jitter amounts the recognition will fail.

• The recognition works in very limited situations: sparse activity in space/time, short patterns to be recognized, single spikes trigger synaptic transitions.

• The traditional STDP shape only shifts the recognition to the earliest meaningful coincidence (cfr. Thorpe) whereas tempotron chooses time itself. Other STDP shapes would just shift the meaningful coincidences here and there in time.

• Once the pattern is learned, you can remove the entire pattern and keep only the part where the meaningful coincidences happen.

104 APPENDIX A. OTHER EXPERIMENTS

A.3.1 Studying coincidences

• Generate 100 Poisson spiketrains with certain mean rate (50 Hz). The ensemble of these will be the pattern to be recognized (call it p), duration 50 ms or so.

• Generate n of similar trains which will be the background, duration not important, mean rate 50 Hz + noise, where noise could be 10 Hz. Call these trains B_i

• Generate n other trains, mean rate of noise (10 Hz), duration same as p, and overlap with p.

• Concatenate B_i —(p + noise) sequences.

• Compute instantaneous firing rate (IFR) across the 100 trains population. Compute it for different time-bin values.

• There is going to be a time-bin for which a peak in correspondence of the pattern arises.

• For very short time-bins, the IFR is a sequence of deltas, occuring at time where one of the trains has a spike. For time-bins longer than 1/f/N (f mean-rate, e.g. 50 Hz; N number of neurons, e.g. 100), the IFR is flat. For intermediate values. . .

• For those intermediate values you can also see (obviously) a shrinkage of the standard deviation of the IFR.

Figure A.3: Average firing rate over 10 repetition of a BpB sequence. Left: averages for different time bin widths. Rigth: shortest time bin, standard deviation over the 10 realizations in shadow. The peak is caused by the coincidence of a number of spikes across the 100 neurons population. Other peaks can also be seen for longer time-bins (see left-panel) but their height is comparable to the noise level across realizations (see standard deviation on right panel).

Spike-timing dependent plasticity in hardware

The plasticity rule for the synaptic updates considers the value of the post-synaptic variable at the time when a pre-synaptic spike hits the synapse. The direction of the synaptic change is determined by this value with respect to a fixed threshold. This particular choice combines two important aspects of synaptic plasticity. The first is that the single synaptic updates are not deterministic because they depend on the singles occurrences of the pre- and post-synaptic

105 Classification in VLSI

Figure A.4: Comparison between theoretical STDP from simulations Brader et al. [2007] (bottom) and spiking hardware (top). Both figures are centered around a post-synaptic spike. Each white pixel of the top figure represent a potentiated synapse after the presentation of a Poisson stimuli of 50 Hz. The synaptic state changes in hardware follow the predictions of the simulations. See text for details. spikes, both being noisy processes. The second aspect is that there exist a dependence on the time difference between pre- and post-synaptic spikes without the need for a dedicated time- counter. The effect of the bi-stability of the synaptic states is that it is only possible to define the tendency Brader et al. [2007] of the synapses to potentiate or depress given a certain mean rate and time correlation between pre- and post-synaptic spikes.

106 Bibliography

H.D.I. Abarbanel, R. Huerta, and M.I. Rabinovich. Dynamical model of long-term synaptic plasticity. Proceedings of the National Academy of Sciences, 99(15):10132–10137, 2002.

L.F. Abbott and S.B. Nelson. Synaptic plasticity: taming the beast. Nature Neuroscience, 3: 1178–1183, November 2000.

A. Aizerman, E.M. Braverman, and L.I. Rozoner. Theoretical foundations of the potential function method in pattern recognition learning. Automation and remote control, 25:821–837, 1964.

L. Alvado, J. Tomas, S. Saighi, S. Renaud-Le Masson, T. Bal, A. Destexhe, and G. Le Masson. Hardware computation of conductance-based neuron models. Neurocomputing, 58–60:109–115, 2004.

S. Amari and M.A. Arbib. Competition and cooperation in neural nets. In J. Metzler, editor, Systems Neuroscience, pages 119–165. Academic Press, 1977.

D. Amit and N. Brunel. Model of global spontaneous activity and local structured activity during delay periods in the cerebral cortex. Cerebral Cortex, 7:237–252, 1997.

D.J. Amit. Modeling brain function: The world of attractor neural networks. Cambridge University Press, 1992.

D.J. Amit and S. Fusi. Dynamic learning in neural networks with material synapses. Neural Computation, 6:957, 1994.

D.J. Amit and G. Mongillo. Spike-driven synaptic dynamics generating working memory states. Neural Computation, 15(3):565–596, 2003.

Y. Amit and G. Blanchard. Multiple randomized classifiers: Mrcl. Technical report, University of Chicago, 2001. URL http://galton.uchicago.edu/~amit/Papers/mrcl.pdf.

Y. Amit and M. Mascaro. Attractor networks for shape recognition. Neural Computation, 13(6): 1415–1442, 2001.

107 Classification in VLSI

J. Arthur and K. Boahen. Learning in silicon: Timing is everything. In Y. Weiss, B. Schölkopf, and J. Platt, editors, Advances in Neural Information Processing Systems 18. MIT Press, Cambridge, MA, USA, 2006.

J. V. Arthur, P. A. Merolla, F. Akopyan, R. Alvarez, A. Cassidy, A. Chandra, S. K. Esser, N. Imam, W. Risk, D. B. D. Rubin, R. Manohar, and D. S. Modha. Building block of a programmable neuromorphic substrate: A digital neurosynaptic core. In International Joint Conference on Neural Networks, IJCNN 2012, pages 1946–1953. IEEE, Jun 2012. doi: 10.1109/IJCNN.2012.6252637.

W.F. Asaad, G. Rainer, and E.K. Miller. Neural activity in the primate prefrontal cortex during associative learning. Neuron, 21(6):1399–1407, 1998.

J. Backus. Can programming be liberated from the von neumann style?: a functional style and its algebra of programs. Communications of the ACM, 21(8):613–641, 1978.

L. Badel, S. Lefort, R. Brette, C.C.H. Petersen, W. Gerstner, and M.J.E. Richardson. Dynamic i-v curves are reliable predictors of naturalistic pyramidal-neuron voltage traces. Journal of Neurophysiology, 99:656–666, 2008.

J. Bailey and D. Hammerstrom. Why vlsi implementations of associative vlcns require connection multiplexing. In Neural Networks, 1988., IEEE International Conference on, pages 173–180. IEEE, 1988.

O. Barak, M. Rigotti, and S. Fusi. The sparseness of mixed selectivity neurons controls the generalization–discrimination trade-off. The Journal of Neuroscience, 33(9):3844–3856, 2013.

C. Bartolozzi and G. Indiveri. A neuromorphic selective attention architecture with dynamic synapses and integrate-and-fire neurons. In Brain Inspired Cognitive Systems, BICS 2004, volume BIS2.2, pages 1–6, Aug 2004. URL http://ncs.ethz.ch/pubs/pdf/Bartolozzi_ Indiveri04.pdf.

C. Bartolozzi and G. Indiveri. A selective attention multi–chip system with dynamic synapses and spiking neurons. In B. Schölkopf, J.C. Platt, and T. Hofmann, editors, Advances in Neural Information Processing Systems 19, pages 113–120, Cambridge, MA, USA, Dec 2006. Neural Information Processing Systems Foundation, MIT Press. URL http://ncs.ethz.ch/pubs/ pdf/Bartolozzi_Indiveri06c.pdf.

C. Bartolozzi and G. Indiveri. Synaptic dynamics in analog VLSI. Neural Computation, 19(10): 2581–2603, Oct 2007. doi: 10.1162/neco.2007.19.10.2581. URL http://ncs.ethz.ch/pubs/ pdf/Bartolozzi_Indiveri07b.pdf.

C. Bartolozzi and G. Indiveri. Global scaling of synaptic efficacy: Homeostasis in silicon synapses. Neurocomputing, 72(4–6):726–731, Jan 2009. doi: 10.1016/j.neucom.2008.05.016. URL http://ncs.ethz.ch/pubs/pdf/Bartolozzi_Indiveri09.pdf.

C. Bartolozzi, S. Mitra, and G. Indiveri. An ultra low power current–mode filter for neuromorphic systems and biomedical signal processing. In Biomedical Circuits and Systems Conference, (BioCAS), 2006, pages 130–133. IEEE, 2006. doi: 10.1109/BIOCAS.2006.4600325. URL http://ncs.ethz.ch/pubs/pdf/Bartolozzi_etal06.pdf.

108 BIBLIOGRAPHY

S. Behnel, R. Bradshaw, C. Citro, L. Dalcin andro D.S., Seljebotn, and K. Smith. Cython: The best of both worlds. Computing in Science & Engineering, 13(2):31–39, 2011.

D. Ben Dayan Rubin, E. Chicca, and G. Indiveri. Characterizing the firing proprieties of an adaptive analog VLSI neuron. In Masayuki Murata Auke Jan Ijspeert and Naoki Wakamiya, editors, Biologically Inspired Approaches to Advanced Information Technology First Inter- national Workshop, BioADIT 2004, Lausanne, Switzerland, January 29-30, 2004, Revised Selected Pa, volume 3141/2004 of LCNS, pages 189–200. Springer-Verlag Heidelberg, 2004. URL http://ncs.ethz.ch/pubs/pdf/Ben-Dayan-Rubin_etal04.pdf.

R. Ben-Yishai, R. Lev Bar-Or, and H. Sompolinsky. Theory of orientation tuning in visual cortex. Proceedings of the National Academy of Sciences of the USA, 92(9):3844–3848, April 1995.

Y. Bengio. Learning deep architectures for ai. Foundations and Trends R in Machine Learning, 2 (1):1–127, 2009.

Y. Bengio and O. Delalleau. On the expressive power of deep architectures. In Algorithmic Learning Theory, pages 18–36. Springer, 2011.

Y. Bengio and Y. LeCun. Scaling learning algorithms towards ai. Large-Scale Kernel Machines, 34, 2007.

A. Bennett. Large competitive networks. Network, 1:449–62, 1990.

M. Beyeler, F. Stefanini, H. Proske, G. Galizia, and E. Chicca. Exploring olfactory sensory networks: Simulations and hardware emulation. In Biomedical Circuits and Systems Conference (BioCAS), 2010, pages 270–273. IEEE, 2010. doi: 10.1109/BIOCAS.2010.5709623. URL http://ncs.ethz.ch/pubs/pdf/Beyeler_etal10.pdf.

C.M. Bishop. Neural networks for pattern recognition. Oxford Univ Pr, 2005.

C.M. Bishop. Pattern recognition and machine learning. Springer New York, 2006.

K.A. Boahen. Communicating neuronal ensembles between neuromorphic chips. In T.S. Lande, editor, Neuromorphic Systems Engineering, pages 229–259. Kluwer Academic, Norwell, MA, 1998.

K.A. Boahen. Point-to-point connectivity between neuromorphic chips using address-events. IEEE Transactions on Circuits and Systems II, 47(5):416–34, 2000.

K.A. Boahen. A burst-mode word-serial address-event link – I: Transmitter design. IEEE Transactions on Circuits and Systems I, 51(7):1269–80, 2004.

K.A. Boahen. Neuromorphic microchips. Scientific American, 292(5):56–63, May 2005.

J. Brader, W. Senn, and S. Fusi. Learning real world stimuli in a neural network with spike-driven synaptic dynamics. Neural Computation, 19:2881–2912, 2007.

L. Breiman. Bagging predictors. Machine Learning, 24:123–140, 1996. ISSN 0885-6125. doi: 10.1007/BF00058655.

109 Classification in VLSI

L. Breiman. Random forests. Machine learning, 45(1):5–32, 2001.

R. Brette and W. Gerstner. Adaptive exponential integrate-and-fire model as an effective description of neuronal activity. Journal of Neurophysiology, 94:3637–3642, 2005.

R. Brette and D.F.M. Goodman. Simulating spiking neural networks on gpu. Network: Compu- tation in Neural Systems, 23(4):167–182, 2012.

S. Brink, S. Nease, and P. Hasler. Computing with networks of spiking neurons on a biophysically motivated floating-gate based neuromorphic integrated circuit. Neural Networks, 2013.

D. Brüderle, E. Müller, A. Davison, E. Muller, J. Schemmel, and K. Meier. Estab- lishing a novel modeling tool: a python-based interface for a neuromorphic hardware system. Frontiers in Neuroinformatics, 4, 2009. ISSN 1662-5196. doi: 10.3389/ neuro.11.017.2009. URL http://frontiersin.org/Journal/Abstract.aspx?s=752&name= neuroinformatics&ART_DOI=10.3389/neuro.11.017.2009.

D. Brüderle, M.A. Petrovici, B. Vogginger, M. Ehrlich, T. Pfeil, S. Millner, A. Grübl, K. Wendt, E. Müller, M.-O. Schwartz, D.H. de Oliveira, S. Jeltsch, J. Fieres, M. Schilling, P. Müller, O. Bre- itwieser, V. Petkov, L. Muller, A.P. Davison, P. Krishnamurthy, J. Kremkow, M. Lundqvist, E. Muller, J. Partzsch, S. Scholze, L. Zühl, C. Mayr, A. Destexhe, M. Diesmann, T.C. Potjans, A. Lansner, R. Schüffny, J. Schemmel, and K. Meier. A comprehensive workflow for general- purpose neural modeling with highly configurable neuromorphic hardware systems. Biological cybernetics, 104(4):263–296, 2011.

N. Brunel, F. Carusi, and S. Fusi. Slow stochastic hebbian learning of classes of stimuli in a recurrent neural network. Network: Computation in Neural Systems, 9(1):123–152, 1998. doi: 10.1088/0954-898X_9_1_007. URL http://informahealthcare.com/doi/abs/10.1088/ 0954-898X_9_1_007.

R. Brunelli and T. Poggio. Face recognition: Features versus templates. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 15(10):1042–1052, 1993.

K. Cameron and A. Murray. Minimizing the effect of process mismatch in a neuromorphic system using spike-timing-dependent adaptation. Neural Networks, IEEE Transactions on, 19(5): 899–913, May 2008. doi: 10.1109/TNN.2007.914192.

P. Camilleri, M. Giulioni, V. Dante, D. Badoni, G. Indiveri, B. Michaelis, J. Braun, and P. Del Giudice. A neuromorphic aVLSI network chip with configurable plastic synapses. In Hybrid Intelligent Systems, 2007, HIS 2007, pages 296–301, Los Alamitos, CA, USA, 2007. IEEE Computer Society. doi: 10.1109/HIS.2007.60. URL http://ncs.ethz.ch/pubs/pdf/ Camilleri_etal07.pdf. (Best paper award).

A. Cassidy and A.G. Andreou. Dynamical digital silicon neurons. In Biomedical Circuits and Systems Conference, (BioCAS), 2008, pages 289–292. IEEE, Nov. 2008. doi: 10.1109/BIOCAS. 2008.4696931.

A.S. Cassidy, J. Georgiou, and A.G. Andreou. Design of silicon brains in the nano-cmos era: Spiking neurons, learning synapses and neural architecture optimization. Neural Networks,

110 BIBLIOGRAPHY

2013. doi: 10.1016/j.neunet.2013.05.011. URL http://www.sciencedirect.com/science/ article/pii/S0893608013001597.

R. Cattell and A. Parker. Challenges for brain emulation: why is building a brain so difficult. Natural Intelligence: the INNS Magazine, 1(3):17–31, 2012.

C. Chen, A. Liaw, and L. Breiman. Using random forest to learn imbalanced data. Technical report, University of California, Berkeley, 2004.

K. Cheung, S.R. Schultz, and W. Luk. A large-scale spiking neural network accelerator for fpga systems. In Artificial Neural Networks and Machine Learning–ICANN 2012, pages 113–120. Springer, 2012.

E. Chicca. A VLSI neuromorphic device with 128 neurons and 3000 synapses: area optimization and design. Master’s thesis, University of Rome 1, “La Sapienza”, 1999. In Italian.

E. Chicca and S. Fusi. Stochastic synaptic plasticity in deterministic aVLSI networks of spiking neurons. In Frank Rattay, editor, Proceedings of the World Congress on Neuroinformatics, ARGESIM Reports, pages 468–477, Vienna, 2001. ARGESIM/ASIM Verlag.

E. Chicca, D. Badoni, V. Dante, M. D’Andreagiovanni, G. Salina, L. Carota, S. Fusi, and P. Del Giudice. A VLSI recurrent network of integrate–and–fire neurons connected by plastic synapses with long term memory. IEEE Transactions on Neural Networks, 14(5):1297–1307, September 2003. doi: 10.1109/TNN.2003.816367. URL http://ncs.ethz.ch/pubs/pdf/Chicca_etal03. pdf.

E. Chicca, G. Indiveri, and R.J. Douglas. Context dependent amplification of both rate and event-correlation in a VLSI network of spiking neurons. In B. Schölkopf, J.C. Platt, and T. Hofmann, editors, Advances in Neural Information Processing Systems 19, pages 257–264, Cambridge, MA, USA, Dec 2007. Neural Information Processing Systems Foundation, MIT Press. URL http://ncs.ethz.ch/pubs/pdf/Chicca_etal07.pdf.

E. Chicca, F. Stefanini, and G. Indiveri. Neuromorphic electronic circuits for building autonomous cognitive systems. Proceedings of IEEE, 2013. (Submitted, under review).

Y. Cho and L.K. Saul. Large-margin classification in infinite neural networks. Neural computation, 22(10):2678–2697, 2010.

T.Y. W. Choi, P.A. Merolla, J.V. Arthur, K.A. Boahen, and B.E. Shi. Neuromorphic imple- mentation of orientation hypercolumns. IEEE Transactions on Circuits and Systems I, 52(6): 1049–1060, 2005.

S. Choudhary, S. Sloan, S. Fok, A. Neckar, E. Trautmann, P. Gao, T. Stewart, C. Eliasmith, and K. Boahen. Silicon neurons that compute. In A. Villa, W. Duch, P. Érdi, F. Masulli, and G. Palm, editors, Artificial Neural Networks and Machine Learning – ICANN 2012, volume 7552 of Lecture Notes in Computer Science, pages 121–128. Springer Berlin / Heidelberg, 2012. ISBN 978-3-642-33268-5. doi: 10.1007/978-3-642-33269-2_16.

L. Chua. Memristor-the missing circuit element. Circuit Theory, IEEE Transactions on, 18(5): 507–519, 1971.

111 Classification in VLSI

L.O. Chua and S.M. Kang. Memristive devices and systems. Proceedings of the IEEE, 64(2): 209–223, 1976.

D.C. Cireşan, U. Meier, L.M. Gambardella, and J. Schmidhuber. Deep, big, simple neural nets for handwritten digit recognition. Neural computation, 22(12):3207–3220, 2010.

C. Clopath, L. Büsing, E. Vasilaki, and W. Gerstner. Connectivity reflects coding: a model of voltage-based STDP with homeostasis. Nature Neuroscience, 13(3):344–352, 2010.

A. Coates, H. Lee, and A.Y. Ng. An analysis of single-layer networks in unsupervised feature learning. Ann Arbor, 1001:48109, 2010.

B.W. Connors, M.J. Gutnick, and D.A. Prince. Electrophysiological properties of neocortical neurons in vitro. Jour. of Neurophysiol., 48(6):1302–1320, 1982.

C. Cortes and V. Vapnik. Support-vector networks. Machine learning, 20(3):273–297, 1995.

M.C. Crair, D.C. Gillespie, and M.P. Stryker. The role of visual experience in the development of columns in cat visual cortex. Science, 279(5350):566–570, 1998. doi: 10.1126/science.279.5350. 566.

S. Davies. Learning in Spiking Neural Networks. Ph.D. thesis, School of Computer Science, The University of Manchester, Kilburn Building, Oxford Road, M13 9PL, Manchester, UK, Feb 2013.

AP Davison, D. Brüderle, J. Eppler, J. Kremkow, E. Muller, D. Pecevski, L. Perrinet, and P. Yger. Pynn: a common interface for neuronal network simulators. front. neuroinform. Front. Neuroinform., 2:11, 2008. doi: 10.3389/neuro.11.011.2008.

P. Dayan and L.F. Abbott. Theoretical Neuroscience: Computational and Mathematical Modeling of Neural Systems. MIT Press, Cambridge, MA, USA, 2001. ISBN 9780262541855.

G. Deco and E. Rolls. Neurodynamics of biased competition and cooperation for attention: a model with spiking neurons. Journal of Neurophysiology, 94:295–313, 2005.

D. Decoste and B. Schölkopf. Training invariant support vector machines. Machine Learning, 46 (1-3):161–190, 2002.

A. Destexhe, Z.F. Mainen, and T.J. Sejnowski. Methods in Neuronal Modelling, from ions to networks, chapter Kinetic Models of Synaptic Transmission, pages 1–25. MIT Press, 1998.

K. Dhoble, N. Nuntalid, G. Indiveri, and N. Kasabov. Online spatio-temporal pattern recognition with evolving spiking neural networks utilising address event representation, rank order, and temporal spike learning. In International Joint Conference on Neural Networks, IJCNN 2012, pages 554–560. IEEE, 2012. URL http://ncs.ethz.ch/pubs/pdf/Dhoble_etal12.pdf.

T.G. Dietterich. Ensemble methods in machine learning. In Multiple classifier systems, pages 1–15. Springer, 2000.

C. Diorio, P. Hasler, B.A. Minch, and C. Mead. A single-transistor silicon synapse. IEEE Transactions on Electron Devices, 43(11):1972–1980, 1996.

112 BIBLIOGRAPHY

M. Djurfeldt, M. Lundqvist, C. Johansson, M. Rehn, Ö. Ekeberg, and A. Lansner. Brain-scale simulation of the neocortex on the ibm blue gene/l supercomputer. IBM J. Res. Dev., 52(1/2): 31–41, 2008. ISSN 0018-8646.

Y.S. Dong and K.S. Han. Boosting svm classifiers by ensemble. In Special interest tracks and posters of the 14th international conference on World Wide Web, pages 1072–1073. ACM, 2005.

R.J. Douglas and K. Martin. Recurrent neuronal circuits in the neocortex. Current Biology, 17 (13):R496–R500, 2007.

R.J. Douglas and K.A.C. Martin. Neural circuits of the neocortex. Annual Review of Neuroscience, 27:419–51, 2004.

R.J. Douglas, K.A.C. Martin, and D. Whitteridge. A canonical microcircuit for neocortex. Neural Computation, 1:480–488, 1989.

R.J. Douglas, C.K. Koch, M.A. Mahowald, K.A.C. Martin, and H.H. Suarez. Recurrent excitation in neocortical circuits. Science, 269:981–985, 1995a.

R.J. Douglas, M.A. Mahowald, and C. Mead. Neuromorphic analogue VLSI. Annu. Rev. Neurosci., 18:255–281, 1995b.

D. Dupeyron, S. Le Masson, Y. Deval, G. Le Masson, and J.-P. Dom. A BiCMOS implementation of the Hodgkin-Huxley formalism. In Proceedings of the Fifth International Conference on Microelectronics for Neural, Fuzzy and Bio-inspired Systems; Microneuro’96, pages 311–316, Los Alamitos, CA, February 1996. MicroNeuro, IEEE Computer Society Press.

M. Ehrlich, K. Wendt, L. Zühl, R. Schüffny, D. Brüderle, E. Müller, and B. Vogginger. A software framework for mapping neural networks to a wafer-scale neuromorphic hardware system. Proceedings of ANNIIP, pages 43–52, 2010.

C. Eliasmith and C.H. Anderson. Neural engineering: Computation, representation, and dynamics in neurobiological systems. The MIT Press, 2004.

C. Eliasmith, T.C. Stewart, X. Choo, T. Bekolay, T. DeWolf, Y. Tang, and D. Rasmussen. A large-scale model of the functioning brain. Science, 338(6111):1202–1205, 2012. doi: 10.1126/ science.1225266. URL http://www.sciencemag.org/content/338/6111/1202.abstract.

D. Erhan, Y. Bengio, A. Courville, P.A. Manzagol, P. Vincent, and S. Bengio. Why does unsupervised pre-training help deep learning? The Journal of Machine Learning Research, 11: 625–660, 2010.

FACETS. Fast analog computing with emergent transient states in neural architectures (FACETS). FP6-2005-015879 EU Grant, 2005–2009.

D.B. Fasnacht and G. Indiveri. A PCI based high-fanout AER mapper with 2 GiB RAM look-up table, 0.8 µs latency and 66 mhz output event-rate. In Conference on Information Sciences and Systems, CISS 2011, pages 1–6, Johns Hopkins University, March 2011. doi: 10.1109/CISS. 2011.5766102. URL http://ncs.ethz.ch/pubs/pdf/Fasnacht_Indiveri11.pdf.

113 Classification in VLSI

D.B. Fasnacht, A.M. Whatley, and G. Indiveri. A serial communication infrastructure for multi- chip address event system. In International Symposium on Circuits and Systems, (ISCAS), 2008, pages 648–651. IEEE, May 2008. doi: 10.1109/ISCAS.2008.4541501. URL http://ncs. ethz.ch/pubs/pdf/Fasnacht_etal08.pdf.

FINS-Python. Special topic: Python in neuroscience, 2009. URL http://www.frontiersin. org/neuroinformatics/specialtopics/python_in_neuroscience/8.

F. Folowosele, R. Etienne-Cummings, and T.J. Hamilton. A CMOS switched capacitor imple- mentation of the Mihalas-Niebur neuron. In Biomedical Circuits and Systems Conference, (BioCAS), 2009, pages 105–108. IEEE, Nov. 2009. doi: 10.1109/BIOCAS.2009.5372048.

K. Fukushima. Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological cybernetics, 36(4):193–202, 1980.

S. Furber and S. Temple. Neural systems engineering. Journal of the Royal Society interface, 4 (13):193–206, 2007.

S. Fusi. Hebbian spike-driven synaptic plasticity for learning patterns of mean firing rates. Biological Cybernetics, 87:459–470, 2002.

S. Fusi and L.F. Abbott. Limits on the memory storage capacity of bounded synapses. Nature Neuroscience, 10:485–493, 2007.

S. Fusi and M. Mattia. Collective behavior of networks with linear (VLSI) integrate and fire neurons. Neural Computation, 11:633–52, 1999.

S. Fusi, M. Annunziato, D. Badoni, A. Salamon, and D.J. Amit. Spike–driven synaptic plasticity: theory, simulation, VLSI implementation. Neural Computation, 12:2227–58, 2000.

S. Fusi, P.J. Drew, and L.F. Abbott. Cascade models of synaptically stored memories. Neuron, 45:599–611, 2005.

S. Fusi, W.F. Asaad, E.K. Miller, and X-J. Wang. A neural circuit model of flexible sensori-motor mapping: learning and forgetting. Neuron, 2007. In press.

C.G. Galizia, D. M¨nch, M. Strauch, A. Nissler, and S. Ma. Integrating heterogeneous odor response data into a common response model: A door to the complete olfactome. Chemical Senses, 35 (7):551–563, 2010. doi: 10.1093/chemse/bjq042. URL http://chemse.oxfordjournals.org/ content/35/7/551.abstract.

F. Galluppi, K. Brohan, S. Davidson, T. Serrano-Gotarredona, J.P. Carrasco, B. Linares-Barranco, and S. Furber. A real-time, event-driven neuromorphic system for goal-directed attentional selection. In Neural Information Processing, pages 226–233. Springer, 2012.

B. Gilbert. Translinear circuits: An historical review. Analog Integrated Circuits and Signal Processing, 9(2):95–118, March 1996.

M. Giulioni, P. Camilleri, V. Dante, D. Badoni, G. Indiveri, J. Braun, and P. Del Giudice. A VLSI network of spiking neurons with plastic fully configurable “stop-learning” synapses.

114 BIBLIOGRAPHY

In International Conference on Electronics, Circuits, and Systems, ICECS 2008, pages 678– 681. IEEE, 2008. doi: 10.1109/ICECS.2008.4674944. URL http://ncs.ethz.ch/pubs/pdf/ Giulioni_etal08.pdf.

M. Giulioni, P. Camilleri, M. Mattia, V. Dante, J. Braun, and P. Del Giudice. Robust working memory in an asynchronously spiking neural network realized in neuromorphic VLSI. Frontiers in Neuroscience, 5, 2011. ISSN 1662-453X. doi: 10.3389/fnins.2011.00149. URL http://www.frontiersin.org/Journal/Abstract.aspx?s=755&name=neuromorphic_ engineering&ART_DOI=10.3389/fnins.2011.00149.

D.H. Goldberg, G. Cauwenberghs, and A.G. Andreou. Probabilistic synaptic weighting in a reconfigurable network of VLSI integrate-and-fire neurons. Neural Networks, 14(6–7):781–793, Sep 2001.

M. Graupner and N. Brunel. Calcium-based plasticity model explains sensitivity of synaptic changes to spike pattern, rate, and dendritic location. Proceedings of the National Academy of Sciences, 2012. doi: 10.1073/pnas.1109359109. URL http://www.pnas.org/content/early/ 2012/02/21/1109359109.abstract.

R.W. Guillery. Anatomical evidence concerning the role of the thalamus in corticocortical communication: a brief review. Journal of Anatomy, 187(Pt 3):583, 1995.

R. Gütig and H. Sompolinsky. The tempotron: a neuron that learns spike timing–based decisions. Nature Neuroscience, 9:420–428, 2006. doi: 10.1038/nn1643.

R. Gütig and H. Sompolinsky. Time-warp-invariant neuronal processing. PLoS Biology, 7(7): e1000141, July 2009. doi: 10.1371/journal.pbio.1000141.

R. Guyonneau, H. Kirchner, and S.J. Thorpe. Animals roll around the clock: The rotation invariance of ultrarapid visual processing. Journal of Vision, 6(10):1008–1017, 2006. doi: 10.1167/6.10.1. URL http://journalofvision.org/6/10/1/.

P. Häfliger, M. Mahowald, and L. Watts. A spike based learning neuron in analog VLSI. In M.C. Mozer, M.I. Jordan, and T. Petsche, editors, Advances in neuralinformation processing systems, volume 9, pages 692–698. MIT Press, 1997.

R. Hahnloser, R. Sarpeshkar, M.A. Mahowald, R.J. Douglas, and S. Seung. Digital selection and analog amplification co-exist in an electronic circuit inspired by neocortex. Nature, 405(6789): 947–951, 2000.

J.A. Hanley and N.J. McNeil. The meaning and use of the area under a receiver operating characteristic (roc) curve. Radiology, 143(1):29–36, 1982.

D. Hansel and H. Sompolinsky. Methods in Neuronal Modeling, chapter Modeling Feature Selectivity in Local Cortical Circuits, pages 499–567. MIT Press, Cambridge, MA, USA, 1998.

R.R. Harrison, P. Hasler, and B.A. Minch. Floating gate CMOS analog memory array. In Proc. IEEE Intl. Symp. on Circuits and Systems, volume 2, pages 404–407, Monterey, CA., 1998.

115 Classification in VLSI

J. Hertz, A. Krogh, and R.G. Palmer. Introduction to the Theory of Neural Computation. Addison-Wesley, Reading, MA, 1991.

G. Hinton and T.J. Sejnowski. Unsupervised learning: foundations of neural computation. The MIT press, 1999.

G.E. Hinton. Learning multiple layers of representation. Trends in cognitive sciences, 11(10): 428–434, 2007.

G.E. Hinton and R.R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, 2006.

G.E. Hinton, S. Osindero, and Y.W. Teh. A fast learning algorithm for deep belief nets. Neural computation, 18(7):1527–1554, 2006.

G.E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R.R. Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. Submitted on 03 Jul. 2012, 2012.

S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.

S. Hochreiter, Y. Bengio, P. Frasconi, and J. Schmidhuber. Gradient flow in recurrent nets: the difficulty of learning long-term dependencies, 2001.

H. Hoffmann. Kernel pca for novelty detection. Pattern Recognition, 40(3):863–874, 2007.

T.K. Horiuchi and C. Koch. Analog VLSI-based modeling of the primate oculomotor system. Neural Computation, 11(1):243–265, January 1999.

G.B. Huang, Q.Y. Zhu, and C.K. Siew. Extreme learning machine: theory and applications. Neurocomputing, 70(1):489–501, 2006.

D.H. Hubel and T.N. Wiesel. Receptive fields and functional architecture in the two nonstriate visual areas (18 and 19) of the cat. Jour. Neurophysiol., 160:106–154, 1965.

KM Hynna and KA Boahen. Nonlinear influence of T-channels in an in silico relay neuron. Biomedical Circuits and Systems, IEEE Transactions on, 56(6):1734, 2009.

A. Bofill i Petit and A.F. Murray. Synchrony detection and amplification by silicon neurons with STDP synapses. IEEE Transactions on Neural Networks, 15(5):1296–1304, September 2004.

N. Imam, F. Akopyan, J. Arthur, P. Merolla, R. Manohar, and D.S. Modha. A digital neurosynaptic core using event-driven qdi circuits. In Asynchronous Circuits and Systems (ASYNC), 2012 18th IEEE International Symposium on, pages 25–32, May 2012. doi: 10.1109/ASYNC.2012.12.

G. Indiveri. A neuromorphic VLSI device for implementing 2-D selective attention systems. IEEE Transactions on Neural Networks, 12(6):1455–1463, November 2001. doi: 10.1109/72.963780. URL http://ncs.ethz.ch/pubs/pdf/Indiveri01b.pdf.

116 BIBLIOGRAPHY

G. Indiveri. A low-power adaptive integrate-and-fire neuron circuit. In International Symposium on Circuits and Systems, (ISCAS), 2003, pages IV–820–IV–823. IEEE, May 2003. doi: 10. 1109/ISCAS.2003.1206342. URL http://ncs.ethz.ch/pubs/pdf/Indiveri03b.pdf.

G. Indiveri and R.J. Douglas. ROBOTIC VISION: Neuromorphic vision sensor. Science, 288: 1189–1190, May 2000.

G. Indiveri and S. Fusi. Spike-based learning in VLSI networks of integrate-and-fire neurons. In International Symposium on Circuits and Systems, (ISCAS), 2007, pages 3371–3374. IEEE, 2007. doi: 10.1109/ISCAS.2007.378290. URL http://ncs.ethz.ch/pubs/pdf/Indiveri_ Fusi07.pdf.

G. Indiveri, E. Chicca, and R.J. Douglas. A VLSI array of low-power spiking neurons and bistable synapses with spike–timing dependent plasticity. IEEE Transactions on Neural Networks, 17 (1):211–221, Jan 2006. doi: 10.1109/TNN.2005.860850. URL http://ncs.ethz.ch/pubs/pdf/ Indiveri_etal06.pdf.

G. Indiveri, F. Stefanini, and E. Chicca. Spike-based learning with a generalized integrate and fire silicon neuron. In International Symposium on Circuits and Systems, (ISCAS), 2010, pages 1951–1954. IEEE, 2010. doi: 10.1109/ISCAS.2010.5536980. URL http://ncs.ethz.ch/pubs/ pdf/Indiveri_etal10.pdf.

G. Indiveri, B. Linares-Barranco, T.J. Hamilton, A. van Schaik, R. Etienne-Cummings, T. Del- bruck, S.-C. Liu, P. Dudek, P. Häfliger, S. Renaud, J. Schemmel, G. Cauwenberghs, J. Arthur, K. Hynna, F. Folowosele, S. Saighi, T. Serrano-Gotarredona, J. Wijekoon, Y. Wang, and K. Boa- hen. Neuromorphic silicon neuron circuits. Frontiers in Neuroscience, 5:1–23, 2011. ISSN 1662- 453X. doi: 10.3389/fnins.2011.00073. URL http://www.frontiersin.org/Neuromorphic_ Engineering/10.3389/fnins.2011.00073/abstract.

E. Izhikevich and G. Edelman. Large-scale model of mammalian thalamocortical systems. Proceedings of the National Academy of Science, 105:3593–3598, 2008. doi: 10.1073/pnas. 0712231105.

E.M. Izhikevich. Simple model of spiking neurons. IEEE Transactions on Neural Networks, 14 (6):1569–1572, 2003.

E.M. Izhikevich. Dynamical systems in neuroscience: The geometry of excitability and bursting. The MIT press, 2006.

R.A. Jacobs, M.I. Jordan, S.J. Nowlan, and G.E. Hinton. Adaptive mixtures of local experts. Neural computation, 3(1):79–87, 1991.

C. Ji and S. Ma. Combinations of weak classifiers. IEEE Transactions on Neural Networks, 8(1): 32–42, jan 1997. ISSN 1045-9227. doi: 10.1109/72.554189.

X. Jin, M Lujan, L.A. Plana, S. Davies, S. Temple, and S. Furber. Modeling spiking neural networks on SpiNNaker. Computing in Science & Engineering, 12(5):91–97, September-October 2010.

117 Classification in VLSI

S.H. Jo, T. Chang, I. Ebong, B.B. Bhadviya, P. Mazumder, and W. Lu. Nanoscale memristor device as synapse in neuromorphic systems. Nano letters, 10(4):1297–1301, 2010.

R. Jolivet, T.J. Lewis, and W. Gerstner. Generalized integrate-and-fire models of neuronal activity approximate spike trains of a detailed model to a high degree of accuracy. Journal of neurophysiology, 92:959–976, 2004.

E.G. Jones, M. Steriade, and D. McCormick. The thalamus. Plenum Press New York, 1985.

Eric Jones, Travis Oliphant, Pearu Peterson, et al. SciPy: Open source scientific tools for Python, 2001–. URL http://www.scipy.org/.

S. Jung and S. Su Kim. Hardware implementation of a real-time neural network controller with a dsp and an fpga for nonlinear systems. Industrial Electronics, IEEE Transactions on, 54(1): 265–271, 2007.

J.H. Kaas. Is most of neural plasticity in the thalamus cortical? Proceedings of the National Academy of Sciences, 96(14):7622–7623, 1999.

D. Kahng and S.M. Sze. A floating-gate and its applications to memory devices. The Bell System Technical Journal, XLVI(6):1288–1295, July-August 1967.

H.C. Kim, S. Pang, H.M. Je, D. Kim, and S.Y. Bang. Pattern classification using support vector machine ensemble. In Pattern Recognition, 2002. Proceedings. 16th International Conference on, volume 2, pages 160–163. IEEE, 2002.

K.H. Kim, S. Gaba, D. Wheeler, J.M. Cruz-Albrecht, T. Hussain, N. Srinivasa, and W. Lu. A functional hybrid memristor crossbar-array/cmos system for data storage and neuromorphic applications. Nano letters, 12(1):389–395, 2011.

Y. Komura, R. Tamura, T. Uwano, H Nishijo, K. Kaga, and T. Ono. Retrospective and prospective coding for predicted reward in the sensory thalamus. Nature, 412(6846):546–549, 2001.

D. Krupa, A.A. Ghazanfar, and M.A.L. Nicolelis. Immediate thalamic sensory plasticity depends on corticothalamic feedback. Proceedings of the National Academy of Sciences, 96(14):8200–8205, 1999.

J. Lazzaro, S. Ryckebusch, M.A. Mahowald, and C.A. Mead. Winner-take-all networks of O(n) complexity. In D.S. Touretzky, editor, Advances in neural information processing systems, volume 2, pages 703–711, San Mateo - CA, 1989. Morgan Kaufmann.

Q.V. Le, M.A. Ranzato, R. Monga, M. Devin, K. Chen, G.S. Corrado, J. Dean, and A.Y. Ng. Building high-level features using large scale unsupervised learning. Last revised 12 Jul 2012 (v5), 2012.

Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.

M.A. Lewis, R. Etienne-Cummings, M. Hartmann, A.H. Cohen, and Z.R. Xu. An in silico central pattern generator: silicon oscillator, coupling, entrainment, physical computation and biped mechanism control. Biological Cybernetics, 88(2):137–151, 2003.

118 BIBLIOGRAPHY

P. Lichtsteiner and T. Delbruck. A 64x64 AER logarithmic temporal derivative silicon retina. In Research in Microelectronics and Electronics, 2005 PhD, volume 2, pages 202–205, July 2005.

S.-C. Liu, J. Kramer, G. Indiveri, T. Delbruck, and R.J. Douglas. Analog VLSI:Circuits and Principles. MIT Press, 2002. URL http://ncs.ethz.ch/pubs/pdf/Liu_etal02b.pdf.

S.C. Liu, A. van Schaik, B.A. Minch, and T. Delbruck. Event-based 64-channel binaural silicon cochlea with q enhancement mechanisms. In Circuits and Systems (ISCAS), Proceedings of 2010 IEEE International Symposium on, pages 2027–2030. IEEE, 2010.

P. Livi and G. Indiveri. A current-mode conductance-based silicon neuron for address-event neuromorphic systems. In International Symposium on Circuits and Systems, (ISCAS), 2009, pages 2898–2901. IEEE, May 2009. doi: 10.1109/ISCAS.2009.5118408. URL http://ncs.ethz. ch/pubs/pdf/Livi_Indiveri09.pdf.

W. Maass. On the computational power of winner-take-all. Neural Computation, 2000.

W. Maass, T. Natschläger, and H. Markram. Real-time computing without stable states: A new framework for neural computation based on perturbations. Neural Computation, 14(11): 2531–2560, 2002.

L.P. Maguire, T. M. McGinnity, B. Glackin, A. Ghani, A. Belatreche, and J. Harkin. Challenges for large-scale implementations of spiking neural networks on fpgas. Neurocomputing, 71(1): 13–29, 2007.

M. Mahowald and R.J. Douglas. A silicon neuron. Nature, 354:515–518, 1991.

M.A. Mahowald. VLSI analogs of neuronal visual processing: a synthesis of form and function. PhD thesis, Department of Computation and Neural Systems, California Institute of Technology, Pasadena, CA., 1992.

Rajit Manohar. A case for asynchronous computer architecture. Web, 2000. URL http: //citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.43.949.

H. Markram and M. Tsodyks. Redistribution of synaptic efficacy between neocortical pyramidal neurons. Nature, 382:807–10, 1996.

Timothée Masquelier, Rudy Guyonneau, and Simon J. Thorpe. Competitive stdp-based spike pat- tern learning. Neural Computation, 21(5):1259–1276, 2009. doi: 10.1162/neco.2008.06-08-804.

T.M. Massoud and T.K. Horiuchi. A neuromorphic vlsi head direction cell system. Circuits and Systems I: Regular Papers, IEEE Transactions on, 58(1):150–163, 2011.

C. Mead. Neuromorphic electronic systems. Proceedings of the IEEE, 78(10):1629–36, 1990.

C.A. Mead. Analog VLSI and Neural Systems. Addison-Wesley, Reading, MA, 1989.

P. Merolla and K. Boahen. Dynamic computation in a recurrent network of heterogeneous silicon neurons. In International Symposium on Circuits and Systems, (ISCAS), 2006, pages 4539–4542. IEEE, May 2006.

119 Classification in VLSI

P. Merolla, J. Arthur, F. Akopyan, N. Imam, R. Manohar, and D.S. Modha. A digital neurosynaptic core using embedded crossbar memory with 45pj per spike in 45nm. In Custom Integrated Circuits Conference (CICC), 2011 IEEE, pages 1–4, Sept. 2011. doi: 10.1109/CICC.2011. 6055294.

N. Mesgarani, M. Slaney, and S.A. Shamma. Discrimination of speech from nonspeech based on multiscale spectro-temporal modulations. Audio, Speech, and Language Processing, IEEE Transactions on, 14(3):920–930, 2006.

D. Meyer, F. Leisch, and K. Hornik. The support vector machine under test. Neurocomputing, 55 (1):169–186, 2003.

S. Mihalas and E. Niebur. A generalized linear integrate-and-fire neural model produces diverse spiking behavior. Neural Computation, 21:704–718, 2009.

S. Mika, G. Ratsch, J. Weston, B. Schölkopf, and K.R. Mullers. Fisher discriminant analysis with kernels. In Neural Networks for Signal Processing IX, 1999. Proceedings of the 1999 IEEE Signal Processing Society Workshop., pages 41–48. IEEE, 1999.

R. Mill, S. Sheik, G. Indiveri, and S. Denham. A model of stimulus-specific adaptation in neuromorphic analog VLSI. Biomedical Circuits and Systems, IEEE Transactions on, 5(5): 413–419, 2011. doi: 10.1109/TBCAS.2011.2163155. URL http://ncs.ethz.ch/pubs/pdf/ Mill_etal11.pdf.

E.K. Miller and J.D. Cohen. An integrative theory of prefrontal cortex function. Annual review of neuroscience, 24(1):167–202, 2001.

K. Minkovich, N. Srinivasa, J.M. Cruz-Albrecht, Y. Cho, and A. Nogin. Programming time- multiplexed reconfigurable hardware using a scalable neuromorphic compiler. Neural Networks and Learning Systems, IEEE Transactions on, 23(6):889–901, 2012.

M.L. Minsky and S.A. Papert. Perceptrons: An Introduction to Computational Geometry. MIT Press, Cambridge, Mass, 1969.

J. Misra and I. Saha. Artificial neural networks in hardware: A survey of two decades of progress. Neurocomputin, 74(1):239–255, 2010.

S. Mitra, S. Fusi, and G. Indiveri. Real-time classification of complex patterns using spike-based learning in neuromorphic VLSI. Biomedical Circuits and Systems, IEEE Transactions on, 3(1): 32–42, Feb. 2009. doi: 10.1109/TBCAS.2008.2005781. URL http://ncs.ethz.ch/pubs/pdf/ Mitra_etal09.pdf.

S. Mitra, G. Indiveri, and R. Etienne-Cummings. Synthesis of log-domain integrators for silicon synapses with global parametric control. In International Symposium on Circuits and Systems, (ISCAS), 2010, pages 97–100. IEEE, 2010. doi: 10.1109/ISCAS.2010.5537019. URL http://ncs.ethz.ch/pubs/pdf/Mitra_etal10.pdf. mnist. The mnist database of handwritten digits. Yann LeCun’s web-site, May 2012. URL http://yann.lecun.com/exdb/mnist/.

120 BIBLIOGRAPHY

G. Mongillo, D.J. Amit, and N. Brunel. Retrospective and prospective persistent activity induced by hebbian learning in a recurrent cortical network. European Journal of Neuroscience, 18(7): 2011–2024, 2003.

S. Moradi and G. Indiveri. A VLSI network of spiking neurons with an asynchronous static random access memory. In Biomedical Circuits and Systems Conference (BioCAS), 2011, pages 277–280. IEEE, 2011. doi: 10.1109/BioCAS.2011.6107781. URL http://ncs.ethz.ch/pubs/ pdf/Moradi_Indiveri11.pdf.

M.Rigotti, O. Barak, M.R. Warden, X.J. Wang, N.D. Daw, E.K. Miller, and S. Fusi. The importance of mixed selectivity in complex cognitive tasks. Nature, May 2013. ISSN 1476-4687. doi: 10.1038/nature12160. URL http://dx.doi.org/10.1038/nature12160.

K.R. Müller, A.J. Smola, G. Rätsch, B. Schölkopf, J. Kohlmorgen, and V. Vapnik. Predicting time series with support vector machines. In Artificial Neural Networks, ICANN 1997, pages 999–1004. Springer, 1997.

J.M. Nageswaran, N. Dutt, J.L. Krichmar, A. Nicolau, and A.V. Veidenbaum. A configurable simulation environment for the efficient simulation of large-scale spiking neural networks on graphics processors. Neural Networks, 22(5-6):791–800, 2009.

Stephen G. Nash. A survey of truncated-newton methods. Journal of Computational and Ap- plied Mathematics, 124(1-2):45–59, 2000. ISSN 0377-0427. doi: 10.1016/S0377-0427(00) 00426-X. URL http://www.sciencedirect.com/science/article/B6TYH-41MJ0RK-4/2/ e9e5a91a3219fd5d4bdce9ac15e92dd5.

Richard Naud, Thomas Berger, Wulfram Gerstner, Brice Bathellier, and Matteo Caran- dini. Quantitative Single-Neuron Modeling: Competition 2009. Frontiers in Neu- roinformatics, pages 1–8, 2009. ISSN 1662-5196. doi: 10.3389/conf.neuro.11.2009.08. 106. URL http://www.frontiersin.org/conferences/individual_abstract_listing. php?conferid=155&pap=2139&ind_abs=1&q=98.

E. Neftci and G. Indiveri. A device mismatch compensation method for VLSI spiking neural networks. In Biomedical Circuits and Systems Conference (BioCAS), 2010, pages 262–265. IEEE, 2010. doi: 10.1109/BIOCAS.2010.5709621. URL http://ncs.ethz.ch/pubs/pdf/ Neftci_Indiveri10.pdf.

E. Neftci, E. Chicca, M. Cook, G. Indiveri, and R.J. Douglas. State-dependent sensory processing in networks of VLSI spiking neurons. In International Symposium on Circuits and Systems, (ISCAS), 2010, pages 2789–2792. IEEE, 2010. doi: 10.1109/ISCAS.2010.5537007. URL http://ncs.ethz.ch/pubs/pdf/Neftci_etal10.pdf.

E. Neftci, E. Chicca, G. Indiveri, and R.J. Douglas. A systematic method for configuring VLSI networks of spiking neurons. Neural Computation, 23(10):2457–2497, Oct. 2011. doi: 10.1162/NECO_a_00182. URL http://ncs.ethz.ch/pubs/pdf/Neftci_etal11.pdf.

E. Neftci, B. Toth, G. Indiveri, and H. Abarbanel. Dynamic state and parameter estimation applied to neuromorphic systems. Neural Computation, 24(7):1669–1694, July 2012a. doi: 10.1162/NECO_a_00293. URL http://ncs.ethz.ch/pubs/pdf/Neftci_etal12.pdf.

121 Classification in VLSI

Emre Neftci, Jonathan Binas, Elisabetta Chicca, Giacomo Indiveri, and Rodney Douglas. Sys- tematic construction of finite state automata using VLSI spiking neurons. In Tony Prescott, Nathan Lepora, Anna Mura, and Paul Verschure, editors, Biomimetic and Biohybrid Sys- tems, volume 7375 of Lecture Notes in Computer Science, pages 382–383. Springer Berlin / Heidelberg, 2012b. ISBN 978-3-642-31524-4. doi: 10.1007/978-3-642-31525-1_52. URL http://ncs.ethz.ch/pubs/pdf/Neftci_etal12b.pdf.

A. Nere, U. Olcese, D. Balduzzi, and G. Tononi. A neuromorphic architecture for object recognition and motion anticipation using burst-stdp. PLoS One, 7(5):e36958, 2012.

E. Osuna, R. Freund, and F. Girosit. Training support vector machines: an application to face detection. In Computer Vision and Pattern Recognition, 1997. Proceedings., 1997 IEEE Computer Society Conference on, pages 130–136. IEEE, 1997.

E. Painkras, L.A. Plana, J. Garside, S. Temple, F. Galluppi, C. Patterson, D.R. Lester, A.D. Brown, and S.B. Furber. SpiNNaker: A 1-W 18-core system-on-chip for massively-parallel neural network simulation. IEEE Journal of Solid-State Circuits, 48(8):–, August 2013. ISSN 0018-9200. doi: 10.1109/JSSC.2013.2259038.

Y.V. Pershin and M. Di Ventra. Experimental demonstration of associative memory with memristive neural networks. Neural Networks, 23:881–886, 2010.

T. Pfeil, A. Grübl, S. Jeltsch, E. Müller, P. Müller, M. Petrovici, M. Schmuker, D. Brüderle, J. Schemmel, and K.Meier. Six networks on a universal neuromorphic computing substrate. Frontiers in neuroscience, 7, 2013.

D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, and K. Vesely. The kaldi speech recognition toolkit. In IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. IEEE Signal Processing Society, 2011. IEEE Catalog No.: CFP11SRW-USB.

M. Rabinovich, R. Huerta, and G. Laurent. Transient dynamics for neural processing. Science, 321:48–50, Jul 2008. doi: 10.1126/science.1155564. URL http://www.pubmed.org/18599763.

Guy Rachmuth, Harel Z. Shouval, Mark F. Bear, and Chi-Sang Poon. A biophysically-based neuromorphic model of spike rate- and timing-dependent plasticity. Proceedings of the National Academy of Science, 108(49):E1266–E1274, December 2011. doi: 10.1073/pnas.1106161108.

S. Ramakrishnan, R. Wunderlich, and P. Hasler. Neuron array with plastic synapses and programmable dendrites. In Biomedical Circuits and Systems Conference, (BioCAS), 2012, pages 400–403. IEEE, Nov. 2012. doi: 10.1109/BioCAS.2012.6418412.

A. Renart, P. Song, and X.-J. Wang. Robust spatial working memory through homeostatic synaptic scaling in heterogeneous cortical networks. Neuron, 38:473–485, May 2003.

F. Rieke. Spikes: reading the neural code. The MIT Press, 1997.

M. Riesenhuber and T. Poggio. Hierarchical models of object recognition in cortex. Nature Neuroscience, 2(11):1019–25, 1999.

122 BIBLIOGRAPHY

M. Rigotti, D.D. Ben Dayan Rubin, S.E. Morrison, C.D. Salzman, and S. Fusi. Attractor concretion as a mechanism for the formation of context representations. NeuroImage, 52 (3):833–847, 2010a. ISSN 1053-8119. doi: 10.1016/j.neuroimage.2010.01.047. URL http: //www.sciencedirect.com/science/article/pii/S1053811910000698.

M. Rigotti, D.D. Ben Dayan Rubin, X.-J. Wang, and S. Fusi. Internal representation of task rules by recurrent dynamics: the importance of the diversity of neural responses. Fron- tiers in Computational Neuroscience, 4(0), 2010b. ISSN 1662-5188. doi: 10.3389/fncom. 2010.00024. URL http://www.frontiersin.org/Journal/Abstract.aspx?s=237&name= computationalneuroscience&ART_DOI=10.3389/fncom.2010.00024.

F Rosenblatt. The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review,, 65(6):386–408, Nov 1958. doi: 10.1037/h0042519.

Ueli Rutishauser and Rodney Douglas. State-dependent computation using coupled recurrent networks. Neural Computation, 21:478–509, 2009.

Y. Sandamirskaya and G. Schöner. An embodied account of serial order: How instabilities drive sequence generation. Neural Networks, 23(10):1164–1179, 2010. doi: doi:10.1016/j.neunet.2010. 07.012.

R Sarpeshkar. Brain power – borrowing from biology makes for low power computing – bionic ear. IEEE Spectrum, 43(5):24–29, May 2006.

R. Sarpeshkar, T. Delbruck, and C.A. Mead. White noise in MOS transistors and resistors. IEEE Circuits and Devices Magazine, 9(6):23–29, November 1993.

R.E. Schapire and Y. Freund. Boosting: Foundations and Algorithms. MIT Press (MA), 2012.

J. Schemmel, D. Brüderle, K. Meier, and B. Ostendorf. Modeling synaptic plasticity within networks of highly accelerated I&F neurons. In International Symposium on Circuits and Systems, (ISCAS), 2007, pages 3367–3370. IEEE, 2007.

J. Schemmel, J. Fieres, and K. Meier. Wafer-scale integration of analog neural networks. In Proceedings of the IEEE International Joint Conference on Neural Networks, 2008.

E. Schneidman, W. Bialek, and M.J. Berry II. Synergy, redundancy, and independence in population codes. The Journal of Neuroscience, 23(37):11539–11553, 2003.

C. Schölkopf and J.C. Burges. Advances in Kernel Methods. MIT Press, 1999.

G. Schöner and E. Dineva. Dynamic instabilities as mechanisms for emergence. Developmental Science, 10(1):69–74, 2007.

J.R. Searle. Minds, brains, and programs. Behavioral and brain sciences, 3(3):417–457, 1980.

W. Senn. Beyond spike timing: the role of nonlinear plasticity and unreliable synapses. Biol. Cybern., 87:344–355, 2002.

W. Senn and S. Fusi. Learning Only When Necessary: Better Memories of Correlated Patterns in Networks with Bounded Synapses. Neural Computation, 17(10):2106–2138, 2005. URL http://neco.mitpress.org/cgi/content/abstract/17/10/2106.

123 Classification in VLSI

J. Seo, B. Brezzo, Y. Liu, B.D. Parker, S.K. Esser, R.K. Montoye, B. Rajendran, J. Tierno, L. Chang, and D.S. Modha. A 45nm cmos neuromorphic chip with a scalable architecture for learning in networks of spiking neurons. In Custom Integrated Circuits Conference (CICC), 2011 IEEE, pages 1–4. IEEE, 2011.

R. Serrano-Gotarredona, M. Oster, P. Lichtsteiner, A. Linares-Barranco, R. Paz-Vicente, F. Gómez- Rodriguez, L. Camunas-Mesa, R. Berner, M. Rivas-Perez, T. Delbruck, S.-C. Liu, R. Douglas, P. Häfliger, G. Jimenez-Moreno, A. Civit-Ballcels, T. Serrano-Gotarredona, A.J. Acosta-Jiménez, and B. Linares-Barranco. CAVIAR: A 45k neuron, 5M synapse, 12G connects/s aer hardware sensory–processing– learning–actuating system for high-speed visual object recognition and tracking. IEEE Transactions on Neural Networks, 20(9):1417–1438, September 2009. doi: 10.1109/TNN.2009.2023653.

Teresa Serrano-Gotarredona, Timothée Masquelier, Themistoklis Prodromakis, Giacomo Indiveri, and Bernabe Linares-Barranco. STDP and STDP variations with memristors for spiking neuromorphic learning systems. Frontiers in Neuroscience, 7(2), 2013. ISSN 1662-453X. doi: 10.3389/fnins.2013.00002. URL http://www.frontiersin.org/neuroscience/10.3389/ fnins.2013.00002/full.

S. Sheik, F. Stefanini, E. Neftci, E. Chicca, and G. Indiveri. Systematic configuration and automatic tuning of neuromorphic systems. In International Symposium on Circuits and Systems, (ISCAS), 2011, pages 873–876. IEEE, May 2011. doi: 10.1109/ISCAS.2011.5937705. URL http://ncs.ethz.ch/pubs/pdf/Sheik_etal11.pdf.

S. Sheik, E. Chicca, and G. Indiveri. Exploiting device mismatch in neuromorphic VLSI systems to implement axonal delays. In International Joint Conference on Neural Networks, IJCNN 2012, pages 1940–1945. IEEE, 2012. URL http://ncs.ethz.ch/pubs/pdf/Sheik_etal12b.pdf.

S Murray Sherman. The thalamus is more than just a relay. Current opinion in neurobiology, 17 (4):417, 2007.

W.L. Shew, H. Yang, S. Yu, R. Roy, and D. Plenz. Information capacity and transmission are maximized in balanced cortical networks with neuronal avalanches. The Journal of Neuroscience, 31(1):55–63, 2011.

B. E. Shi. The effect of mismatch in current- versus voltage-mode resistive grids. International Journal of Circuit Theory and Applications, 37:53–65, 2009.

W. Shockley. Problems related to p-n junctions in silicon. Solid-State Electronics, 2(1):35–67, 1961.

Harel Z Shouval, Mark F Bear, and Leon N Cooper. A unified model of NMDA receptor-dependent bidirectional synaptic plasticity. Proc Natl Acad Sci USA, 99(16):10831–10836, Aug 2002. doi: 10.1073/pnas.152343099. URL http://dx.doi.org/10.1073/pnas.152343099.

R. Silver, K. Boahen, S. Grillner, N. Kopell, and K.L. Olsen. Neurotech for neuroscience: unifying concepts, organizing principles, and emerging tools. Journal of Neuroscience, 27(44):11807, 2007.

124 BIBLIOGRAPHY

M.F. Simoni, G.S. Cymbalyuk, M.E. Sorensen, and R.L. DeWeerth S.P. Calabrese. A multicon- ductance silicon neuron with biologically matched dynamics. Biomedical Circuits and Systems, IEEE Transactions on, 51(2):342–354, February 2004.

G.S.V.S. Sivaram, S.K. Nemala, M. Elhilali, T.D. Tran, and H. Hermansky. Sparse coding for speech recognition. In Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on, pages 4346–4349. IEEE, 2010.

D.C. Somers, S.B. Nelson, and M. Sur. An emergent model of orientation selectivity in cat visual cortical simple cells. The Journal of Neuroscience, 15:5448–65, 1995.

N. Srinivasa and J.M. Cruz-Albrecht. Neuromorphic adaptive plastic scalable electronics: Analog learning systems. Pulse, IEEE, 3(1):51–56, 2012.

L. Takács. Investigation of waiting time problems by reduction to markov processes. Acta Mathematica Hungarica, 6(1):101–129, 1955.

J. Tapson and A. van Schaik. Learning the pseudoinverse solution to network weights. Neural Networks, 2013. ISSN 0893-6080. doi: 10.1016/j.neunet.2013.02.008. URL http://www. sciencedirect.com/science/article/pii/S089360801300049X.

C. Tomazou, F.J. Lidgey, and D.G. Haigh, editors. Analogue IC design: the current-mode approach. Peregrinus, Stevenage, Herts., UK, 1990.

A.M. Turing. Computing machinery and intelligence. Mind, 59(236):433–460, 1950.

G.G. Turrigiano, K.R. Leslie, N.S. Desai, L.C. Rutherford, and S.B. Nelson. Activity-dependent scaling of quantal amplitude in neocortical neurons. Nature, 391:892–896, February 1998.

A. van Schaik, C. Jin, and T.J Hamilton. A log-domain implementation of the Izhikevich neuron model. In International Symposium on Circuits and Systems, (ISCAS), 2010, pages 4253–4256. IEEE, 2010a.

A. van Schaik, C. Jin, T.J Hamilton, S. Mihalas, and E. Niebur. A log-domain implementation of the Mihalas-Niebur neuron model. In International Symposium on Circuits and Systems, (ISCAS), 2010, pages 4249–4252. IEEE, 2010b.

V. Vapnik. The nature of statical learning theory. Springer Verlag, 1995.

V. Vapnik. Statistical learning theory. Wiley, 1998.

V. Vasilkov and R.A. Tikidji-Hamburyan. Accurate detection of interaural time differences by a population of slowly integrating neurons. Physical Review Letters, 108(13):138104, 2012.

L.B. Vosshall, A.M. Wong, and R. Axel. An olfactory sensory map in the fly brain. Cell, 102(2): 147–159, 2000.

S. Van Der Walt, S.C. Colbert, and G. Varoquaux. The numpy array: a structure for efficient numerical computation. Computing in Science & Engineering, 13(2):22–30, 2011.

J.W. Wang, A.M. Wong, J. Flores, L.B. Vosshall, and R. Axel. Two-photon calcium imaging reveals an odor-evoked map of activity in the fly brain. Cell, 112(2):271–282, 2003.

125 Classification in VLSI

R. Wang, G. Cohen, K.M. Stiefel, T.J. Hamilton, J. Tapson, and A. van Schaik. An fpga implementation of a polychronous spiking neural network with delay adaptation. Frontiers in neuroscience, 7, 2013.

X.J. Wang. Probabilistic decision making by slow reverberation in cortical circuits. Neuron, 36 (5):955–968, 2002.

K. Wendt, M. Ehrlich, and R. Schüffny. A graph theoretical approach for a multistep mapping software for the facets project. In Proceedings of the 2008 WSEAS international conference on computer engineering and applications (CEA), pages 189–194, 2008.

J.H.B. Wijekoon and P. Dudek. Compact silicon neuron circuit with spiking and bursting behaviour. Neural Networks, 21(2–3):524–534, March–April 2008.

J.H.B. Wijekoon and P. Dudek. Vlsi circuits implementing computational models of neocortical circuits. Journal of Neuroscience Methods, 210(1):93–109, 2012.

Richard Wood, Alex McGlashan, Jay Yatulis, Peter Mascher, and Ian Bruce. Digital imple- mentation of a neural network for imaging. In Photonics North 2012, pages 84121H–84121H. International Society for Optics and Photonics, 2012.

W. Wulf, E. Cohen, W. Corwin, A. Jones, R. Levin, C. Pierson, and F. Pollack. Hydra: The kernel of a multiprocessor operating system. Communications of the ACM, 17(6):337–345, 1974.

X. Xie, R.H.R. Hahnloser, and H.S. Seung. Selectively grouping neurons in recurrent networks of lateral inhibition. Neural computation, 14(11):2627–2646, 2002.

T. Yu and G. Cauwenberghs. Analog VLSI biophysical neurons and synapses with programmable membrane channel kinetics. Biomedical Circuits and Systems, IEEE Transactions on, 4(3): 139–148, June 2010.

A.L. Yuille and G.D. Geiger. Winner-Take-All Networks, chapter Part III: Articles, pages 1228–1231. MIT Press, Cambridge, MA, USA, 2003.

126 Fabio Stefanini

Personal Data

Nationality: Italian Marital Status: Married Place and Date of Birth: Italy — 19 April 1983 Phone: +41 44 6353046 Fax: +41 44 6353053 email: [email protected] Address: Institute of Neuroinformatics University of Zurich and ETH Zurich Winterthurerstrasse 190 CH-8057, Zurich Switzerland

Current Position: 5th-year PhD Student at Institute of Neuroinformatics, University of Zurich and ETH Zurich

Scientific profile: Stefanini obtained a “Laurea Triennale” degree (BSc) and a “Laurea Magistrale” degree (MSc) in Physics from La Sapienza University of Rome, Rome, Italy in 2008. He has been Research Collaborator at the Institute for Complex Systems, CNR-INFM, Rome, Italy, developing experimental, software and theoretical methods for the study of collective behaviour in flocking birds. His main research interests are in attractor neural networks, learning systems and neuromorphic engineering.

Education: 2009-present: PhD Student at Institute of Neuroinformatics, UZH and ETH Zurich. • 26/02/2009: MSc in Physics at La Sapienza University of Rome, Rome, Italy • 02/10/2006: BSc in Physics at La Sapienza University of Rome, Rome, Italy • Research experience: 10/2006-03/2007: Research Collaborator at the Institute for Complex Systems, CNR-INFM, • Rome, Italy

Fabio Stefanini’s selected list of publications References

[1] Cavagna, A. and Queirs, S.M.D. and Giardina, I. and Stefanini, F. and Viale, M., Diffusion of individual birds in starling flocks, Proceedings of the Royal Society B: Biological Sciences, 280,1756 (2013) [2] Sheik, S. and Stefanini, F. and Neftci, E. and Chicca, E. and Indiveri, G., Systematic configura- tion and automatic tuning of neuromorphic systems, Circuits and Systems (ISCAS), 2011 IEEE International Symposium on, 873-876 (2011) [3] Cavagna, A. and Cimarelli, A. and Giardina, I. and Parisi, Giorgio; Santagati, R. and Stefanini, F. and Viale, M., Scale-free correlations in starling flocks, Proceedings of the National Academy of Sciences, 107,26,11865-11870 (2010) [4] Cavagna, A. and Cimarelli, A. and Giardina, I. and Parisi, G. and Santagati, R. and Stefanini, F. and Tavarone, R., From empirical data to inter-individual interactions: Unveiling the rules of collective animal behavior, Mathematical Models and Methods in Applied Sciences, 20,supp01,1491- 1510 (2010)

[5] Cavagna, A. and Cimarelli, A. and Giardina, I. and Orlandi, A. and Parisi, G. and Procaccini, A. and Santagati, R. and Stefanini, F., New statistical tools for analyzing the structure of animal groups, Mathematical biosciences, 214,1,32-37 (2008) [6] Stefanini, F., La fisica degli stormi di storni in volo, Matematica e cultura 2010, 185-194, Springer Milan (2010)

[7] Indiveri, G. and Stefanini, F. and Chicca, E., Spike-based learning with a generalized integrate and fire silicon neuron Circuits and Systems (ISCAS), Proceedings of 2010 IEEE International Symposium on, 1951-1954 (2010) [8] Beyeler, M. and Stefanini, F. and Proske, H. and Galizia, G. and Chicca, E., Exploring olfactory sensory networks: simulations and hardware emulation, Biomedical Circuits and Systems Confer- ence (BioCAS), 2010 IEEE, 270-273 (2010)